Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Massachusetts Institute of Technology, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
4984
Masumi Ishikawa Kenji Doya Hiroyuki Miyamoto Takeshi Yamakawa (Eds.)
Neural Information Processing 14th International Conference, ICONIP 2007 Kitakyushu, Japan, November 13-16, 2007 Revised Selected Papers, Part I
13
Volume Editors Masumi Ishikawa Hiroyuki Miyamoto Takeshi Yamakawa Kyushu Institute of Technology Department of Brain Science and Engineering 2-4 Hibikino, Wakamatsu, Kitakyushu 808-0196, Japan E-mail: {ishikawa, miyamo, yamakawa}@brain.kyutech.ac.jp Kenji Doya Okinawa Institute of Science and Technology Initial Research Project 12-22 Suzaki, Uruma, Okinawa 904-2234, Japan E-mail:
[email protected]
Library of Congress Control Number: Applied for CR Subject Classification (1998): F.1, I.2, I.5, I.4, G.3, J.3, C.2.1, C.1.3, C.3 LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues ISSN ISBN-10 ISBN-13
0302-9743 3-540-69154-5 Springer Berlin Heidelberg New York 978-3-540-69154-9 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media springer.com © Springer-Verlag Berlin Heidelberg 2008 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12282845 06/3180 543210
Preface
These two-volume books comprise the post-conference proceedings of the 14th International Conference on Neural Information Processing (ICONIP 2007) held in Kitakyushu, Japan, during November 13–16, 2007. The Asia Pacific Neural Network Assembly (APNNA) was founded in 1993. The first ICONIP was held in 1994 in Seoul, Korea, sponsored by APNNA in collaboration with regional organizations. Since then, ICONIP has consistently provided prestigious opportunities for presenting and exchanging ideas on neural networks and related fields. Research fields covered by ICONIP have now expanded to include such fields as bioinformatics, brain machine interfaces, robotics, and computational intelligence. We had 288 ordinary paper submissions and 3 special organized session proposals. Although the quality of submitted papers on the average was exceptionally high, only 60% of them were accepted after rigorous reviews, each paper being reviewed by three reviewers. Concerning special organized session proposals, two out of three were accepted. In addition to ordinary submitted papers, we invited 15 special organized sessions organized by leading researchers in emerging fields to promote future expansion of neural information processing. ICONIP 2007 was held at the newly established Kitakyushu Science and Research Park in Kitakyushu, Japan. Its theme was “Towards an Integrated Approach to the Brain—Brain-Inspired Engineering and Brain Science,” which emphasizes the need for cross-disciplinary approaches for understanding brain functions and utilizing the knowledge for contributions to the society. It was jointly sponsored by APNNA, Japanese Neural Network Society (JNNS), and the 21st century COE program at Kyushu Institute of Technology. ICONIP 2007 was composed of 1 keynote speech, 5 plenary talks, 4 tutorials, 41 oral sessions, 3 poster sessions, 4 demonstrations, and social events such as the Banquet and International Music Festival. In all, 382 researchers registered, and 355 participants joined the conference from 29 countries. In each tutorial, we had about 60 participants on the average. Five best paper awards and five student best paper awards were granted to encourage outstanding researchers. To minimize the number of researchers who cannot present their excellent work at the conference due to financial problems, we provided travel and accommodation support of up to JPY 150,000 to six researchers and of up to JPY to eight students 100,000. ICONIP 2007 was jointly held with the 4th BrainIT 2007 organized by the 21st century COE program, “World of Brain Computing Interwoven out of Animals and Robots,” with the support of the Japanese Ministry of Education, Culture, Sports, Science and Technology (MEXT) and Japan Society for the Promotion of Science (JSPS).
VI
Preface
We would like to thank Mitsuo Kawato for his superb Keynote Speech, and Rajesh P.N. Rao, Fr´ed´eric Kaplan, Shin Ishii, Andrew Y. Ng, and Yoshiyuki Kabashima for their stimulating plenary talks. We would also like to thank Sven Buchholz, Eckhard Hitzer, Kanta Tachibana, Jung Wang, Nikhil R. Pal, and Tetsuo Furukawa for their enlightening tutorial lectures. We would like to express our deepest appreciation to all the participants for making the conference really attractive and fruitful through lively discussions, which we believe would tremendously contribute to the future development of neural information processing. We also wish to acknowledge the contributions by all the Committee members for their devoted work, especially Katsumi Tateno for his dedication as Secretary. Last but not least, we want to give special thanks to Irwin King and his students, Kam Tong Chan and Yi Ling Wong, for providing the submission and reviewing system, Etsuko Futagoishi for hard secretarial work, Satoshi Sonoh and Shunsuke Sakaguchi for maintaining our conference server, and many secretaries and graduate students at our department for their diligent work in running the conference.
January 2008
Masumi Ishikawa Kenji Doya Hiroyuki Miyamoto Takeshi Yamakawa
Organization
Conference Committee Chairs General Chair Organizing Committee Chair Steering Committee Chair Program Co-chairs
Tutorials Chair Exhibitions Chair Publications Chair Publicity Chair Local Arrangements Chair Web Master Secretary
Takeshi Yamakawa (Kyushu Institute of Technology, Japan) Shiro Usui (RIKEN, Japan) Takeshi Yamakawa (Kyushu Institute of Technology, Japan) Masumi Ishikawa (Kyushu Institute of Technology, Japan), Kenji Doya (OIST, Japan) Hirokazu Yokoi (Kyushu Institute of Technology, Japan) Masahiro Nagamatsu (Kyushu Institute of Technology, Japan) Hiroyuki Miyamoto (Kyushu Institute of Technology, Japan) Hideki Nakagawa (Kyushu Institute of Technology, Japan) Satoru Ishizuka (Kyushu Institute of Technology, Japan) Tsutomu Miki (Kyushu Institute of Technology, Japan) Katsumi Tateno (Kyushu Institute of Technology, Japan)
Steering Committee Takeshi Yamakawa, Masumi Ishikawa, Hirokazu Yokoi, Masahiro Nagamatsu, Hiroyuki Miyamoto, Hideki Nakagawa, Satoru Ishizuka, Tsutomu Miki, Katsumi Tateno
Program Committee Masumi Ishikawa, Kenji Doya Track Co-chairs
Track 1: Masato Okada (Tokyo Univ.), Yoko Yamaguchi (RIKEN), Si Wu (Sussex Univ.) Track 2: Koji Kurata (Univ. of Ryukyus), Kazushi Ikeda (Kyoto Univ.), Liqing Zhang (Shanghai Jiaotong Univ.)
VIII
Organization
Track 3: Yuzo Hirai (Tsukuba Univ.), Yasuharu Koike (Tokyo Institute of Tech.), J.H. Kim (Handong Global Univ., Korea) Track 4: Akira Iwata (Nagoya Institute of Tech.), Noboru Ohnishi (Nagoya Univ.), SeYoung Oh (Postech, Korea) Track 5: Hideki Asoh (AIST), Shin Ishii (Kyoto Univ.), Sung-Bae Cho (Yonsei Univ., Korea)
Advisory Board Shun-ichi Amari (Japan), Sung-Yang Bang (Korea), You-Shou Wu (China), Lei Xu (Hong Kong), Nikola Kasabov (New Zealand), Kunihiko Fukushima (Japan), Tom D. Gedeon (Australia), Soo-Young Lee (Korea), Yixin Zhong (China), Lipo Wang (Singapore), Nikhil R. Pal (India), Chin-Teng Lin (Taiwan), Laiwan Chan (Hong Kong), Jun Wang (Hong Kong), Shuji Yoshizawa (Japan), Minoru Tsukada (Japan), Takashi Nagano (Japan), Shozo Yasui (Japan)
Referees S. Akaho P. Andras T. Aonishi T. Aoyagi T. Asai H. Asoh J. Babic R. Surampudi Bapi A. Kardec Barros J. Cao H. Cateau J-Y. Chang S-B. Cho S. Choi I.F. Chung A.S. Cichocki M. Diesmann K. Doya P. Erdi H. Fujii N. Fukumura W-k. Fung T. Furuhashi A. Garcez T.D. Gedeon
S. Gruen K. Hagiwara M. Hagiwara K. Hamaguchi R.P. Hasegawa H. Hikawa Y. Hirai K. Horio K. Ikeda F. Ishida S. Ishii M. Ishikawa A. Iwata K. Iwata H. Kadone Y. Kamitani N. Kasabov M. Kawamoto C. Kim E. Kim K-J. Kim S. Kimura A. Koenig Y. Koike T. Kondo
S. Koyama J.L. Krichmar H. Kudo T. Kurita S. Kurogi M. Lee J. Liu B-L. Lu N. Masuda N. Matsumoto B. McKay K. Meier H. Miyamoto Y. Miyawaki H. Mochiyama C. Molter T. Morie K. Morita M. Morita Y. Morita N. Murata H. Nakahara Y. Nakamura S. Nakauchi K. Nakayama
Organization
K. Niki J. Nishii I. Nishikawa S. Oba T. Ogata S-Y. Oh N. Ohnishi M. Okada H. Okamoto T. Omori T. Omori R. Osu N. R. Pal P. S. Pang G-T. Park J. Peters S. Phillips
Y. Sakaguchi K. Sakai Y. Sakai Y. Sakumura K. Samejima M. Sato N. Sato R. Setiono T. Shibata H. Shouno M. Small M. Sugiyama I. Hong Suh J. Suzuki T. Takenouchi Y. Tanaka I. Tetsunari
N. Ueda S. Usui Y. Wada H. Wagatsuma L. Wang K. Watanabe J. Wu Q. Xiao Y. Yamaguchi K. Yamauchi Z. Yi J. Yoshimoto B.M. Yu B-T. Zhang L. Zhang L. Zhang
Sponsoring Institutions Asia Pacific Neural Network Assembly (APNNA) Japanese Neural Network Society (JNNS) 21st Century COE Program, Kyushu Institute of Technology
Cosponsors RIKEN Brain Science Institute Advanced Telecommunications Research Institute International (ATR) Japan Society for Fuzzy Theory and Intelligent Informatics (SOFT) IEEE CIS Japan Chapter Fuzzy Logic Systems Institute (FLSI)
IX
Table of Contents – Part I
Computational Neuroscience A Retinal Circuit Model Accounting for Functions of Amacrine Cells . . . Murat Saglam, Yuki Hayashida, and Nobuki Murayama
1
Global Bifurcation Analysis of a Pyramidal Cell Model of the Primary Visual Cortex: Towards a Construction of Physiologically Plausible Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tatsuya Ishiki, Satoshi Tanaka, Makoto Osanai, Shinji Doi, Sadatoshi Kumagai, and Tetsuya Yagi
7
Representation of Medial Axis from Synchronous Firing of Border-Ownership Selective Cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yasuhiro Hatori and Ko Sakai
18
Neural Mechanism for Extracting Object Features Critical for Visual Categorization Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mitsuya Soga and Yoshiki Kashimori
27
An Integrated Neuro-mechanical Model of C. elegans Forward Locomotion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jordan H. Boyle, John Bryden, and Netta Cohen
37
Applying the String Method to Extract Bursting Information from Microelectrode Recordings in Subthalamic Nucleus and Substantia Nigra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pei-Kuang Chao, Hsiao-Lung Chan, Tony Wu, Ming-An Lin, and Shih-Tseng Lee
48
Population Coding of Song Element Sequence in the Songbird Brain Nucleus HVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jun Nishikawa, Masato Okada, and Kazuo Okanoya
54
Spontaneous Voltage Transients in Mammalian Retinal Ganglion Cells Dissociated by Vibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tamami Motomura, Yuki Hayashida, and Nobuki Murayama
64
Region-Based Encoding Method Using Multi-dimensional Gaussians for Networks of Spiking Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lakshmi Narayana Panuku and C. Chandra Sekhar
73
Firing Pattern Estimation of Biological Neuron Models by Adaptive Observer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kouichi Mitsunaga, Yusuke Totoki, and Takami Matsuo
83
XII
Table of Contents – Part I
Thouless-Anderson-Palmer Equation for Associative Memory Neural Network Models with Fluctuating Couplings . . . . . . . . . . . . . . . . . . . . . . . . Akihisa Ichiki and Masatoshi Shiino Spike-Timing Dependent Plasticity in Recurrently Connected Networks with Fixed External Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Matthieu Gilson, David B. Grayden, J. Leo van Hemmen, Doreen A. Thomas, and Anthony N. Burkitt A Comparative Study of Synchrony Measures for the Early Detection of Alzheimer’s Disease Based on EEG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Justin Dauwels, Fran¸cois Vialatte, and Andrzej Cichocki Reproducibility Analysis of Event-Related fMRI Experiments Using Laguerre Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hong-Ren Su, Michelle Liou, Philip E. Cheng, John A.D. Aston, and Shang-Hong Lai
93
102
112
126
The Effects of Theta Burst Transcranial Magnetic Stimulation over the Human Primary Motor and Sensory Cortices on Cortico-Muscular Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Murat Saglam, Kaoru Matsunaga, Yuki Hayashida, Nobuki Murayama, and Ryoji Nakanishi
135
Interactions between Spike-Timing-Dependent Plasticity and Phase Response Curve Lead to Wireless Clustering . . . . . . . . . . . . . . . . . . . . . . . . Hideyuki Cˆ ateau, Katsunori Kitano, and Tomoki Fukai
142
A Computational Model of Formation of Grid Field and Theta Phase Precession in the Entorhinal Cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yoko Yamaguchi, Colin Molter, Wu Zhihua, Harshavardhan A. Agashe, and Hiroaki Wagatsuma Working Memory Dynamics in a Flip-Flop Oscillations Network Model with Milnor Attractor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . David Colliaux, Yoko Yamaguchi, Colin Molter, and Hiroaki Wagatsuma
151
160
Corticopetal Acetylcholine: Possible Scenarios on the Role for Dynamic Organization of Quasi-Attractors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hiroshi Fujii, Kazuyuki Aihara, and Ichiro Tsuda
170
Tracking a Moving Target Using Chaotic Dynamics in a Recurrent Neural Network Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yongtao Li and Shigetoshi Nara
179
A Generalised Entropy Based Associative Model . . . . . . . . . . . . . . . . . . . . . Masahiro Nakagawa
189
Table of Contents – Part I
The Detection of an Approaching Sound Source Using Pulsed Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kaname Iwasa, Takeshi Fujisumi, Mauricio Kugler, Susumu Kuroyanagi, Akira Iwata, Mikio Danno, and Masahiro Miyaji
XIII
199
Sensitivity and Uniformity in Detecting Motion Artifacts . . . . . . . . . . . . . Wen-Chuang Chou, Michelle Liou, and Hong-Ren Su
209
A Ring Model for the Development of Simple Cells in the Visual Cortex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Takashi Hamada and Kazuhiro Okada
219
Learning and Memory Practical Recurrent Learning (PRL) in the Discrete Time Domain . . . . . Mohamad Faizal Bin Samsudin, Takeshi Hirose, and Katsunari Shibata
228
Learning of Bayesian Discriminant Functions by a Layered Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yoshifusa Ito, Cidambi Srinivasan, and Hiroyuki Izumi
238
RNN with a Recurrent Output Layer for Learning of Naturalness . . . . . . J´ an Dolinsk´y and Hideyuki Takagi
248
Using Generalization Error Bounds to Train the Set Covering Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zakria Hussain and John Shawe-Taylor
258
Model of Cue Extraction from Distractors by Active Recall . . . . . . . . . . . . Adam Ponzi
269
PLS Mixture Model for Online Dimension Reduction . . . . . . . . . . . . . . . . . Jiro Hayami and Koichiro Yamauchi
279
Analysis on Bidirectional Associative Memories with Multiplicative Weight Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chi Sing Leung, Pui Fai Sum, and Tien-Tsin Wong
289
Fuzzy ARTMAP with Explicit and Implicit Weights . . . . . . . . . . . . . . . . . . Takeshi Kamio, Kenji Mori, Kunihiko Mitsubori, Chang-Jun Ahn, Hisato Fujisaka, and Kazuhisa Haeiwa Neural Network Model of Forward Shift of CA1 Place Fields Towards Reward Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Adam Ponzi
299
309
XIV
Table of Contents – Part I
Neural Network Models A New Constructive Algorithm for Designing and Training Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Md. Abdus Sattar, Md. Monirul Islam, and Kazuyuki Murase
317
Effective Learning with Heterogeneous Neural Networks . . . . . . . . . . . . . . Llu´ıs A. Belanche-Mu˜ noz
328
Pattern-Based Reasoning System Using Self-incremental Neural Network for Propositional Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Akihito Sudo, Manabu Tsuboyama, Chenli Zhang, Akihiro Sato, and Osamu Hasegawa
338
Effect of Spatial Attention in Early Vision for the Modulation of the Perception of Border-Ownership . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nobuhiko Wagatsuma, Ryohei Shimizu, and Ko Sakai
348
Effectiveness of Scale Free Network to the Performance Improvement of a Morphological Associative Memory without a Kernel Image . . . . . . . Takashi Saeki and Tsutomu Miki
358
Intensity Gradient Self-organizing Map for Cerebral Cortex Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cheng-Hung Chuang, Jiun-Wei Liou, Philip E. Cheng, Michelle Liou, and Cheng-Yuan Liou Feature Subset Selection Using Constructive Neural Nets with Minimal Computation by Measuring Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . Md. Monirul Kabir, Md. Shahjahan, and Kazuyuki Murase Dynamic Link Matching between Feature Columns for Different Scale and Orientation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yasuomi D. Sato, Christian Wolff, Philipp Wolfrum, and Christoph von der Malsburg
365
374
385
Perturbational Neural Networks for Incremental Learning in Virtual Learning System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Eiichi Inohira, Hiromasa Oonishi, and Hirokazu Yokoi
395
Bifurcations of Renormalization Dynamics in Self-organizing Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Peter Tiˇ no
405
Variable Selection for Multivariate Time Series Prediction with Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Min Han and Ru Wei
415
Table of Contents – Part I
XV
Ordering Process of Self-Organizing Maps Improved by Asymmetric Neighborhood Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Takaaki Aoki, Kaiichiro Ota, Koji Kurata, and Toshio Aoyagi
426
A Characterization of Simple Recurrent Neural Networks with Two Hidden Units as a Language Recognizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Azusa Iwata, Yoshihisa Shinozawa, and Akito Sakurai
436
Supervised/Unsupervised/Reinforcement Learning Unbiased Likelihood Backpropagation Learning . . . . . . . . . . . . . . . . . . . . . . Masashi Sekino and Katsumi Nitta
446
The Local True Weight Decay Recursive Least Square Algorithm . . . . . . Chi Sing Leung, Kwok-Wo Wong, and Yong Xu
456
Experimental Bayesian Generalization Error of Non-regular Models under Covariate Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Keisuke Yamazaki and Sumio Watanabe
466
Using Image Stimuli to Drive fMRI Analysis . . . . . . . . . . . . . . . . . . . . . . . . David R. Hardoon, Janaina Mour˜ ao-Miranda, Michael Brammer, and John Shawe-Taylor
477
Parallel Reinforcement Learning for Weighted Multi-criteria Model with Adaptive Margin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kazuyuki Hiraoka, Manabu Yoshida, and Taketoshi Mishima
487
Convergence Behavior of Competitive Repetition-Suppression Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Davide Bacciu and Antonina Starita
497
Self-Organizing Clustering with Map of Nonlinear Varieties Representing Variation in One Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hideaki Kawano, Hiroshi Maeda, and Norikazu Ikoma
507
An Automatic Speaker Recognition System . . . . . . . . . . . . . . . . . . . . . . . . . P. Chakraborty, F. Ahmed, Md. Monirul Kabir, Md. Shahjahan, and Kazuyuki Murase Modified Modulated Hebb-Oja Learning Rule: A Method for Biologically Plausible Principal Component Analysis . . . . . . . . . . . . . . . . . Marko Jankovic, Pablo Martinez, Zhe Chen, and Andrzej Cichocki
517
527
Statistical Learning Algorithms Orthogonal Shrinkage Methods for Nonparametric Regression under Gaussian Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Katsuyuki Hagiwara
537
XVI
Table of Contents – Part I
A Subspace Method Based on Data Generation Model with Class Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Minkook Cho, Dongwoo Yoon, and Hyeyoung Park
547
Hierarchical Feature Extraction for Compact Representation and Classification of Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Markus Schubert and Jens Kohlmorgen
556
Principal Component Analysis for Sparse High-Dimensional Data . . . . . . Tapani Raiko, Alexander Ilin, and Juha Karhunen
566
Hierarchical Bayesian Inference of Brain Activity . . . . . . . . . . . . . . . . . . . . . Masa-aki Sato and Taku Yoshioka
576
Neural Decoding of Movements: From Linear to Nonlinear Trajectory Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Byron M. Yu, John P. Cunningham, Krishna V. Shenoy, and Maneesh Sahani
586
Estimating Internal Variables of a Decision Maker’s Brain: A Model-Based Approach for Neuroscience . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kazuyuki Samejima and Kenji Doya
596
Visual Tracking Achieved by Adaptive Sampling from Hierarchical and Parallel Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tomohiro Shibata, Takashi Bando, and Shin Ishii
604
Bayesian System Identification of Molecular Cascades . . . . . . . . . . . . . . . . Junichiro Yoshimoto and Kenji Doya
614
Use of Circle-Segments as a Data Visualization Technique for Feature Selection in Pattern Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shir Li Wang, Chen Change Loy, Chee Peng Lim, Weng Kin Lai, and Kay Sin Tan
625
Extraction of Approximate Independent Components from Large Natural Scenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yoshitatsu Matsuda and Kazunori Yamaguchi
635
Local Coordinates Alignment and Its Linearization . . . . . . . . . . . . . . . . . . . Tianhao Zhang, Xuelong Li, Dacheng Tao, and Jie Yang
643
Walking Appearance Manifolds without Falling Off . . . . . . . . . . . . . . . . . . . Nils Einecke, Julian Eggert, Sven Hellbach, and Edgar K¨ orner
653
Inverse-Halftoning for Error Diffusion Based on Statistical Mechanics of the Spin System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yohei Saika
663
Table of Contents – Part I
XVII
Optimization Algorithms Chaotic Motif Sampler for Motif Discovery Using Statistical Values of Spike Time-Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Takafumi Matsuura and Tohru Ikeguchi
673
A Thermodynamical Search Algorithm for Feature Subset Selection . . . . F´elix F. Gonz´ alez and Llu´ıs A. Belanche
683
Solvable Performances of Optimization Neural Networks with Chaotic Noise and Stochastic Noise with Negative Autocorrelation . . . . . . . . . . . . . Mikio Hasegawa and Ken Umeno
693
Solving the k-Winners-Take-All Problem and the Oligopoly Cournot-Nash Equilibrium Problem Using the General Projection Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiaolin Hu and Jun Wang Optimization of Parametric Companding Function for an Efficient Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shin-ichi Maeda and Shin Ishii A Modified Soft-Shape-Context ICP Registration System of 3-D Point Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jiann-Der Lee, Chung-Hsien Huang, Li-Chang Liu, Shih-Sen Hsieh, Shuen-Ping Wang, and Shin-Tseng Lee Solution Method Using Correlated Noise for TSP . . . . . . . . . . . . . . . . . . . . Atsuko Goto and Masaki Kawamura
703
713
723
733
Novel Algorithms Bayesian Collaborative Predictors for General User Modeling Tasks . . . . Jun-ichiro Hirayama, Masashi Nakatomi, Takashi Takenouchi, and Shin Ishii
742
Discovery of Linear Non-Gaussian Acyclic Models in the Presence of Latent Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shohei Shimizu and Aapo Hyv¨ arinen
752
Efficient Incremental Learning Using Self-Organizing Neural Grove . . . . . Hirotaka Inoue and Hiroyuki Narihisa
762
Design of an Unsupervised Weight Parameter Estimation Method in Ensemble Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Masato Uchida, Yousuke Maehara, and Hiroyuki Shioya
771
Sparse Super Symmetric Tensor Factorization . . . . . . . . . . . . . . . . . . . . . . . Andrzej Cichocki, Marko Jankovic, Rafal Zdunek, and Shun-ichi Amari
781
XVIII
Table of Contents – Part I
Probabilistic Tensor Analysis with Akaike and Bayesian Information Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dacheng Tao, Jimeng Sun, Xindong Wu, Xuelong Li, Jialie Shen, Stephen J. Maybank, and Christos Faloutsos Decomposing EEG Data into Space-Time-Frequency Components Using Parallel Factor Analysis and Its Relation with Cerebral Blood Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fumikazu Miwakeichi, Pedro A. Valdes-Sosa, Eduardo Aubert-Vazquez, Jorge Bosch Bayard, Jobu Watanabe, Hiroaki Mizuhara, and Yoko Yamaguchi
791
802
Flexible Component Analysis for Sparse, Smooth, Nonnegative Coding or Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andrzej Cichocki, Anh Huy Phan, Rafal Zdunek, and Li-Qing Zhang
811
Appearance Models for Medical Volumes with Few Samples by Generalized 3D-PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rui Xu and Yen-Wei Chen
821
Head Pose Estimation Based on Tensor Factorization . . . . . . . . . . . . . . . . . Wenlu Yang, Liqing Zhang, and Wenjun Zhu
831
Kernel Maximum a Posteriori Classification with Error Bound Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zenglin Xu, Kaizhu Huang, Jianke Zhu, Irwin King, and Michael R. Lyu Comparison of Local Higher-Order Moment Kernel and Conventional Kernels in SVM for Texture Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . Keisuke Kameyama
841
851
Pattern Discovery for High-Dimensional Binary Datasets . . . . . . . . . . . . . . V´ aclav Sn´ aˇsel, Pavel Moravec, Duˇsan H´ usek, Alexander Frolov, ˇ Hana Rezankov´ a, and Pavel Polyakov
861
Expand-and-Reduce Algorithm of Particle Swarm Optimization . . . . . . . . Eiji Miyagawa and Toshimichi Saito
873
Nonlinear Pattern Identification by Multi-layered GMDH-Type Neural Network Self-selecting Optimum Neural Network Architecture . . . . . . . . . Tadashi Kondo
882
Motor Control and Vision Coordinated Control of Reaching and Grasping During Prehension Movement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Masazumi Katayama and Hirokazu Katayama
892
Table of Contents – Part I
Computer Simulation of Vestibuloocular Reflex Motor Learning Using a Realistic Cerebellar Cortical Neuronal Network Model . . . . . . . . . . . . . . Kayichiro Inagaki, Yutaka Hirata, Pablo M. Blazquez, and Stephen M. Highstein Reflex Contributions to the Directional Tuning of Arm Stiffness . . . . . . . . Gary Liaw, David W. Franklin, Etienne Burdet, Abdelhamid Kadi-allah, and Mitsuo Kawato
XIX
902
913
Analysis of Variability of Human Reaching Movements Based on the Similarity Preservation of Arm Trajectories . . . . . . . . . . . . . . . . . . . . . . . . . Takashi Oyama, Yoji Uno, and Shigeyuki Hosoe
923
Directional Properties of Human Hand Force Perception in the Maintenance of Arm Posture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yoshiyuki Tanaka and Toshio Tsuji
933
Computational Understanding and Modeling of Filling-In Process at the Blind Spot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shunji Satoh and Shiro Usui
943
Biologically Motivated Face Selective Attention Model . . . . . . . . . . . . . . . . Woong-Jae Won, Young-Min Jang, Sang-Woo Ban, and Minho Lee
953
Multi-dimensional Histogram-Based Image Segmentation . . . . . . . . . . . . . . Daniel Weiler and Julian Eggert
963
A Framework for Multi-view Gender Classification . . . . . . . . . . . . . . . . . . . Jing Li and Bao-Liang Lu
973
Japanese Hand Sign Recognition System . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hirotada Fujimura, Yuuichi Sakai, and Hiroomi Hikawa
983
An Image Warping Method for Temporal Subtraction Images Employing Smoothing of Shift Vectors on MDCT Images . . . . . . . . . . . . . Yoshinori Itai, Hyoungseop Kim, Seiji Ishikawa, Shigehiko Katsuragawa, Takayuki Ishida, Ikuo Kawashita, Kazuo Awai, and Kunio Doi
993
Conflicting Visual and Proprioceptive Reflex Responses During Reaching Movements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1002 David W. Franklin, Udell So, Rieko Osu, and Mitsuo Kawato An Involuntary Muscular Response Induced by Perceived Visual Errors in Hand Position . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1012 David W. Franklin, Udell So, Rieko Osu, and Mitsuo Kawato Independence of Perception and Action for Grasping Positions . . . . . . . . . 1021 Takahiro Fujita, Yoshinobu Maeda, and Masazumi Katayama
XX
Table of Contents – Part I
Handwritten Character Distinction Method Inspired by Human Vision Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1031 Jumpei Koyama, Masahiro Kato, and Akira Hirose Recent Advances in the Neocognitron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1041 Kunihiko Fukushima Engineering-Approach Accelerates Computational Understanding of V1–V2 Neural Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1051 Shunji Satoh and Shiro Usui Recent Studies Around the Neocognitron . . . . . . . . . . . . . . . . . . . . . . . . . . . 1061 Hayaru Shouno Toward Human Arm Attention and Recognition . . . . . . . . . . . . . . . . . . . . . 1071 Takeharu Yoshizuka, Masaki Shimizu, and Hiroyuki Miyamoto Projection-Field-Type VLSI Convolutional Neural Networks Using Merged/Mixed Analog-Digital Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1081 Osamu Nomura and Takashi Morie Optimality of Reaching Movements Based on Energetic Cost under the Influence of Signal-Dependent Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1091 Yoshiaki Taniai and Jun Nishii Influence of Neural Delay in Sensorimotor Systems on the Control Performance and Mechanism in Bicycle Riding . . . . . . . . . . . . . . . . . . . . . . . 1100 Yusuke Azuma and Akira Hirose Global Localization for the Mobile Robot Based on Natural Number Recognition in Corridor Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1110 Su-Yong An, Jeong-Gwan Kang, Se-Young Oh, and Doo San Baek A System Model for Real-Time Sensorimotor Processing in Brain . . . . . . 1120 Yutaka Sakaguchi Perception of Two-Stroke Apparent Motion and Real Motion . . . . . . . . . . 1130 Qi Zhang and Ken Mogi Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1141
A Retinal Circuit Model Accounting for Functions of Amacrine Cells Murat Saglam, Yuki Hayashida, and Nobuki Murayama Graduate School of Science and Technology, Kumamoto University, 2-39-1 Kurokami, Kumamoto 860-8555, Japan
[email protected], {yukih,murayama}@cs.kumamoto-u.ac.jp
Abstract. In previous experimental studies on vertebrates, high level processes of vision such as object segregation and spatio-temporal pattern adaptation were found to begin at retinal stage. In those visual functions, diverse subtypes of amacrine cells are believed to play essential roles by processing the excitatory and inhibitory signals laterally over a wide region on the retina to shape the ganglion cell responses. Previously, a simple "linear-nonlinear" model was proposed to explain a specific function of the retina, and could capture the spiking behavior of retinal output, although each class of the retinal neurons were largely omitted in it. Here, we present a spatio-temporal computational model based on the response function for each class of the retinal neurons and the anatomical intercellular connections. This model is not only capable of reproducing filtering properties of outer retina but also realizes high-order inner retinal functions such as object segregation mechanism via wide-field amacrine cells. Keywords: Retina, Amacrine Cells, Model, Visual Function.
1 Introduction The vertebrate retina is far more than a passive visual receptor. It has been reported that many high-level vision tasks begin in the retinal circuits although they are believed to be performed in visual cortices of the brain [1]. One important task among those is the discrimination of the actual motion of an object from the global motion across the retina. Even in the case of perfect stationary scene, eye movements cause retinal image-drifts that hinder the retinal circuit to have a stationary global input at the background [2]. To handle this problem, retinal circuits are able to distinguish the object motions better when they have different patterns than background motion. The synaptic configuration of diverse types of retinal cells plays an essential role during this function. It was reported that wide-field polyaxonal amacrine cells can drive inhibitory process between the surround and the object regions (receptive field) on the retina [1, 2, 4, 5, 6]. Those wide-field amacrine cells are known to use inhibitory neurotransmitters like glycine or GABA [7, 8]. A previous study reported that glycine-mediated wide-field inhibition exists in the salamander retina and proposed a simple “linear-nonlinear” model, consisting of a temporal filter and a threshold function [2]. However that model does not include the details of any retinal neurons M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 1–6, 2008. © Springer-Verlag Berlin Heidelberg 2008
2
M. Saglam, Y. Hayashida, and N. Murayama
accounting for that inhibitory mechanism although it is capable of predicting the spiking behavior of the retinal output for certain input patterns. On the other hand, different temporal models that include the behavior of each class of retinal neurons exist in the literature [9, 10]. Even though those models provide high temporal resolution, they lack spatial information of the retinal processing. Here, we present a spatio-temporal computational model that realizes wide-field inhibition between object and surround region via wide-field transient on/off amacrine cells. The model considers the responses of all major retinal neurons in detail.
2 Retinal Model The retina model parcels the stimulating input into a line of spatio-temporal computational units. Each unit consists of main retinal elements that convey information forward (photoreceptors, bipolar and ganglion cells) and laterally (horizontal and amacrine cells). Figure 1 illustrates the organization and the synaptic connections of the retinal neurons.
Fig. 1. Three computational units of the model are depicted. Each unit includes: PR: Photoreceptor, HC: Horizontal Cell, onBC: On Bipolar Cell, offBC: Off Bipolar Cell, onAC: Sustained On Amacrine Cell, offAC: Sustained Off Amacrine Cell, on/offAC: Fast-transient Wide-field On/Off Amacrine Cell, GC: On/Off Ganglion Cell. Excitatory/Inhibitory synaptic connections are represented with black/white arrowheads, respectively. Gap junctions within neighboring HCs are indicated by dotted horizontal lines. Wide-field connections are realized between wide-field on/offAC and GCs, double-s-symbols point to distant connections within those neurons.
A Retinal Circuit Model Accounting for Functions of Amacrine Cells
3
Each neuron’s membrane dynamics is governed by a differential equation (eqn.1) which is adjusted from push-pull shunting models of retinal neurons [9]. n dvc (t ) = − Avc (t ) + [ B − vc (t )]e(t ) − [ D + vc (t )]i (t ) + ∑Wck vk (t ) dt k =1
(1)
Here υc(t) stands for the membrane potential of the neuron of interest. A represents the rate of passive membrane decay toward the resting potential in the dark. B and D are the saturation levels for the excitatory, e(t) and inhibitory i(t) inputs, respectively. Those excitatory/inhibitory inputs correspond to the synaptic connections (solid lines in fig.1) of different neurons in a computational unit. υk(t) is the membrane potential of a different neuron belonging to another unit making synapse or gap-junction to the neuron of interest, υc(t). The efficiency of that link is determined by a weight parameter, Wck. In the current model, spatial connectivity is present within horizontal cells as gap junctions (dashed lines in fig.1) and between on/off amacrine cells and ganglion cells as wide-field inhibitory process (thin solid lines in fig1). As for the other neurons Wck is fixed to zero since we ignore the lateral spatial connection within them. A compressive nonlinearity (eqn. 2) is cascaded prior to the photoreceptor input stage in order to account the limited dynamic range of the neural elements. Therefore the photoreceptor is fed by a hyperpolarizing input, r (t), representing the compressed form of light intensity, f (t).
⎛ f (t ) ⎞ ⎟⎟ r (t ) = G⎜⎜ ⎝ f (t ) + I ⎠
n
(2)
Here G denotes the saturation level of the hyperpolarizing input to the photoreceptor. I represents the light intensity yielding half-maximum response and n is a real constant. Although ganglion cell receptive field size is diverse among different animals, we defined that each unit corresponds to 500μm which is in well accordance with experiments on salamander [2]. 32 computational units are interconnected as a line. Wck values are determined as a function of distance between computational units. Parameter set given in [9] is calibrated to reproduce the temporal dynamics of all neuron classes. Spatial parameters of the model are selected to meet the spatial ganglion cell response profile given in [2]. All differential equations in the model are solved sequentially using fixed-step (1ms) Bogacki-Shampine solver of the MATLAB/SIMULINK software package (The Mathworks-Inc., Natick, MA).
3 Results First we confirmed that responses of each neuron agree with the physiological observations [9]. Figure 2 illustrates the response of all neurons to 150ms-long flash light stimulating the whole model retina. At the outer retina, photoreceptor responds with a transient hyperpolarization followed by a less steep level and reaches to the resting potential with a small overshoot. Essentially horizontal cell has the smoothened form of the photoreceptor response due to its low-pass filtering feature. On- and offbipolar cells are depolarized during on/off-set of the flash light, respectively. Since
4
M. Saglam, Y. Hayashida, and N. Murayama
Fig. 2. Responses of retinal neurons (labeled as fig.1) to 150ms-long full field light flash. Dashed horizontal lines indicate dark responses(resting potentials) of each neuron. Note that the timings of on/off responses of wide-field ACs and GC spike generating potentials (GC gen. pot.) match each other. This phenomenon drives the wide-field inhibition.
Fig. 3. GC spike generating potential responses (top row) of the center unit at incoherent (left column) and coherent (right column) stimulation case. Stimulation timing and position are depicted on the x and y axes, respectively (bottom row, white bars indicate the light onset). Under coherent stimulation condition, off responses are significantly inhibited and on responses all disappeared.
those cells build negative feedback loop with sustained on- and off-amacrine cells, their responses are more transient than photoreceptors and horizontal cells as expected. Eventually bipolar cells transmit excitatory, wide-field transient on/off amacrine cells
A Retinal Circuit Model Accounting for Functions of Amacrine Cells
5
convey inhibitory inputs to ganglion cells. Significant inhibition at ganglion cell level only happens when wide-field amacrine cell signal matches the excitatory input. Figure 3 demonstrates how inhibitory process differs when the peripheral (surround) and the object regions are stimulated coherently or incoherently. In both cases object region invades 3 units (750μm radius) and stimulated identically. When the surround is stimulated incoherently, depolarized peaks of wide-field amacrine cells do not coincide with the ganglion cell peaks so that spike generating potentials are evident. However when the surround region is simulated coherently, inhibition from amacrine cells cancels out the big portions of ganglion cell depolarizations. This leads to maximum inhibition of the spike generating potentials of ganglion cells (Fig.3, right column).
4 Discussion In the current study we realized the basic mechanism of an important retinal task which is discriminating a moving object from moving background image. The coherent stimulation (Fig.2) can be linked to the global motion of the retinal image that takes place when the eye moves. However when there is a moving object in the
Fig. 4. Relative generating potential response of GC as a function of object size. Dashed line represents the model response with the original parameter set. Triangle and square markers indicate data points for ‘wide-field AC blocked’ and ‘control’ cases, respectively. Maximum GC response is observed when the object radius is 250μm (1 unit stimulation, 2nd data point). For the sake of symmetry 3rd data point represents 3 unit stimulation (750μm radius as in Fig.3), similarly each interval after 2nd data point corresponds to 500μm increment in the radius of object. As the object starts to invade the surround region, GC response decreases. When the weight of the interconnections among wide-field on/off ACs are set to zero (STR application), inhibition process is partially disabled (solid line).
6
M. Saglam, Y. Hayashida, and N. Murayama
scene, its image would be reflected on the receptive field as a different stimulation pattern than the global pattern (Incoherent stimulation, Fig.3). Experimental results revealed that blocking glycine-mediated inhibition by strychnine (STR) disables the wide-field process [2]. Therefore this glycine-ergic mechanism could be accounted to wide-field amacrine cells [7]. In our model, STR application can be realized by turning off synaptic weight parameters between wide-field amacrine cells and ganglion cells. Figure 4 demonstrates how STR application can affect ganglion cell response. As the object invades the background ganglion cell response is expected to be inhibited as in the control case however STR prevents this phenomenon to occur. This behavior of the model is in very well accordance with the experimental results in [2]. Note that the model is flexible enough to fit to another wide-field inhibitory process such as GABA-ergic mechanism [8]. Spike generation of the ganglion cells is not implemented in the current model in order to highlight the role of wide-field amacrine cells only. A specific spike generator can be cascaded to the model to reproduce spike responses and highlight retinal features more. Since the model covers on/off pathways and all major retinal neurons, it can be flexibly adjusted to reproduce other functions. Although we deduced the retina into a line of spatio-temporal computational units, the model was able to reproduce a retinal mechanism. This deduction can be bypassed and more precise results can be achieved by creating a 2-D mesh of spatiotemporal computational units.
References 1. Masland, R.H.: Vision: The retina’s fancy tricks. Nature 423(6938), 387–388 (2003) 2. Olveczky, B.P., Baccus, S.A., Meister, M.: Segregation of object and background motion in the retina. Nature 423(6938), 401–408 (2003) 3. Volgyi, B., Xin, D., Amarillo, Y., Bloomfield, S.A.: Morphology and physiology of the polyaxonal amacrine cells in the rabbit retina. J. Comp. Neurol. 440(1), 109–125 (2001) 4. Lin, B., Masland, R.H.: Populations of wide-field amacrine cells in the mouse retina. J. Comp. Neurol. 499(5), 797–809 (2006) 5. Solomon, S.G., Lee, B.B., Sun, H.: Suppressive surrounds and contrast gain in magnocellular pathway retinal ganglion cells of macaque. J. Neurosci. 26(34), 8715–8726 (2006) 6. van Wyk, M., Taylor, W.R., Vaney, D.: Local edge detectors: a substrate for fine spatial vision at low temporal frequencies in rabbit retina. J. Neurosci. 26(51), 250–263 (2006) 7. Hennig, M.H., Funke, K., Worgotter, F.: The influence of different retinal subcircuits on the nonlinearity of ganglion cell behavior. J. Neurosci. 22(19), 8726–8738 (2002) 8. Lukasiewicz, P.D.: Synaptic mechanisms that shape visual signaling at the inner retina. Prog Brain Res. 147, 205–218 (2005) 9. Thiel, A., Greschner, M., Ammermuller, J.: The temporal structure of transient ON/OFF ganglion cell responses and its relation to intra-retinal processing. J. Comput. Neurosci. 21(2), 131–151 (2006) 10. Gaudiano, P.: Simulations of X and Y retinal ganglion cell behavior with a nonlinear pushpull model of spatiotemporal retinal processing. Vision Res. 34(13), 1767–1784 (1994)
Global Bifurcation Analysis of a Pyramidal Cell Model of the Primary Visual Cortex: Towards a Construction of Physiologically Plausible Model Tatsuya Ishiki, Satoshi Tanaka, Makoto Osanai, Shinji Doi, Sadatoshi Kumagai, and Tetsuya Yagi Division of Electrical, Electronic and Information Engineering, Graduate School of Engineering, Osaka University, Yamada-Oka 2-1, Suita, Osaka, Japan
[email protected]
Abstract. Many mathematical models of different neurons have been proposed so far, however, the way of modeling Ca2+ regulation mechanisms has not been established yet. Therefore, we try to construct a physiologically plausible model which contains many regulating systems of the intracellular Ca2+ , such as Ca2+ buffering, Na+ /Ca2+ exchanger and Ca2+ pump current. In this paper, we seek the plausible values of parameters by analyzing the global bifurcation structure of our temporary model.
1
Introduction
Complex information processing of brain is regulated by the electrical activity of neurons. Neurons transmit an electrical signal called action potential each other for the information processing. The action potential is a spiking or bursting and plays an important role in the information processing of the brain. In the visual system, visual signals from the retina are processed by neurons in the primary visual cortex. There are several types of neurons in the visual cortex, and pyramidal cells compose roughly 80% of the neurons of the cortex. Pyramidal cells are connected each other and form a complex neuronal circuit. Previous physiological and anatomical studies [1] revealed the fundamental structure of the circuit. However, it is not completely understood how visual signals propagate and function in the neuronal circuit of the visual cortex. In order to investigate the neuronal circuit, not only physiological experiments but also simulations by using a mathematical model of neuron are necessary. Many mathematical models of neurons have been proposed so far [2]. Though there are various models of neurons, the way of modeling the regulating system of the intracellular calcium ions (Ca2+ ) has not been established yet. The regulating system of the intracellular Ca2+ is a very important element because the intracellular Ca2+ plays crucial roles in cellular processes such as hormone and neurotransmitter release, gene transcription, and regulations of synaptic plasticity. Therefore, it is important to establish the way of modeling the regulating system of the intracellular Ca2+ . M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 7–17, 2008. c Springer-Verlag Berlin Heidelberg 2008
8
T. Ishiki et al.
In this paper, we try to construct a model of pyramidal cells by regarding previous physiological experimental data, especially focusing on the regulating systems of the intracellular Ca2+ , such as Ca2+ buffering, Na+ /Ca2+ exchanger and Ca2+ pump current. In order to estimate the value of the parameters which cannot be determined by physiological experiments solely, we analyze the global bifurcation structure based on the slow/fast decomposition of the model. Thus we demonstrate the usefulness of such nonlinear analyses not only for the analysis of an established model but also for the construction of a model.
2
Cell Model
The well-known Hodgkin-Huxley (HH) equations [3] describe the temporal variation of membrane potential of neuronal cells. Though there are many neuron models based on the HH equations, the way of modeling the regulating system of the intracellular Ca2+ has not been established yet. Thus, we construct a pyramidal cell model by using several physiological experiments data [4]-[11]. The model includes Ca2+ buffer, Ca2+ pump, Na+ /Ca2+ exchanger in order to describe the regulating system of the intracellular Ca2+ appropriately. The model also includes seven ionic currents through the ionic channels. The equations of the pyramidal cell model are as follows: −C
dV = Itotal − Iext , dt dy 1 = (y∞ − y) , (y = M 1, · · · , M 6, H1, · · · , H6), dt τy
d[Ca2+ ] −S · ICatotal = + k− [CaBuf] − k+ [Ca2+ ][Buf], dt 2·F d[Buf] = k− [CaBuf] − k+ [Ca2+ ][Buf], dt d[CaBuf] = −(k− [CaBuf] − k+ [Ca2+ ][Buf]), dt
(1a) (1b) (1c) (1d) (1e)
where V is the membrane potential, C is the membrane capacitance, Itotal is the sum of all currents through the ionic channels and Na+ /Ca2+ exchanger, and Iext is the current injected to the cell externally. The variable y denotes gating variables (M 1, · · · , M 6, H1, · · · , H6) of the ionic channels, y∞ is the steady state function of y, and τy is a time constant. [Ca2+ ] denotes the intracellular Ca2+ concentration, ICatotal is the sum of all Ca2+ ionic currents, S is the surface to volume ratio, and F is the Faraday constant. [Buf] and [CaBuf] are the concentrations of the unbound and the bound buffers, and k− and k+ are the reverse and forward rate constants of the binding reaction, respectively. Details of all equations and parameter values of this model can be found in Appendix. First we show the simulation results when a certain external stimulus current is injected to the cell model (1). Figure 1 shows an action potential waveform, and a change of [Ca2+ ] when a narrow external stimulus current (length 1ms,
Global Bifurcation Analysis of a Pyramidal Cell Model
20
9
[B]
[A]
0.0018
V (mV)
[Ca2+] (mM)
0 -20 -40 -60 -80 3500
4000
4500
5000
t (ms)
5500
6000
0.0014 0.001 0.0006 0.0002 3500 4000
4500
5000 5500 6000
t (ms)
Fig. 1. Waveforms of [A] the membrane potential and [B] [Ca2+ ] in the case of a 1ms narrow pulse injection [A]
[B]
V (mV)
20 0
-20 -40 -60 -80
4000
5000
t (ms)
6000
7000
Fig. 2. [A] A waveform of the membrane potential under a long pulse injection. [B] A typical waveform of the membrane potential of pyramidal cell in physiological experiments [12].
density 40μA/cm2 ) is injected at t = 4000ms. The waveforms of the membrane potential and [Ca2+ ] are not so different from the physiological experimental data qualitatively [1]. In contrast, as shown in Fig. 2A, in the case that a long pulse (length 1000ms, density 40μA/cm2 ) is injected, the membrane potential keeps a resting state after one action potential is generated. Though the membrane potential is spiking continuously in the physiological experiment (Fig. 2B), the membrane potential of the model does not show such a behavior. In general, it is well known that the membrane potential of pyramidal cell is resting in the case of no external stimulus, and spiking or bursting when the external stimulus is added. The aim of this paper is a reconstruction or a parameter tuning of the model (1) which can reproduce such a behavior of the membrane potential.
3
Bifurcation Analysis
The characteristics of the membrane potential vary with the change of the value of some parameters, therefore we investigate the bifurcation structure of the model (1) to estimate the values of the parameters. For the bifurcation analysis in this paper, we used the bifurcation analysis software AUTO [13].
10
T. Ishiki et al.
V (mV)
-35 -40 -45 -50 -55 -60 -65 -70 0
20
40
60
80
100
Iext (µA/cm2)
Fig. 3. One-parameter bifurcation diagram on the parameter Iext . The solid curve denotes stable equilibria of eq. (1).
3.1
The External Stimulation Current I ext
We analyze the bifurcation structure of the model to understand why the continuous spiking of the membrane potential is not generated when a long pulse is injected. In order to investigate whether the spiking is generated or not when Iext is increased, we vary the external stimulation current Iext as a bifurcation parameter. We show the one-parameter bifurcation diagram on the parameter Iext (Fig. 3) in which the solid curve denotes the membrane potential at stable equilibria of eq. (1). The one-parameter bifurcation diagram shows the dependence of the membrane potential on the parameter Iext . There is no bifurcation point in Fig. 3. Therefore the stability of the equilibrium point does not change, and thus the membrane potential of the model keeps resting even if Iext is increased. This result means that we cannot reproduce the physiological experimental result in Fig. 2B by varying Iext , thus we have to reconsider the parameter values of the model which cannot be determined by physiological experiments only. 3.2
The Maximum Conductance of Ca2+ -Dependent Potassium Channel GKCa
The current through the Ca2+ -dependent potassium channel is involved in the generation of spiking or bursting of the membrane potential. Therefore, we select the maximum conductance of Ca2+ -dependent potassium channel GKCa as a bifurcation parameter, and show the one-parameter bifurcation diagram when Iext = 10 (Fig. 4A). There are two saddle-node bifurcation points (SN1, SN2), three Hopf bifurcation points (HB1-HB3) and two torus bifurcation points (TR1, TR2). An unstable periodic solution which is bifurcated from HB1 changes its stability at the two torus bifurcation points and merges into the equilibrium point at HB3. Only in the range between HB1 and HB3, the membrane potential can oscillate. In order to investigate the dependence of the oscillatory range on the Iext value, we show the two-parameter bifurcation diagram (Fig. 4B) in which the horizontal and vertical axes denote Iext and GKCa , respectively. The twoparameter bifurcation diagram shows the loci where a specific bifurcation occurs.
Global Bifurcation Analysis of a Pyramidal Cell Model [A]
14
V (mV)
-10
-40
HB3
-60
2.50
SN1 HB2 HB1
2.75
3.00
3.25
HB SN
10
SN2
-50
[B]
12
TR1
-20 -30
Iext =10
TR2
G K Ca (mS/cm 2 )
0
11
3.50
G K Ca (mS/cm 2 )
8 6 4 2
3.75
4.00
0 -15 -10 -5
0
5
10 15 20 25 30
Iext (µA/cm2)
Fig. 4. [A] One-parameter bifurcation diagram on the parameter GKCa . The solid and broken curves show stable and unstable equilibria, respectively. The symbols • and ◦ denote the maximum value of V of the stable and unstable periodic solutions, respectively. [B] Two-parameter bifurcation diagram in the (Iext , GKCa )-plane.
In the diagram, the gray colored area separated by the HB and SN bifurcation curves corresponds to the range between HB1 and HB3 in Fig. 4A where the periodic solutions appear. Increasing Iext , the gray colored area shrinks gradually and disappears near Iext = 25. This result means that the membrane potential of the model cannot present any oscillations (spontaneous spiking) for large values of Iext even if we changed the value of the parameter GKCa . 3.3
The Maximum Pumping Rate of the Ca2+ Pump Apump
The Ca2+ pump plays an important role in the regulation of the intracellular Ca2+ . We also investigate the effect of varying the Apump value, which is the maximum pumping rate of the Ca2+ pump, on the membrane potential. Figure 5A is the one-parameter bifurcation diagram when Iext = 10. There are two Hopf bifurcation points (HB1, HB2) and four double-cycle bifurcation points (DC1DC4). A stable periodic solution generated at HB1 changes its stability at the four double-cycle bifurcation points and merges into the equilibrium point at HB2. Similarly to the case of GKCa , we show the two-parameter bifurcation diagram in the plane of two parameters Iext and Apump (Fig. 5B) in order to examine the dependence of the oscillatory range between HB1 and HB2 on Iext . In Fig. 5B, the gray colored area, where the membrane potential oscillates, shrinks and disappears as Iext increases. The result shows that the membrane potential of the model cannot present any oscillations (spontaneous spiking) for large values of Iext even if we changed the value of Apump similarly to the case of GKCa . 3.4
Slow/Fast Decomposition Analysis
In this section, in order to investigate the dynamics of our pyramidal cell model in more detail, we use the slow/fast decomposition analysis [14].
T. Ishiki et al.
30 20 10 0 -10 -20 -30 -40 -50 -60 0
[A]
1000 DC3
Iext =10
DC4 DC2 HB2 DC1 HB1
300
500
800
1000
Apump (pmol/s/cm2)
1300
Apump (pmol/s/cm2)
V (mV)
12
[B] HB
800 600 400 200 0
0
20
40
60
80
100 120 140
Iext (µA/cm2)
Fig. 5. [A] One-parameter bifurcation diagram on the parameter Apump . [B] Twoparameter bifurcation diagram in the (Iext , Apump )-plane.
A system with multiple time scales can be denoted generally as follows: dx = f (x, y), x ∈ Rn , y ∈ Rm , dt dy = g(x, y), 1. dt
(2a) (2b)
Equation (2b) is called a slow subsystem since the value of y changes slowly while equation (2a) a fast subsystem. The whole eq. (2) is called a full system. So-called slow/fast analysis divides the full system into the slow and fast subsystems. In the fast subsystem (2a), the slow variable y is considered as a constant or a parameter. The variable x changes more quickly than y and thus x is considered to stay close to the attractor (stable equilibrium points, limit cycle, etc.) of the fast subsystem for a fixed value of y. The variable y changes slowly with a velocity g(x, y) in which x is considered to be in the neighborhood of the attractor. The attractor of the fast subsystem may change if y is varied. The problem of analysis of the dependence of attractor on the parameter y is a bifurcation problem. Thus the slow/fast analysis reduces the analysis of full system to the bifurcation problem of the fast subsystem with a slowly-varying bifurcation parameter. In the case of the pyramidal cell model (1), under the assumption that the change of the intracellular Ca2+ concentration [Ca2+ ] is slower than the other variables, the slow/fast analysis can be made. Thus, we consider [Ca2+ ] as a bifurcation parameter and eq. (1c) as a slow subsystem, and all other equations of eqs. (1a,b,d,e) are considered as a fast subsystem. We show the bifurcation diagram of the fast subsystem by varying the value of [Ca2+ ] as a parameter (Fig. 6). The figure shows the stable and unstable equilibria of the fast subsystem with Iext = 0 (thick solid and broken curves, resp.), and the slow subsystem (thin curve). The point at the intersection of the equilibrium curve of the fast subsystem with the nullcline of the slow subsystem is the equilibrium point of the f ull system. The stability of the full system is determined whether the intersection point is on the stable or unstable branch of the equilibrium curve of the f ast subsystem. Therefore, the stability of the full system is stable in the case of
Global Bifurcation Analysis of a Pyramidal Cell Model
13
10
0
V (mV)
-10 -20 -30 -40 -50 -60 -70 -80
0
0.0001 0.0002 0.0003 0.0004 0.0005
[Ca2+] (mM)
Fig. 6. Bifurcation diagram of the fast subsystem (Iext = 0) with Ca2+ as a bifurcation parameter and the slow-nullcline of the slow subsystem
Fig. 6. In addition, when Iext is increased, the bifurcation diagram (equilibrium curve) of the fast subsystem shifts upward and the stability of the full system keeps stable and no oscillation appear, as will be shown in Fig. 7. By changing the parameter of the slow subsystem, the shape of the slownullcline changes and the intersection point is also shifted. First, we select some parameters of the slow subsystem. Because the Ca2+ pump is included only in the slow subsystem, we select the Apump and the dissociation constant Kpump which are both parameters contained in Ca2+ pump as the parameters of the slow subsystem. Second, we change the values of Apump and Kpump in order to change the shape of the nullcline. Figure 7A shows the slow-nullclines (thin solid or broken curves) with varying Apump and also the equilibria of the fast subsystem (thick solid and broken curves) with Iext = 0, 20 and 40. By the increase of Apump , the nullcline of the slow subsystem shifts upward, and the intersection point of the equilibrium curve of the fast subsystem (Iext = 0) with the slow-nullcline is then located at an unstable equilibrium. Therefore, the membrane potential of the full system is spiking when Iext = 0. Figure 7B is the similar diagram to Fig. 7A, where the value of Kpump is varied (Apump is varied in Fig. 7A). By the change of Kpump value, the shape of the slow-nullcline is not changed much, therefore the intersection point of the equilibrium curve of the fast subsystem with the nullcline keeps staying at stable equilibria and the full system remains stable at a resting state. Next, in Fig. 8, we show an example of spontaneous spiking induced by an increase of Apump (Apump = 20). The gray colored orbit in Fig. 8A is the projected trajectory of the oscillatory membrane potential and the waveform is shown in Fig. 8B. Because the equilibrium curve of the fast subsystem intersects with the slow-nullcline at the unstable equilibrium, the membrane potential oscillates even though Iext = 0. The projected trajectory of the full system follows the stable equilibrium of the fast subsystem (the lower branch of thick curve) for a long time, and this prolongs the inter-spike interval. After the trajectory passes through the intersection point, the membrane potential makes a spike. When the trajectory passes through the intersection point, the trajectory winds around the intersection point. This winding is possibly caused by a complicated nonlinear
14
T. Ishiki et al.
10
[A] Iext=0 Iext=20 Iext=40 Apump=5(default) Apump=50 Apump=100 Apump=150
0 -10 -30 -40 -50
-20 -30 -40 -60 -70
0.0002
-80
0.0004 0.0006 0.0008 0.001
or or or
-50
-70 0
Iext=0 Iext=20 Iext=40 Kpump=0.2 Kpump=0.32 Kpump=0.4(default) Kpump=1.2 Kpump=2.0 Kpump=3.2
-10
-60 -80
[B]
0
V (mV)
V (mV)
-20
10
or or or
[Ca2+] (mM)
0
0.0002
0.0004 0.0006 0.0008 0.001
[Ca2+] (mM)
Fig. 7. Variation of the equilibria of the fast subsystem and the nullcline of the slow subsystem, by the change of the parameters of slow subsystem: [A] Apump , [B] Kpump [A]
20
0
0
-20
-20
V (mV)
V (mV)
20
-40
-40 -60
-60 -80
[B]
0
0.0005
0.001
[Ca2+] (mM)
0.0015
-80 20000
20400
20800
21200 21600 22000
t (ms)
Fig. 8. [A] An oscillatory trajectory of the full system (gray curve) with bifurcation diagram of the fast subsystem (Iext = 0, thick solid and broken curve) and the nullcline of the slow subsystem (Apump = 20, thin curve), [B] The oscillatory waveform of the membrane potential
dynamics [14] and makes the subthreshold oscillation of the membrane potential just before the spike in Fig. 8B. However, this subthreshold oscillation cannot be observed in physiological experiments of pyramidal cell (Fig. 2B).
4
Conclusion
In this research, we tried to construct a model of pyramidal cells in the visual cortex focusing on Ca2+ regulation mechanisms, and analyzed the global bifurcation structure of the model in order to seek the physiologically plausible values of its parameter. We analyzed the global bifurcation structure of the model using the maximum conductance of Ca2+ -dependent potassium channel (GKCa ) and the maximum pumping rate of the Ca2+ pump (Apump) as bifurcation parameters. According to the two-parameter bifurcation diagrams we showed that the range where the spontaneous spiking occurs shrinks as the external stimulation current Iext
Global Bifurcation Analysis of a Pyramidal Cell Model
15
increases. Therefore, the membrane potential of the model cannot oscillate for large values of Iext even if both values of GKCa and Apump were changed. We also investigated the effect of Apump and the dissociation constant Kpump on the nullcline of slow subsystem based on the slow/fast decomposition analysis. If Apump is increased, the membrane potential is spiking when Iext = 0 because the nullcline shifts upward and the stability of the full system becomes unstable. When Kpump is varied, the membrane potential keeps a resting state because the full system remains stable. Unfortunately, no expected behavior was obtained by the change of values of the parameters considered in this paper. We have, however, demonstrated the usefulness of such nonlinear analyses as the bifurcation and slow/fast analyses to examine parameter values and construct a physiological model. More detailed study using the other parameters is necessary for the construction of the appropriate model as a future subject.
References 1. Osanai, M., Takeno, Y., Hasui, R., Yagi, T.: Electrophysiological and optical studies on the signal propagation in visual cortex slices. In: Proc. of 2005 Annu. Conf. of Jpn. Neural Network Soc., pp. 89–90 (2005) 2. Herz, A.V.M., Gollisch, T., Machens, C.K., Jaeger, D.: Modeling single-neuron dynamics and computations: a balance of detail and abstraction. Science 314, 80– 85 (2006) 3. Hodgkin, A.L., CHuxley, A.F.: A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol(Lond) 177, 500–544 (1952) 4. Brown, A.M., Schwindt, P.C., Crill, W.E.: Voltage dependence and activation kinetics of pharmacologically defend components of the high-threshold calcium current in rat neocortical neurons. J. Neurophysiol. 70, 1516–1529 (1993) 5. Peterson, B.Z., Demaria, C.D., Yue, D.T.: Calmodulin is the Ca2+ sensor for Ca2+ dependent inactivation of L-type calcium channels. Neuron. 22, 549–558 (1999) 6. Cummins, T.R., Xia, Y., Haddad, G.G.: Functional properties of rat and human neocortical voltage-sensitive sodium currents. J. Neurophysiol. 71, 1052–1064 (1994) 7. Korngreen, A., Sakmann, B.: Voltage-gated K + channels in layer 5 neocortical pyramidal neurons from young rats: subtypes and gradients. J. Neurophy. 525, 621–639 (2000) 8. Kang, J., Huguenard, J.R., Prince, D.A.: Development of BK channels in neocortical pyramidal neurons. J. Neurophy. 76, 188–198 (1996) 9. Hayashida, Y., Yagi, T.: On the interaction between voltage-gated conductances and Ca+ regulation mechanisms in retinal horizontal cells. J. Neurophysiol. 87, 172–182 (2002) 10. Naraghi, M., Neher, E.: Linearized buffered Ca+ diffusion in microdomains and its implications for calculation of [Ca+ ] at the mouth of a calcium channel. J. Neurosci. 17, 6961–6973 (1997) 11. Noble, D.: Influence of Na/Ca exchanger stoichiometry on model cardiac action potentials. Ann. N.Y, Acad. Sci. 976, 133–136 (2002)
16
T. Ishiki et al.
12. Yuan, W., Burkhalter, A., Nerbonne, J.M.: Functional role of the fast transient outward K+ current IA in pyramidal neurons in (rat) primary visual cortex. J. Neurosci. 25, 9185–9194 (2005) 13. Doedel, E.J., Champeny, A.R., Fairgrieve, T.F., Kunznetsov, Y.A., Sandstede, B., Xang, X.: Continuation and bifurcation software for ordinary differential equations (with HomCont). Technical Report, Concordia University (1997) 14. Doi, S., Kumagai, S.: Generation of very slow neuronal rhythms and chaos near the Hopf bifurcation in single neuron models. J. Comp. Neurosci. 19, 325–356 (2005)
Appendix Itotal = INa + IKs + IKf + IK−Ca + ICaL + ICa + Ileak + Iex ICatotal = ICaL + ICa − 2Iex + Ipump 2 INa = GNa · M 1 · H1 · (V − ENa ), GNa = 13.0(mS/cm ), ENa = 35.0(mV) V +29.5 τM1 (V ) = 1/ 0.182 1−expV +29.5 − 0.124 1−exp [− V +29.5 ] [ V +29.5 ] 6.7 6.7 V +124.955511 V +10.07413 τH1 (V ) = 0.5 + 1/ exp − 19.76147 + exp − 20.03406 2 IKs = GKs · M 2 · H2 · (V − EK ), GKs = 0.66(mS/cm ), EKs = −103.0(mV) 1.25 + 115.0 exp[0.026V ], (V < −50mV) τM2 (V ) = 1.25 + 13.0 exp[−0.026V ], (V ≥ −50mV) τH2 (V ) = 360.0 + (1010.0 + 24.0(V + 55.0)) exp (−(V + 75.0)/48.0)2 IKf = GKf · M 3 · H3 · (V − EKf ), GKf = 0.27(mS/cm2 ), EKf = EKs τM3 (V ) = 0.34 + 0.92 exp (−(V + 71.0)/59.0)2 τH3 (V ) = 8.0 + 49.0 exp (−(V + 37.0)/23.0)2 IKCa = GKCa · M 4 · (V − EKCa ), GKCa = 12.5(mS/cm2 ), EKCa = EKs M 4∞ (V, [Ca2+ ]) = ([Ca2+ ]/([Ca2+ ] + Kh )) · (1/(1 + exp[−(V + 12.7)/26.2])) Kh = 0.15(μM) 1.25 + 1.12 exp[(V + 92.0)/41.9] (V < 40mV) τM4 (V ) = 27.0 (V ≥ 40mV)
2+ (2F )2 [Ca ] exp[2V F /RT ]−[Ca2+ ]o ICaL = PCaL · M 5 · H5 · RT · V · exp[2V F /RT ]−1 4
PCaL = 0.225(cm/ms), H5∞ = KCa 4 /(KCa 4 + [Ca2+ ] ), KCa = 4.0(μm) τM5 (V ) = 2.5/(exp[−0.031(V + 37.1)] + exp[0.031(V + 37.1)]) τH5 = 2000.0
2+ )2 exp[2V F /RT ]−[Ca2+ ]o ICa = PCa · M 6 · H6 · (2F · V · [Ca ] exp[2V R·T F /RT ]−1 PCa = 0.155(cm/ms), τH6 = 2000.0 τM6 (V ) = 2.5/(exp[−0.031(V + 37.1)] + exp[0.031(V + 37.1)]) Ileak = V /30.0 Iex = k [Na+ ]3i · [Ca2+ ]o · exp s · VRTF − [Na+ ]3o · [Ca2+ ] · exp −(1 − s) · VRTF −5 2 4 k = 6.0 × 10 (μA/cm /mM ), s = 0.5 Ipump = (2 · F · Apump · [Ca2+ ])/([Ca2+ ] + Kpump ) Apump = 5.0(pmol/s/cm2 ), Kpump = 0.4(μM) M i∞ = 1/(1 + exp (−(V − αM i )/βM i ), i = 1, 2, 3, 5, 6 Hj∞ = 1/(1 + exp ((V − αHj )/βHj ), j = 1, 2, 3, 6 i 1 2 3 5 6 j 1 2 3 6 αM i (mV ) −29.5 −3.0 −3.0 −18.75 18.75 βM i 6.7 10.0 10.0 7.0 7.0
αHj (mV ) −65.8 −51.0 −66.0 −12.6 βHi 7.1 12.0 10.0 18.9
Global Bifurcation Analysis of a Pyramidal Cell Model
17
C = 1.0(μF/cm2 ), S = 3.75(/cm), k− = 5.0(/ms), k+ = 500.0(/mM · ms) [Ca2+ ]o = 2.5(mM), [Na+ ]i = 7.0(mM), [Na+ ]o = 150.0(mM)
T and R denote the absolute temperature and gas constant, respectively. In all ionic currents of the model, the powers for gating variables (M 1, · · · , M 6, H1, · · · , H6) are approximately set at one in order to simplify the equations. Leak current is assumed to have no ion selectivity and follow Ohm’s law. Thus, the reversal potential of leak current is set at 0 (mV), though it might be unusual.
Representation of Medial Axis from Synchronous Firing of Border-Ownership Selective Cells Yasuhiro Hatori and Ko Sakai Graduate School of Systems and Information Engineering, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8573 Japan
[email protected] [email protected] http://www.cvs.cs.tsukuba.ac.jp/
Abstract. The representation of object shape in the visual system is one of the most crucial questions in brain science. Although we can perceive figure shape correctly and quickly, without any effort, the underlying cortical mechanism is largely unknown. Physiological experiment with macaque indicated the possibility that the brain represents a surface with Medial Axis (MA) representation. To examine whether early visual areas could provide basis for MA representation, we constructed the physiologically realistic, computational model of the early visual cortex, and examined what constraint is necessary for the representation of MA. Our simulation results showed that simultaneous firing of BorderOwnership (BO) selective cells at the stimulus onset is a crucial constraint for MA representation. Keywords: Shape, representation, perception, vision, neuroscience.
1 Introduction Segregation of figure from ground might be the first step in the cortex toward the recognition of shape and object. Recent physiological studies have shown that around 60% of neurons in cortical areas V2 and V4 are selective to Border Ownership (BO) that tells which side of a contour owns the border, or the direction of figure, even about 20% of V1 neurons also showed the BO selectivity [1]. These reports also give an insightful idea on coding of shape in early- to intermediate- level vision. The coding of shape is a major question in neuroscience as well as in robot vision. Specifically, it is of great interest that how the visual information in early visual areas is processed to form the representation of shape. Physiological studies in monkeys [2] suggest that shape is coded by medial axis (MA) representation in early visual areas. The MA representation is the method that codes a surface by a set of circles inscribed along the contour of the surface. An arbitrary shape would be reproduced from the centers of the circles and their diameters. We examined whether neural circuits in early- to intermediate-level visual areas could provide a basis for MA representation. We propose that the synchronized responses of BO-selective neurons could evoke the representation of MA. The physiological study [2] showed that V1 neurons responded to figure shape around 40 ms after the stimulus onset, while the latency of the cells responded to MA was about 200 ms after the onset. Physiological study on M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 18–26, 2008. © Springer-Verlag Berlin Heidelberg 2008
Representation of Medial Axis from Synchronous Firing of BO Selective Cells
19
BO-selective neurons reported that their latency is around 70 ms. These results give rise to the proposal that the neurons detect contours first, then the BO is determined from the local contrast detected, and finally MA representation is constructed. It should be noted that BO could be determined from local contrast surrounding the classical receptive field (CRF), thus a single neuron with surrounding modulation would be sufficient to yield BO selectivity [3]. To examine the proposal, we constructed physiologically realistic, firing model of Border-ownership (BO) selective neuron. We assume that the onset of a stimulus with high contrast evokes simultaneous, strong responses of the neurons. These fast responses will propagate retinotopically, thus the neurons at MA (equidistance to the contour) will be activated. Our simulation results show that even relatively small facilitation from the propagated signals yield the firing of the model-cells at the MA because of the synchronization, indicating that simultaneous firing of BO-selective cells at the stimulus onset could enable the representation of MA.
2 The Proposed Model The model is comprised of three stages: (1) contrast detection stage, (2) BO detection stage, and (3) MA detection stage. The first stage extracts luminance contrast as similar to V1 simple cells. The model cells in the second stage mimic BO-selective neurons, which determine the direction of BO with respect to the border at the CRF based on the modulation from surrounding contrast up to 5 deg in visual angle from the CRF. The third stage spatially pools the responses from the second stage to test the MA representation from simultaneous firing of BO cells. A schematic diagram of the model is given in Figure 1. The following sections describe the functions of each stage. 2.1 Model Neurons To enable the simulations of precise spatiotemporal properties, we implemented single-component firing neurons and their connections through their synapses on NEURON simulator [4]. The cell body of the model cell is approximated by a sphere. We set the radius and the membrane resistance to 50μm and 34.5Ωcm, respectively. The model neurons calculate the membrane potential following the Hodgkin-Huxley equation [5] with constant parameter as shown in table 1. We used the biophysically realistic spiking neuron model because we need to examine exact timing of the firing of BO-selective cells as well as the propagtation of the signals, which cannot be realized by integrate-and-fire neuron model or any other abstract models. 2.2 Contrast Detection Stage The model cells in the first stage have response properties similar to those of V1 cells, including the contrast detection, dynamic contrast normalization and static compressive nonlinearity. The model cells in this stage detect the luminance contrast from oriented Gabor filters with distnct orientations. We limited ourselves to have four orientations for the sake of simplicity; vertical (0 and 180 deg) and horizontal (90 and 270 deg). Gabor filter is defined as follows:
Gθ (x , y ) = cos(2πω
(x sin θ + y cos θ )) × gaussian(x , y , μ x , μ y , σ x , σ y ),
(1)
20
Y. Hatori and K. Sakai
Fig. 1. A schematic illustration of the proposed model. Luminance contrast is detected in the first stage which is then processed to determine the direction of BO by surrounding modulation. F and S represent excitatory and inhibitory regions, respectively, for the surrounding modulation. The last stage detects MA based on the propagations from BO model cells.
where x and y represent spatial location, μx and μy represent central coordinate of Gauss function, σx and σy represent standard deviation of Gauss function, θ and ω represent orientation and spatial frequency, respectively. We take a convolution of an input image
Representation of Medial Axis from Synchronous Firing of BO Selective Cells
21
with the Gabor filters, with dynamic contrast normalization [6] including a static, compressive nonlinear function. For the purpose of efficient computation, the responses of the vertical pathways (0 and 180 deg) are integrated to form vertical orientation, and so do the horizontal pathways (90 and 270 deg), which will be convenient for the computation of iso-orientation suppression and cross-orientation facilitation in the next stage: O θ (x , y ) = (I * G θ )(x , y ) ,
(2)
1 (x , y ) = O0 (x , y ) + O180 (x , y ) , Oiso
(3)
1 (x , y ) = O90 (x , y ) + O270 (x , y ) , Ocross
(4)
where I represents input image, Oθ(x,y) represents the output of convolution (*), O1iso (O1cross) represents integrated responses of vertical (horizontal) pathway. Table 1. The constant values for the model cells used in the simulations
Parameter
Value
Cm
1( F/cm2) 50(mv) -77(mv) -54.3(mv) 0.120(S/cm2) 0.036(S/cm2) 0.0003(S/cm2)
ENa EK El gNa gK gl
μ
2.3 BO Detection Stage The second stage models surrounding modulation reported in early- to intermediatelevel vision for the determination of BO [3]. The model cells integrate surrounding contrast information up to5 deg in visual angle from the CRF center. We modeled the surrounding region with two Gaussians, one for inhibition and the other for facilitation, that are located asymmetrically with respect to the CRF center. If a part of the contour of an object is projected onto the excitatory (or inhibitory) region, the contrast information of the contour that is detected by the first stage is transmitted via a pulse to generate EPSP (or IPSP) of the BO-selective model-cell. In other words, the projection of a figure within the excitatory region facilitates the response of the BO model cell. Conversely, if a figure is projected onto the inhibitory region, the response of the model cell is suppressed. Therefore, surrounding contrast signals from the excitatory and inhibitory regions modulate the activity of BO model cells depending on the direction of figure. In this way, we implemented BO model cells based on the surrounding modulation. Note that Jones and her colleagues reported the orientation dependency in the surrounding modulation in monkeys' V1 cells [7]. The suppression is limited to similar orientations to the preferred orientation of the CRF (iso-orientation suppression), and facilitation is dominant for other orientations (cross-orientation facilitation). We implemented this orientation dependency for surround modulation.
22
Y. Hatori and K. Sakai
Taking into account the EPSP and IPSP from the surrounds, we compute the membrane potential of a BO-selective model cell at time t as follows: O 2 ( x1 , y1 , t ) = input ( x1 , y1 ) + c ∑{E iso (x, y , t - d x1, y1 (x, y ) ) + E cross (x, y , t - d x1, y1 (x, y ))} (5) x, y
,where x1 and y1 represent the spatial position of the BO-selective cell, input(x1,y1) represents the output of the first stage that is O1iso or O1cross, c represents a weight of synaptic connection, Eiso(x,y,t-dx1,y1(x,y)) represents EPSP (or IPSP) that is triggered by the pulse which is generated at t-dx1,y1. Ecross(x,y,t-dx1,y1(x,y)) is the same except for input orientation. And, dx1,y1(x,y) shows time delay in proportion to the distance between the BO-selective cell whose coordinate is (x1,y1) and the connected cell whose coordinate is (x,y). We defined Eiso(x,y,t-dx1,y1(x,y)) and dx1,y1(x,y) as:
(
)
(
)
E iso x , y ,t - d x1 , y1 (x , y ) = gaussian x , y , μ x1 , μ y1 ,σ x1 ,σ y1 × exp d x1 , y1 (x , y ) = ctime
(x
1
((
)
- t - d x1 , y1 ( x , y )
τ
)× (v - e) ,
2 2 - x ) + (y1 - y ) ,
(6) (7)
where τ represents time constant, v represents membrane potential, e represents reversal potential, ctime is a constant that converts distance to time. We set c, τ, e and ctime to 0.6 (or 1.0), 10ms, 0mv and 0.2ms/μm, respectively. And Ecross(x,y,t - dx1,y1(x,y)) is also calculated similarly to eq.6. 2.4 MA Detection Stage The third stage integrates BO information from the second stage to detect the MA. A MA model cell has a single excitatory surrounding region that is represented by a Gaussian. The membrane potential of a MA model cell is given by: O 3 (x2 , y 2 ,t )= cmedial
∑gaussian(x , y , μ
(x ,y )∈(x1 ,y1 )
x2
)
(
)
, μ y2 ,σ x2 ,σ y2 × O 2 x , y ,t - d x2 , y2 (x , y ) ,
(8)
where x2 and y2 represent the spatial position of the MA model cell, μx2 and μy2 represent the center of gauss function, cmedial is a constant that we set to 1.8 (or 10.0), dx2,y2(x,y) is calculated similarly to eq.7. Note that MA model cells receive EPSP only from BO model cells. When a BO model cell is activated, the model cell transmits a pulse to MA model cells, with its magnitude and time delay depending on the distance between the two. If a MA model cell was located equidistance to some parts of the contours, the pulses from the BO model cells on the contours reach the MA cell at the same time to evoke strong EPST that will generate a spike. On the other hand, MA model cells that are located not equidistance to the contours will never evoke a pulse. Therefore, the model cells that are located equidistance to the contours are activated based on simultaneous activation from BO model cells, the neural population of which will represent the medial axis of the object.
3 Simulation Result We carried out the simulations of the model to test whether the model shows the representation of MA. As typical examples, the results for thee types of stimuli are
Representation of Medial Axis from Synchronous Firing of BO Selective Cells
23
shown here, a square, a C-shaped figure, and a natural image of an eagle [8], as shown in Fig 1. Note that the model cells described in the previous sections are distributed retinotopically to form each layer with 138 × 138 cells.
(A)
(B)
(C)
Fig. 2. Three examples of stimuli used for the simulations. A square (A), a C-shaped figure (B), and a natural image of an eagle from Berkeley Segmentation Dataset [8] (C).
3.1 A Single Square First, we tested the model with a single square similar to that used in corresponding physiological experiment [2], as shown in Fig.1(A). Although we carried out the simulations retinotopically in 2D, for the purpose of graphical presentation, the responses of a horizontal cross section of the stimulus indicated by Fig. 3(A) are shown here in Fig. 3(B). The figure exhibits the firing rates for two types of the model cells, BO model cells responding to contours (solid lines at the horizontal positions of -1 and 1) and MA model cells (dotted lines at the horizontal position of 0). We observe a clear peak corresponding to MA at the center, as similar to the results of physiological experiments by Lee, et al. [2]. Although we tested the model along the horizontal cross section, the response of a vertical is identical. This result suggests that simultaneous firing of BO cells is capable of generating MA without any other particular constraints. 14 12 c 10 e s m 0 8 0 4 / s 6 e ki p s
4 2 0
(A)
-1 0 Horizontal position
1
(B)
Fig. 3. The simulation results for a square. (A) We show the responses of the cells located along the dashed line (the number of neurons is 138). Horizontal positions -1 and 1 represent the places on the vertical edges of the square. Zero represents the center of the square. (B) The responses of the model cells in firing rate along the cross-section as a function of the horizontal location. A clear peak at the center corresponding to MA is observed.
24
Y. Hatori and K. Sakai
3.2 C-Shape We tested the model with a C-shaped figure that has been suggested to be difficult shape for the determination of BO. Fig. 4 shows the simulation result for the C-shape. The responses of a horizontal and vertical cross sections, as indicated in Fig.3(A), are shown here in Fig. 4(B) and (C), respectively, with the conventions same as Fig.1. The figure exhibits the firing rates for two types of the model cells, BO model cells responding to contours (solid lines) and MA model cells (dotted lines). MA representation predicts a strong peak at 0 along the horizontal cross section, and distributed strong responses along the vertical cross section. Although we observe the maximum responses at the centers corresponding to MA, the peak along the horizontal was not significant, and the distribution along the vertical was peaky and uneven. It appears that MA cells cannot integrate properly the signals propagated from BO cells. This distributed MA response comes from complicated propagations of BO signals from the concaved shape. Furthermore, as has been suggested [3], the C-shaped figure is a challenging shape for the determination of BO. Therefore, the BO signals are not clear before the propagation begins. We would like to note that the model is still capable of providing a basis for MA representation for a complicated figure.
(A) 14
14
12
12
c10 e s m 0 8 0 4 / s 6 e ik p s 4
c10 e s m 0 8 0 4 / s 6 e ik p s
2
2
0
4
-1
0 1 Horizontal position
(B)
0
-1
0 Vertical position
1
(C)
Fig. 4. The simulation results for the C-shaped figure. (A) Positions of the analysis along horizontal and vertical cross-sections as indicated by dashed line. The responses of the model cells in firing rate along the horizontal cross section (B), and that along the vertical cross section (C). Solid and dotted lines indicate the responses of BO and MA model cells, respectively. Although the maximum responses are observed at the centers, the responses are distributed.
3.3 Natural Images The model has shown its ability to extract a basis for MA representation for not only simple but also difficult shapes. To further examine the model for arbitrary shapes, we
Representation of Medial Axis from Synchronous Firing of BO Selective Cells
25
tested the model with natural images taken from Berkeley Segmentation Dataset [8]. Fig.2(C) shows an example, an eagle stops on a tree branch. Because we are interested in the representation of shape, we extracted its shape by binarizing the gray scale, as shown in Fig. 5(A). The simulation results of BO and MA model cells are shown in Fig. 5(B). We plotted the responses along the horizontal cross-section indicated in Fig. 5(A). Although the shape is much more complicated, detailed and asymmetric, the results are very similar to that for a square as shown in Fig.2(B). The BO model cells responded to the contours (horizontal positions at -1 and 1), and MA model cells exhibited a strong peak at the center (horizontal position at 0). This result indicates that the model detects MA for figures with arbitrary shape. The aim of the simulations with natural images is to test a variety of stimulus shape and configurations that are possible in natural scenes. Further simulations with a number of natural images is expected, specifically with images including occulusion, multiple objects and ambiguous figures. 12 10 c e s 8 m 0 0 4 6 / s e ik 4 p s
2 0
(A)
-1 0 1 Horizontal position
(B)
Fig. 5. An example of simulation results for natural images. (A) The binary image of an eagle together with the horizontal cross-section for the graphical presentation of the results. (B) The responses of the model cells in firing rate along the cross-section as a function of the horizontal location. Horizontal positions -1 and 1 represent the places on the vertical edges of the eagle. Zero represents the center of the bird. A clear peak at the center corresponding to MA is observed. Although the shape is much more complicated, detailed and asymmetric, the results are very similar to that for a square as shown in Fig. 2(B), indicating robustness of the model.
4 Conclusion We studied whether early visual areas could provide basis for MA representation, specifically what constraint is necessary for the representation of MA. Our results showed that simultaneous firing of BO-selective neurons is crucial for MA representation. We implemented the physiologically realistic firing model neurons that have connections from BO model cells to MA model cells. If a stimulus is presented at once so that BO cells along the contours fire simultaneously, and if a MA cell is located equidistant from some of contours of the stimulus, then the MA cell fires because of synchronous signals from the BO cells that give rise to strong EPSP. We showed three typical examples of the simulation results, a simple square, a difficult C-shaped figure, and a natural image of an eagle. The simulation results showed that the model provides a basis for MA representation for all three types of stimuli. These
26
Y. Hatori and K. Sakai
results suggest that the simultaneous firing of BO cells is an essence for the MA representation in early visual areas.
Acknowledgment We thank Dr. Haruka Nishimura for her insightful comments and Mr. Satoshi Watanabe for his help in simulations. This work was supported by Grant-in-aid for Scientific Research from the Brain Science Foundation, the Okawa Foundation, JSPS (19530648), and MEXT of Japan (19024011).
References 1. Zhou, H., Friedman, H.S., Heydt, R.: Coding of Border Ownership in Monkey Visual Cortex. The Journal of Neuroscience 86, 2796–2808 (2000) 2. Lee, T.S., Mumford, D., Romero, R., Lamme, V.A.F.: The role of the primary visual cortex in higher level vision. Vision Research 38, 2429–2454 (1998) 3. Sakai, K., Nishimura, H.: Surrounding Suppression and Facilitation in the determination of Border Ownership. The Journal of Cognitive Neuroscience 18, 562–579 (2006) 4. NEURON: http://www.neuron.yale.edu/neuron/ 5. Johnston, D., Wu., S. (eds.): Foundations of Cellular Neurophysiology. MIT Press, Cambridge (1999) 6. Carandini, M., Heeger, D.J., Movshon, J.A.: Linearity and Normalization in Simple Cells of the Macaque Primary Visual Cortex. The Journal of Neuroscience 21, 8621–8644 (1997) 7. Jones, H.E., Wang, W., Silito, A.M.: Spatial Organization and Magnitude of Orientation Contrast Interaction in Primate V1. Journal of Neurophysiology 88, 2796–2808 (2002) 8. The Berkeley Segmentation Dataset: http://www.eecs.berkeley.edu/Research/ Projects/CS/vision/grouping/segbench/ 9. Engel, A.K., Fries, P., Singer, W.: Dynamic predictions: oscillations and synchrony in topdown processing. Nature Reviews Neuroscience 2, 704–716 (2001)
Neural Mechanism for Extracting Object Features Critical for Visual Categorization Task Mitsuya Soga1 and Yoshiki Kashimori1,2 1
2
Dept. of Information Network Science, Graduate school of Information Systems, Univ. of Electro-Communications, Chofu, Tokyo 182-8585, Japan Dept. of Applied Physics and Chemistry, Univ. of Electro-communications, Chofu, Tokyo 182-8585, Japan
Abstract. The ability to group visual stimuli into meaningful categories is a fundamental cognitive process. Some experiments are made to investigate the neural mechanism of visual categorization. Although experimental evidence is known that prefrontal cortex (PFC) and inferiortemporal (IT) cortex neurons sensitively respond in categorization task, little is known about the functional role of interaction between PFC and IT in categorization task To address this issue, we propose a functional model of visual system, and investigate the neural mechanism for the categorization task of line drawings of faces. We show here that IT represents similarity of face images based on the information of the resolution maps of early visual stages. We show also that PFC neurons bind the information of part and location of the face image, and then PFC generates a working memory state, in which only the information of face features relevant to the categorization task are sustained.
1
Introduction
Visual categorization is fundamental to the behavior of higher primates. Our raw perceptions would be useless without our classification of items such as animals and food. The visual system has the ability to categorize visual stimuli, which is the ability to react similarity to stimuli even when they are physically distinct, and to react differently to stimuli that may be similar. How does the brain group stimuli into meaningful categories? Some experiments have been made to investigate the neural mechanism of visual categorization. Freedman et al.[1] examined the responses of neurons in the prefrontal cortex(PFC) of monkey trained to categorize animal forms(generated by computer) as either “doglike” or “catlike”. They reported that many PFC neurons responded selectively to the different types of visual stimuli belonging to either the cats or the dogs category. Sigala and Logothetis [2] recorded from inferior temporal (IT) cortex after monkey learned a categorization task, and found that selectivity of the IT neurons was significantly increased to features critical for the task. The numerous reciprocal connections between PFC and IT could allow the necessary interactions to select the best diagnostic features of stimuli [3]. However, little is known about the role of interaction between IT M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 27–36, 2008. c Springer-Verlag Berlin Heidelberg 2008
28
M. Soga and Y. Kashimori
and PFC in categorization task that gives the category boundaries in relation to behavioral consequences. To address this issue, we propose a functional model of visual system in which categorization task is achieved based on functional roles of IT and PFC. The functional role of IT is to represent features of object parts, based on different resolution maps in early visual system such as V1 and V4. In IT, visual stimuli are categorized by the similarity based on the features of object parts. Posterior parietal (PP) encodes the location of object part to which attention is paid. The PFC neurons combine the information about feature and location of object parts, and generate a working memory of the object information relevant to the categorization task. The synaptic connections between IT and PFC are learned so as to achieve the categorization task. The feedback signals from PFC to IT enhance the sensitivity of IT neurons that respond to the features of object parts critical for the categorization task, thereby enabling the visual system to perform quickly and reliably task-dependent categorization. In the present study, we present a neural network model which makes categories of visual objects depending on categorization task. We investigated the neural mechanism of the categorization task about line drawings of faces used by Sigala and Logothetis [2]. Using this model we show that IT represents similarity of face images based on the information of the resolution maps in V1 and V4. We also show that PFC generates a working memory state, in which only the information of face features relevant to the categorization task are sustained.
2
Model
To investigate the neural mechanism of visual categorization, we made a neural network model for a form perception pathway from retina to prefrontal cortex(PFC). The model consists of eight neural networks corresponding to the retina, lateral geniculate nucleus(LGN), V1, V4, inferior temporal cortex (IT), posterior parietal(PP), PFC, and premotor area, which are involved in ventral and dorsal pathway [4,5]. The network structure of our model is illustrated in Fig. 1. 2.1
Model of Retina and LGN
The retinal network is an input layer, on which the object image is projected. The retina has a two-dimensional lattice structure that contains NR x NR pixel scene. The LGN network consists of three different types of neurons with respect to the spatial resolution of contrast detection, fine-tuned neurons with high spatial frequency (LGNF), middle-tuned neurons with middle spatial frequency (LGNM), and broad-tuned neurons with low spatial frequency (LGNB). The output of LGNX neuron (X=B, M, F) of (i, j) site is given by ILGN (iR , jR ; x) =
i
j
IR (iR , jR )M (i, j; x),
(1)
Neural Mechanism for Extracting Object Features
29
Fig. 1. The structure of our model. The model is composed of eight modules structured such that they resemble ventral and dorsal pathway of the visual cortex, retina, lateral geniculate nucleus (LGN), primary visual cortex (V1), V4, inferior temporal cortex (IT), posterior parietal (PP), prefrontal cortex (PFC), and premotor area. a1 ∼ a3 mean dynamical attractors of visual working memory.
(i − iR )2 + (j − jR )2 (i − iR )2 + (j − jR )2 M (i, j; x) = A exp − −B exp − , 2 2 σ1x σ2x (2) where, IR (iR , jR ) is the gray-scale intensity of the pixel of retinal site (iR , jR ), and the function M (i, j; X) is Mexican hat-like function that represents the convergence of the retinal inputs with ON center-OFF surrounding connections between retina and LGN. The parameter values were set to be A = 1.0, B = 1.0, σ1x = 1.0, and σ2x = 2.0. 2.2
Model of V1 Network
The neurons of V1 network have the ability to elemental features of the object image, such as orientation and edge of a bar. The V1 network consists of three different types of networks with high, middle, and broad spatial resolutions, V1B, V1M, and V1F, each of which receives the outputs of LGNB, LGNM, and LGNF, respectively. The V1X network (X=B, M, F) contains MX × MX hypercolumns, each of which contains LX orientation columns. The neurons in V1X (X=B, M, F) have receptive field performing a Gabor transform. The output of V1X neuron of (i, j) site is given by IV 1 (i, j, θ; x) = ILGN (p, q; x)G(p, q, θ; x), (3) p
q
30
M. Soga and Y. Kashimori
1 (i − p)2 (j − q)2 G(p, q, θ; x) = exp − + 2 2 2πσxX σyX 2 σGx σGy π
× sin 2πfx p cos θ + 2πfy q sin θ + , 2 1
(4)
where fx , fy are spatial frequencies of x- and y- coordinate, respectively. The parameter values were σGx = 1, σGy = 1, fx = 0.1Hz, and fy = 0.1Hz. 2.3
Model of V4 Network
The V4 network consists of three different networks with high, middle, and low spatial resolutions, which receive convergent inputs from the cell assemblies with the same tuning in V1F, V1M, and V1B. The convergence of outputs of V1 neurons enables V4 neurons to respond specifically to a combination of elemental features such as a cross and triangle represented on the V4 network. 2.4
Model of PP
The posterior parietal(PP) network consists of NP P x NP P neurons, each of which corresponds to a spatial position of each pixel of the retinal image. The functions of PP network are to represent the spatial position of a whole object and the spatial arrangement of its parts in the retinotopic coordinate system and to mediate the location of the object part to which attention is paid. 2.5
Model of IT
The network of IT consists of three subnetworks, each of which receives the outputs of V4F, V4M, and V4B maps, respectively. Each network has neurons tuned to various features of the object parts depending on the resolutions of V4 maps and the location of the object parts to which attention is directed. The first sub-network detects a broad outline of a whole object, the second subnetwork detects elemental figures of the object that represent elemental outlines of the object, and the third subnetwork represents the information of the object parts based on the fine resolution of V4F map. Each subnetwork was made based on Kohonen’s self-organized map model. The elemental figures in the second subnetwork may play an important role in extracting the similarity to the outlines of objects. 2.6
Model for Working Memory in PFC
The PFC memorizes the information of spatial positions of the object parts as dynamical attractors. The functional role of the PFC is to reproduce a complete form of the object by binding the information of the object parts memorized in the second and third subnetworks of ITC and the their spatial arrangements represented by PP network. The PFC network model was made based on the dynamical map model [6,7]. The network model consists of three types of neurons, M, R, and Q neurons. Q neuron is connected to M neuron with inhibitory
Neural Mechanism for Extracting Object Features
31
synapse. M neurons are interconnected to each other with excitatory and inhibitory synaptic connections. M neuron layer of the present model corresponds to an associative neural network. M neurons receive inputs from three IT subnetworks. Dynamical evolutions of membrane potentials of the neurons, M, R, and Q, are described by
τm
tm dumi = −umi + Wmm,ij (t, Tij )Vm (t − τij ) + Wmq U qi dt j τij =0 +Wmr Uri + WIT,ik XkIT + WP P,il XlP P , R
(5)
l
duqi = −uqi + Wqm Vmi , dt duir τri = −uri + Wrm Vmi , dt τqi
(6) (7)
where um,i , uq,i , and ur,i are the membrane potentials of ith M neuron, ith Q neuron, and ith R neuron, respectively. τmi , τqi , τri are the relaxation times of these membrane potentials. τij is the delay time of the signal propagation from jth M neuron to ith one, and τmax is the maximum delay time. The time delay plays an important role in the stabilization of temporal sequence of firing pattern. Wmm,ij (t, Tij ) is the strength of the axo-dendric synaptic connection from jth M neuron to ith M neuron whose propagation delay time is τij . Wmq (t), Wmr (t), and Wqm (t), and Wrm (t) are the strength of the dendro-dendritic synaptic connection from Q neuron to M neuron, from R neuron to M neuron, from M neuron to Q neuron, and from M neuron to R neuron, respectively. Vm is the output of ith M neuron, Uqi and Uri are dendritic outputs of ith Q neuron and ith R neuron, respectively. Outputs of Q and R neurons are given by sigmoidal functions of uqi and uri , respectively. M neuron has a spike output, the firing probability of which is determined by a sigmoidal function of umi . WIT,ik and WP P,il are the synaptic strength from kth IT neuron to ith M neuron and that from lth PP neuron to ith M neuron, respectively. XkIT and XlP P are the output of k th IT neuron and lth PP neuron, respectively. The parameters were set to be τm = 3ms, τqi = 1ms, τri = 1ms, Wmq = 1, Wmr = 1, Wqm = 1, Wqr = −10, and τm = 38ms. Dynamical evolution of the synaptic connections are described by τw
dWmm,ij (t, Tij ) = −Wmm,ij (t, Tij ) + λVmi (t)Vmj (t − Tij ), dt
(8)
where τw is a time constant, and λ is a learning rate.The parameter values were τw = 1200ms, and λ = 28. The PFC connects reciprocally with IT, PP, and the premotor networks. The synaptic connections between PFC and IT and those between PFC and PP are learned by Hebbian learning rule.
32
2.7
M. Soga and Y. Kashimori
Model of Premotor Cortex
The model of premotor area consists of neurons whose firing correspond to action relevant to the categorization task, that is, pressing right or left lever.The details of mathematical description are described in Refs. [8-10].
3 3.1
Neural Mechanism for Extracting Diagnostic Features in Visual Categorization Task Categorization Task
We used the line drawings of faces used by Sigala and Logothetis [2] to investigate the neural mechanism of categorization task. The face images consist of four varying features, eye height, eye separation, nose length, and mouse height. The monkeys were trained to categorize the face stimuli depending on two diagnostic features, eye height and eye separation. The two diagnostic features allowed separation between classes along a linear category boundary as shown in Fig.2b. The face stimuli were not linearly separable by using the other two, non-diagnostic features, or nose length and mouth height. On each trial, the monkeys saw one face stimulus and then pressed one of two levers to indicate category. Thereafter, they received a reward only if they chose correct category. After the training, the monkeys were able to categorize various face stimuli based on the two diagnostic features. In our simulation, we used four training stimuli shown in Fig.2a and test stimuli with varying four features.
Fig. 2. a) The training stimulus set consisted of line drawing of faces with four varying features: eye separation, eye height, nose length and mouth height. b) In the categorization task, the monkyes were presented with one stimulus at a time. The two categories were linearly separable along the line. The test stimuli are illustrated by the marks, ‘x‘ and ‘o‘. See Ref.2 for details of the task.
Neural Mechanism for Extracting Object Features
3.2
33
Neural Mechanism for Accomplishing Visual Categorization Task
Neural mechanism for accomplishing visual categorization task is illustrated in Fig.3. Object features are encoded by hierarchical processing at each stage of ventral visual pathway from retina to V4. The IT neurons encode the information of object parts such as eyes, nose, and mouth. The PP neurons encode the location of object parts to which attention should be directed. The information of object part and its location are combined by a dynamical attractor in PFC. The output of PFC is sent to two types of premotor neurons, each firing of which leads to pressing of the right or left lever. When monkeys exhibit the relevant behavior for the task and then receive a reward, the attractor in PFC, which represents the object information relevant to the task, is gradually stabilized by the facilitation of learning across the PFC neurons and that between PFC and premotor neurons. On the other hand, when monkeys exhibit the irrelevant behavior for the task and then receive no reward, the attractor associated with the irrelevant behavior is destabilized, and then eliminated in the PFC network. As a result, the PFC retains only the object information relevant to the categorization task, as working memory. The feedback from PFC to IT and PP makes the responses of IT and PP neurons strengthened, thereby enabling the visual system to rapidly and accurately discriminate between object features belonging to different categories. When monkey pays attention to a local region of face stimulus, the attention signal from other brain area such as prefrontal eye field increases the activity of PP neurons encoding the location of face part to which attention is directed. The PP neurons send their outputs back to V4, and thereby increasing the activity of V4 neurons encoding the feature of the face parts which the attention is paid, because V4 has the same retinotopic map as PP. Thus the attention to PP allows V4 to send IT only the information of face part to which attention is directed,
Fig. 3. Neural mechanism for accomplishing visual categorization task. The information about object part and it’s location are combined to generate working memory, indicated by α and β. The solid and dashed lines indicate the formation and elimination of synaptic connection, respectively.
34
M. Soga and Y. Kashimori
leading to generation of IT neurons encoding face parts. Furthermore, as the training proceeds, the attention relevant to the task is fixed by the learning of synaptic connections between PFC and PP, allowing monkey to perform quickly the visual task.
4 4.1
Results Information Processing of Visual Images in Early Visual Areas
Figure 4 shows the responses of neurons in early visual areas, LGN, V1, and V4, to the training stimulus, face 1 shown in Fig. 2a. The visual information is processed in a hierarchical manner, because the neurons involved in the pathway from LGN to V4 have progressively larger receptive fields and prefer more complex stimuli. At the first stage, the contrast of the stimulus is encoded by ON center-OFF surrounding receptive field of LGN neurons, as shown in Fig. 4b. Then, the V1 neurons, receiving the outputs of LGN neurons, encode the information of directional features of short bars contained in the drawing face, as shown in Fig. 4c. The V4 network was made by Kohonen’s self-organized map so that the V4 neurons could respond to more complex features of the stimulus. Figure 4d shows that the V4 neurons respond to the more complex features such as eyes, nose, and mouth.
Fig. 4. Responses of neurons in LGN, V1, and V4. The magnitude of neuronal responses in these areas is illustrated with a gray scale, in which the response magnitude is increased with increase of gray color. a) Face stimulus. The image is 90 x 90 pixel scale. b) Response of LGN neurons. c) Responses of V1 neurons tuned to four kinds of directions. d) Responses of V4 neurons. The kth neurons (k=1-3) encode the stimulus features such as eyes, nose, and mouth, respectively.
4.2
Information Processing of Visual Images in IT Cortex
Figure 5a shows the ability of IT neurons encoding eye separation and eye height to categorize test stimuli of faces. The test stimuli with varying the two features were categorized by the four ITC neurons learned by the four training stimuli shown in Fig.2a, suggesting that the ITC neurons are capable for separating test stimuli into some categories, based on similarity to the features of face parts.
Neural Mechanism for Extracting Object Features
35
Fig. 5. a) Ability of IT neurons to categorize face stimuli for two diagnostic features. The four IT neurons were made by using the four kinds of training stimuli shown in Fig. 2a, whose features are represented by four kinds of symbols (circle, square, triangle, cross). The test stimuli, represented by small symbols, are categorized by the four IT neurons. The kind of small symbols means the symbol of IT neuron that categorizes the test stimulus. The solid lines mean the boundary lines of the categories. b) Temporal variation of dynamic state of the PFC network during the categorization task. The attractors representing the diagnostic features are denoted by α ∼ δ, and the attractors representing non-diagnostic feature is denoted by . A mark on the row corresponding to α ∼ indicates that the network activity stays in the attractor. The visual stimulus of face 1 was applied to the retina at 300m ∼ 500ms.
Similarly, the IT neurons encoding nose length and mouth height separated test stimuli into other categories on the basis of similarity of the two features. However, the classification in the IT is not task-dependent, but is made based on the similarity of face features. 4.3
Mechanism for Generating Working Memory Attractor in PFC
The PFC combines the information of face features and that of location of face parts to which attention is directed, and then makes memory attractors about the information. Figure 5b shows temporal variation of the memory attractors in PFC. The information about face parts with the two diagnostic features is represented by attractors X (X= α, β, γ, δ), in which X represents the information about eye separation and eye height of the four training stimuli and the location around eyes. The attractors X are dynamically linked in the PFC. As shown in Fig. 5b, the information about face parts with the diagnostic features are memorized as working memory α ∼ δ, because the synaptic connections between PFC and premotor area are strengthened by a reward signal given by the choice of correct categorization. On the other hand, the information about face parts with non-diagnostic features are not memorized as a stable attractor, as shown by in Fig.5b, because the information of non-diagnostic features does
36
M. Soga and Y. Kashimori
not lead to correct categorization behavior. Thus, the PFC can retain only the information required for the categorization task, as working memory.
5
Concluding Remarks
In the present study, we have shown that IT represents similarity of face images based on the resolution maps of V1 and V4, and PFC generates a working memory state, in which the information of face features relevant to categorization task are sustained. The feedback from PFC to IT and PP may play an important role in extracting the diagnostic features critical for the categorization task. The feedback from PFC increases the sensitivity of IT and PP neurons which encode the relevant object feature and location to the task, respectively. This allows the visual system to rapidly and accurately perform the categorization task. It remains to see how the feedback from PFC to IT and PP makes the functional connections across the three visual areas.
References 1. Freedman, D.J., Riesenhube, M., Poggio, T., Miller, E.K.: Categorical representation of visual stimuli in the primate prefrontal cortex. Science 291, 312–316 (2001) 2. Sigala, N., Logothetis, N.K.: Visual categorization shapes feature selectivity in the primate temporal cortex. Nature 415, 318–320 (2002) 3. Hagiwara, I., Miyashita, Y.: Categorizing the world: expert neurons look into key features. Nature Neurosci. 5, 90–91 (2002) 4. Marcelija, S.: Mathematical description of the responses of simple cortical cells. J. Opt. Soc. Am. 70, 1297–1300 (1980) 5. Rolls, E.T., Deco, G.: Computational Neuroscience of Vision. Oxford University Press, Oxford (2002) 6. Hoshino, O., Inoue, S., Kashimori, Y., kambara, T.: A hierachical dynamical map as a basic frame for cortical mapping and its application to priming. Neural Comput. 13, 1781–1810 (2001) 7. Hoshino, O., Kashimori, Y., Kambara, T.: An olfactory recognition model of spatiotemporal coding of odor quality in olfactory bulb. Biol. Cybernet 79, 109–120 (1998) 8. Suzuki, N., Hashimoto, N., Kashimori, Y., Zheng, M., Kambara, T.: A neural model of predictive recognition in form pathway of visual cortex. Biosystems 79, 33–42 (2004) 9. Ichinose, Y., Kashimori, Y., Fujita, K., Kambara, T.: A neural model of visual system based on multiple resolution maps for categorizing visual stimuli. In: Proceedings of ICONIP 2005, pp. 515–520 (2005) 10. Kashimori, Y., Suzuki, N., Fujita, K., Zheng, M., Kambara, T.: A functional role of multiple spatial resolution maps in form perception along the ventral visual pathway. Neurocomputing 65-66, 219–228 (2005)
An Integrated Neuro-mechanical Model of C. elegans Forward Locomotion Jordan H. Boyle, John Bryden, and Netta Cohen School of Computing, University of Leeds, Leeds LS2 9JT, United Kingdom
Abstract. One of the most tractable organisms for the study of nervous systems is the nematode Caenorhabditis elegans, whose locomotion in particular has been the subject of a number of models. In this paper we present a first integrated neuro-mechanical model of forward locomotion. We find that a previous neural model is robust to the addition of a body with mechanical properties, and that the integrated model produces oscillations with a more realistic frequency and waveform than the neural model alone. We conclude that the body and environment are likely to be important components of the worm’s locomotion subsystem.
1
Introduction
The ultimate aim of neuroscience is to unravel and completely understand the links between animal behaviour, its neural control and the underlying molecular and genetic computation at the cellular and sub-cellular levels. This daunting challenge sets a distant goal post in the study of the vast majority of animals, but work on one animal in particular, the nematode Caenorhabditis elegans, is leading the way. This tiny worm has only 302 neurons and yet is capable of generating an impressive wealth of sensory-motor behaviours. With the first fully sequenced animal genome [1], a nearly complete wiring diagram of the nervous circuit [2], and hundreds of well characterised mutant strains, the link between genetics and behaviour never seemed more tractable. To date, a number of models have been constructed of subcircuits within the C. elegans nervous system, including sensory circuits for thermotaxis and chemotaxis [3,4], reflex control such as tap withdrawal [5], reversals (from forward to backward motion and vice versa) [6] and head swing motion [7]. Locomotion, like the overwhelming majority of known motor activity in animals, relies on the rhythmic contraction of muscles, which are controlled or regulated by neural networks. This system consists of a circuit in the head (generally postulated to initiate motion and determine direction) and an additional subcircuit along the ventral cord (responsible for propagating and sustaining undulations, and potentially generating them as well). Models of C. elegans locomotion have tended to focus on forward locomotion, and in particular, on the ability of the worm to generate and propagate undulations down its length [8,9,10,11,12]. These models have tended to study either the mechanics of locomotion [8] or the forward locomotion neural circuit [9,10,11,12]. In this paper we present simulations of an M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 37–47, 2008. c Springer-Verlag Berlin Heidelberg 2008
38
J.H. Boyle, J. Bryden, and N. Cohen
integrated model of the neural control of forward locomotion [12] with a minimal model of muscle actuation and a mechanical model of a body, embedded in a minimal environment. The main questions we address are (i) whether the disembodied neural model is robust to the addition of a body with mechanical properties; and (ii) how the addition of mechanical properties alters the output from the motor neurons. In particular, models of the isolated neural circuit for locomotion suffer from a common limitation: the inability to reproduce undulations with frequencies that match the observed behaviour of the crawling worm. To address this question, we have limited our integrated model to a short section of the worm, rather than modelling the entire body. We find that the addition of a mechanical framework to the neural control model of Ref. [12] leads to robust oscillations, with significantly smoother waveforms and reduced oscillation frequencies, matching observations of the worm.
2 2.1
Background C . elegans Locomotion
Forwards locomotion is achieved by propagating sinusoidal undulations along the body from head to tail. When moving on a firm substrate (e.g. agarose) the worm lies on its side, with the ventral and dorsal muscles at any longitudinal level contracting in anti-phase. With the exception of the head and neck, the worm is only capable of bending in the dorso-ventral plane. Like all nematode worms, C. elegans lacks any form of rigid skeleton. Its roughly cylindrical body has a diameter of ∼ 80 μm and a length of ∼ 1 mm. It has an elastic cuticle containing (along with its intestine and gonad) pressurised fluid, which maintains the body shape while remaining flexible. This structure is referred to as a hydrostatic skeleton. The body wall muscles responsible for locomotion are anchored to the inside of the cuticle. 2.2
The Neural Model
The neural model used here is based on the work of Bryden and Cohen [11,12,13]. Specifically, we use the model (equations and parameters) presented in [12] which is itself an extension of Refs. [11,13]. The model simplifies the neuronal wiring diagram of the worm [2,14] into a minimal neural circuit for forward locomotion. This reduced model contains a set of repeating units, (one “tail” and ten “body” units) where each unit consists of one dorsal motor neuron (of class DB) and one ventral motor neuron (of class VB). A single command interneuron (representing a pair of interneurons of class AVB in the biological worm) provides the “on” signal to the forward locomotion circuit and is electrically coupled (via gap junctions) to all motor neurons of classes DB and VB. In the model, motor neurons also have sensory function, integrating inputs from stretch-receptors, or mechano-sensitive ion channels, that encode each unit’s bending angle. Motor neurons receive both local and – with the exception of the tail – proximate sensory input, with proximate input received from the adjacent posterior unit.
An Integrated Neuro-mechanical Model of C. elegans Forward Locomotion
39
Fig. 1. A: Schematic diagram of the physical model illustrating nomenclature (see Appendix B for details). B: The neural model, with only two units (one body, one tail). AVB is electrically coupled to each of the motor neurons via gap junctions (resistor symbols).
The sensory-motor loop for each unit gives rise to local oscillations which phase lock with adjacent units. Equations and parameters for the neural model are set out in Appendix A. This neural-only model uses a minimal physical framework to translate neuronal output to bending. Fig. 1B shows the neural model with only two units (a tail and one body unit), as modelled in this paper. In the following section, we outline a more realistic physical model of the body of the worm.
3
Physical Model
Our physical model is an adaptation of Ref. [8], a 2-D model consisting of two rows of N points (representing the dorsal and ventral sides of the worm). Each point is acted on by the opposing forces of the elastic cuticle and pressure, as well as muscle force and drag (often loosely referred to as friction or surface friction [8]). We modify this model by introducing simplifications to reduce simulation time, in part by allowing us to use a longer time step. Fig. 1A illustrates the model’s structure. The worm is represented by a number rigid beams, connected to each of the adjacent beams by four springs. Two horizontal (h) springs connect points on the same side of adjacent beams and resist both elongation and compression. Two diagonal (d) springs connect the dorsal side of the ith beam to the ventral side of the i + 1st , and vice versa. These springs strongly resist compression and have an effect analogous to that of pressure, in that they help to maintain reasonably constant area in each unit. The model was implemented in C++, using a 4th order Runge-Kutta method for numerical integration, with a time step of 0.1 ms.1 Equations and parameters 1
The original model [8] required a time step of 0.001 ms with the same integration method.
40
J.H. Boyle, J. Bryden, and N. Cohen
of the physical model are given in Appendix B. The steps taken to interface the physical and neuronal models are described in Appendix C.
4
Results
Using our integrated model we first simulated a single unit (the tail), and then implemented two phase-lagged units (adding a body unit). In what follows, we present these results, as compared to those of the neural model alone. 4.1
Single Oscillating Segment
The neural model alone produces robust oscillations in unit bending angle (θi ) with a roughly square waveform, as shown in Fig. 2A. The model unit oscillates at about 3.5 Hz, as compared to frequencies of about 0.5 Hz observed for C. elegans forward locomotion on an agarose substrate. It has not been possible to find parameters within reasonable electrophysiological bounds for the neural model that would slow the oscillations to the desired time scales [12]. Oscillations of the integrated neuro-mechanical model of a single unit are shown in Fig. 2B. All but four parameters of the neuronal model remain unchanged from Ref. [12]. However, parameters used for the actuation step caused a slight asymmetry in the oscillations when integrated with a physical model, and were therefore modified. As can be seen from the traces in the figure, the frequency of oscillation in the integrated model is about 0.5 Hz for typical agarose drag [8], and the waveform has a smooth, almost sinusoidal shape. Faster (and slower) oscillations are possible for lower (higher) values of drag. Fig. 2C shows a plot of oscillation frequencies as a function of drag for the integrated model.
Fig. 2. Oscillations of, A, the original neural model [12] and, B, the integrated model (with drag of 80 × 10−6 kg.s−1 ). Note the different time scales. C: Oscillation frequency as a function of drag. The zero frequency point indicates that the unit can no longer oscillate.
An Integrated Neuro-mechanical Model of C. elegans Forward Locomotion
4.2
41
Two Phase-Lagged Segments
Parameters of the neural model are given in Table A-1 for the tail unit and in Table A-2 for the body unit. Fig. 3 compares bending waveforms recorded from a living worm (Fig. 3A), simulated by the neural model (Fig. 3B) and simulated by the integrated model (Fig. 3C).
Fig. 3. Phase lagged oscillation of two units. A: Bending angles extracted from a recording of a forward locomoting worm on an agarose substrate. The traces are of two 1 points along the worm (near the middle and 12 of a body length apart). B: Simulation of two coupled units in the neural model. C: Simulation of the integrated model. Take note of the faster oscillations in subplot B.
5
Discussion
C. elegans is amenable to manipulations at the genetic, molecular and neuronal levels but with such rich behaviour being produced by a system with so few components, it can often be difficult to determine the pathways of cause and effect. Mathematical and simulation models of the locomotion therefore provide an essential contribution to the understanding of C. elegans neurobiology and motor control. The inclusion of a realistic embodiment is particularly relevant to a model of C. elegans locomotion. Sensory feedback is important to the locomotion of all animals. However, in C. elegans, the postulated existence of stretch receptor inputs along the body (unpublished communication, L. Eberly and R. Russel, reported in [2]) would provide direct information about body posture to the motor neurons themselves. Thus, the neural control is likely to be tightly coupled to the shape the worm takes as it locomotes. Modelling the body physics is therefore particularly important in this organism. Here we have presented the first steps in the implementation of such an integrated model, using biologically plausible parameters for both the neural and mechanical components. One interesting effect is the smoothing of the waveform from a square-like waveform in the isolated neural model to a nearly sinusoidal waveform in the integrated model. The smoothing can be attributed to the body’s resistance to bending (modelled as a set of springs), which increases with the bending angle.
42
J.H. Boyle, J. Bryden, and N. Cohen
By contrast, in the original neural model, the rate of bending depends only on the neural output. The work presented here would naturally lead to an integrated neuromechanical model of locomotion for an entire worm. The next step toward this goal, extending the neural circuit to the entire ventral cord (and the corresponding motor system) is currently underway. The physical model introduces long range interactions between units via the body and environment. In a real worm, as in the physical model, for bending to occur at some point along the worm, local muscles must contract. However, such contractions also apply physical forces to adjacent units, and so on up and down the worm, giving rise to a significant persistence length. For this reason the extension of the neuro-mechanical model from two to three (or more) units will not be automatic and will require parameter changes to model an operable balance between the effects of the muscle and body properties. In fact, the worm’s physical properties (and, in particular, the existence of long range physical interactions along it) could set new constraints on the neural model, or could even be exploited by the worm to achieve more effective locomotion. Either way, the physics of the worm’s locomotion is likely to offer important insights that could not be gleaned from a model of the isolated neural subcircuit. We have shown that a neural model developed with only the most rudimentary physical framework can continue to function with a more realistic embodiment. Indeed, both the waveform and frequency have been improved beyond what was possible for the isolated neural model. We conclude that the body and environment are likely to be important components of the subsystem that generates locomotion in the worm.
Acknowledgement This work was funded by the EPSRC, grant EP/C011961. NC was funded by the EPSRC, grant EP/C011953. Thanks to Stefano Berri for movies of worms and behavioural data.
References 1. C. elegans Sequencing Consortium: Genome sequence of the nematode C. elegans: A platform for investigating biology. Science 282, 2012–2018 (1998) 2. White, J.G., Southgate, E., Thomson, J.N., Brenner, S.: The structure of the nervous system of the nematode Caenorhabditis elegans. Philosophical Transactions of the Royal Society of London, Series B 314, 1–340 (1986) 3. Ferr´ee, T.C., Marcotte, B.A., Lockery, S.R.: Neural network models of chemotaxis in the nematode Caenorhabditis elegans. Advances in Neural Information Processing Systems 9, 55–61 (1997) 4. Ferr´ee, T.C., Lockery, S.R.: Chemotaxis control by linear recurrent networks. Journal of Computational Neuroscience: Trends in Research, 373–377 (1998) 5. Wicks, S.R., Roehrig, C.J., Rankin, C.H.: A Dynamic Network Simulation of the Nematode Tap Withdrawal Circuit: Predictions Concerning Synaptic Function Using Behavioral Criteria. Journal of Neuroscience 16, 4017–4031 (1996)
An Integrated Neuro-mechanical Model of C. elegans Forward Locomotion
43
6. Tsalik, E.L., Hobert, O.: Functional mapping of neurons that control locomotory behavior in Caenorhabditis elegans. Journal of Neurobiology 56, 178–197 (2003) 7. Sakata, K., Shingai, R.: Neural network model to generate head swing in locomotion of Caenorhabditis elegans. Network: Computation in Neural Systems 15, 199–216 (2004) 8. Niebur, E., Erd¨ os, P.: Theory of the locomotion of nematodes. Biophysical Journal 60, 1132–1146 (1991) 9. Niebur, E., Erd¨ os, P.: Theory of the Locomotion of Nematodes: Control of the Somatic Motor Neurons by Interneurons. Mathematical Biosciences 118, 51–82 (1993) 10. Niebur, E., Erd¨ os, P.: Modeling Locomotion and Its Neural Control in Nematodes. Comments on Theoretical Biology 3(2), 109–139 (1993) 11. Bryden, J.A., Cohen, N.: A simulation model of the locomotion controllers for the nematode Caenorhabditis elegans. In: Schaal, S., Ijspeert, A.J., Billard, A., Vijayakumar, S., Hallam, J., Meyer, J.A. (eds.) Proceedings of the eighth international conference on the simulation of adaptive behavior, pp. 183–192. MIT Press / Bradford Books (2004) 12. Bryden, J.A., Cohen, N.: Neural control of C. elegans forward locomotion: The role of sensory feedback (Submitted 2007) 13. Bryden, J.A.: A simulation model of the locomotion controllers for the nematode Caenorhabditis elegans. Master’s thesis, University of Leeds (2003) 14. Chen, B.L., Hall, D.H., Chklovskii, D.B.: Wiring optimization can relate neuronal structure and function. Proceedings of the National Academy of Sciences USA 103, 4723–4728 (2006)
Appendix A: Neural Model Neurons are assumed to have graded potentials [11,12,13]. In particular, motor neurons (VB and DB) and are modelled by leaky integrators with a transmembrane potential V (t) following: C
dV = −G(V − Erev ) − I shape + I AVB , dt
(A-1)
where C is the cell’s membrane capacitance; Erev is the cell’s effective reversal potential; and G is the total effective membrane conductance. Sensory input n stretch I shape = )Gstretch σjstretch (θj ) is the stretch receptor input j j=1 (V − Ej from the shape of the body, where Ejstretch is the reversal potential of the ion channels, θj is the bending angle of unit j and σjstretch is a sigmoid response function of the stretch receptors to the local bending. The stretch receptor activation function is given by σ stretch (θ) = 1/ [1 + exp (−(θ − θ0 )/δθ)] where the steepness parameter δθ and the threshold θ0 are constants. The command input current I AVB = GAVB (VAVB − V ) models gap junctional coupling with AVB (with coupling strength GAVB and denoting AVB voltage by VAVB ). Note that in the model, AVB is assumed to have a sufficiently high capacitance, so that the gap junctional currents have a negligible effect on its membrane potential.
44
J.H. Boyle, J. Bryden, and N. Cohen
Segment bending in this model is given as a summation of an output function from each of the two neurons: dθ out = σVout B (V ) − σDB (V ) , dt
(A-2)
where σ out (V ) = ωmax /[1 + exp (−(V − V0 )/δV )] with constants ωmax , δV and V0 . Note that dorsal and ventral muscles contribute to bending in opposite directions (with θ and -θ denoting ventral and dorsal bending, respectively). Table A-1. Parameters for a self-oscillating tail unit (as in Ref. [12]) Parameter Value Parameter Value Parameter Value Erev −60mV VAVB −30.7mV C 5pF GVB 19.07pS GDB 17.58pS GAVB 35.37pS VB GAVB 13.78pS Gstretch 98.55pS Gstretch 67.55pS DB VB DB Estretch 60mV θ0,VB −18.68o θ0,DB −19.46o δθVB 0.1373o δθDB 0.4186o ωmax,VB 6987o /sec ωmax,DB 9951o /sec V0,VB 22.8mV V0,DB 25.0mV δVVB 0.2888mV/sec δVDB 0.0826mV/sec
Table A-2. Parameters for body units and tail-body interactions as in Ref. [12]. All body-unit parameters that are not included here are the same as for the tail unit. Parameter GVB Gstretch DB θ0,DB
Value Parameter Value Parameter Value 26.09pS GDB 25.76pS Gstretch 16.77pS VB 18.24pS E stretch 60mV θ0,VB −19.14o −13.26o δθVB 1.589o /sec δθDB 1.413o /sec
Appendix B: Physical Model The physical model consists of N rigid beams which form the boundaries between the N − 1 units. The ith beam can be described in one of two ways: either by the (x, y) coordinates of the centre of mass (CoMi in Fig. 1) and angle φi , or by the (x, y) coordinates of its two end points (PiD and PiV in Fig. 1). Each formulation has its own advantages and is used where appropriate. B.1 Spring Forces The rigid beams are connected to each of their neighbours by two horizontal (h) springs and two diagonal (d) springs, directed along the vectors k k Δh k,i = Pi+1 − Pi for k = D, V k Δdm,i = Pi+1 − Pil for k = D, V , l = V, D and m = 1, 2 ,
(B-1)
An Integrated Neuro-mechanical Model of C. elegans Forward Locomotion
45
for i = 1 : N − 1, where Pik = (xki , yik ) are the coordinates of the ends of the ith beam. The spring forces F(s) depend on the length of these vectors, Δjk,i = |Δjk,i | and are collinear to them. The magnitude of the horizontal and diagonal spring forces are piecewise linear functions ⎧ h κ (Δ − Lh2 ) + κhS1 (Lh2 − Lh0 ) : Δ > Lh2 ⎪ ⎪ ⎨ S2 κhS1 (Δ − Lh0 ) : Lh2 > Δ > Lh0 h F(s) (Δ) = , (B-2) h h κ (Δ − L1 ) + κhC1 (Lh1 − Lh0 ) : Δ < Lh1 ⎪ ⎪ ⎩ C2 h h κC1 (Δ − L0 ) : otherwise ⎧ d ⎨ κC2 (Δ − Ld1 ) + κdC1 (Ld1 − Ld0 ) : Δ < Ld1 d F(s) (Δ) = κdC1 (Δ − Ld0 ) : Ld1 < Δ < Ld0 , (B-3) ⎩ 0 : otherwise where spring (κ) and length (L) constants are given in Table B-1. Table B-1. Parameters of the physical model. Note that values for θ0 and θ0 differ from Ref. [12] and Table A-1. Parameter Value Parameter Value Parameter Value D 80μm Lh0 50μm Lh1 0.5Lh0 h h d d h2 2 L2 1.5L1 L0 L0 + D L1 0.95Ld0 h −1 h h h κS1 20μN.m κS2 10κS1 κC1 0.5κhS1 h h d h d κC2 10κC1 κC1 50κS1 κC2 10κdC1 fmuscle 0.005Lh0 κhC1 c = c⊥ 80 × 10−6 kg.s−1 θ0,VB −29.68o θ0,DB −8.46o θ0,VB −22.14o θ0,DB −10.26o
B.2
Muscle Forces
Muscle forces F(m) are directed along the horizontal vectors Δh k,i with magnitude F(m)k,i = fmuscle Ak,i for k = D, V and i = 1 : N − 1 ,
(B-4)
where fmuscle is a constant (see Table B-1) and Ak,i are scalar activation functions for the dorsal and ventral muscles, determined by (θi (t), 0) if θi (t) ≥ 0 (AD,i , AV,i ) = (B-5) (0, −θi (t)) if θi (t) < 0 , where θi (t) = B.3
t
dθi 0 dt dt
is the integral over the output of the neural model.
Total Point Force
With the exception of points on the outer beams, each point i is subject to forces F D,i and F V,i , given by differences of the spring and muscle forces from the corresponding units (i and i − 1):
46
J.H. Boyle, J. Bryden, and N. Cohen
F D,i = (F h(s)D,i − F h(s)D,i−1 ) + (F d(s)1,i − F d(s)2,i−1 ) + (F (m)D,i − F (m)D,i−1 ) F V,i = (F h(s)V,i − F h(s)V,i−1 ) + (F d(s)2,i − F d(s)1,i−1 ) + (F (m)V,i − F (m)V,i−1 ) . (B-6) Since the first beam has no anterior body parts, and the last beam has no posterior body parts, all terms with i = 0 or i = N are taken as zero. B.4
Equations of Motion
Motion of the beams is calculated from the total force acting on each of the 2N points. Since the points PiD and PiV are connected by a rigid beam, it is convenient to convert F(t)k,i to a force and a torque acting on the beam’s centre of mass. y x Rotation by φi converts the coordinate system of F(t)k,i = (F(t)k,i , F(t)k,i )
⊥ to a new system F(t)k,i = (F(t)k,i , F(t)k,i ) with axes perpendicular to (⊥) and parallel with () the beam: y ⊥ x F(t)k,i = F(t)k,i cos(φi ) + F(t)k,i sin(φi )
y x F(t)k,i = F(t)k,i cos(φi ) − F(t)k,i sin(φi ) .
(B-7)
The parallel components are summed and applied to CoMi , resulting in pure translation. The perpendicular components are separated into odd and even parts (giving rise to a torque and force respectively) by Fi⊥,even = Fi⊥,odd =
⊥ ⊥ (F(t)D,i + F(t)V,i )
2 ⊥ ⊥ (F(t)D,i − F(t)V,i ) 2
.
(B-8)
As in Ref. [8] we disregard inertia, but include Stokes’ drag. Also following Ref. [8], we allow for different constants for drag in the parallel and perpendicular directions, given by c and c⊥ respectively. The motion of CoMi is therefore
1 (F + F(t)V,i ) c (t)D,i 1 = (2Fi⊥,even) c⊥ 1 = (2Fi⊥,odd ) , rc⊥
V(CoM),i = ⊥ V(CoM),i
ω(CoM),i
(B-9)
where r = 0.5D is the radius of the worm. Finally we convert V(CoM),i and ⊥ V(CoM),i back to (x, y) coordinates with
x ⊥ V(CoM),i = V(CoM),i cos(φi ) − V(CoM),i sin(φi )
y ⊥ V(CoM),i = V(CoM),i cos(φi ) + V(CoM),i sin(φi ) .
(B-10)
An Integrated Neuro-mechanical Model of C. elegans Forward Locomotion
47
Appendix C: Integrating the Neural and Physical Model In the neural model, the output dθi (t)/dt specifies the bending angles θi (t) for each unit. In the integrated model, θi (t) are taken as the input to the muscles. Muscle ouputs (or contraction) are given by unit lengths. The bending angle αi is then estimated from the dorsal and ventral unit lengths by αi = 36.2
h |Δh D,i | − |ΔV,i |
Lh0
,
(C-1)
where Lh0 is the resting unit length. (For simplicity, we have denoted the bending angles of both the neural and integrated models by θ in the Figures).
Applying the String Method to Extract Bursting Information from Microelectrode Recordings in Subthalamic Nucleus and Substantia Nigra Pei-Kuang Chao1, Hsiao-Lung Chan1,4, Tony Wu2,4, Ming-An Lin1, and Shih-Tseng Lee3,4 1
Department of Electrical Engineering, Chang Gung University 2 Department of Neurology, Chang Gung Memorial Hospital 3 Department of Neurosurgery, Chang Gung Memorial Hospital 4 Center of Medical Augmented Virtual Reality, Chang Gung Memorial Hosipital 259 Wen-Hua First Road, Gui-Shan, 333, Taoyuan, Taiwan
[email protected]
Abstract. This paper proposes that bursting characteristics can be effective parameters in classifying and identifying neural activities from subthalamic nucleus (STN) and substantia nigra (SNr). The string method was performed to quantify bursting patterns in microelectrode recordings into indexes. Interspike-interval (ISI) was used as one of the independent variables to examine effectiveness and consistency of the method. The results show consistent findings about bursting patterns in STN and SNr data across all ISI constraints. Neurons in STN tend to release a larger number of bursts with fewer spikes in the bursts. Neurons in SNr produce a smaller number of bursts with more spikes in the bursts. According to our statistical evaluation, 50 and 80 ms are suggested as the optimal ISI constraint to classify STN and SNr’s bursting patterns by the string method. Keywords: Subthalamic nucleus, substantia nigra, inter-spike-interval, burst, microelectrode.
1 Introduction Subthalamic nucleus (STN) is frequently the target to study and to treat Parkinson’s disease [1, 2]. Placing a microelectrode to record neural activities in deep brain nuclei provides useful information for localization during deep brain stimulation (DBS) neurosurgery. DBS has been approved by FDA since 1998[3]. The surgery implants a stimulator to deep brain nuclei, usually STN, to alleviate Parkinson’s symptoms, such as tremor and rigidity. To search for STN in operation, a microelectrode probe is often used to acquire neural signals from outer areas to the specific target. With assistance of imagery techniques, microelectrode signals from different depth are read and recorded. Then, an important step to determine STN location is to distinguish signals of STN from its nearby areas, e.g. subtantia nigra (SNr) (which is a little ventral and medial to STN). Therefore, characterizing and quantifying firing patterns of STN and M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 48–53, 2008. © Springer-Verlag Berlin Heidelberg 2008
Applying the String Method to Extract Bursting Information
49
SNr are essential. Firing rate defined as the number of neural spikes within a period is the most common variable used for describing neural activities. However, STN and SNr have a broad range of firing rate and mostly overlapped [4] (although SNr has a slightly higher mean firing rate than STN). This makes it difficult to depend on firing rate to target STN. Bursting patterns may provide a better solution to separate signals from different nuclei. Bursting, defined as clusters of high-frequency spikes released by a neuron, is believed storing important neural information. To establish long-term responses, central synapses usually require groups of action potentials (bursts) [5,6]. Exploring bursting information in neural activities has recently become fundamental in Parkinson’s studies [1,2]. Also, that spike arrays are more regular in SNr than in STN signals is observed [7]. However, the regularity of firing or grouping of spikes in STN and SNr potentials has not been investigated thoroughly. This study aims to extract bursting information from STN and SNr. A quantifying method for bursting, the string method, will be applied. The string method quantifies bursting information relying on inter-spike-interval (ISI) and spike number [8]. Although some other methods for quantifying bursts exist [9,10], the string method is the one which can provide information about what spike members contributing to the detected bursts. In addition, because various ISIs have been used in research [8,9] to define bursts, this study will also evaluate the effect of ISI constraints on discriminating STN and SNr signals.
2 Method The neuronal data used in this study were acquired during DBS neurosurgery in Chang Gung Memorial Hospital. With assistance of imagery localization systems [11], trials (10s for each) of microelectrode recordings were collected at a sampling
a
b
Fig. 1. MRI images from one patient: a. In the sagittal plane – the elevation angle of the probe (yellow line) from the inter-commissural line (green line) was around 50 to 75°; b. In the frontal plane – the angle of the probe (yellow lines) from the midline (green line) was about 8 to 18° to right or left.
50
P.-K. Chao et al.
rate of 24,000 Hz. Based on several observations, e.g. magnetic resonance imaging (MRI), computed topography (CT), motion/perception-related responses and probe location according to a stereotactic system, experienced neurologists diagnosed 18 trials as neural signals from STN, and the other 23 trials as from SNr. The trials which were collected outside STN and SNr and/or confused between STN and SNr were excluded. In this paper, the data are from 3 Parkinson’s patients (2 females, 1 male, age=73.3±8.3 y/o) who received DBS treatment. Due to the patients’ individual difference, e.g. head size, the depth of STN from the scalp was found varied between 15 and 20 cm. During surgery, the elevation angle of the probe from the intercommissural line was around 50 to 75° (Fig 1a) and the angle between the probe and the midline was about 8 to 18° toward either right or left (Fig 1b). 2.1 Spike Detection Each trial of microelectrode recordings includes 2 types of signals, spikes and background signals. The background signals are interference from nearby neural areas or environment. Because background signals can be interpreted as a Gaussian distribution, signals which are 3 standard deviations (SD) above or below mean can be treated as non-background signals or spikes. Therefore, a threshold in the level of mean plus 3 SD is applied in this study to detect spikes (Fig 2).
100 90 80
amplitude
70 60 50 40 30 20 10 5.175
5.18
5.185
5.19 5.195 time (s)
5.2
5.205
5.21
Fig. 2. A segment of microelectrode recording – the red horizontal line is threshold; the green stars indicate the found spikes; most signals around the baseline are background signals
Applying the String Method to Extract Bursting Information
51
2.2 The String Method Every detected spike was plotted as a circle in a spike sequential number versus spike occurring time figure (Fig 3). The spike sequential number is starting at 1 for the first spike in a trial. The spikes which are closer to each other are labeled as strings [8] and defined as bursts. Two parameters were controlled and manipulated to determine bursts: (1) the minimum number of spikes to form a burst was 5; (2) the maximum ISI of the adjacent spikes in a burst was set as 20ms, 50ms, 80ms, and 110ms separately to find an optimal condition to distinguish STN and SNr bursting patterns.
spike sequential number
180
160
140
120
100
80 5
5.5
6
6.5
7
time (s)
Fig. 3. A segment of strings plot – each blue circle means a spike; the red triangles mean the starting spikes of a burst; the black triangles mean the ending spikes of a burst
2.3 Dependent Variables Three dependent variables were computed: (1) Firing rate (FR) was calculated as total spike number divided by trial duration (10 s). (2) Number of bursts (NB) was determined by the string method as the total burst number in a trial. (3) The average of spike number in each burst (SB) was also counted in every trial. 2.4 Statistical Analysis Independent sample’s t-test was applied to test firing rate difference between STN and SNr signals. MANOVA was performed to evaluate NB and SB separately among different ISI constraints (α=.05).
52
P.-K. Chao et al.
3 Results The signals from STN and SNr showed similar firing rate but different bursting patterns. There is no significant difference between STN and SNr in firing rate (STN: 57.0±22.1; SNr: 68.8±23.5) (p>.05). The results of NB and SB are listed in Table 1 and Table 2. In NB, SNr has significantly fewer bursts than STN while ISI setting is 50ms and 80 ms (p<.05). In SB, SNr has significant more spikes in bursts than STN while ISI setting is 20ms, 50ms and 80 ms (p<.05). Based on the findings, 2 points can be addressed: First, STN and SNr have different bursting patterns, although their total numbers of spikes (firing rate) are similar. Comparing to STN, SNr releases fewer bursts but each burst contains more spikes. Second, setting ISI constraints around 50~80 ms in the string method can be effective to distinguish the difference between STN and SNr signals. Table 1. NB results in SNr and STN ISI constraint
SNr
STN
p
20ms
32.1±15.7
38.2±23.1
>.05
50ms
8.9±6.8
23.9±7.7
<.05*
80ms
5.2±5.8
10.7±7.5
<.05*
110ms
3.1±3.4
4.9±4.4
>.05
Table 2. SB results in SNr and STN ISI constraint
SNr
STN
p
20ms
21.6±27.8
8.5±2.2
<.05*
50ms
202.5±309.9
28.1±24.1
<.05*
80ms
396.0±373.9
124.2±141.3
<.05*
110ms
515.1±367.6
337.7±342.8
>.05
4 Discussion and Conclusion Microelectrode recordings from STN and SNr show valuable information in bursting patterns which may be useful to assist neurosurgery in the future. From the results, STN and SNr show very different patterns in bursting. Neurons in STN tend to release more “small” bursts which contain fewer spikes. Neurons in SNr tend to produce “giant” bursts which contain a big number of spikes. These bursting characteristics are quantified into NB and SB which may assist in making decision about localizing stimulation probes during DBS operations. Also, the simplicity of the string method can offer quick information and be efficient in real-time analysis.
Applying the String Method to Extract Bursting Information
53
Different bursting characterisitcs in STN and SNr are revealed across all ISI constraints (20, 50, 80, 110 ms) in both variables, although statistical significance only shows in 50 and 80ms. Because there is no “gold standard” to determine ISIs of adjacent spikes in bursts, several ISI constraint settings were tested in this study. No matter in which setting, STN has a larger NB and smaller SB than what SNr has. Statistical significance only shows in both dependent variables in 50 and 80 ms settings. Therefore, we suggest the optimal ISI setting for identifying bursts in STN and SNr signals should be around 50 and 80 ms. For further studies, signals from other deep brain nuclei may be analyzed to enrich the application of bursting information. Also, since the bioelectrical signals from deep brain nuclei are non-stationary, non-linear methods, e.g. complexity, would be performed and compared with current results to provide more nucleus-identifying clues in the future. Acknowledgments. The authors would like to express sincere appreciation to the grant support from the Ministry of Economic Affairs in Taiwan, under contract 95-EC-17-A-19-S1-035.
References 1. Baufreton, J., Zhu, Z.-T., Garret, M., Bioulac, B., Johnson, S.W., Taupignon, A.I.: Dopamine Receptors Set the Pattern of Activity Generated in Subthalamic Neurons. FASEB J. 19, 1771–1777 (2005) 2. Magarinos-Ascone, C.M., Figueiras-Mendez, R., Riva-Meana, C., Cordoba-Fernadez, A.: Subthalamic Neuron Activity Related to Tremor and Movement in Parkinson’s Disease. Eur. J. Neurosci. 12, 2597–2607 (2000) 3. Deushl, G., Volkmann, J., Krack, P.: Deep Brain Stimulation for Movement Disorders. Mov. Disord. 17, S1-S1 (2002) 4. Sterio, D., Zonenshayn, M., Mogilner, A.Y., Rezai, A.R., Kiprovski, K., Kelly, P.J., Beric, A.: Neurophysiological Refinement of Subthalamic Nucleus Targeting. Neurosurg 50, 58–69 (2002) 5. Zucker, R.S.: Frequency Dependent Changes in Excitatory Synaptic Efficacy. In: Dichter, M.A. (ed.) Mechanisms of Epiletogenesis, pp. 153–157. Plenum Press, New York (1988) 6. Lisman, L.E.: Bursts as a Unit of Neural Information: Making Unreliable Synapses Reliable. TINS 20, 38–43 (1997) 7. Benazzouz, A., Breit, S., Koudsie, A., Pollak, P., Krack, P., Benabid, A.L.: Intraoperative Microrecordings of the Subthalamic Nucleus in Parkinson’s Disease. Mov. Disord. 17, S145S149 (2002) 8. Turnbull, L., Dian, E., Gross, G.: The String Method of Burst Identification in Neuronal Spike Trains. J. Neurosci. Methods 145, 23–35 (2005) 9. Mulloney, B.: A Method to Measure the Strength of Multi-unit Bursts of Action Potentials. J. Neurosci. Methods 146, 98–105 (2005) 10. Favre, J., Taha, J.M., Baumann, T., Burchiel, K.J.: Computer Analysis of the Tonic, Phasic, and Kinesthetic Activity of Pallidal Discharges in Parkinson Patients. Surg. Neurol. 51, 665– 673 (1999) 11. Lee, J.-D., Huang, C.-H., Lee, S.-T.: Improving Stereotactic Surgery Using 3-D Reconstruction. IEEE Eng. Med. Biol. Mag. 21, 109–116 (2002)
Population Coding of Song Element Sequence in the Songbird Brain Nucleus HVC Jun Nishikawa1 , Masato Okada1,2 , and Kazuo Okanoya1 1
2
RIKEN Brain Science Institute, 2-1 Hirosawa, Wako, Saitama 351-0198, Japan Graduate School of Frontier Sciences, The University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa, Chiba 277-8561, Japan
Abstract. Birdsong is a complex vocalization composed of various song elements organized according to sequential rules. To reveal the neural representation of song element sequence, we recorded the neural responses to all possible element pairs of stimuli in the Bengalese finch brain nucleus HVC. Our results show that each neuron has broad but differential response properties to element sequences. We calculated the time course of population activity vectors and mutual information between auditory stimuli and neural activities. The clusters of population vectors responding to second elements had a large overlap, whereas the clusters responding to first elements were clearly divided. At the same timing, confounded information also significantly increased. These results indicate that the song element sequence is encoded in a neural ensemble in HVC via population coding.
1
Introduction
Songbirds have a complex learned vocalization composed of various song elements with a typical sequential rule. In Bengalese finches, these rules follow individually distinctive finite state syntax [1]. Songbirds have been intensively studied as a model for the syntactical properties of human language [2]. It is important to reveal the neural representation of complex song element sequences in the songbird brain. Based on the finding of sequential selective neurons [3,4], it has been thought that the song element sequence is encoded in a chain of the rigid selective neurons [5]. Alternatively, it can be encoded in a neural ensemble of relatively broadly selective neurons in a distributed manner [6,7]. We attempted to determine which neural representation actually occurs in the songbird brain. Songbirds have a specialized brain area for generating and learning complex vocalizations, and the area is called as a song system. From the importance of auditory feedback in song learning, numerous studies have investigated auditory neural representation in the song system [3,4]. Especially, HVC is one of the major sensory-motor integration sites in the song system; these neurons selectively respond to the bird’s own song (BOS) in a time-locked manner. Margoliash and Fortune found the neurons selectively responded to only typical element pair stimuli included in their own song [3,4]. This type of neuron was named the temporal combination selective neuron (TCS neuron). Since the discovery of M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 54–63, 2008. c Springer-Verlag Berlin Heidelberg 2008
Population Coding of Song Element Sequence
55
TCS neurons, it has been thought that the song element sequence is encoded in a series of different types of TCS neurons. However, TCS neurons were found only in a small portion of the recorded data. Many song-selective neurons lack TCS properties. In addition, the stimuli used in these experiments were only partial presentations of the entire sequence, such as EE, EF, FE, or FF within ABCDEFGHI. This design did not test for responses to other element pairs, such as AB, GC, or any other combination. To more fully understand the neural representation of song element sequences, we must evaluate activity in response to all possible song element pairs within ABCDEFGHI. In this study, we recorded single-unit activities of HVC neurons driven by all possible song element pair stimuli in anesthetized Bengalese finches. Then, we used sequential response distribution analysis, population dynamics analysis, and information-theoretic analysis to show that the song element sequence is encoded within a neural ensemble in HVC neurons by population coding. These findings led us postulate an alternative scheme for encoding the song element sequence in the songbird HVC, with distributed neural representation rather than the chain model of rigid selective TCS neurons.
2 2.1
Material and Methods Animals
Twenty-three adult Bengalese finches (> 180 days post-hatch) were used in this study. All experimental procedures were performed according to established animal care protocols approved by the animal care and use committee at RIKEN. 2.2
Stimuli
Undirected songs were recorded in a quiet soundproof box using a microphone and amplifier connected to a computer with a sampling rate of 44.1 kHz and 16-bit resolution. We calculated sonograms from the recorded song using sound analysis software (SASLab Pro; Avisoft, Berlin, Germany). A birdsong consists of a series of discrete song elements with silent intervals among them. Song elements were divided into distinct types by visual inspection of the spectrotemporal structure of each sonogram. The transition matrix, representing the transition probability between each song element, was then calculated. We can evaluate the syntactical structure of song in individual Bengalese finches using this transition matrix. We prepared five different types of sound stimuli: BOS, REV, OREV, element, and element pair. BOS is the forward playback of the bird’s own song, while REV is the reversed playback of the same song. OREV is a modified version of the song, in which the spectro-temporal composition of each song element is retained, but the order of the song elements has been reversed. Element stimuli are isolated playbacks of each song element. For example, if the song has nine elements, element stimuli are A, B, C, and so on. Element pair stimuli are combinations of all possible element pairs. In this case, we prepared 81 stimuli, including AA, AB, · · ·, IH, and II.
56
2.3
J. Nishikawa, M. Okada, and K. Okanoya
Recording Procedure
Before electrophysiological recording sessions, birds were anesthetized with 4 to 7 doses of 10% urethane (40 μl per dose) at 20-min intervals. The birds were restrained in a custom-made chamber on a stereotaxic device (David Kopf Instruments, Tujunga, CA, USA). The birds were fixed with ear-bars and a beak-holder that positioned the beak tip at an angle of 45 degrees below the horizontal plane. The head was treated with Xylocaine gel and the feathers and skin were removed. A custom-made three-point fixation device (Narishige, Tokyo, Japan) was attached to the rostral part of the skull surface with dental cement. Small holes were made in the skull just above the HVC. Finally, the dura was removed, and tungsten electrodes were set on the surface. The ear-bars were removed before making physiological recordings. The birds were located in an electromagnetically shielded sound-attenuation box while in the stereotaxic device. The electrodes were lowered into the brain using a hydraulic micropositioner (MODEL640, David Kopf Instruments), and extracellular signals from HVC were recorded. The signals from the electrodes were amplified (gain 10,000) and filtered (100 Hz-10 kHz bandpass) using an extracellular recording amplifier (ER-91, Cygnus Technology, Water Gap, PA, USA). The data were digitized at 20 kHz with 16-bit resolution using the data acquisition system (Micro1401, Cambridge Electronic Design, Cambridge, UK) and the associated software (Spike2, Cambridge Electronic Design). The data were stored in a computer disk for off-line analysis. During the neural recording session, sound stimuli were presented at a peak sound pressure of 70 dB. At first, we presented BOS, REV, OREV, and the silent stimuli. Next, each of the elements and silent stimuli were delivered. Finally, we presented element pairs and silent stimuli. Each sound stimulus was presented 20 times in a random order with an interstimulus interval of 3 to 5 s. The computer for neural recording and that for stimulus presentation were synchronized by a trigger-signal generated simultaneously with the stimuli. 2.4
Data Analysis
Analyses were performed using custom-made programs written by MATLAB (Mathworks, Natick, MA, USA). The mean spontaneous firing rate was calculated from the baseline activity registered during the silent stimulus. The response strength RS was calculated by subtracting the spontaneous rate from the firing rate R registered during stimulus presentation. R and RS were measured 2 in each 10 ms bin, from which we calculated the average R, RS, and variance σR , 2 σRS across the stimulus presentation period. To determine the selectivity of each neuron, we calculated the psychophysical measure , as previously described [8]. 2(xA − xB ) d (xA /xB ) = 2 . σxA + σx2B
(1)
In this equation, xA is the response to stimulus A, and xB is the response to stimulus B. d (xA /xB ) represents the response selectivity to stimulus A relative to stimulus B based on the mean and variation of the responses. We considered a neuron to be selective for a stimulus when the selectivity satisfies d > 1.0 [9].
Population Coding of Song Element Sequence
2.5
57
Population Dynamics Analysis
To analyze neural activity at the population level, we performed a population dynamics analysis [10]. With our experimental design, we were not able to combine the data from different individuals because their songs and elements thereof were completely different from each other. Therefore, we presented the data for one typical bird in which we could register activity from six distinct single units throughout the presentation of all stimuli. Note that the qualitative property for the obtained results was similar in the other birds. For each stimulus, we calculated a population activity vector, which is the set of instantaneous mean firing rates for each neuron in a 50 ms time window. In this case, each population activity vector had six dimensions. Since this typical bird had four song elements, we calculated 16 population activity vectors for 16 element pair stimuli, within the 50 ms time window. The time window shifted by increments of 1 ms from -200 to 600 ms (stimulus onset = 0 ms). The data were smoothed using a Gaussian filter with a variance of 10 ms. These procedures enabled us to observe the temporal aspects of the neuronal population. The multidimensional scaling method (MDS) [11] is a dimension-reduction method that rearranges data from a high-dimensional space into a lower-dimensional space, while preserving as much of the information as possible. MDS was applied to the set of population activity vectors for each time window. Finally, the population response to each stimulus was represented in two-dimensional MDS space, and the clustering of these responses was analyzed. 2.6
Information-Theoretic Analysis
To evaluate how much information is transmitted by each neuron, we calculated the mutual information between the stimulus and the neural response [12]. Mutual information was quantified as the decrease in entropy of the stimulus occurrence: I(S; R) = H(S) − H(S|R), =− p(s) log p(s) − − p(s|r) log p(s|r) . s
s
(2)
r
In this equation, S is the set of stimuli s, and R is the set of neural responses r, i.e., spike count. p(s|r)is the conditional probability of stimulus s given an observed spike count r , and p(s) is the a priori probability of stimulus s. The brackets indicate an average of the signal distribution p(r). To examine the time course of the information, the response was evaluated using a 50 ms sliding window. The center of the window was moved in 10 ms steps, beginning 200 ms before the stimulus onset and lasting until 600 ms after the stimulus. To test the statistical significance, we estimated the mean and standard deviation of the information during the 200 ms period before the stimulus onset. If the value exceeded the mean + 3SD, we considered the information significant (P < 0.001).
58
3 3.1
J. Nishikawa, M. Okada, and K. Okanoya
Results Selective Auditory Response to BOS
We used a spike sorting procedure to classify the signals recorded from the Bengalese finch HVC, which yielded well identified single units (n = 104, 23 birds). We analyzed these data using the psychophysical measure d . In total, 86% of HVC neurons selectively responded to BOS compared to the silent condition (d (RS BOS /RS Baseline ) > 1.0, 89/104 cells). In addition, 63% of neurons were more responsive to BOS than to REV (d (RS BOS /RS REV ) > 1.0, 65/104 cells), and 28% of neurons were more responsive to BOS than to OREV (d (RS BOS /RS OREV ) > 1.0, 29/104 cells). These results are consistent with past studies [4]. The mean of d (RS BOS /RS REV ) was 1.25, and the mean of d (RS BOS /RS OREV ) was 0.63. These results indicate that BOS-selective neurons in HVC are largely variable, especially in terms of sequential response properties. A. Elements PSTH (Hz)
Raster
20 0 100 0 -0.2
0
0.5 (s)
A
B
C
A B
A C
D
PSTH (Hz)
Raster
B. Element pairs 20 0 100 0 -0.2
0 A A
B A
0.5 (s)
B
B
B
AD
C
B D
C A
C B
C C
CD
D A
D B
D C
D D
Fig. 1. An example of the auditory response to song elements (A) and element pairs (B) in a single unit from HVC
Population Coding of Song Element Sequence
59
Fig. 2. Song transition matrices of self-generated songs (first row) and sequential response distribution matrices of each single unit (second to seventh rows)
3.2
Responses to Song Element Pair Stimuli
To investigate the neural selectivity to song element sequences, we recorded neural responses to all possible element pair stimuli. Because the playback of these stimuli is extremely time-consuming, we could only maintain 34% of the recorded single units stable throughout the entire presentation (35/104 cells, 12/23 birds). In total, 70% of the stable single-units were BOS-selective (d (RS BOS /RS REV ) > 1.0, 27/35 cells, 12/12 birds). Thereafter, we focused on these data. A typical example of neural responses to each song element is shown in Fig. 1 (A). The neuron responded to a single element A or C with single phasic activity, but it did not respond to element B. It responded to element D with double phasic activity. These results indicate that the neuron has various response properties even during single element presentation. In addition, the neuron exhibited more complex response properties during the presentation of element pairs (Fig. 1(B)). The neuron responded more strongly to most of the element pairs when the second element was A or C, compared to single presentation of each element. However, the response was weaker when the first and second elements were the same. When the second element was B, no differences were observed between single and paired stimuli. When the second element was D, we measured single
60
J. Nishikawa, M. Okada, and K. Okanoya
phasic responses, and a strong response to BD. These response properties were not correlated with the element-to-element transition probabilities in the song structure. The dotted boxes indicate the sequences included in BOS. However, the neuron responded only weakly to some sequences that were included in BOS (brack arrows). In contrast, the neuron responded strongly to other sequences that were not included in BOS (white arrows). Thus, the neuron had broad response properties to song element pairs beyond the structure of self-generated song. To quantitatively evaluate sequential response properties, we calculated the response strength measure d (RS S /RS Baseline ) to the element pair stimuli S. The sequential response distributions were created for each neuron in two individuals with more than five well identified single units. Song transition matrices and sequential response distributions are shown in Fig. 2. The response distributions were not correlated with the associated song transition matrices. However, each HVC neuron in the same individual had broad but different response distribution properties. This tendency was consistent among individuals. This result indicates that the song element sequence is encoded at the population level, within broadly but differentially selective HVC neurons. 3.3
Population Dynamics Analysis
To analyze the information coding of song element sequences at the population level, we calculated the time course of population activity vectors, which is the set of instantaneous mean firing rates for each neuron in a 50 ms time window. Snapshots of population responses to stimuli are shown in eight panels of Fig. 3A (n = 6, bird 2 of Fig. 2). Each point in the panel represents the population vector toward each stimulus on the MDS space. The ellipses in the upper four panels indicate the group of vectors whose stimuli have the same first element, while the ellipses in the lower four panels indicate the group of vectors whose stimuli have the same second element. Note that the population activity vectors in the upper four panels are identical to those in the bottom four panels, and only the ellipses differ. Before the stimulus presentation ([-155 ms: -105 ms], upper and lower panels), only spontaneous activities were observed around the origin. After the first element presentation ([50 ms: 100 ms], upper panel), groups with the same first elements split. After the second element presentation ([131 ms: 181 ms] of the lower panel), groups with the same second elements were still largely overlapping. In the next section, we will show that confounded information, which represents the relation between first and second elements, increased significantly in this timing. After sufficient time ([480 ms: 530 ms], upper and lower panels), the neurons returned to spontaneous activity. The result indicates that the population response to the first and second element is drastically different. Subsequently, we will show that this overlap is derived from the information in the song element sequence. 3.4
Information-Theoretic Analysis
To determine the origin of the overlap in the population response, we calculated the time course of mutual information between the stimulus and neural activity.
Population Coding of Song Element Sequence
61
Fig. 3. Responses of HVC neurons at the population level (A) and encoded information (B)
The mutual information for first elements I(S1 ; R), the second elements I(S2 ; R), and that of element pairs I(S1 , S2 ; R) was calculated within each time window; the window was shifted to analyze the temporal dynamics of information coding (left upper 3 graphs in Fig. 3B). Narrow lines in each graph indicate the cumulative trace of mutual information in each neuron. The thick line is the cumulative trace of all neurons in the individual. The bottom-left graph in Fig. 3B shows the probability of stimulus presentation. After the presentation of the first elements, mutual information for the first elements increased, showing a statistically significant peak (P < 0.001). After the presentation of the second elements, mutual information for the second elements significantly increased (P < 0.001). At the
62
J. Nishikawa, M. Okada, and K. Okanoya
same time, mutual information for element pairs also showed a significant peak (P < 0.001). Intuitively, information for element pairs I(S1 , S2 ; R) would consist of information for the first elements I(S1 ; R) and second elements I(S2 ; R). However, the consecutive calculation of I(S1 , S2 ; R) − I(S1 ; R) − I(S2 ; R) in each time window causes a statistical peak after the presentation of element pairs (P < 0.001; forth graph from left in Fig. 3B). The difference C represents the conditional mutual information between the first and second elements for a given neural response, otherwise known as confounded information [13]. Therefore, confounded information represents the relationship between the first and second elements encoded in the neural responses. The I(S1 ; R) peak occurred at the same time that groups of population vectors with the same first elements were splitting ([50 ms: 100 ms]). The peaks for I(S2 ; R), I(S1 , S2 ; R), and C occurred during the same time that groups with the same second elements were still largely overlapping ([131 ms: 181 ms]). This indicates that the sequential information causes an overlap in the population response. In the population dynamics analysis, we cannot combine the data from different birds because each bird has a different number and types of song elements. However, in the mutual information analysis, we can combine and average the data from different birds. The five graphs on the right in Fig. 3B show the time courses for I(S1 ; R), I(S2 ; R), I(S1 , S2 ; R), C, and the stimulus presentation probability, which were calculated from all stable single units with BOS selectivity (n = 27, 12 birds). The combined mutual information for first elements was very similar to that from one bird, showing a significant peak after the presentation of the first elements (P < 0.001). Mutual information for second elements, element pairs, and confounded information also had significant peaks after the presentation of the second elements (P < 0.001). These results show that the song element sequence is encoded into a neural ensemble in HVC by population coding.
4
Conclusion
In this study, we recorded auditory responses to all possible element pair stimuli from the Bengalese finch HVC. By determining the sequential response distributions for each neuron, we showed that each neuron in HVC has broad but differential response properties to song element sequences. The population dynamics analysis revealed that population activity vectors overlap after the presentation of element pairs. Using mutual information analysis, we demonstrated that this overlap in the population response is due to confounded information, namely, the sequential information of song elements. These results indicate that the song element sequence is encoded into the HVC microcircuit at the population level. Song element sequences are encoded in a neural ensemble with broad and differentially selective neuronal populations, rather than the chain-like model of differential TCS neurons.
Population Coding of Song Element Sequence
63
Acknowledgment This study was partially supported by the RIKEN Brain Science Institute, and by a Grant-in-Aid for young scientists (B) No. 18700303 from the Japanese Ministry of Education, Culture, Sports, Science, and Technology.
References 1. Okanoya, K.: The Bengalese finch: a window on the behavioral neurobiology of birdsong syntax. Ann. N.Y. Acad. Sci. 1016, 724–735 (2004) 2. Doupe, A.J., Kuhl, P.K.: The Bengalese finch: a window on the behavioral neurobiology of birdsong syntax. Birdsong and human speech: common themes and mechanisms. Annu. Rev. Neurosci. 22, 567–631 (1999) 3. Margoliash, D., Fortune, E.S.: Temporal and harmonic combination-selective neurons in the zebra finch’ s HVc. J. Neurosci. 12, 4309–4326 (1992) 4. Lewicki, M.S., Arthur, B.J.: Hierarchical organization of auditory temporal context sensitivity. J. Neurosci. 16, 6987–6998 (1996) 5. Drew, P.J., Abbott, L.F.: Model of song selectivity and sequence generation in area HVc of the songbird. J. Neurophysiol. 89, 2697–2706 (2003) 6. Deneve, S., Latham, P.E., Pouget, A.: Reading population codes: a neural implementation of ideal observers. Nat. Neurosci. 2, 740–745 (2001) 7. Pouget, A., Dayan, P., Zemel, R.: Information processing with population codes. Nat. Rev. Neurosci. 1, 125–132 (2000) 8. Green, D., Swets, J.: Signal Detection Theory and Psychophysics. Wiley, New York (1966) 9. Theunissen, F.E., Doupe, A.J.: Temporal and spectral sensitivity of complex auditory neurons in the nucleus HVc of male zebra finches. J. Neurosci. 18, 3786–3802 (1998) 10. Matsumoto, N., Okada, M., Sugase-Miyamoto, Y., Yamane, S., Kawano, K.: Population dynamics of face-responsive neurons in the inferior temporal cortex. Cerebr. Cort. 15, 1103–1112 (2005) 11. Gower, J.C.: Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53, 325–328 (1966) 12. Sugase, Y., Yamane, S., Ueno, S., Kawano, K.: Global and fine information coded by single neurons in the temporal visual cortex. Nature 400, 869–873 (1999) 13. Reich, D.S., Mechler, F., Victor, J.D.: Formal and attribute-specific information in primary visual cortex. J. Neurophysiol. 85, 305–318 (2001)
Spontaneous Voltage Transients in Mammalian Retinal Ganglion Cells Dissociated by Vibration Tamami Motomura, Yuki Hayashida, and Nobuki Murayama Graduate school of Science and Technology, Kumamoto University, 2-39-1 Kurokami, Kumamoto 860-8555, Japan
[email protected], {yukih,murayama}@cs.kumamoto-u.ac.jp
Abstract. We recently developed a new method to dissociate neurons from mammalian retinae by utilizing low-Ca2+ tissue incubation and the vibrodissociation technique, but without use of enzyme. The retinal ganglion cell somata dissociated by this method showed spontaneous voltage transients (sVT) with the fast rise and slower decay. In this study, we analyzed characteristics of these sVT in the cells under perforated-patch whole-cell configuration, as well as in a single compartment cell model. The sVT varied in amplitude with quantal manner, and reversed in polarity around −80 mV in a normal physiological saline. The reversal potential of sVT shifted dependently on the K+ equilibrium potential, indicating the involvement of some K+ conductance. Based on the model, the conductance changes responsible for producing sVT were little dependent on the membrane potential below −50 mV. These results could suggest the presence of isolated, inhibitory presynaptic terminals attaching on the ganglion cell somata. Keywords: Neuronal computation, dissociated cells, retina, patch-clamp, neuron model.
1 Introduction Elucidating the functional role of single neurons in neural information processing is intricate because the neuronal computation itself is highly nonlinear and adaptive, and depends on combinations of many parameters, e.g. the ionic conductances, the intracellular signaling, their subcellular distributions, and the cell morphology. Furthermore, the interactions with surrounding neurons/glias can alter those factors, and thereby hinder us from examining some of those factors separately. This could be overcome by pharmacologically or physically isolating neurons from the circuits. One would use the pharmacological agents those can block the synaptic signal transmission in situ, although it is hard to know whether or not such agents show any unintended side-effects. Alternatively, one can dissociate neural tissue into single neurons by means of enzymatic digestion and mechanical trituration. The dissociated single neurons often lost their fine neurites and the synaptic contacts with other cells during the dissociation procedure, and thus, are useful for examining the properties of ionic conductances at known membrane potentials [3]. Unfortunately, however, M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 64–72, 2008. © Springer-Verlag Berlin Heidelberg 2008
sVT in Mammalian Retinal Ganglion Cells Dissociated by Vibration
65
several studies have demonstrated that proteolytic enzymes employed for the cell dissociations can distort the amplitude, kinetics, localization, and pharmacological properties of ionic currents, e.g. [2]. These observations lead to attempts to isolate neurons by enzyme-free, mechanical means. Recently, we developed a new protocol for dissociating single neurons from specific layers of mammalian retinae without use of any proteolytic enzymes [12], but with a combination of the low-Ca2+ tissue incubation [9] and the vibrodissociation technique [15] which has been applied to the slices of brains and spinal cords [1]. The somata of ganglion cells dissociated by our method showed spontaneous voltage transients (sVT) with fast rise and slower decay in the time course [8]. To our knowledge, such sVT have never been reported in previous studies on the retinal ganglion cells dissociated with or without enzyme [9]. Therefore, in this study, we analyzed characteristics of these sVT in the cells under perforated-patch whole-cell configuration, as well as in a single compartment cell model. The present results could suggest the presence of inhibitory presynaptic terminals attaching to the ganglion cell somata we recorded from, as demonstrated in previous studies on the vibrodissociated neurons of brains and spinal cords [1]. If this is the case, the retinal neurons dissociated by our method would be advantageous to investigate the mechanisms of transmitter release in single tiny synaptic boutons even under the isolation from the axons and neurites.
2 Methods All animal care and experimental procedures in this study were approved by the committee for animal researches of Kumamoto University. 2.1 Cell Dissociation The neural retinas were isolated from two freshly enucleated eyes of Wistar rats (P7-P25), cut into 2-4 pieces each, and briefly kept in chilled extracellular “bath” solution. This solution contained (in mM): 140 NaCl, 3.5 KCl, 1 MgCl2, 2.5 CaCl2, 10 D-glucose, 5 HEPES. The pH was adjusted to 7.3 with NaOH. A retinal piece was then placed with photoreceptor-side down in a culture dish, covered with 0.4 ml of chilled, low-Ca2+ solution and incubated for 3-5 min. This low-Ca2+ solution contained (in mM): 140 sucrose, 2.5 KCl, 70 CsOH, 20 NaOH, 1 NaH2PO4, 15 CaCl2, 20 EDTA, 11 D-glucose, 15 HEPES. The estimated free Ca2+ concentration was 100–200 nM. The pH was adjusted to 7.2 with HCl. After the incubation, the fireblunted glass pipette horizontally vibrating in amplitude of 0.2-0.5 mm at 100 Hz was applied to the flattened surface of retina throughout under visual control with the microscope, so that the cells were dissociated from the ganglion cell layer, but least from the inner and outer nuclear layers. After removing the remaining retinal tissue, the culture dish was filled with the bath solution, and left on a vibration-isolation table for allowing cells to settle down for 15-40 min. The bath solution was replaced by a fresh aliquot supplemented with 1 mg/ml bovine serum albumin and the dissociated cells were maintained at room temperature (20-25 oC) for 2-18 hrs prior to the electrophysiological recordings described below. The ganglion cells were identified
66
T. Motomura, Y. Hayashida, and N. Murayama
based on the size criteria [6]. Nearly all those cells we made recordings from in voltage-/current-clamp showed the large amplitude of voltage-gated Na+ current and/or of action potentials (see Fig. 1A-B), verifying that they were ganglion cells [4]. 2.2 Electrophysiology Since the previous studies demonstrated that the membrane conductances of retinal ganglion cells can be modulated by the intracellular messengers, e.g. Zn2+ [13] and cAMP [7], all recordings presented here were performed in perforated-patch wholecell mode [9] to maintain cytoplasmic integrity. Patch electrodes were pulled from borosilicate glass capillaries to tip resistances of approximately 4-8 MΩ. The tip of the electrodes were filled with a recording “electrode” solution that contained (in mM): 110 K-D-gluconic acid, 15 KCl, 15 NaOH, 2.6 MgCl2, 0.34 CaCl2, 1 EGTA, 10 HEPES. The pH was adjusted to 7.2 with methanesulfonic acid. The shank of the electrodes were filled with this solution after the addition of amphotericin B as the perforating agent (260 μg/ml, with 400 μg/ml Pluronic F-127). The recordings were made after the series resistance in perforated-patch configuration reached a stable value (typically 20-40 MΩ, ranging 10-100 MΩ). In the fast current-clamp mode of the amplifier (EPC-10, Heka), the voltage monitor output was analog-filtered by the built-in Bessel filters (3-pole 10–30 kHz followed by 4-pole 2-5 kHz) and digitally sampled (5–20 kHz). The voltage drop across the series resistance was compensated by the built-in circuitry. The recording bath was grounded via an agar bridge, and the bath solution was continuously superfused over each cell recorded from, at a constant flow rate (0.4 ml/min). The volume of solution in the recording chamber was kept at about 2 ml. To apply a high-K+ solution (Fig. 2B), 8 mM NaCl in the bath solution was replaced by the equimolar KCl. An enzyme solution was made by supplementing 0.25 mg/ml papain and 2.5 mM L-cystein in the bath solution. All experiments were performed at room temperature.
3 Results Perforated-patch whole-cell recordings were made from the somata of ganglion cells dissociated by our recently developed protocol (see Methods), which offered us quantitative measurements of the intrinsic membrane properties with the least distortion due to the proteolysis by enzymes [2]. Conversely, since these cells were never exposed to any enzyme, they were useful in examining the effects of the enzymes utilized for the cell dissociation in previous studies. In fact, spike firing of the ganglion cells in response to constant current injection via the patch electrode (30-pA step in the positive direction) was irreversibly altered when the enzyme solution was superfused over those cells (n=3): 1) The resting potential depolarized by 5-20 mV and the spike firing diminished during the enzyme application; 2) When the enzyme was washed out from the recording chamber, the resting potential gradually hyperpolarized near the original level and the spike firing returned in some way;
sVT in Mammalian Retinal Ganglion Cells Dissociated by Vibration
A
67
B
C
Fig. 1. Spontaneous voltage transients (sVT) observed in the dissociated retinal ganglion cells. A: Microphotograph of the cell recorded from. Note the soma being lager than 15 μm in diameter. B: Membrane potential changes in response to step-wise constant current injections. Four traces are superimposed. The injected current was 10 pA in the negative direction and 10, 20, and 30 pA in the positive directions. C: Spontaneous hyperpolarizations under the currentclamp. The recordings were made for 50 sec in three different episodes with breaks for 12 sec between first and second, and for 6 sec between second and third. A constant current (2 pA in the positive direction) was injected to hold the membrane potential at around –70 mV (dashed gray line). Inset: Examples of sVT in an expanded time scale. Five events are recognized. Three of them are similar in their amplitude and time course, and the other two have roughly half and quarter of the largest amplitude.
3) After ~20 min of the washing out of enzyme, the spike firing reached a steady state at which the interval between the first and second spike firings in response to the current step was shorter than that before the enzyme application, by 40 11 % (mean S.E.) (not shown, [12]). These results suggest that, in previous studies on isolated retinal ganglion cells, some of the ionic channels could be significantly distorted during the dissociation procedure because of the use of proteolytic enzymes. Moreover, we found the spontaneous voltage transients (sVT) with the fast rise and slower decay in the retinal ganglion cell somata dissociated by our method [8]. Fig. 1C shows an example of sVT recorded from the cell shown in Fig. 1A. As shown in the Fig., transient hyperpolarizations spontaneously appeared under a constant current injection. Most of these hyperpolarizations are similar in amplitude and time course at a certain membrane potential (−70 mV here) and in some, the peak amplitude of hyperpolarizations was roughly the half or quarter (or one-eighth, in other cells) of the largest one (Inset). Such sVT appeared in 10-20 % of the cells we made recordings from, and could be observed in particular cells as long as we kept the recordings (0.5-2 hrs). When the enzyme solution was superfused over one of those cells, the sVT disappeared completely, and then were not seen again even after 20 min of the washing out of enzyme.
±
±
68 A
T. Motomura, Y. Hayashida, and N. Murayama B
C
Fig. 2. Reversal potential of sVT. A, B: The sVT recorded in the saline containing extracellular K+ of 3.5 mM (A) and 11.5 mM (B). The basal membrane potential (indicated by arrows) was varied by injecting holding currents ranging between –8 and +8 pA in A and between –8 and +12 pA in B. C: Plots of the peak amplitude versus basal potential. Only the events having the largest amplitude (see Fig. 1C) are taken into account. Note that the amplitude of depolarizations were plotted as negative values, and vice versa. The filled circles and open circles represent the data for 3.5-mM K+ and 11.5-mM K+, respectively.
As shown in Fig. 1C, the sVT were all recorded as hyperpolarizations when the cell was held at approximately −70 mV. Thus, the reversal potential of the ionic current producing sVT should be below this voltage. In Fig. 2, the reversal potential for sVT was measured by holding the basal membrane potential at different levels under the current-clamp, c.f. [5]. As expected, the polarity of sVT reversed around −80 mV when the basal membrane potential was varied from about −100 to −40 mV (Fig. 2A). Based on the ionic compositions in the bath and electrode solutions used in this recording, the equilibrium potential of K+ (EK) was estimated to be about −90 mV, and close to the reversal potential for sVT. When the EK was shifted by +30 mV, i.e. from about −90 to −60 mV by applying the high-K+ solution (see Method), the polarity of sVT reversed between −54 and −39 mV of the basal membrane potential (Fig. 2B). Fig. 2C plots the peak amplitude of sVT versus the basal membrane potential. The linear regressions on these plots (gray lines) crossed the abscissa (dashed line) at approximately –76 mV and –48 mV for 3.5-mM K+ and 11.5-mM K+, respectively, showing the shift of reversal potential parallel to the EK shift. Similar results were obtained in other two cells. These results indicate that the ionic conductance responsible for producing sVT is permeable to, at least, K+. In the present experiments, we made recordings from the cells without neurites or with neurites no longer than 10 μm or so. Therefore, those cells can be modeled as a single compartment shown in Fig. 3A. In this model, the unknown conductance responsible for producing sVT and the reversal potential are represented by “gx” and “Ex”, respectively. The membrane properties intrinsic to the cell are represented by membrane capacitance Cm, nonlinear conductance gm, and the apparent reversal potential Em. Here, Cm was measured with the capacitance compensation circuitry of
sVT in Mammalian Retinal Ganglion Cells Dissociated by Vibration
69
B
A
C
E
D
G
H F
Fig. 3. Conductance changes during sVT. A: Single compartment model of the isolated somata. ICm, the current through Cm. Igm, the current through gm. Igx, the current through gx. B: Voltage responses to the current steps injected. The amplitude of current steps (Iinj) were varied from –10 to +10 pA in 5-pA increment and the corresponding voltage changes (Vm), from the bottom (black) to the top (light gray), were recorded. C: Plots of the membrane potential versus the amplitude of current steps. The voltage was measured at the time points indicated by the marks in B (circle, square, triangle, rhombus, and hexagon). The solid line shows the best fit to the plots with a single exponential function. D: Voltage-dependency of gm calculated from the plots in C. The derivative of current with respect to the membrane potential gave the slope conductance gm, which could be approximated by a hyperbolic function, gm = Iα / (Vα –Vm), where Vm
±
70
T. Motomura, Y. Hayashida, and N. Murayama
the patch-clamp amplifier we used and Em was measured as the resting potential under current-clamp. As shown in Fig. 3B-D, gm (a function of membrane potential) was also estimated experimentally from the voltage shifts induced by the current steps injected to the cell. Likewise, the value of Ex was estimated as shown in Fig. 2C. Since, during sVT, the membrane potential (Vrec) was recorded while the known amplitude of current (Iinj) was injected from the electrode, the time-varying gx could be calculated from the Kirchhoff’s law, by the equation:
g x(t) =
I inj − C m
d Vrec(t) − g m {Vrec(t) − E m } dt Vrec(t) − E x
(1)
Fig. 3F shows the calculated time course of gx during the sVT shown in Fig. 3E. Although the sVT were different in amplitude, time course, and polarity when measured at different basal potentials (E), the calculated gx fairly resembled each other (F). When the peak amplitude of the gx change was plotted against the basal membrane potential, little voltage dependency was found at potentials especially below –50 mV (Fig. 3G). And also, the decay time constant of the gx change are almost constant over the potential we recorded (Fig. 3H). The mean values of peak amplitude and decay time constants were in the ranges of 0.1-0.3 nS and 0.1-0.15 sec, respectively. These results were quantitatively similar in two cells, as compared the plots with different marks in G and H, implying that similar populations of ionic channels operated to produce the sVT.
4 Discussions In previous studies, spontaneous voltage/current transients have never been reported in acutely dissociated retinal ganglion cells. Here, we made recordings from the mammalian retinal ganglion cells dissociated by the recently developed method [8], and showed the characteristics of sVT observed in those cells. The present study showed that the sVT have the peak amplitude in quantal manner (Fig. 1), and may involve K+ conductance activation (Fig. 2) which has little or slight voltagedependency over the range we examined (Fig. 3). These results are less compatible with the idea that the voltage-gated K+ conductance are activated spontaneously by intracellular messengers like Ca2+, as in the dissociated retinal amacrine cells [11]. Previous studies have shown that spontaneous postsynaptic potentials or currents can be recorded in the single neurons vibrodissociated from brains and spinal cords [1]. Since the dissociation protocol we used here utilized the vibrodissociation technique [8], it is feasible for us to have the working hypothesis illustrated in Fig. 4, to infer the underlying mechanisms of sVT we observed in the retinal ganglion cells. If this hypothesis is the case, it enables one to investigate the functional machinery controlling the transmitter release in single synaptic terminals under the isolation from the axons and neurites [14].
sVT in Mammalian Retinal Ganglion Cells Dissociated by Vibration
71
Fig. 4. A working hypothesis for the underlying mechanism of sVT. It shows that the transmitter released from an isolated presynaptic terminal (the left-hand side) activates the K+ conductance on the membrane of the dissociated ganglion cell somata we recorded from (the right-hand side).
In our preliminary experiments, it was found that the sVT frequency, but not the amplitude, decreased when a low-Ca2+ solution was superfused over the cell and increased briefly upon application of caffeine. Thus, it is suggested that, at least, the sVT are elicited via an intracellular Ca2+-dependent process. Although we cannot rule out the contribution of cytoplasmic Ca2+ in the cells recorded from to the effects of extracellular low-Ca2+ and caffeine on the sVT, those experimental results are consistent with the present hypothesis (Fig. 4). The retinal ganglion cells in situ/vivo receive the inhibitory transmitter GABA released from the presynaptic amacrine cells, and the GABAb receptor, which is generally coupled with K+ conductance, is known to be expressed in rat ganglion cells [10]. However sVT were not suppressed when GABAb receptor antagonist (2hydroxysaclofen or SCH 50911) was applied. Further studies remain to be conducted to reveal the underlying mechanisms of the sVT in retinal ganglion cells somata. Acknowledgments. The authors shall be grateful to Dr. N. Akaike for his valuable suggestions on our dissociation protocol and lending most of the equipments used in the present experiments, to Dr. K. Hayashi for lending the microscope. This work was partly supported by the Japan Ministry of Education, Science, Sports and Culture, Grant–in–Aid for Young Scientists (B), 17700398, 2005 to Y.H.
References 1. Akaike, N., Moorhouse, A.J.: Techniques: applications of the nerve-bouton preparation in neuropharmacology. Trends in Pharmacological Science 24(1), 44–47 (2003) 2. Armstrong, C.E., Roberts, W.M.: Electrical properties of frog saccular hair cells: distortion by enzymatic dissociation. Journal of Neuroscience 18(8), 2962–2973 (1998)
72
T. Motomura, Y. Hayashida, and N. Murayama
3. Armstrong, C.M., Gilly, W.F.: Access resistance and space clamp problems associated with whole-cell patch clamping. Methods in enzymology 207, 100–122 (1992) 4. Barres, B.A., Silverstein, B.E., Corey, D.P., Chun, L.L.: Immunological, morphological, and electrophysiological variation among retinal ganglion cells purified by panning. Neuron 1(9), 791–803 (1988) 5. Coombs, J.S., Eccles, J.C., Fatt, P.: The specific ionic conductances and the ionic movements across the motoneuronal membrane that produce the inhibitory post-synaptic potential. Journal of Physiology 130(2), 326–374 (1955) 6. Guenther, E., Schmid, S., Grantyn, R., Zrenner, E.: In vitro identification of retinal ganglion cells in culture without the need of dye labeling. Journal of Neuroscience Methods 51(2), 177–181 (1994) 7. Hayashida, Y., Ishida, A.T.: Dopamine receptor activation can reduce voltage-gated Na+ current by modulating both entry into and recovery from inactivation. Journal of Neurophysiology 92(5), 3134–3141 (2004) 8. Hayashida, Y., Motomura, T., Murayama, N.: Vibrodissociation of rat retinal ganglion cells attached with inhibitory synaptic boutons. Investigative Ophthalmology & Visual Science 47, E-Abstract 3763 (2006) 9. Hayashida, Y., Partida, G.J., Ishida, A.T.: Dissociation of retinal ganglion cells without enzymes. Journal of Neuroscience Methods 137(1), 25–35 (2004) 10. Koulen, P., Malitschek, B., Kuhn, R., Bettler, B., Wassle, H., Brandstatter, J.H.: Presynaptic and postsynaptic localization of GABA(B) receptors in neurons of the rat retina. European Journal of Neuroscience 10(4), 1446–1456 (1998) 11. Mitra, P., Slaughter, M.M.: Mechanism of generation of spontaneous miniature outward currents (SMOCs) in retinal amacrine cells. Journal of General Physiology 119(4), 355– 372 (2002) 12. Motomura, T., Hayashida, Y., Murayama, N.: Mechanical Dissociation of Retinal Neurons with Vibration. IEEJ Transactions on Electronics, Information and Systems 127(10) (in press, 2007) 13. Tabata, T., Ishida, A.T.: A zinc-dependent Cl- current in neuronal somata. Journal of Neuroscience 19(13), 5195–5204 (1999) 14. von Gersdorff, H., Matthews, G.: Dynamics of synaptic vesicle fusion and membrane retrieval in synaptic terminals. Nature 367(6465), 735–739 (1994) 15. Vorobjev, V.S.: Vibrodissociation of sliced mammalian nervous tissue. Journal of Neuroscience Methods 38(2-3), 145–150 (1991)
Region-Based Encoding Method Using Multi-dimensional Gaussians for Networks of Spiking Neurons Lakshmi Narayana Panuku and C. Chandra Sekhar Department of Computer Science and Engineering, Indian Institute of Technology Madras, Chennai 600 036, India {panuku,chandra}@cs.iitm.ernet.in
Abstract. In this paper, we address the issues in representation of continuous valued variables by firing times of neurons in the spiking neural network used for clustering multi-variate data. The existing range-based encoding method encodes each dimension separately. This method does not make use of the correlation among the different variables, and the knowledge of the distribution of data. We propose a region-based encoding method that places multi-dimensional Gaussian receptive fields in the data-inhabited regions, and captures the correlation among the variables. Effectiveness of the proposed encoding method in clustering the complex 2-dimensional and 3-dimensional data sets is demonstrated.
1
Introduction
Artificial neural networks (ANNs) have been shown to have the ability to extract patterns from complex data [1, 2]. Based on the computational units used, these ANN models can be classified into three generations [3]. The McCulloch-Pitts neurons, which are considered as first generation neurons, can only give binary output. The computational units that output continuous values, like sigmoidal units, are considered as the second generation neurons. Biologically, the output of a sigmoidal unit can be interpreted as the firing rate of a neuron. Under the assumption that, in biological neural networks the continuously varying mean firing rate of a neuron (rate code) contains the information about the neuron’s time varying state of excitation, the sigmoidal units can model the computations in biological systems. Recently, the timing of the action potentials or spikes has been recognized as a possible means of neural information coding rather than the average firing rate of the neurons [4, 5, 6]. It is shown that coding with the timing of spikes allows powerful neuronal information processing [7]. These results have generated considerable interest in the third generation time-based neurons like spiking neurons [3]. Various models of spiking neural networks (SNNs) like leaky integrate-fire model, spike response model, and liquid state machine [6, 8] and various learning methods for these models have been reported in the literature [9, 10, 11]. The SNNs have been used in many applications such as signal M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 73–82, 2008. c Springer-Verlag Berlin Heidelberg 2008
74
L.N. Panuku and C.C. Sekhar
coincidence detection [2], isolated word recognition [8], and implementation of temporal-RBF networks [12]. In [13], a Hebbian based learning mechanism is proposed for spiking neuron models with multi-delay connections, namely multi-delay SNNs (MDSNNs). This learning mechanism is observed to select the connections with matching delays. For clustering the data, this approach is extended in [10] by considering not only the firing or non-firing of a neuron, but also its firing time. A coding scheme to convert an analog input variable into firing times of neurons is proposed. However, this method has limitations in both clustering capacity as well as precision [14]. To overcome this limitation, Bohte, et al., [14], proposed a population coding based encoding method that encodes the values of input variables using multiple overlapping 1-dimensional (1-D) Gaussian receptive fields (GRFs). This method is shown to cluster a number of data sets at low expense in terms of neurons while enhancing clustering capacity and precision. Using this encoding method and a multi-layer MDSNN with lateral connections in the hidden layer, complex data like the interlocking cluster data is clustered. However, this encoding method does not make use of the correlation present among the variables in the multi-variate data. It uniformly places the GRFs along each dimension, leading to a large neuron count and increased computational cost. The boundaries given by the MDSNN with this encoding method are observed to be combinations of linear segments. To overcome these limitations, we propose a novel encoding method that places multi-dimensional GRFs in the data-inhabited regions and uses the correlation present in the data. We show that the proposed encoding method helps the MDSNNs in clustering a number of 2-D and 3-D nonlinearly separable data sets, while keeping a low neuron count. Moreover, the cluster boundaries given by the MDSNN with this encoding method are observed to follow the shapes of the clusters. This paper is organized as follows: The architecture of the MDSNN and the Hebbian based learning rule for clustering are described in Section 2. The existing range-based encoding method and its limitations are discussed in Section 3. Section 4 presents the proposed region-based encoding method that uses the multi-dimensional GRFs. The performance of the proposed encoding method for clustering different complex data sets is also given in this section.
2
Multi-delay Spiking Neural Networks
The architecture of the MDSNN consists of a fully connected feedforward network of spiking neurons with connections implemented as multiple delayed synaptic terminals, as shown in Fig. 1(a). A connection from pre-synaptic neuron i to post-synaptic neuron j consists of a fixed number (m) of synaptic terminals. Each terminal serves as a subconnection that is associated with a different delay and weight. The delay dl of a synaptic terminal l is the difference between the firing time of the pre-synaptic neuron, and the time when the post-synaptic potential (PSP) resulting from terminal l starts raising.
Region-Based Encoding Method Using Multi-dimensional Gaussians
75
The time varying impact of the pre-synaptic spike on a post-synaptic neuron is described by a spike response function, ε(.), also referred to as the PSP. The PSP is modeled by the α-function, as in [14]. A neuron j in the network generates a spike when the value of its internal state variable xj , the “membrane potential”, crosses a threshold ϑ. The internal state variable xj (t) is defined as follows: xj (t) =
m
l wij ε t − ti − dl ,
(1)
i∈Γj l=1 l where Γj is the set of neurons pre-synaptic to neuron j, wij is the weight of the th l synaptic terminal between the neurons i and j, and ti is the firing time of the pre-synaptic neuron i. The time at which xj (t) crosses the threshold ϑ with a positive slope is the firing time of the neuron j, denoted by tj .
Output Layer 1
j
J
L(Δt)
j
00 11 1111 00 0 ij 1 00
m
0011 11 0 l 1 00 wij 11 00 0 1 w 00 11 1
t
i+
d
1
t
i+
w 1 0 0 ij 1 d
t
l
i
i+
d
β
m
1
11 011 11 000 1100 0 11 i 00
11 00 11 00 11 00 11 00
I
c 0
(a) f(a)
1
Δt
b
Input Layer
(b) 2
3
4
5
6
7
8
t=0 T5 T4
t=9
T T63
coding threshold
a
Value of Variable
T(a)=[ NF, NF, 9, 1, 0, 8, NF, NF ]
(c) Fig. 1. (a) Architecture of an MDSNN. (b) Learning function L(Δt). (c) Range-based encoding of an input variable with value a into firing times, T (a).
For clustering, the weights of the terminals of the connections between the input neurons and the winning neuron, i.e., the output neuron that fires first, are modified using a time-variant of Hebbian learning. The learning rule for the l weight wij of the synaptic terminal with delay dl is as follows: l Δwij = ηL(Δtlij ) = η (1 − b) exp −(Δtlij − c)2 /β 2 + b , (2) where Δtlij denotes the time difference between the onset of PSP at the lth synaptic terminal, tP SP li , and the firing time of the winning neuron, tj , i.e, Δtlij
76
L.N. Panuku and C.C. Sekhar
= tP SP li −tj = (ti +dl )−tj . The parameter c determines the position of the peak, b determines the negative update given to a neuron for which Δt is significantly different from c, and β sets the width of the positive part of the learning function (Fig. 1(b)). The weight of a synaptic terminal is limited to the range 0 to wmax , the maximum value that a weight can take. In our experiments, the range of the values for delays dl is set to 0 − 9 milliseconds with a resolution of 1 millisecond, i.e., m = 9. The parameters b and η are set to −0.2 and 0.01, while c and β are empirically chosen. In [14], an MDSNN with fixed threshold units is considered for the task of clustering. This model, with the above mentioned learning rule (Eqn. 2), could cluster linearly separable data. However, when applied on nonlinearly separable data like the single-ring data and the interlocking cluster data, all the data points are grouped into a single cluster. To overcome this limitation, we use a varyingthreshold method [15], in which the threshold of a spiking neuron is initialized to a small, positive value and is gradually increased (in steps of Δϑ) as the learning progresses, until it reaches a maximum value, ϑmax . Moreover, when a multi-layer MDSNN is trained to cluster complex data, the layers are trained using multi-stage learning method [15] in which the nth layer is trained before starting the training of (n + 1)th layer. With these two extensions, the MDSNNs are able to cluster the single-ring data and the interlocking cluster data [15]. In our studies, we use the varying-threshold method and the multi-stage learning method. The values of Δϑ and ϑmax are determined empirically.
3
Range-Based Encoding
When the input variables are continuous valued attributes, it is necessary to encode the value of each variable into firing times of neurons in the input layer of MDSNN. Bohte, et al., [14], proposed an encoding method that encodes the values of input variables by a population code obtained from neurons with graded and overlapping sensitivity profiles. As this method encodes each variable with 1-D GRFs uniformly placed to cover the whole range of values that the variable can take, it is called the range-based encoding method. In this method, the range of values for each input variable is determined. For the range [Imin ...Imax ] of a variable, n(> 2) GRFs are used. The center of the ith GRF is set to μi = Imin + ((2i − 3)/2)((Imax − Imin )/(n − 2)). One GRF is placed outside the range at each of the two ends. All the GRFs encoding an input variable will have the same width. The width of a GRF is set to σ = (1/γ)((Imax − Imin )/(n − 2)), where γ controls the extent of overlap between the GRFs. For an input variable with value a, the activation value of the ith GRF with center μi and width σ is given by fi (a) = exp −(a − μi )2 /(2 σ 2 ) (3) The firing time of the neuron associated with this GRF is inversely proportional to fi (a). For a highly stimulated GRF, with value of fi (a) close to 1.0, the firing time t = 0 milliseconds is assigned. When the activation value of the
Region-Based Encoding Method Using Multi-dimensional Gaussians
77
GRF is small, the firing time t is high indicating that the neuron fires later. In our experiments, the firing time of an input neuron is chosen to be in the range 0 to 9 milliseconds. While converting the activation values of GRFs into firing times, a coding threshold is imposed on the activation value. A GRF that gives an activation value less than the coding threshold will be marked as not-firing (NF), and the corresponding input neuron will not contribute to the membrane potential of the post-synaptic neuron. The range-based encoding method is illustrated in Fig. 1(c). For multi-variate data, each variable is encoded separately, effectively not using the correlation present among the variables. In this encoding method, 1-D GRFs are uniformly placed along an input dimension, without considering the distribution of the data. Hence, when the data is sparse some GRFs, placed in the regions where data is not present, are not effectively used. This results in a high neuron count and computational cost. The widths of the GRFs are derived without using the knowledge of the data distribution, except for the range of values that the variables take. Taking one GRF along an input dimension and quantizing its activation value results in the formation of intervals within the range of values of that variable, such that one or more intervals are mapped onto a particular quantization level. When 2-D data is encoded by taking an array of 1-D GRFs along each input dimension, the input space is quantized into rectangular grids such that all the input patterns falling into a particular rectangular grid have the same vector of quantization levels, and hence the same encoded time vector. Additionally, one or more rectangular grids may have the same encoded time vector. For multivariate data, the input space is divided into hypercuboids. To demonstrate this, the single-ring data (shown in Fig. 2(a)) is encoded by placing 5 GRFs along each dimension, dividing the input space into grids as shown in Fig. 2(b). A 10-2 MDSNN, having 10 neurons in the input layer and 2 neurons in the output layer, is trained to cluster this data. The space of data points as represented by the output layer neurons is shown in Fig. 2(c). The cluster boundary is observed to be a combination of linear segments defined by the rectangular grid boundaries formed by encoding. The shape of the boundary formed by the MDSNN is significantly different from the desired circle-shaped boundary between the two clusters in the single-ring data. Increasing the number of GRFs used for encoding each dimension may give a boundary that is a combination of smaller linear segments, at the expense of high neuron count. However, this may not result in proper clustering of the data, as the choice of number of GRFs is observed to be crucial in the range-based encoding method. When the range-based encoding method is used along with the varyingthreshold method and the multi-stage learning method [15] to cluster complex data sets such as the double-ring data and the spiral data, it is observed that proper subclusters are not formed by the neurons in the hidden layer. As each dimension is encoded separately, the spatially disjoint subsets of data points, as shown by the marked regions in Fig. 3, that have similar encoding along a particular dimension are found to be represented by a single neuron in the hidden
78
L.N. Panuku and C.C. Sekhar 2
1
1.5
0.8 0.6
1
0.4
0.5
0.2 0
0
−0.2
−0.5
−0.4
−1
−0.6
−1.5
−0.8
−2 −2 −1.5 −1 −0.5
0(a)0.5
1
1.5
−1 −1 −0.8−0.6−0.4−0.2
2
0
0.2 0.4 0.6 0.8
(b)
1
Fig. 2. Clustering the single-ring data encoded using the range-based encoding method: (a) The single-ring data, (b) data space quantization due to the range-based encoding, and (c) data space representation by the output neurons
15 3
10 5
y
y
1
0
−1 −5 −3
−5 −5
−10 −15 −3
x
−1
1 (a)
3
−15
−10
−5
0 x
(b)
5
10
15
Fig. 3. Improper subclusters formed when the data is encoded with the range-based encoding method for (a) the double-ring data and (b) the spiral data
layer. This binding is observed to form during the initial iterations of learning when the firing thresholds of neurons are low. The established binding cannot be unlearnt in the subsequent iterations, leading to improper clustering at the output layer. An encoding method that overcomes the above discussed limitations is proposed in the next section.
4
Region-Based Encoding Using Multi-dimensional Gaussian Receptive Fields
Using multi-dimensional GRFs for encoding helps in capturing the correlation present in the data. One approach would be to uniformly place the multidimensional GRFs covering the whole range of the input data space. However, this results in an exponential increase in the number of neurons in the input layer with the dimensionality of the data. To circumvent this, we propose a region-based encoding method that places the multi-dimensional GRFs only in the data-inhabited regions, i.e., the regions where data is present. The mean vectors and the covariance matrices of these GRFs are computed from the data in the regions, thus capturing the correlation present in the data. To identify the data-inhabited regions in the input space, first the k-means clustering is performed on the data to be clustered, with the value of k being larger than the number of actual clusters. On each of the regions, identified using the k-means clustering method, a multi-dimensional GRF is placed by computing the mean vector and the covariance matrix from the data in that region. Response of the ith GRF for a multi-variate input pattern a is computed as,
Region-Based Encoding Method Using Multi-dimensional Gaussians
1 t −1 fi (a) = exp − (a − μi ) Σi (a − μi ) , 2
79
(4)
where, μi and Σi are the mean vector and the covariance matrix of the ith GRF respectively, and fi (a) is the activation value of that GRF. As discussed in Section 3, these activation values are translated into firing times in the range 0 to 9 milliseconds and the non-optimally stimulated input neurons are marked as NF. By deriving the covariance for a GRF from the data, the region-based encoding method captures the correlation present in the data. The regions identified by k-means clustering and the data space quantization resulting from this encoding method, for the single-ring data used in Section 3, are shown in Fig 4(a) and 4(b) respectively. The boundary given by the MDSNN with the region-based encoding method is shown in Fig. 4(c). This boundary is more like the desired circle-shaped boundary, as against the combination of linear segments observed with the range-based encoding method (see Fig. 2(c)). 2
2.5
1.5
2 1.5 1
1 0.5 0
0.5 0
−0.5
−0.5 −1 −1.5 −2
−1 −1.5 −2 −2 −1.5 −1 −0.5
0 (a)0.5
1
1.5
2
−2.5 −2.5 −2 −1.5 −1 −0.5
0
0.5
b)
1
1.5
2
2.5
Fig. 4. (a) Regions identified by k-means clustering with k = 8, (b) data space quantization due to the region-based encoding and (c) data space representation by the neurons in the output layer
Next, we study the performance of the proposed region-based encoding method in clustering complex 2-D and 3-D data sets. For the double-ring data, the k-means clustering is performed with k = 20 and the resulting regions are shown in Fig 5(a). Over each of the regions, a 2-D GRF is placed to encode the data. A 20-8-2 MDSNN is trained using the multi-stage learning method, discussed in Section 2. It is observed that out of 8 neurons in the hidden layer 3 neurons do not win for any of the training examples and the data is represented by the remaining 5 neurons as shown in Fig 5(b). These 5 neurons provide the input in the second stage of learning to form the final clusters (Fig 5(c)). The resulting cluster boundaries are seen to follow the data distribution as shown in Fig 5(d). Similarly, the spiral data is encoded using 40, 2-D GRFs. The regions of data identified using the k-means clustering method are shown in in Fig 6(a). A 40-20-2 MDSNN is trained to cluster the spiral data. As shown in Fig 6(b), 14 subclusters are formed in the hidden layer that are combined in the next layer to form the final clusters as shown in Fig 6(c). The region-based encoding method helps in proper subcluster formation at the hidden layer (Fig 5(b) and Fig 6(b)), against the range-based encoding method (Fig 3). The proposed method is also used to cluster 3-D data sets namely, the interlocking donuts data and the 3-D ring data. The interlocking donuts data is
80
L.N. Panuku and C.C. Sekhar 4
4
4
3
3
3
2
2
2
1
1
0
0
0
−1
−1
−1
−2
−2
−2
−3 −4 −4
1
−3
−3 −3
−2
−1
0
1
2
3
(a)
4
−4 −4
−3
−2
−1
0
(b) 1
2
3
4
−4 −4
−3
−2
−1
0
1
2
3
4
(c)
Fig. 5. Clustering the double-ring data: (a) Regions identified by k-means clustering with k = 20, (b) subclusters formed at the hidden layer, (c) clusters formed at the output layer and (d) data space representation by the neurons in the output layer
15
15
15
10
10
10
5
5
5
0
0
0
−5
−5
−5
−10
−10 −15 −15
−10
−5
0
(a)
5
10
15
−15 −15
−10
−10
−5
0
5
10
15
−15 −15
−10
−5
(b)
0
(c)
5
10
15
Fig. 6. Clustering the spiral data: (a) Regions identified by k-means clustering with k = 40, (b) subclusters formed at the hidden layer, (c) clusters formed at the output layer and (d) data space representation by the neurons in the output layer 1
2 1 0 −1 −2 −2
−1
0
1
(a) 2
2 1.5 1 2 0.5 0 1 −0.5 0 −1 −1.5 −1 −2 3 3 −2
0
1 0
−1 2 −2 −1
−1 2 1.5
1 0
0 2
1
1 0
(b)−1
−2 2
−1 −2 −2
−1
0
(c)
1
2
1 0.5
2 0 −0.5
1 0 −1 −1.5
−1 −2 −2
(d)
Fig. 7. Clustering of the interlocking donuts data and the 3-D ring data: (a) Regions identified by k-means clustering on the interlocking donuts data with k = 10 and (b) clusters formed at the output layer. (c) Regions identified by k-means clustering on the 3-D ring data with k = 5 and (d) clusters formed at the output layer.
encoded with 10, 3-D GRFs and a 10-2 MDSNN is trained to cluster this data. The k-means clustering results and the final clusters formed by the MDSNN are shown in Fig 7(a) and (b) respectively. The clustering results for the 3-D ring data with the proposed encoding method are shown in Fig 7(c) and 7(d). For comparison, the performance of the range-based encoding method and the region-based encoding method, for different data sets, is presented in Table 1. It is observed that the region-based encoding method outperforms the range-based encoding method for clustering complex data sets like the double-ring data and the spiral data. For the cases, where both the methods give the same or almost the same performance, the number of neurons used in the input layer is given in the parentheses. It is observed that the region-based encoding method always maintains a low neuron count, there by reducing the computational cost. The difference between the neuron counts for the two methods may look small for these 2-D and 3-D data sets. However, as the dimensionality of the data increases, this
Region-Based Encoding Method Using Multi-dimensional Gaussians
81
Table 1. Comparison of the performance (in %) of MDSNNs using the range-based encoding method and the region-based encoding method for clustering. The numbers in parentheses give the number of neurons in the input layer. Data set
Encoding method Range-based Region-based encoding encoding
Double-ring data 74.82 Spiral data 66.18 Single-ring data 100.00 (10) Interlocking cluster data 99.30 (24) 3-D ring data 100.00 (15) Interlocking donuts data 97.13 (21)
100.00 100.00 100.00 100.00 100.00 100.00
(8) (6) (5) (10)
difference can be significant. From these results, it is evident that the proposed encoding method scales well to higher dimensional data clustering problems, while keeping a low count of neurons. Additionally, and more importantly, the nonlinear cluster boundaries given by the region-based encoding method follow the distribution of the data or shapes of the clusters.
5
Conclusions
In this paper, we have proposed a new encoding method using multi-dimensional GRFs for MDSNNs. We have demonstrated that the proposed encoding method effectively uses the correlation present in the data and positions the GRFs in the data-inhabited regions. We have also shown that the proposed method results in a low neuron count as opposed to the encoding method proposed in [14] and the simple approach of placing multi-dimensional GRFs covering the data space. This in turn results in low computational cost for clustering. With the encoding method proposed in [14], the cluster boundaries obtained for clustering nonlinearly separable data are observed to be combinations of linear segments and the MDSNN is failed to cluster the double-ring data and the spiral data. We have experimentally shown that with the proposed encoding method, the MDSNNs could cluster complex data like the double-ring data and the spiral data, while giving smooth nonlinear boundaries that follow the data distribution. In the existing range-based encoding method, when the data consists of clusters with different scales, i.e., narrow and wider clusters, then the GRFs with different widths are used. This technique is called multi-scale encoding. However, in the region-based encoding method the widths of the multi-dimensional GRFs are automatically computed from the data-inhabited regions. The widths of these GRFs can be different. In the proposed method, for clustering the 2-D and 3-D data, the value of k is decided empirically and the formation of subclusters at the hidden layer is verified visually. However, for higher dimensional data, it is necessary to ensure the formation of subclusters automatically.
82
L.N. Panuku and C.C. Sekhar
References 1. Haykin, S.: Neural Networks: A Comprehensive Foundation. Prentice Hall PTR, Englewood Cliffs (1998) 2. Kumar, S.: Neural Networks: A Classroom Approach. Tata McGraw-Hill, New Delhi (2004) 3. Maass, W.: Networks of Spiking Neurons: The Third Generation of Neural Network Models. Trans. Soc. Comput. Simul. Int. 14(4), 1659–1671 (1997) 4. Bi, Q., Poo, M.: Precise Spike Timing Determines the Direction and Extent of Synaptic Modifications in Cultured Hippocampal Neurons. Neuroscience 18, 10464–10472 (1998) 5. Maass, W., Bishop, C.M.: Pulsed Neural Networks. MIT-Press, London (1999) 6. Gerstner, W., Kistler, W.M.: Spiking Neuron Models. Cambridge University Press, Cambridge (2002) 7. Maass, W.: Fast Sigmoidal Networks via Spiking Neurons. Neural Computation 9, 279–304 (1997) 8. Verstraeten, D., Schrauwen, B., Stroobandt, D., Campenhout, J.V.: Isolated Word Recognition with the Liquid State Machine: A Case Study. Information Processing Letters 95(6), 521–528 (2005) 9. Bohte, S.M., Kok, J.N., Poutre, H.L.: Spike-Prop: Error-backpropagation in Temporally Encoded Networks of Spiking Neurons. Neural Computation 48, 17–37 (2002) 10. Natschlager, T., Ruf, B.: Spatial and Temporal Pattern Analysis via Spiking Neurons. Network: Comp. Neural Systems 9, 319–332 (1998) 11. Ruf, B., Schmitt, M.: Unsupervised Learning in Networks of Spiking Neurons using Temporal Coding. In: Gerstner, W., Hasler, M., Germond, A., Nicoud, J.-D. (eds.) ICANN 1997. LNCS, vol. 1327, pp. 361–366. Springer, Heidelberg (1997) 12. Hopfield, J.J.: Pattern Recognition Computation using Action Potential Timing for Stimulus Representations. Nature 376, 33–36 (1995) 13. Gerstner, W., Kempter, R., Van Hemmen, J.L., Wagner, H.: A Neuronal Learning Rule for Sub-millisecond Temporal Coding. Nature 383, 76–78 (1996) 14. Bohte, S.M., Poutre, H.L., Kok, J.N.: Unsupervised Clustering with Spiking Neurons by Sparse Temporal Coding and Multilayer RBF Networks. IEEE Transactions on Neural Networks 13, 426–435 (2002) 15. Panuku, L.N., Sekhar, C.C.: Clustering of Nonlinearly Separable Data using Spiking Neural Networks. In: de S´ a, J.M., Alexandre, L.A., Duch, W., Mandic, D. (eds.) ICANN 2007. LNCS, vol. 4668, Springer, Heidelberg (2007)
Firing Pattern Estimation of Biological Neuron Models by Adaptive Observer Kouichi Mitsunaga1 , Yusuke Totoki2 , and Takami Matsuo2 1
2
Control Engineering Department, Oita Institute of Technology, Oita, Japan Department of Architecture and Mechatronics, Oita University, 700 Dannoharu, Oita, 870-1192, Japan
Abstract. In this paper, we present three adaptive observers with the membrane potential measurement under the assumption that some of parameters in HR neuron are known. Using the Strictly Positive Realness and Yu’s stability criterion, we can show the asymptotic stability of the error systems. The estimators allow us to recover the internal states and to distinguish the firing patterns with early-time dynamic behaviors.
1
Introduction
In traditional artificial neural networks, the neuron behavior is described only in terms of firing rate, while most real neurons, commonly known as spiking neurons, transmit information by pulses, also called action potentials or spikes. Model studies of neuronal synchronization can be separated in those where models of the integrated-and-fire type are used and those where conductance-based spiking and bursting models are employed[1]. Bursting occurs when neuron activity alternates, on slow time scale, between a quiescent state and fast repetitive spiking. In any study of neural network dynamics, there are two crucial issues that are: 1) what model describes spiking dynamics of each neuron and 2) how the neurons are connected[3]. Izhikevich considered the first issue and compared various models of spiking neurons. He reviewed the 20 types of real (cortical) neurons response, considering the injection of simple dc pulses such as tonic spiking, phasic spiking, tonic bursting, phasic bursting. Through out his simulations, he suggested that if the goal is to study how the neuronal behavior depends on measurable physiological parameters, such as the maximal conductance, steady-state (in)activation functions and time constants, then the Hodgkin-Huxley type model is the best. However, its computational cost is the highest in all models. He also pointed out that the Hindmarsh-Rose(HR) model is computationally simple and capable of producing rich firing patterns exhibited by real biological neurons. Nevertheless the HR model is a computational one of the neuronal bursting using three coupled first order differential equations[5,6], it can generate a tonic spiking, phasic spiking, and so on, for different parameters in the model equations. Charroll simulated that the additive noise shifts the neuron model into two-frequency region (ı.e. bursting) and the slow part of the M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 83–92, 2008. c Springer-Verlag Berlin Heidelberg 2008
84
K. Mitsunaga, Y. Totoki, and T. Matsuo
responses allows being robust to added noises using the HR model[7]. The parameters in the model equations are important to decide the dynamic behaviors in the neuron[12]. From the measurement theoretical point of view, it is important to estimate the states and parameters using measurement data, because extracellular recordings are a common practice in neuro-physiology and often represent the only way to measure the electrical activity of neurons[8]. Tokuda et al. applied an adaptive observer to estimate the parameters of HR neuron by using membrane potential data recorded from a single lateral pyloric neuron synaptically isolated from other neurons[13]. However, their observer cannot guarantee the asymptotic stability of the error system. Steur[14] pointed out that HR equations could not transformed into the adaptive observer canonical form and it is not possible to make use of the adaptive observer proposed by Marino[10]. He simplified the three dimensional HR equations and write as one-dimensional system with exogenous signal using contracting and the wandering dynamics technique. His adaptive observer with first-order differential equation cannot estimate the internal states of HR neurons. We have recently presented adaptive observers with full states measurement and with the membrane potential measurement[15]. However, the estimates of the states by the observer with output measurement are not enough to recover the immeasurable internal states. In this paper, we present three adaptive observers with the membrane potential measurement under the assumption that some of parameters in HR neuron are known. Using the Kalman-Yakubovich lemma, we can show the asymptotic stability of the error systems based on the standard adaptive control theory[11]. The estimators allow us to recover the internal states and to distinguish the firing patterns with early-time dynamic behaviors. The MATLAB simulations demonstrate the estimation performance of the proposed adaptive observers.
2
Review of Real (Cortical) Neuron Responses
There are many types of cortical neurons responses. Izhikevich reviewed 20 of the most prominent features of biological spiking neurons, considering the injection of simple dc pulses[3]. Typical responses are classified as follows[4]: – Tonic Spiking (TS): The neuron fires a spike train as long as the input current is on. This kind of behavior can be observed in the three types of cortical neurons: regular spiking excitatory neurons (RS), low-threshold spiking neurons (LTS), and first spiking inhibitory neurons (FS). – Phasic Spiking (PS): The neuron fires only a single spike at the onset of the input. – Tonic Bursting: The neuron fires periodic bursts of spikes when stimulated. This behavior may be found in chattering neurons in cat neocortex. – Phasic Bursting (PB): The neuron fires only a single burst at the onset of the input.
Firing Pattern Estimation of Biological Neuron Models
85
– Mixed Mode (Bursting Then Spiking) (MM): The neuron fires a phasic burst at the onset of stimulation and then switch to the tonic spiking mode. The intrinsically bursting excitatory neurons in mammalian neocortex may exhibit this behavior. – Spike Frequency Adaptation (SFA): The neuron fires tonic spikes with decreasing frequency. RS neurons usually exhibit adaptation of the interspike intervals, when these intervals increase until a steady state of periodic firing is reached, while FS neurons show no adaptation.
3
Single Model of HR Neuron
The Hindmarsh-Rose(HR) model is computationally simple and capable of producing rich firing patterns exhibited by real biological neurons. 3.1
Dynamical Equations
The single model of the HR neuron[1,5,6] is given by x˙ = ax2 − x3 − y − z + I y˙ = (a + α)x2 − y z˙ = μ(bx + c − z) where x represents the membrane potential y and z are associated with fast and slow currents, respectively. I is an applied current, and a, α, μ, b and c are constant parameters. We rewrite the single HR neuron as a vectorized form: ˙ = h(w) + Ξ(x, z)θ (S0 ) : w where ⎡ ⎤T ⎡ ⎤ ⎡ 2 ⎤ x −(x3 + y + z) x 1 0 00 0 ⎦, Ξ(x, z) = ⎣ 0 0 x2 0 0 0 ⎦, −y w = ⎣ y ⎦ , h(w) = ⎣ z 0 0 0 0 x 1 −z T T θ = θ1 , θ2 , θ3 , θ4 , θ5 , θ6 = a, I, a + α, μb, μc, μ . 3.2
Numerical Examples
The HR model shows a large variety of behaviors with respect to the parameter values in the differential equations[12]. Thus, we can characterize the dynamic behaviors with respect to different values of the parameters. We focus on the parameters a and I. The parameter a is an internal parameter in the single neuron and I is an external depolarizing current. For the fixed I = 0.05, the HR model shows a tonic bursting with a ∈ [1.8, 2.85] and a tonic spiking with a ≥ 2.9. On the other hand, for the fixed a = 2.8, the HR model shows a tonic bursting with I ∈ [0, 0.18] and a tonic spiking with a ∈ [0.2, 5].
86
K. Mitsunaga, Y. Totoki, and T. Matsuo
2 1.5 −0.3 −0.4
1
−0.5 −0.6
x
z
0.5 0
−0.7 −0.8 −0.9
−0.5
−1 8 6
−1
2 1
4 −1.5 0
0
2 200
400
600
800
1000
Fig. 1. The response of x in the tonic bursting
−1 0
y
time
−2
x
Fig. 2. 3-D surface of x, y, z in the tonic bursting
2 1.5 −0.5 1
−0.6 −0.7
x
z
0.5
−0.8 0 −0.9 −0.5
−1 8 6
−1
2 1
4 −1.5 0
0
2 200
400
600
800
1000
Fig. 3. The response of x in the tonic spiking
0
y
time
−1 −2
x
Fig. 4. 3-D surface of x, y, z in the tonic spiking
1.5
−0.2
1
−0.4
z
x
2
0.5
−0.6 −0.8
0 −1 8 −0.5
6
2 4
−1 0
1 2
200
400
600
800
1000
time
Fig. 5. The response of x1 in the intrinsic bursting neuron
y
0 0
−1
x
Fig. 6. 3-D surface of x, y, z in the intrinsic bursting neuron
The parameters of the HR model in the tonic bursting (TB) are given by a = 2.8, α = 1.6, c = 5, b = 9, μ = 0.001, I = 0.05. Figure 1 shows the response of x. Figure 2 shows the 3 dimensional surface of x, y, z. We call this neuron the intrinsic bursting neuron(IBN).
Firing Pattern Estimation of Biological Neuron Models
87
The parameters of the HR model in the tonic spiking (TS) are given by a = 3.0, α = 1.6, c = 5, b = 9, μ = 0.001, I = 0.05. The difference between the tonic bursting and the tonic spiking is only the value of the parameter a. Figure 3 shows the responses of x. Figure 4 shows the 3 dimensional surface of x, y, z. We also call this neuron the intrinsic spiking neuron(ISN). When the external current change from I = 0.05 to I = 0.2, the IBN shows a tonic spiking. Figure 5 shows the responses of x. Figure 6 shows the 3 dimensional surface of x, y, z.
4 4.1
Synaptically Coupled Model of HR Neuron Dynamical Equations
Consider the following synaptically coupled two HR neurons[1]: x˙ 1 = a1 x21 − x31 − y1 − z1 − gs (x1 − Vs1 )Γ (x2 ) y˙ 1 = (a1 + α1 )x21 − y1 , z˙1 = μ1 (b1 x1 + c1 − z1 ) x˙ 2 = a2 x22 − x32 − y2 − z2 − gs (x2 − Vs2 )Γ (x1 ) y˙ 2 = (a2 + α2 )x22 − y2 , z˙2 = μ2 (b2 x2 + c2 − z2 ) where Γ (x) is the sigmoid function given by Γ (x) =
4.2
1 . 1 + exp(−λ(x − θs ))
Numerical Examples
Consider the IBN neuron with a = 2.8 and the ISN neuron with a = 10.8 whose other parameters are as follows: αi = 1.6, ci = 5, bi = 9, μi = 0.001, Vsi = 2, θs = −0.25, λ = 10. Figures 7,8 show the responses of the membrane potentials in the coupling of IBN neuron and ISN neuron with the coupling strength gs = 0.05, respectively. Each neuron behaves as an intrinsic single neuron. As increasing the coupling strength, however, the IBN neuron shows a chaotic behavior. Figures 9,10 show the responses of the membrane potentials in the coupling of IBN neuron and ISN neuron with the coupling strength gs = 1, respectively. Figure 11 shows the response of the membrane potentials in the coupling of two same IBNs with
88
K. Mitsunaga, Y. Totoki, and T. Matsuo 2
12
1.5
10 8
1
6
x
x
0.5 4
0 2 −0.5
0
−1
−2
−1.5 0
200
400
600
800
−4 0
1000
200
400
time
600
800
1000
time
Fig. 7. The response of x1 of the IBN
Fig. 8. The response of x2 of the ISN
2
12
1.5
10 8
1
6
x
x
0.5 4
0 2 −0.5
0
−1 −1.5 0
−2
200
400
600
800
−4 0
1000
200
time
400
600
800
1000
time
Fig. 9. The response of x1 of the IBN
Fig. 10. The response of x2 of the ISN
2
2
1.5
1.5
1 1
x
x
0.5 0.5
0 0 −0.5 −0.5
−1 −1.5 0
200
400
600
time
800
1000
−1 0
200
400
600
800
1000
time
Fig. 11. The response of x1 of the IBN- Fig. 12. The response of x1 of the IBNIBN coupling with gs = 0.05 IBN coupling with gs = 1
the coupling strength gs = 0.05. In this case, two IBNs synchronize as bursting neurons. Figure 12 shows the response of the membrane potentials in the coupling of two same IBNs with the coupling strength gs = 1. Two IBNs synchronize as spiking neurons.
Firing Pattern Estimation of Biological Neuron Models
5
89
Adaptive Observer with Full States
We present the parameter estimation problem to distinguish the firing patterns by using early-time dynamic behaviors. In this section, assuming that the full states are measurable, we present an adaptive observer to estimate all parameters in the single HR neuron. 5.1
Construction of Adaptive Observer
We present an adaptive observer as ˆ ˆ˙ = W (w ˆ − w) + h(w) + Ξ θ (O0 ) : w ˆ is an estimate of the ˆ = [ˆ where w x, yˆ, zˆ] is an estimate of the states, θ unknown parameters and W is selected as a stable matrix. Using the standard adaptive control theory[11], the parameter update law is given by ˆ˙ = Γ Ξ T P (w − w). ˆ θ where P is a positive definite solution of the following Lyapunov equation for a positive definite matrix Q: W T P + P W = −Q. 5.2
Numerical Examples
We will show the simulation results of single IBN case. The parameters in the tonic bursting are given by a = 2.8, α = 1.6, c = 5, b = 9, μ = 0.001, I = 0.05. The parameters of the adaptive observers are selected as W = −10I3 , Γ = diag{100, 50, 300}. Figure 13 shows the estimation behavior of a (solid line) and I (dotted line). The estimates a ˆ and Iˆ converge to the true values of a and I.
6
Adaptive Observer with a Partial State
We assume that the membrane potential x is available, but the others are immeasurable. In this case, we consider following problems: – Estimate y and z using the available signal x; – Estimate the parameter a or I to distinguish the firing patterns by using early-time dynamic behaviors.
90
K. Mitsunaga, Y. Totoki, and T. Matsuo
6.1
Construction of Adaptive Observer
The parameters a and I are key parameters that determine the firing pattern. The HR model can be rewritten by the following three forms[9]: ˙ = Aw + h1 (x) + b1 (x2 a) (S1 ) : w ˙ = Aw + h2 (x) + b2 (I) (S2 ) : w ˙ = Aw + h3 (x) + b2 (θ T ξ) (S3 ) : w where
(1) (2) (3)
⎡
⎤ ⎡ 3 ⎤ ⎡ ⎤ ⎡ ⎤ 0 −1 −1 −x + I 1 1 A = ⎣ 0 −1 0 ⎦ , h1 = ⎣ αx2 ⎦ , b1 = ⎣ 1 ⎦ , b2 = ⎣ 0 ⎦ , μb 0 −μ μc 0 0 ⎡ 3 ⎤ ⎡ 3⎤ 2 −x + ax −x a x2 2 ⎦ 2 ⎦ ⎣ ⎣ h2 = (a + α)x , h3 = δx ,θ = ,ξ = . I 1 μc μc
In (S1 ) and (S2 ), the unknown parameters are assumed to be a and I, respectively. In (S3 ), we assume that the parameter δ = a + α is known, and a and I are unknown. Since the measurable signal is x, the output equation is given by x = cw = 1 0 0 w We present adaptive observers that estimate the parameter for each system (Si ), i = 1, 2, 3, as follows: ˆ 1 + h1 (x) + b1 (x2 a (O1 ) : wˆ˙ 1 = Aw ˆ) + g(x − x ˆ) ˙ ˆ ˆ 2 + h2 (x) + b2 (I) + g(x − x (O2 ) : wˆ2 = Aw ˆ) T ˙ ˆ ˆ 2 + h3 (x) + b2 (θ ξ) + g(x − xˆ) (O3 ) : wˆ2 = Aw
(4) (5) (6)
where g is selected such that A − gc is a stable. Since (A, b1 , c) and (A, b2 , c) are strictly positive real, the parameter estimation laws are given as ˙ a ˆ˙ = γ1 x2 (x − x ˆ), Iˆ = γ2 (x − xˆ).
(7)
Using the Kalman-Yakubovich (KY) lemma, we can show the asymptotic stability of the error system based on the standard adaptive control theory[11]. 6.2
Numerical Examples
We will show the simulation results of single IBN case. The parameters in the tonic spiking are same as in the previous simulation. Figures 14 and 15 show the estimated parameters by the adaptive observers (O1 ) and (O2 ), respectively. Figures 16 and 17 show the responses of y (solid line) and its estimate yˆ (dotted line) the adaptive observers (O1 ) for t ≤ 500 and for t ≤ 20, respectively. Figure 18 shows the responses of z (solid line) and its estimate zˆ (dotted line) by the adaptive observers (O1 ). The simulation results of other cases are omitted. The states and parameters can be asymptotically estimated.
Firing Pattern Estimation of Biological Neuron Models
5
5
a ˆ Iˆ
4.5
91
4.5
4 4 3.5 3.5 3
a ˆ
a ˆ Iˆ
3 2.5 2
2.5
1.5 2 1 1.5
0.5 0 0
50
100
150
1 0
200
20
40
time
60
80
100
time
ˆ Fig. 13. a ˆ(solid line) and I(dotted line) in the adaptive observer (O0 ) with full states
Fig. 14. a ˆ(solid line) in the adaptive observer (O1 ) with x
0.1
7
y yˆ
6
0
5
−0.1
y yˆ
Iˆ
4
−0.2
3
−0.3 2
−0.4
−0.5 0
1
20
40
60
80
0 0
100
100
200
300
400
500
time
time
ˆ Fig. 15. I(solid line) in the adaptive observer (O2 ) with x
Fig. 16. y(solid line) and yˆ(dottedline) in the adaptive observer (O1 ) (t ≤ 500)
7
0
y yˆ
z zˆ
6 5
−0.5
y yˆ
z zˆ
4 3
−1 2 1 0 0
5
10
15
20
time
Fig. 17. y(solid line) and yˆ(dottedline) in the adaptive observer (O1 ) (t ≤ 20)
−1.5 0
5
10
15
20
time
Fig. 18. z(solid line) and zˆ(dottedline) in the adaptive observer (O1 ) (t ≤ 20)
92
7
K. Mitsunaga, Y. Totoki, and T. Matsuo
Conclusion
We presented estimators of the parameters of the HR model using the adaptive observer technique with the output measurement data such as the membrane potential. The proposed observers allow us to distinguish the firing pattern in early time and to recover the immeasurable internal states.
References 1. Belykh, I., de Lange, E., Hasler, M.: Synchronization of Bursting Neurons: What Matters in the Network Topology. Phys. Rev. Lett. 94, 101–188 (2005) 2. Izhikevich, E.M.: Simple Model of Spiking Neurons. IEEE Trans on Neural Networks 14(6), 1569–1572 (2003) 3. Izhikevich, E.M.: Which model to use for cortical spiking neurons? IEEE Trans on Neural Networks 15(5), 1063–1070 (2004) 4. Watts, L.: A Tour of NeuraLOG and Spike - Tools for Simulating Networks of Spiking Neurons (1993), http://www.lloydwatts.com/SpikeBrochure.pdf 5. Hindmarsh, J.L., Rose, R.M.: A model of the nerve impulse using two first order differential equations. Nature 296, 162–164 (1982) 6. Hindmarsh, J.L., Rose, R.M.: A model of neuronal bursting using three coupled first order differential equations. Proc. R. Soc. Lond. B. 221, 87–102 (1984) 7. Carroll, T.L.: Chaotic systems that are robust to added noise, CHAOS, 15, 013901 (2005) 8. Meunier, N., Narion-Poll, R., Lansky, P., Rospars, J.O.: Estimation of the Individual Firing Frequencies of Two Neurons Recorded with a Single Electrode. Chem. Senses 28, 671–679 (2003) 9. Yu, H., Liu, Y.: Chaotic synchronization based on stability criterion of linear systems. Physics Letters A 314, 292–298 (2003) 10. Marino, R.: Adaptive Observers for Single Output Nonlinear Systems. IEEE Trans. on Automatic Control 35(9), 1054–1058 (1990) 11. Narendra, K.S., Annaswamy, A.M.: Stable Adaptive Systems. Prentice Hall Inc., Englewood Cliffs (1989) 12. Arena, P., Fortuna, L., Frasca, M., Rosa, M.L.: Locally active Hindmarsh-Rose neurons. Chaos, Soliton and Fractals 27, 405–412 (2006) 13. Tokuda, I., Parlitz, U., Illing, L., Kennel, M., Abarbanel, H.: Parameter estimation for neuron models. In: Proc. of the 7th Experimental Chaos Conference (2002), http://www.physik3.gwdg.de/∼ ulli/pdf/TPIKA02 pre.pdf 14. Steur, E.: Parameter Estimation in Hindmarsh-Rose Neurons (2006), http://alexandria.tue.nl/repository/books/626834.pdf 15. Fujikawa, H., Mitsunaga, K., Suemitsu, H., Matsuo, T.: Parameter Estimation of Biological Neuron Models with Bursting and Spiking. In: Proc. of SICE-ICASE International Joint Conference 2006 CD-ROM, pp. 4487–4492 (2006)
Thouless-Anderson-Palmer Equation for Associative Memory Neural Network Models with Fluctuating Couplings Akihisa Ichiki and Masatoshi Shiino Department of Applied Physics, Faculty of Science, Tokyo Institute of Technology, 2-12-2 Ohokayama Meguro-ku Tokyo, Japan
Abstract. We derive Thouless-Anderson-Palmer (TAP) equations and order parameter equations for stochastic analog neural network models with fluctuating synaptic couplings. Such systems with finite number of neurons originally have no energy concept. Thus they defy the use of the replica method or the cavity method, which require the energy concept. However for some realizations of synaptic noise, the systems have the effective Hamiltonian and the cavity method becomes applicable to derive the TAP equations.
1
Introduction
The replica method [1] for random spin systems has been successfully employed in neural network models of associative memory to have the order parameters and the storage capacity [2] and the cavity method [3] has been employed to derive the Thouless-Anderson-Palmer (TAP) equations [4,5]. However these techniques require the energy concept. On the other hand, various types of neural network models which have no energy concept, such as networks with temporally fluctuating synaptic couplings, may exist. The alternative approach to the replica method to derive the order parameter equations, called the self-consistent signal-to-noise analysis (SCSNA), is closely related to the cavity concept in the case where networks have free energy [6,7]. An advantage to apply the SCSNA to neural networks is that the energy concept is not required to derive the order parameter equations once the TAP equations are obtained. The SCSNA, which was originally proposed for deriving a set of order parameter equations for deterministic analog neural networks, becomes applicable to stochastic networks by noting that the TAP equations define the deterministic networks. Furthermore, the coefficients of the Onsager reaction terms characteristic to the TAP equations which determine the form of the transfer functions in analog networks are selfconsistently obtained through the concept shared by the cavity method and the SCSNA. Thus the TAP equations as well as the order parameter equations are derived self-consistently by the hybrid use of the cavity method and the SCSNA in the case where the energy concept exists. However the networks with synaptic noise, which have no energy concept, defy the use of the cavity method to have the TAP equations. On the other hand, as in [8], the network with a specific M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 93–101, 2008. c Springer-Verlag Berlin Heidelberg 2008
94
A. Ichiki and M. Shiino
involvement of synaptic noise can be analyzed by the cavity method to derive the TAP equation in the thermodynamic limit, since the energy concept appears as an effective Hamiltonian in this limit. It is natural to consider such neural network models with fluctuating synaptic couplings, since the synaptic couplings in real biological systems are updated by learning rules and the time-sequence of the synaptic couplings may be stochastic under the influence of noisy external stimuli. Thus the study on such networks is required to understand the retrieval process of the realistic networks. The aim of this paper is two-fold: (i) we will investigate the networks with which realization of synaptic noise have the energy concept to apply the cavity method to derive the TAP equations, (ii) we will show the TAP equations for the networks when the concept of the effective Hamiltonian appears. This paper is organized as the follows: in the next section, we will briefly review how the energy concept appears in the network with synaptic noise and derive the TAP equations and the order parameter equations by using the cavity method and the SCSNA [8]. Once the effective Hamiltonian is found, the replica method is also applicable to derive the order parameter equations. However in the present paper, to make clear the relationship between the TAP equations and the order parameter equations, we do not use the replica trick. In section 3, we will investigate the cases where the energy concept appears in the networks with synaptic noise. We will see that the TAP equations and the order parameter equations for some models can be derived in the framework mentioned in section 2. We will also mention that some difficulties to derive the TAP equations arise in some models with other involvements of synaptic noise. In the last section, we will make discussions on the structure of the TAP equations for the network with temporally fluctuating synaptic noise and conclude this paper.
2
Brief Review on Effective Hamiltonian, TAP Equations and Order Parameter Equations
In this section, we briefly review the cavity method becomes applicable to the network with fluctuating synaptic couplings [8]. Then we derive the TAP equations and the order parameter equations self-consistently in the framework of the SCSNA. In this section, we deal with the following stochastic analog neural network of N neurons with temporally fluctuating synaptic noise (multiplicative noise): x˙ i = −φ (xi ) + Jij (t)xj + ηi (t), (1) j(=i)
ηi (t)ηj (t ) = 2Dδij δ(t − t ),
(2)
where xi (i = 1, · · · , N ) represents a state of the neuron at site i taking a continuous value, φ(xi ) is a potential of an arbitrary form which determines the probability distribution of xi in the case without the input j(=i) Jij xj , ηi the Langevin white noise with its noise intensity 2D and Jij (t) the synaptic coupling.
TAP Equation for Associative Memory Neural Network Models
95
We note here that, in the case of associative memory neural network, the synaptic coupling Jij is usually defined by the well-known Hebb learning rule. However, in the present paper, we will deal with the coupling Jij fluctuating around the Hebb rule with a white noise: Jij (t) = J¯ij + ij (t), ˜ 2D ij (t)kl (t ) = δik δjl δ(t − t ), N
(3) (4)
p where J¯ij is defined by the usual Hebb learning rule J¯ij ≡ N1 μ=1 ξiμ ξjμ with p = αN the number of patterns embedded in the network, ξiμ = ±1 is the μth embedded pattern at neuron i, and ij (t) denotes the synaptic noise independent of ηi (t), which we assume in the present model as a white noise with its intensity ˜ . 2D/N Using Ito integral, we obtain the Fokker-Planck equation corresponding to the Langevin equation (1) as ⎧ ⎫ N ∂ ⎬ ∂P (t, x) ∂ ⎨ ˜q =− −φ (xi ) + J¯ij xj − D + Dˆ P (t, x), (5) ∂t ∂xi ⎩ ∂xi ⎭ i=1
j(=i)
where qˆ ≡ N1 j(=i) x2j . Since the self-averaging property holds in the thermodynamic limit N → ∞, qˆ is identified as qˆ =
N 1 2 x . N i=1 i
(6)
The order parameter qˆ is obtained self-consistently in our framework as seen below. Supposing qˆ is given, one can easily find the equilibrium probability density for the Fokker-Planck equation (5) as ⎧ ⎛ ⎞⎫ N ⎨ ⎬ PN (x) = Z −1 exp −βeff ⎝ φ(xi ) − J¯ij xi xj ⎠ , (7) ⎩ ⎭ i=1
i<j
where Z denotes the normalization constant and −1 ˜ qˆ βeff ≡D+D
(8)
plays the role of the effective temperature of the network. The temperature of the system is modified as a consequence of the multiplicative noise and it depends on the order parameter qˆ. Notice here that the equilibrium distribution of the system becomes Gibbs distribution in the thermodynamic limit N → ∞. The equilibrium solution for equation (5) in the finite N -body system differs√from the indeed 1 2 2 probability density (7). However, since N1 x − x = O(1/ N ), the j(=i) j N j j difference between the probability densities in the finite N -body system PN√and in the system in the thermodynamic system PN →∞ is PN →∞ − PN = O(1/ N ).
96
A. Ichiki and M. Shiino
Thus one can conclude that the equilibrium density for equation (5) converges to the probability density (7) in the thermodynamic limit N → ∞. Since we have explicitly written down the equilibrium probability density (7) as a Gibbsian form, one can define the effective Hamiltonian of (sufficiently large) N -body system as N
HN ≡
i=1
φ(xi ) −
J¯ij xi xj .
(9)
i<j
Since we have found the effective Hamiltonian and the effective temperature, one can apply the usual cavity method [3] to this system and derive the TAP equation. According to the cavity method, we divide the Hamiltonian of N -body system (9) into that of (N −1)-body system and the part involving the state of ith neuron as HN = φ(xi ) − hi xi + HN −1 , where hi ≡ j(=i) J¯ij xj is the local field at site i and the Hamiltonian of (N −1) body system HN −1 is given as HN −1 ≡ j(=i) φ(xj ) − j
˜ −1
=Z
j(=i)
exp { −βeff [φ(xi ) − hi xi ]} PN −1 (hi ),
where Z˜ is the normalization constant and PN −1 (hi ) denotes the probability density of the local field hi in the (N − 1)-body system defined as ⎡ ⎤ ⎛ ⎞ −1 ⎣ PN −1 (hi ) ≡ ZN dxj ⎦ δ ⎝hi − J¯ij xj ⎠ exp [−βeff HN −1 ] , −1 j(=i)
j(=i)
where ZN −1 denotes the normalization constant. Since the local field is given as the summation of a sufficiently large number of random variables and their cross√ correlations are expected to be O(1/ N ), one can expect that PN −1 (hi ) turns out to be a Gaussian density in the thermodynamic limit N → ∞ according to the central limit theorem: 1 (hi − hi N −1 )2 PN −1 (hi ) = √ exp − , 2σ 2 2πσ 2 where ·N −1 represents the thermal average with respect to the (N − 1)-body probability density PN −1 (x) and σ 2 is the variance of PN −1 (hi ), which is evaluated later self-consistently in the framework of the SCSNA. Then taking the
TAP Equation for Associative Memory Neural Network Models
97
average of xi with respect to the marginal probability PN (xi , hi ) straightforwardly yields xi = F (hi N −1 ), where F is a transfer function defined as 2 dx x exp −βeff φ(x) − yx − βeff2σ x2 . F (y) ≡ 2 dx exp −βeff φ(x) − yx − βeff2σ x2
(10)
(11)
Similarly hi N −1 is obtained as hi N −1 = hi − βeff σ 2 xi . Thus we have the pre-TAP equation ⎛ ⎞ xi = F ⎝ J¯ij xj − ΓOns xi ⎠ ,
(12)
j(=i)
where ΓOns ≡ βeff σ 2 . Since the concrete form of the transfer function F depends on the effective temperature βeff and the variance of the local field σ 2 , it is necessary to obtain βeff and σ 2 to have the TAP equation [7,9]. Then we can apply the SCSNA to equation (12) to determine βeff and σ 2 selfconsistently. We assume that the only one condensed pattern {ξi1 } is retrieved. N Using the overlap order parameter mμ ≡ N1 i=1 ξiμ xi and the concept of the SCSNA [7,9], we have
hi = ξi1 m1 + ξiμ mμ + ziμ + ΓSCSNA xi ,
(13)
where ν≥2 ξiν mν = ξiμ mμ + ziμ + γxi , ΓSCSNA ≡ γ − α and ziμ is a Gaussian random variable with zero mean. We will evaluate the overlap mμ selfconsistently and obtain ziμ and γ. Substituting equation (13) into the pre-TAP equation (12) reads xi = F ξi1 m1 + ξiμ mμ + ziμ + (ΓSCSNA − ΓOns )xi and comparing this equation with equation (10) yields [7] ΓSCSNA = ΓOns , since hi N −1 is considered to be a Gaussian random variable √ which should not contain the Onsager reaction term. Noting that mμ = O(1/ N ) for μ ≥ 2, one can obtain the overlap for noncondensed patterns as μ 1 ξ F (ξj1 m1 + zjμ ), N (1 − U ) j=1 j
(14)
N 1 1 1 F (ξj m + zjμ ), N j=1
(15)
N
mμ = U≡
98
A. Ichiki and M. Shiino
where F denotes the derivative of the transfer function F and the order expan√ sion of F with respect to 1/ N has been applied to xi = F (ξi1 m1 +ziμ +ξiμ mμ ). Using equation (14) and the definitions of ziμ and γ, one finds α , 1−U 1 = N (1 − U )
γ= ziμ
ξiν ξjν F (ξj1 m1 + zjν ).
ν(=1,μ) j(=i)
Thus the variance of ziμ is evaluated as σz2 =
2 α F (ξm1 + z) 2 (1 − U )
ξ,z
,
(16)
where ·ξ,z represents the average over a random variables ξ = ±1 and the Gaussian variable z, and the self-averaging property has been used. Similarly one obtains the set of order parameter equations as m1 = ξF (ξm1 + z) ξ,z , (17) 1 U = F (ξm + z) ξ,z , (18) ΓOns = ΓSCSNA =
αU . 1−U
(19)
For the present model, it is required to evaluate the order parameter qˆ to determine the form of the transfer function F . Since qˆ is related to the susceptibility and, by definition of F (11), the order parameter U corresponds with the susceptibility as U = βeff (x2 − x2 ), one finds qˆ =
U (1 − U )2 2 + σz . βeff α
(20)
The set of equations (16), (17), (18), (19), (20) takes a closed form and one can determine the form of F self-consistently as well as the set of order parameters. Therefore substituting into the pre-TAP equation (12) the solutions βeff and ΓOns that are self-consistently obtained within this framework yields the TAP equation.
3
Models with Effective Hamiltonian and Effective Temperature
In this section, we examine the applicability of the concept shown in the previous section for the network models with various types of involvement of synaptic noise. As is shown in the previous section, the network has the effective Hamiltonian when the equilibrium density becomes Gibbsian one in the thermodynamic limit. Since the forms of the equilibrium probability densities are estimated by investigating the forms of the Fokker-Planck equations corresponding to the
TAP Equation for Associative Memory Neural Network Models
99
original models, we will see the concrete forms of the drift and diffusion coefficients of the Fokker-Planck equations in the thermodynamic limit. Once the Fokker-Planck equation for each model is obtained, the form of the equilibrium distribution is estimated and hence we can know whether the network has energy concept in the thermodynamic limit. We deal with the following models: the state of ith neuron xi obeys the dynamics (1) and (2). The synaptic couplings Jij are given in Table 1: Table 1.
case
Jij (t)
noise
(i)
J¯ij + ij (t)
(ii)
J¯ij + i (t)j (t)
(iii)
J¯ij + i (t)˜ j (t)
(iv)
J¯ij (1 + ij (t)) J¯ij (1 + j (t))
˜ ij (t)kl (t ) = 2δik δjl δ(t − t )/βN ! ˜ δij δ 1/2 (t − t ) i (t)j (t ) = 2/βN √ ˜ γ δij δ 1/2 (t − t ) i (t)j (t ) = 2/βN √ ˆ 1−γ δij δ 1/2 (t − t ) ˜ i (t)˜ j (t ) = 2/βN i (t)˜ j (t ) = 0 ij (t)kl (t ) = 2δik δjl δ(t − t )/β˜
i (t)j (t ) = 2δij δ(t − t )/β˜ ! ˜ ij δ 1/2 (t − t ) (vi) J¯ij (1 + i (t)j (t)) i (t)j (t ) = 2/βδ √ ˜ γ i (t)j (t ) = 2δij δ 1/2 (t − t )/βN √ ¯ 1/2 ˆ (vii) Jij (1 + i (t)˜ j (t)) ˜ i (t)˜ j (t ) = 2δij δ (t − t )/βN −γ i (t)˜ j (t ) = 0 ˜ (viii) J¯ij + j (t) i (t)j (t ) = 2δij δ(t − t )/βN ˜ (ix) J¯ij + i (t) i (t)j (t ) = 2δij δ(t − t )/βN (v)
(xi)
J¯ij (1 + i (t)) J¯ij + (t)
i (t)j (t ) = 2δij δ(t − t )/β˜ ˜ (t)(t ) = 2δ(t − t )/βN
(xii)
J¯ij (1 + (t))
(t)(t ) = 2δ(t − t )/β˜
(x)
Note that, in Table 1, · denotes the average with respect to the noise, β and β˜ relate with the noise intensity of i and ˜i respectively and γ arbitrary real number. The case (i) has been treated in the previous section. (2) Using Ito integral, we can straightforwardly find the diffusion coefficient Dij for each model as shown in Table 2. In Table √ 2, the order parameters qˆ and M are defined as qˆ ≡ i x2i /N , M ≡ i xi / N and the local field is defined as (1) (1) hi ≡ j(=i) J¯ij xj . The drift coefficients Di for all cases are same as Di = −φ (xi ) + j(=i) J¯ij xj . Thus there exists the energy concept or the effective Hamiltonian for the mod(2) els (i)-(vii). In these models, the effective temperature 1/βeff = Dii also exits.
100
A. Ichiki and M. Shiino
Therefore the TAP equations and the order parameter equations for these models can be obtained in the framework described in the previous section. However, in (2) the case (viii), the diffusion coefficient Dij contains non-zero off-diagonal elements and hence the equilibrium distribution does not take the form of the Gibbs distribution. This difficulty occurs also in the cases (xi) and (xii). In the case (ix), the effective Hamiltonian and the effective temperature exist. However, in this case, evaluating the quantity M is required to determine the effective temperature. For the quantity M the self-averaging property does not hold and thus the effective temperature can not be found in the framework mentioned previously. In the case (x), the local field hi itself is the thermally fluctuating random variable, and hence the dynamics for the local fields hi ’s are also required to estimate the effective temperature. Note that the equilibrium distribution for this model may not become the Gibbs distribution since the dynamics of the local fields hi ’s should be also considered. In the case (xi), there exist the offdiagonal elements in the diffusion coefficient and it depends on M , which can not be evaluated by the self-averaging property. In the case (xii), the off-diagonal elements on the diffusion coefficient also exist and it depends on the temporally fluctuating random variables hi ’s. Therefore deriving the TAP equations and the order parameter equations for the models (viii)-(xii) is the future problem. Table 2. (2)
case
Dij
energy concept
(i)
˜ ij (1/β + qˆ/β)δ ˜ ij (1/β + qˆ/β)δ
exist
(ii)
exist
˜ ij (vi) (1/β + αˆ q /β)δ ˆ ij (vii) (1/β + αˆ q /β˜β)δ (viii) δij /β + qˆ/β˜
exist
exist exist exist not exist
˜ ij (ix) (1/β + M /β)δ 2 ˜ (x) (1/β + h /β)δij
not exist
δij /β + M 2 /β˜ (xii) δij /β + hi hj /β˜
not exist
2
i
(xi)
4
exist
ˆ ij (iii) (1/β + qˆ/β˜β)δ ˜ ij (iv) (1/β + αˆ q /β)δ ˜ ij (v) (1/β + αˆ q /β)δ
not exist not exist
Conclusion
In this paper we have derived the TAP equations and the order parameter equations for the neural network models with synaptic noise. In the models presented
TAP Equation for Associative Memory Neural Network Models
101
in this paper, the concept of energy or the Hamiltonian does not exist in the original finite N -body system. As we have seen in the previous section, the effective Hamiltonian and the effective temperature exist in the thermodynamic limit and then the TAP equations together with the order parameter equations are derived by the use of the cavity method and the SCSNA as shown in section 2 for the cases (i)-(vii). For the cases (viii)-(xii), however, some difficulties mentioned in the previous section arise. To derive the TAP equations and the order parameter equations for these cases remains as the future problem. One of the authors (A. I.) was supported by the 21st Century COE Program at Tokyo Tech ”Nanometer-Scale Quantum Physics” by the Ministry of Education, Culture, Sports, Science and Technology.
References 1. Sherrington, D., Kirkpatrick, S.: Solvable Model of a Spin-Glass. Phys. Rev. Lett. 35, 1792–1796 (1975) 2. Amit, D.J., Geutfreund, H., Sompolinsky, H.: Storing Infinite Numbers of Patterns in a Spin-Glass Model of Neural Networks. Phys. Rev. Lett. 55, 1530–1533 (1985) 3. M´ezard, M., Parisi, G., Virasoro, M.A.: Spin Glass Theory and Beyond. World Scientific, Singapore (1987) 4. Thouless, D.J., Anderson, P.W., Palmer, R.G.: Solution of ’solvable model of a spin glass’. Philos. Mag. 35, 593–601 (1977) 5. Morita, T., Horiguchi, T.: Exactly solvable model of a spin glass. Solid. State. Comm. 19, 833–835 (1976) 6. Shiino, M., Fukai, T.: Self-consistent signal-to-noise analysis and its application to analogue neural networks with asymmetric connections. J. Phys. A: Math. Gen. 25, L375-L381 (1992) 7. Shiino, M., Yamana, M.: Statistical mechanics of stochastic neural networks: Relationship between the self-consistent signal-to-noise analysis. Thouless-AndersonPalmer equation, and replica symmetric calculation approaches. Phys. Rev. E. 69, 011904-1-13 (2004) 8. Ichiki, A., Shiino, M.: Thouless-Anderson-Palmer equation for analog neural network with temporally fluctuating white synaptic noise. J. Phys. A: Math. Theor. 40, 9201– 9211 (2007) 9. Shiino, M., Fukai, T.: Self-consistent signal-to-noise analysis of the statistical behavior of analog neural networks and enhancement of the storage capacity. Phys. Rev. E. 48, 867–897 (1993)
Spike-Timing Dependent Plasticity in Recurrently Connected Networks with Fixed External Inputs Matthieu Gilson1,2,3 , David B. Grayden1,2,3 , J. Leo van Hemmen4 , Doreen A. Thomas1,3 , and Anthony N. Burkitt1,2,3 1
Department of Electrical and Electronic Engineering, The University of Melbourne 2 The Bionic Ear Institute, Melbourne, Australia 3 NICTA, The Victorian Research Lab, The University of Melbourne 4 Physik Department, die Technische Universit¨ at M¨ unchen, Germany
[email protected]
Abstract. This paper investigates spike-timing dependent plasticity (STDP) for recurrently connected weights in a network with fixed external inputs (homogeneous Poisson pulse trains). We use a dynamical system to model the network activity and predict its asymptotic evolution, which turns out to qualitatively depend on the learning parameters and the correlation structure of the inputs. Our predictions are supported by numerical simulations of Poisson neuron networks in general cases as well as for certain cases when using Integrate-And-Fire (IF) neurons. Keywords: Spiking neurons, neurodynamics, learning, STDP, dynamical system, recurrent network.
1
Introduction
The hypothesis that changes in the efficacies of neuronal connections depend upon the correlations in timing of pre- and postsynaptic action potentials (or spikes) has received considerable experimental support [1]. Such a learning mechanism is related to the notion of Hebbian learning [6] and can implement functional organisation in neural networks. The understanding of the underlying mechanisms behind learning in the brain is crucial both from a physiological level and for applications (eg. neural prostheses, robotics). Spike-timing dependent plasticity (STDP, [3]) is a fruitful candidate for this mechanism, that has been analysed in previous studies for feed-forward network architectures of Poisson neurons [7,8]. This theory was recently extended to the case of recurrently connected architectures, and applied to a fully connected network with no external input [2]. This paper pursues the development of this framework [2] based on the linear Poisson neuron model and additive STDP (as described in Sec. 2). The network activity is characterised by the firing rates of and the correlations between the neurons. The evolution of the network activity (or neural dynamics) is described by a dynamical system (as discussed in Sec. 3). It is analysed in terms of M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 102–111, 2008. c Springer-Verlag Berlin Heidelberg 2008
Spike-Timing Dependent Plasticity in Recurrently Connected Networks
103
equilibria (or fixed points) and stability in order to predict the asymptotic network behaviour when stimulated by steady external inputs (Sec. 4).
2 2.1
Models Linear Poisson Neuron
In this study, we use the linear Poisson neuron [7]. The neuronal output firing mechanism is approximated by an inhomogeneous Poisson process generated by the Poisson parameter ρ(t) (it determines the probability of spiking an action potential, considered as an instantaneous event); ρ(t) evolves according to the sum of all the synaptic contributions (cf. Fig. 1.A) ρ(t) = ν0 + Jk (t) (t − tk,n − dk ) . (1) k
n
As in [7], the total synaptic influx is the temporal and spatial (over the synapses) summation of the post-synaptic potentials (PSP) modelled by the synaptic response kernel weighted by the synaptic efficacies Jk (or weights; note that we only consider positive weights here). models the current injected into the postsynaptic neuron induced by one single spike (indexed by n) at time tk,n at the k th synaptic connection, after the fixed delay dk . Finally, ν0 is the spontaneous firing rate caused by background activity (identical for all the neurons). 2.2
Additive STDP
Gerstner et al. [3] first proposed a framework to study additive STDP, where the change of synaptic weight Jk induced by a sole pair of pre- and post-synaptic pulses at times (tk , tout ) is ΔJk ∝ win + wout + W (tk − tout ).
(2)
Jk is the weight of the k th synapse. win and wout are the rate-based learning parameters (one contribution per pulse at the pulse time); W is the STDP learning window (cf. Fig. 1.B) so that a synaptic weight is potentiated for a given pulse pair if a presynaptic input precedes a postsynaptic spike (i.e. “takes part” in the output firing), and is depressed otherwise [3]. Usually, such changes are scaled by a learning rate η. 2.3
Link with the Physiology
In our abstract model, the neurons are excited by external inputs, which convey “neural information” that is assumed to be encoded in their firing rate and correlation structures. The neurons also have a spontaneous activity that can be understood as a consequence of a background activity. There are thus two sources of stochasticity, which can be interpreted as follows: one is related to the external inputs (pulse trains described by their firing rate and correlation
104
M. Gilson et al.
Fig. 1. A: Linear Poisson neuron model. The output pulse train S(t) (top) is generated using the inhomogeneous Poisson parameter ρ(t) (middle). Each pre-synaptic pulse at the kth synapse (bottom) induces a variation of ρ determined by the post-synaptic response kernel , the synaptic weight Jk (t) and the delay dk . B: STDP window function W (in arbitrary unit). Each side of the real axis is determined by a negative exponential: u → cx exp(−|u|/τx ) with the index x = D for the depression part (u > 0) and x = P for the potentiation part (u < 0), cf. [2, Sec. 5] for details and the parameter values.
structures) and it can be linked to the variability observed in physiological data; the other is due to the additional impact of the background activity upon the generation of the spike output of each neuron (intrinsic stochasticity modelled by the Poisson process).
3 3.1
Theoretical Analysis Characterisation of the Neural Activity
We consider a network of N Poisson neurons (referred to as internal neurons) stimulated with M Poisson pulse trains (or external inputs), as shown in Fig. 2. The activity of the neural network can be described using the firing rates and the pairwise correlations, plus the weights. In the terminology of the theory of dynamical systems, these variables characterise the “state” of the network activity at each time t and their evolution is referred to as neural dynamics. Similar to [7,2], we consider the time-averaged firing rates νi (t) (for the internal neuron indexed by i; T is a given time period) 1 t νi (t) Si (t ) dt (3) T t−T where Si (t) is the spike-time series of the ith neuron and the brackets . . . denotes the ensemble averaging. Likewise for the coefficient correlations (timeaverage correlation function convoluted with the STDP window function W ): W Dik (t) between the ith internal neuron and the k th external input t +∞ 1 W Dik (t) W (u) Si (t )Sˆk (t + u) dt du (4) T t−T −∞ th and QW and j th internal neurons (with Si and Sj ) [2, Sec. 3]. ij (t) between the i
Spike-Timing Dependent Plasticity in Recurrently Connected Networks
105
Fig. 2. Presentation of the network and the notation. The internal neurons (within the network) are indexed by i ∈ [1..N ], and their output pulse trains are denoted Si (t) (it can be understood as a sum of “Dirac functions” at each spiking time [2, Sec. 2]). Likewise for the external input pulse trains Sˆk (t) (k ∈ [1..M ]). The time-averaged firing rates are denoted by νi (t), cf. Eq. 3; the correlation coefficients within the network by Qij (t) (resp. Dik (t) between a neuron in the network and an external input; and ˆ kl (t) between two external inputs), cf. Eq. 4. The weight of the connection from the Q kth external input onto the ith internal neuron is denoted Kik (t) (resp. Jij (t) from the j th internal neuron onto the ith internal neuron).
The stimulation parameters (which represent the “information” carried in the external inputs) are determined by the time-averaged input spiking rates (ˆ νk (t), ˆ W (t), defined simidefined similarly to νi (t)) and their correlation coefficients (Q kl W larly to Dik (t) and QW ij (t)) [2, Sec. 3]. In this paper, we only consider stimulation parameters that are constant in time. 3.2
Learning Equations
Learning equations can be derived from Eq. 2 as in Kempter et al. [7]. This requires the assumption that the internal pulse trains are statistically independent (this can be considered valid when the number of recurrent connection is large enough) and a small learning rate η. This leads to the matrix equation Eq. 10 for the weights between internal neurons (resp. to Eq. 9 for the input weights K). 3.3
Activation Dynamics
In order to study the evolution of the weights described by Eqs. 9 and 10, we need to evaluate the neuron time-average firing rates (the vector ν(t)) and their time-average correlation coefficients (the matrices DW (t) and QW (t)). Similar to Kempter et al. [7] and Burkitt et al. [2], we approximate the instantaneous firing rate Si (t) of the ith Poisson neuron by its expected inhomogeneous Poisson parameter ρ(t) (cf. Eq. 1) and we neglect the impact of the short-time dynamics (the synaptic response kernel and the synaptic delays dˆik and dij ) by using time averaged variables (over a “long” period T ). We require that the
106
M. Gilson et al.
learning occurs slowly compared to the activation mechanisms (cf. the neuron and synapse models) so that T is large compared to the time scale of these mechanisms but small compared to the inverse of the learning parameter η −1 . This leads to the consistency matrix equations of the firing rates (Eq. 6) and of the correlation coefficients (Eqs. 7 and 8). See Burkitt et al. [2, Sec. 3] for details of the derivation. Note that the consistency equations of the correlation coefficients as defined in [2, Sec. 3 and 4] have been reformulated to express the usual covariance using the assumption that the correlations are quasi-constant in time [5] (this implies ˆ V T [2, Sec. 3] is actually equal to Q ˆ W ). Eqs. 7 and 8 express the impact that Q of the connectivity (through the term [I − J]−1 K) on the internal firing rates and the cross covariances in terms of the input covariance Cˆ W
ˆW − W νˆ νˆT , Cˆ W Q
(5)
W (u) du evaluates the balance between the potentiation and the where W depression of our STDP rule. These equations are obtained by combining the equations in [2] with the firing-rate consistency equation Eq. 6. 3.4
Network Dynamical System
Putting everything together, the network dynamics is described by
ν = [I − J]−1 ν0 E + K νˆ (6)
ν νˆT = [I − J]−1 K Q ˆW − W νˆ νˆT DW − W (7)
ν ν T = [I − J]−1 K Q ˆW − W νˆ νˆT K T [I − J]−1T QW − W (8)
dK ˆ T + DW = ΦK win E νˆT + wout ν E (9) dt
dJ = ΦJ win E ν T + wout ν E T + QW , (10) dt where E is the unit vector of N elements (likewise for Eˆ with M elements); ΦJ is a projector on the space of N × N matrices to which J belongs [2, Sec. 3]. The effect of ΦJ is to nullify the coefficients corresponding to a missing connection in the network, viz. all the diagonal terms because the self-connection of a neuron onto itself is forbidden. More generally, such an projection operator can account for any network connectivity [2, Sec. 3]. Note that time has been rescaled, in order to remove η from these equations and for simplicity of notation the dependence on time t will be omitted in the rest of this paper. The matrix [I − J(t)] is assumed invertible at all time (the contrary would imply a diverging behaviour of the firing rates [2, Sec. 4]).
4
Recurrent Network with Fixed Input Weights
We now examine the case of a recurrently connected network with fixed input weights K and learning on the recurrent weights J. Only Eqs. 6, 8 and 10 remain,
Spike-Timing Dependent Plasticity in Recurrently Connected Networks
107
which simplifies the analysis. In the case of full recurrent connectivity, ΦJ in Eq. 10 only nullifies the diagonal terms of the square matrix in its argument. 4.1
Analytical Predictions
Homeostatic equilibrium. Similarly to [7,2], we derive the scalar equations of the mean firing rate νav N −1 i νi and weight Jav (N (N − 1))−1 i=j Jij . It consists of neglecting the inhomogeneities of the firing rates and of the weights over the network, as well as of the connectivity, and we obtain ν0 + (K νˆ)av 1 − (N − 1)Jav
in +C ν2 , = w + wout νav + W av
νav = J˙av
(11)
is defined as where (K νˆ)av denotes the mean of the matrix product K νˆ and C ˆW T (K C K )av . C 2 [ν0 + (K νˆ)av ]
(12)
Eqs. 11 are thus the same equations as for the case with no input [2, Sec. 5], with replaced by, resp., ν0 + (K νˆ)av and W + C. It qualitatively implies the ν0 and W < 0, the network exhibits same dynamical behaviour and, provided that W + C a homeostatic equilibrium (when it is realisable, it requires in particular μ > 0), and the means of the firing rates and of the weights converge towards win + wout +C W μ − ν0 − (K νˆ)av = , (N − 1)μ
∗ νav =μ − ∗ Jav
(13)
If the correlation inputs are time-invariant functions and if the correlations are positive (i.e. more likely to fire in a synchronous way) and homogeneous is of the same sign as W . among the correlated input pool, it follows that C < 0 as in [7,2]. Therefore, the condition for stability reverts to W Structural dynamics. For the particular case of uncorrelated inputs (or more generally when K Cˆ W K T = 0), the fixed-point structure of the dynamical system is qualitatively the same as for the case of no external input [2, Sec. 5]: a homogeneous distribution for the firing rates and a continuous manifold of fixed points for the internal weights. In the case of “spatial” inhomogeneities over the input correlations, the network dynamics shows a different evolution. To illustrate and to compare this with the case of feed-forward connections, we consider the network configuration described in Fig. 3 and inspired by [7], where one input pool has correlation while the other pool has uncorrelated sources. In general, the equations that
108
M. Gilson et al.
Fig. 3. Architecture of the simulated network. The inputs are divided into two subpools, each feeding half of the internal network with means K1 and K2 for the input weights from each subpool. Similarly to the recurrent weights J and the firing rates νˆ and ν, the inhomogeneities are neglected within each subpool and they are assumed to be all identical to their mean. The weights and delays are initially set with 10% random variation around a mean.
determine the fixed points have no accurate solution. Yet, we can reduce the dimensionality by neglecting the variance within each subpool and make approximations to evaluate the asymptotic distribution of the firing rates, which turns out to be bimodal. 4.2
Simulation Protocol and Results
We simulated a network of Poisson neurons as described in Fig. 3 with random initial recurrent weights (uniformly distributed in a given interval, as well as all the synaptic delays). An issue with such simulations is to maintain positive internal weights during their homeostatic convergence because they individually diverge. Thus, all equilibria are not realisable depending on the initial distributions of K and J and the weight bounds [7,2]. See [2, Sec. 5] for details about the simulation parameters. An interesting first case to test the analytical predictions consists of two pools of uncorrelated inputs, each feeding half of the network with distinct weights (K1 νˆ1 = K2 νˆ2 and Cˆ W = 0). Each half of the internal network thus has distinct firing rates initially and, as predicted, the outcome is a convergence of these firing-rates towards a uniform value (similar to the case of no external input [2, Sec. 5]). In the case of a full connected (for both the K and the J according to Fig. 3) network stimulated by one correlated input pool (short-time correlation inspired by [4] so that Cˆ W = 0) and one uncorrelated one, both the internal firing rates and the internal weights also exhibit a homeostatic equilibrium. As shown in Fig. 4, the means over the network (thick solid lines) converge towards the predicted equilibrium values (dashed lines). Furthermore, the individual firing rates tend to stabilise and their distribution remains bimodal (the subpool #1 excited by < 0). The recurrent correlated inputs fires at a lower rate eventually when W weights individually diverge similarly to [2, Sec. 5] and reorganise so that the outgoing weights from the subpool #1 (see the means over each weight subpool
Spike-Timing Dependent Plasticity in Recurrently Connected Networks 25
0.03
15
0.02 0.015 J11
0.01
#1
#2
J21
0.025
#2
weights
firing rates (Hz)
#1 20
10 0
109
0.005 10000
0 0
20000
10000
J22 J12 20000
time (sec)
time (sec)
Fig. 4. Evolution of the firing rates (left) and of the recurrent weights (right) for N = 30 fully-connected Poisson neurons (cf. Fig.3) with short-time correlated inputs. The outcome is a quasi bimodal distribution of the firing rates (the grey bundle, the mean is the thick solid line) around the predicted homeostatic equilibrium (dashed line). The subgroup #1 that receives correlated inputs is more excited initially but fires at a lower rate at the end of the simulation (cf. the two thin black solid lines which represent the mean over each subpool). The internal weights individually diverge, while their mean (thick line) converges towards the predicted equilibrium value (dashed line). They reorganise themselves so that the weights outgoing from the subpool #2 (that receives uncorrelated inputs) become silent, while the ones from #1 are strengthened. Note that the homeostatic equilibrium is preserved even when some weights saturate.
firing rates (Hz)
25 #1
#2
20
15 #2 0
#1 10000
20000
time (sec) Fig. 5. Evolution of the firing rates for a partially connected network of N = 75 neurons. Both the K and the J have 40% probability of connection with the same setup as the network in Fig. 4 (to preserve the total input synaptic strength). The mean firing rate (very thick line) still converges towards the predicted equilibrium value (dashed line) and the two subgroups (grey bundles, each mean represented by a thin black solid line) get separated similarly to the case of full connectivity. The internal weights (not shown) exhibit similar dynamics as for the case of full connectivity.
J11 and J21 in Fig. 4) are strengthened while the other ones almost become silent (see J12 and J22 ). In other words, the subpool that receives correlated inputs takes the upper hand in the recurrent architecture.
110
M. Gilson et al.
30
firing rates (Hz)
firing rates (Hz)
30
25
20
15 0
10000
time (sec)
25
20
15 0
10000
time (sec)
Fig. 6. Simulation of networks of IF neurons with partial random connectivity of 50%. The network qualitatively exhibits the expected behaviour in the case of uncorrelated inputs (left) and inputs with one correlated pool and one uncorrelated pool (right).
In the case of partial connectivity for both the K and the J, the behaviour of the individual firing rates (cf. Fig. 5) still follows the predictions but they are more dispersed, their convergence is slower and the bimodal distribution is not always observed as clearly as in the case of full connectivity (in Fig. 5 the means of the two internal neuron subpools clearly remain separated though). The homeostatic equilibrium of the internal weights also holds and they individually diverge. The partial connectivity needs to be rich enough for the predictions of the mean variable to be accurate enough. First results with IF neurons comply with the analytical predictions remain valid even if the activation mechanisms are more complex (here with a connectivity of 50%, cf. Fig. 6).
5
Discussion and Future Work
The analytical results presented here are preliminary, and further investigation is needed to gain better understanding of the interplay between the input correlation structure and STDP. Nevertheless, our results illustrate two points: STDP induces a stable activity in recurrent architectures similar to that for feed-forward ones (homeostatic regulation of the network activity under the con < 0); and the qualitative structure of the internal firing rates is mainly dition W determined by the input correlation structure. Namely, a “poor” correlation structure (uncorrelated or delta-correlated inputs, so that Cˆ W = 0) induces a homogenisation of the firing activity. Finally, partial connectivity impacts upon the structure of the internal firing rates, but such networks still exhibit similar behaviour to fully connected ones. ˆ W K T suggest Preliminary results involving more complex patterns of K Q more complex interplay between the input correlation structure and the equilibrium distribution of the network firing rates and weights. Such cases are under investigation and may constitute more “interesting” dynamic behaviour of the network from a cognitive modelling point of view, namely through the
Spike-Timing Dependent Plasticity in Recurrently Connected Networks
111
relationship between the attractors of the network activity and the input structure. The case of learning for both the input connections and the recurrent ones will also form part of a future study. Comparison with IF neurons suggests that the impact of the neuron activation mechanisms on the weight dynamics may not be significant. It can be linked to the separation of the time scales between them in the case of slow learning. Note that with our approximations the IF neurons are assumed to be in “linear input-output regime” (no bursting for instance).
Acknowledgments The authors thank Iven Mareels, Chris Trengove, Sean Byrnes and Hamish Meffin for useful discussions that introduced significant improvements. MG is funded by two scholarships from The University of Melbourne and NICTA. ANB and DBG acknowledge funding from the Australian Research Council (ARC Discovery Projects #DP0453205 and #DP0664271) and The Bionic Ear Institute.
References 1. Bi, G.Q., Poo, M.M.: Synaptic modification by correlated activity: Hebb’s postulate revisited. Annual Review of Neuroscience 24, 139–166 (2001) 2. Burkitt, A.N., Gilson, M., van Hemmen, J.L.: Spike-timing-dependent plasticity for neurons with recurrent connections. Biological Cybernetics 96, 533–546 (2007) 3. Gerstner, W., Kempter, R., van Hemmen, J.L., Wagner, H.: A neuronal learning rule for sub-millisecond temporal coding. Nature 383, 76–78 (1996) 4. Gutig, R., Aharonov, R., Rotter, S., Sompolinsky, H.: Learning input correlations through nonlinear temporally asymmetric hebbian plasticity. Journal of Neuroscience 23, 3697–3714 (2003) 5. Hawkes, A.G.: Point spectra of some mutually exciting point processes. Journal of the Royal Statistical Society Series B-Statistical Methodology 33, 438–443 (1971) 6. Hebb, D.O.: The organization of behavior: a neuropsychological theory. Wiley, Chichester (1949) 7. Kempter, R., Gerstner, W., van Hemmen, J.L.: Hebbian learning and spiking neurons. Physical Review E 59, 4498–4514 (1999) 8. van Rossum, M.C.W., Bi, G.Q., Turrigiano, G.G.: Stable hebbian learning from spike timing-dependent plasticity. Journal of Neuroscience 20, 8812–8821 (2000)
A Comparative Study of Synchrony Measures for the Early Detection of Alzheimer’s Disease Based on EEG Justin Dauwels, Fran¸cois Vialatte, and Andrzej Cichocki RIKEN Brain Science Institute, Saitama, Japan
[email protected], {fvialatte,cia}@brain.riken.jp
Abstract. It has repeatedly been reported in the medical literature that the EEG signals of Alzheimer’s disease (AD) patients are less synchronous than in age-matched control patients. This phenomenon, however, does at present not allow to reliably predict AD at an early stage, so-called mild cognitive impairment (MCI), due to the large variability among patients. In recent years, many novel techniques to quantify EEG synchrony have been developed; some of them are believed to be more sensitive to abnormalities in EEG synchrony than traditional measures such as the cross-correlation coefficient. In this paper, a wide variety of synchrony measures is investigated in the context of AD detection, including the cross-correlation coefficient, the mean-square and phase coherence function, Granger causality, the recently proposed corr-entropy coefficient and two novel extensions, phase synchrony indices derived from the Hilbert transform and time-frequency maps, information-theoretic divergence measures in time domain and timefrequency domain, state space based measures (in particular, non-linear interdependence measures and the S-estimator), and at last, the recently proposed stochastic-event synchrony measures. For the data set at hand, only two synchrony measures are able to convincingly distinguish MCI patients from age-matched control patients (p < 0.005), i.e., Granger causality (in particular, full-frequency directed transfer function) and stochastic event synchrony (in particular, the fraction of non-coincident activity). Combining those two measures with additional features may eventually yield a reliable diagnostic tool for MCI and AD.
1
Introduction
Many studies have shown that the EEG signals of AD patients are generally less coherent than in age-matched control patients (see [1] for an in-depth review). It is noteworthy, however, that this effect is not always easily detectable: there tends to be a large variability among AD patients. This is especially the case for patients in the pre-symptomatic phase, commonly referred to as Mild Cognitive Impairment (MCI), during which neuronal degeneration is occurring prior to the clinical symptoms appearance. On the other hand, it is crucial to predict AD at M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 112–125, 2008. c Springer-Verlag Berlin Heidelberg 2008
A Comparative Study of Synchrony Measures for the Early Detection of AD
113
an early stage: medication that aims at delaying the effects of AD (and hence intend to improve the quality of life of AD patients) are the most effective if applied in the pre-symptomatic phase. In recent years, a large variety of measures has been proposed to quantify EEG synchrony (we refer to [2]–[5] for recent reviews on EEG synchrony measures); some of those measures are believed to be more sensitive to perturbations in EEG synchrony than classical indices as for example the cross-correlation coefficient or the coherence function. In this paper, we systematically investigate the state-ofthe-art of measuring EEG synchrony with special focus on the detection of AD in its early stages. (A related study has been presented in [6,7] in the context of epilepsy.) We consider various synchrony measures, stemming from a wide spectrum of disciplines, such as physics, information theory, statistics, and signal processing. Our aim is to investigate which measures are the most suitable for detecting the effect of synchrony perturbations in MCI and AD patients; we also wish to better understand which aspects of synchrony are captured by the different measures, and how the measures are related to each other. This paper is structured as follows. In Section 2 we review the synchrony measures considered in this paper. In Section 3 those measures are applied to EEG data, in particular, for the purpose of detecting MCI; we describe the EEG data set, elaborate on various implementation issues, and present our results. At the end of the paper, we briefly relate our results to earlier work, and speculate about the neurophysiological interpretation of our results.
2
Synchrony Measures
We briefly review the various families of synchrony measures investigated in this paper: cross-correlation coefficient and analogues in frequency and time-frequency domain, Granger causality, phase synchrony, state space based synchrony, information theoretic interdependence measures, and at last, stochastic-event synchrony measures, which we developed in recent work. 2.1
Cross-Correlation Coefficient
The cross-correlation coefficient r is perhaps one of the most well-known measures for (linear) interdependence between two signals x and y. If x and y are not linearly correlated, r is close to zero; on the other hand, if both signals are identical, then r = 1 [8]. 2.2
Coherence
The coherence function quantifies linear correlations in frequency domain. One distinguishes the magnitude square coherence function c(f ) and the phase coherence function φ(f ) [8].
114
2.3
J. Dauwels, F. Vialatte, and A. Cichocki
Corr-Entropy Coefficient
The corr-entropy coefficient rE is a recently proposed [9] non-linear extension of the correlation coefficient r; it is close to zero if x and y are independent (which is stronger than being uncorrelated). 2.4
Coh-Entropy and Wav-Entropy Coefficient
One can define a non-linear magnitude square coherence function, which we will refer to as “coh-entropy” coefficient cE (f ); it is an extension of the corr-entropy coefficient to the frequency domain. The corr-entropy coefficient rE can also be extended to the time-frequency domain, by replacing the signals x and y in the definition of rE by their time-frequency (“wavelet”) transforms. In this paper, we use the complex Morlet wavelet, which is known to be well-suited for EEG signals [10]. The resulting measure is called “wav-entropy” coefficient wE (f ). (To our knowledge, both cE (f ) and wE (f ) are novel). 2.5
Granger Causality
Granger causality1 refers to a family of synchrony measures that are derived from linear stochastic models of time series; as the above linear interdependence measures, they quantify to which extent different signals are linearly interdependent. Whereas the above linear interdependence measures are bivariate, i.e., they can only be applied to pairs of signals, Granger causality measures are multivariate, they can be applied to multiple signals simultaneously. Suppose that we are given n signals X1 (k), X2 (k), . . . , Xn (k), each stemming from a different channel. We consider the multivariate autoregressive (MVAR) model: X(k) =
p
A(j)X(k − ) + E(k),
(1)
=1
where X(k) = (X1 (k), X2 (k), . . . , Xn (k))T , p is the model order, the model coefficients A(j) are n × n matrices, and E(k) is a zero-mean Gaussian random vector of size n. In words: Each signal Xi (k) is assumed to linearly depend on its own p past values and the p past values of the other signals Xj (k). The deviation between X(k) and this linear dependence is modeled by the noise component E(k). Model (1) can also be cast in the form: E(k) =
p
˜ A(j)X(k − ),
(2)
=0 1
The Granger causality measures we consider here are implemented in the BioSig library, available from http://biosig.sourceforge.net/
A Comparative Study of Synchrony Measures for the Early Detection of AD
115
˜ ˜ where A(0) = I (identity matrix) and A(j) = −A(j) for j > 0. One can transform (2) into the frequency domain (by applying the z-transform and by substituting z = e−2πiΔt , where 1/Δt is the sampling rate):
˜ −1 (f )E(f ) = H(f )E(f ). X(f ) = A
(3)
The power spectrum matrix of the signal X(k) is determined as S(f ) = X(f )X(f )∗ = H(f )VH∗ (f ),
(4)
where V stands for the covariance matrix of E(k). The Granger causality measures are defined in terms of coefficients of the matrices A, H, and S. Due to space limitations, only a short description of these methods is provided here, additional information can be found in existing literature (e.g., [4]). From these coefficients, two symmetric measures can be defined: – Granger coherence |Kij (f )| ∈ [0, 1] describes the amount of in-phase components in signals i and j at the frequency f . – Partial coherence (PC) |Cij (f )| ∈ [0, 1] describes the amount of in-phase components in signals i and j at the frequency f when the influence (i.e., linear dependence) of the other signals is statistically removed. The following asymmetric (“directed”) Granger causality measures capture causal relations: 2 – Directed transfer function (DTF) γij (f ) quantifies the fraction of inflow to channel i stemming from channel j. – Full frequency directed transfer function (ffDTF) |Hij (f )|2 Fij2 (f ) = m ∈ [0, 1], 2 f j=1 |Hij (f )|
(5)
2 is a variation of γij (f ) with a global normalization in frequency. – Partial directed coherence (PDC) |Pij (f )| ∈ [0, 1] represents the fraction of outflow from channel j to channel i 2 – Direct directed transfer function (dDTF) χ2ij (f ) = Fij2 (f )Cij (f ) is non-zero if the connection between channel i and j is causal (non-zero Fij2 (f )) and 2 direct (non-zero Cij (f )).
2.6
Phase Synchrony
Phase synchrony refers to the interdependence between the instantaneous phases φx and φy of two signals x and y; the instantaneous phases may be strongly synchronized even when the amplitudes of x and y are statistically independent. The instantaneous phase φx of a signal x may be extracted as [11]:
φH x(k)] , x (k) = arg [x(k) + i˜
(6)
116
J. Dauwels, F. Vialatte, and A. Cichocki
where x ˜ is the Hilbert transform of x. Alternatively, one can derive the instantaneous phase from the time-frequency transform X(k, f ) of x:
φW x (k, f ) = arg[X(k, f )].
(7)
The phase φW x (k, f ) depends on the center frequency f of the applied wavelet. By appropriately scaling the wavelet, the instantaneous phase may be computed in the frequency range of interest. The phase synchrony index γ for two instantaneous phases φx and φy is defined as [11]: γ = ei(nφx −mφy ) ∈ [0, 1], (8) where n and m are integers (usually n = 1 = m). We will use the notation γH and γW to indicate whether the instantaneous phases are computed by the Hilbert transform or time-frequency transform respectively. In this paper, we will consider two additional phase synchrony indices, i.e., the evolution map approach (EMA) and the instantaneous period approach (IPA) [12]. Due to space constraints, we will not describe those measures here, instead we refer the reader to [12]2 ; additional information about phase synchrony can be found in [6]. 2.7
State Space Based Synchrony
State space based synchrony (or “generalized synchronization”) evaluates synchrony by analyzing the interdependence between the signals in a state space reconstructed domain (see e.g., [7]). The central hypothesis behind this approach is that the signals at hand are generated by some (unknown) deterministic, potentially high-dimensional, non-linear dynamical system. In order to reconstruct such system from a signal x, one considers delay vectors X(k) = (x(k), x(k − τ ), . . . , x(k − (m − 1) τ ))T , where m is the embedding dimension and τ denotes the time lag. If τ and m are appropriately chosen, and the signals are indeed generated by a deterministic dynamical system (to a good approximation), the delay vectors lie on a smooth manifold (“mapping”) in Rm , apart from small stochastic fluctuations. The S-estimator [13], here denoted by Sest , is a state space based measure obtained by applying principal component analysis (PCA) to delay vectors3 . We also considered three measures of nonlinear interdependence, S k , H k , and N k (see [6] for details4 ). 2.8
Information-Theoretic Measures
Several interdependence measures have been proposed that have their roots in information theory. Mutual Information I is perhaps the most well-known 2 3
4
Program code is available at www.agnld.uni-potsdam.de/%7Emros/dircnew.m We used the S Toolbox downloadable from http://aperest.epfl.ch/docs/ software.htm Software is available from http://www.vis.caltech.edu/~rodri/software.htm
A Comparative Study of Synchrony Measures for the Early Detection of AD
117
information-theoretic interdependence measure; it quantifies the amount of information the random variable Y contains about random variable X (and vice versa); it is always positive, and it vanishes when X and Y are statistically independent. Recently, a sophisticated and effective technique to compute mutual information between time series was proposed [14]; we will use that method in this paper5 . The method of [14] computes mutual information in time-domain; alternatively, this quantity may also be determined in time-frequency domain (denoted by IW ), more specifically, from normalized spectrograms [15,16] (see also [17,18]). We will also consider several information-theoretic measures that quantify the dissimilarity (or “distance”) between two random variables (or signals). In contrast to the previously mentioned measures, those divergence measures vanish if the random variables (or signals) are identical ; moreover, they are not necessarily symmetric, and therefore, they can not be considered as distance measures in the strict sense. Divergences may be computed in time domain and timefrequency domain; in this paper, we will only compute the divergence measures in time-frequency domain, since the computation in time domain is far more involved. We consider the Kullback-Leibler divergence K, the R´enyi divergence Dα , the Jensen-Shannon divergence J, and the Jensen-R´enyi divergence Jα . Due to space constraints, we will not review those divergence measures here; we refer the interested reader to [15,16]. 2.9
Stochastic Event Synchrony (SES)
Stochastic event synchrony, an interdependence measure we developed in earlier work [19], describes the similarity between the time-frequency transforms of two signals x and y. As a first step, the time-frequency transform of each signal is approximated as a sum of (half-ellipsoid) basis functions, referred to as “bumps” (see Fig. 1 and [20]). The resulting bump models, representing the most prominent oscillatory activity, are then aligned (see Fig. 2): bumps in one timefrequency map may not be present in the other map (“non-coincident bumps”); other bumps are present in both maps (“coincident bumps”), but appear at slightly different positions on the maps. The black lines in Fig. 2 connect the centers of coincident bumps, and hence, visualize the offset in position between pairs of coincident bumps. Stochastic event synchrony consists of five parameters that quantify the alignment of two bump models: – ρ: fraction of non-coincident bumps, – δt and δf : average time and frequency offset respectively between coincident bumps, – st and sf : variance of the time and frequency offset respectively between coincident bumps. The alignment of the two bump models (cf. Fig. 2 (right)) is obtained by iterative max-product message passing on a graphical model; the five SES parameters are determined from the resulting alignment by maximum a posteriori (MAP) 5
The program code (in C) is available at www.klab.caltech.edu/~kraskov/MILCA/
118
J. Dauwels, F. Vialatte, and A. Cichocki
Fig. 1. Bump modeling Coincident bumps (ρ = 27%) 30
25
25
20
20
15
15
f
f
Bump models of two EEG signals 30
10
10
5
5
00
200
400
t
600
800
00
200
400
t
600
800
Fig. 2. Coincident and non-coincident activity (“bumps”); (left) bump models of two signals; (right) coincident bumps; the black lines connect the centers of coincident bumps
estimation [19]. The parameters ρ and st are the most relevant for the present study, since they quantify the synchrony between bump models (and hence, the original time-frequency maps); low ρ and st implies that the two time-frequency maps at hand are well synchronized.
3
Detection of EEG Synchrony Abnormalities in MCI Patients
In the following section, we describe the EEG data we analyzed. In Section 3.2 we address certain technical issues related to the synchrony measures, and in Section 3.3, we present and discuss our results. 3.1
EEG Data
The EEG data6 analyzed here have been analyzed in previous studies concerning early detection of AD [21]–[25]. They consist of rest eyes-closed EEG data 6
We are grateful to Prof. T. Musha for providing us the EEG data.
A Comparative Study of Synchrony Measures for the Early Detection of AD
119
recorded from 21 sites on the scalp based on the 10–20 system. The sampling frequency was 200 Hz, and the signals were band pass filtered between 4 and 30Hz. The subjects comprised two study groups. The first consisted of a group of 25 patients who had complained of memory problems. These subjects were then diagnosed as suffering from MCI and subsequently developed mild AD. The criteria for inclusion into the MCI group were a mini mental state exam (MMSE) score above 24, the average score in the MCI group was 26 (SD of 1.8). The other group was a control set consisting of 56 age-matched, healthy subjects who had no memory or other cognitive impairments. The average MMSE of this control group was 28.5 (SD of 1.6). The ages of the two groups were 71.9 ± 10.2 and 71.7 ± 8.3, respectively. Pre-selection was conducted to ensure that the data were of a high quality, as determined by the presence of at least 20 sec. of artifact free data. Based on this requirement, the number of subjects in the two groups described above was reduced to 22 MCI patients and 38 control subjects. 3.2
Methods
In order to reduce the computational complexity, we aggregated the EEG signals into 5 zones (see Fig. 3); we computed the synchrony measures (except the S-estimator) from the averages of each zone. For all those measures except SES, we used the arithmetic average; in the case of SES, the bump models obtained from the 21 electrodes were clustered into 5 zones by means of the aggregation algorithm described in [20]. We evaluated the S-estimator between each pair of zones by applying PCA to the state space embedded EEG signals of both zones. We divided the EEG signals in segments of equal length L, and computed the synchrony measures by averaging over those segments. Since spontaneous EEG is usually highly non-stationary, and most synchrony measures are strictly speaking only applicable to stationary signals, the length L should be sufficiently small; on the other hand, in order to obtain reliable measures for synchrony, the length should be chosen sufficiently large. Consequently, it is not a priori clear how to choose the length L, and therefore, we decided to test several values, i.e., L = 1s, 5s, and 20s. In the case of Granger causality measures, one needs to specify the model order p. Similarly, for mutual information (in time domain) and the state space based measures, the embedding dimension m and the time lag τ needs to be chosen; the phase synchrony indices IPA and EMA involve a time delay τ . Since it is not obvious which parameter values amount to the best performance for detecting AD, we have tried a range of parameter settings, i.e., p = 1, 2,. . . , 10, and m = 1, 2,. . . , 10; the time delay was in each case set to τ = 1/30s, which is the period of the fastest oscillations in the EEG signals at hand. 3.3
Results and Discussion
Our main results are summarized in Table 1, which shows the sensitivity of the synchrony measures for detecting MCI. Due to space constraints, the table only shows results for global synchrony, i.e., the synchrony measures were averaged
120
J. Dauwels, F. Vialatte, and A. Cichocki
Fp1
Fpz
Fp2
1 F7
F3
T3
3 C3
T5
F4
Fz
2
T4
C4
Cz
Pz
P3
F8
4
P4
T6
5 O1
Oz
O2
Fig. 3. The 21 electrodes used for EEG recording, distributed according to the 10– 20 international placement system [8]. The clustering into 5 zones is indicated by the colors and dashed lines (1 = frontal, 2 = left temporal, 3 = central, 4 = right temporal and 5 = occipital).
over all pairs of zones. (Results for local synchrony and individual frequency bands will be presented in a longer report, including a detailed description of the influence of various parameters such as model order and embedding dimension on the sensitivity.) The p-values, obtained by the Mann-Whitney test, need strictly speaking to be Bonferroni corrected; since we consider many different measures simultaneously, it is likely that a few of those measures have small p-values merely due to stochastic fluctuations (and not due to systematic difference between MCI and control patients). In the most conservative Bonferroni post-correction, the p-values need to be divided by the number of synchrony measures. From the table, it can be seen that only a few measures evince significant differences in EEG synchrony between MCI and control patients: full-frequency DTF and ρ are the most sensitive (for the data set at hand), their p-values remain significant (pcorr < 0.05) after Bonferroni correction. In other words, the effect of MCI and AD on EEG synchrony can be detected, as was reported earlier in the literature; we will expand on this issue in the following section. In other to gain more insight in the relation between the different measures, we calculated the correlation between them (see Fig. 5; red and blue indicate strong correlation and anti-correlation respectively). From this figure, it becomes strikingly clear that the majority of measures are strongly correlated (or anticorrelated) with each other; in other words, the measures can easily be classified in different families. In addition, many measures are strongly (anti-)correlated with the classical cross-correlation coefficient r, the most basic measure; as a result, they do not provide much additional information regarding EEG synchrony. Measures that are only weakly correlated with the cross-correlation coefficient include the phase synchrony indices, Granger causality measures, and stochastic-event synchrony measures; interestingly, those three families of synchrony measures are mutually uncorrelated, and as a consequence, they each seem to capture a specific kind of interdependence.
A Comparative Study of Synchrony Measures for the Early Detection of AD
121
Table 1. Sensitivity of synchrony measures for early prediction of AD (p-values for Mann-Whitney test; * and ** indicate p < 0.05 and p < 0.005 respectively) Measure
Cross-correlation
Coherence
Phase Coherence
Corr-entropy
Wave-entropy
p-value
0.028∗
0.060
0.72
0.27
0.012∗
References Measure
[8]
[9] PDC
DTF
ffDTF
dDTF
0.15
0.16
0.60
0.34
0.0012∗∗
0.030∗
Measure
Kullback-Leibler
R´enyi
Jensen-Shannon
Jensen-R´enyi
IW
I
p-value
0.072
0.076
0.084
0.12
0.060
0.080
Measure
Nk
Sk
Hk
S-estimator
p-value
0.032∗
0.29
0.090
0.33
p-value
Granger coherence Partial Coherence
References
[4]
References
[15]
References
[14]
[6]
Measure
Hilbert Phase
p-value
0.15
[13]
Wavelet Phase
Evolution Map Instantaneous Period
0.082
References
0.020∗
0.072
[6]
[12]
Measure
st
ρ
p-value
0.92
0.00029∗∗
In Fig. 4, we combine the two most sensitive synchrony measures (for the data set at hand), i.e., full-frequency DTF and ρ. In this figure, the MCI patients are fairly well distinguishable from the control patients. As such, the separation is not sufficiently strong to yield reliable early prediction of AD. For this purpose, the two features need to be combined with complementary features, for example, derived from the slowing effect of AD on EEG, or perhaps from different modalities such as PET, MRI, DTI, or biochemical indicators. On the other hand, we remind the reader of the fact that in the data set at hand, patients did not carry out any specific task; moreover, the recordings were short (only 20s). It is plausible that the sensitivity of EEG synchrony could be further improved by increasing the length of the recordings and by recording the EEG before, while, and after patients carry out specific tasks, e.g., working memory tasks.
0.5 MCI CTR
0.45 0.4
ρ
0.35 0.3 0.25 0.2 0.15 0.045
0.05
Fij2
0.055
Fig. 4. ρ vs. ffDTF
0.06
122
J. Dauwels, F. Vialatte, and A. Cichocki
state space
corr/coh mut inf
phase
divergence
Granger
SES
N k (X|Y ) N k (Y |X) S k (X|Y ) S k (Y |X) H k (X|Y ) 5 H k (Y |X) Sest r c rE 10 wE IW I γH γW 15 φ EMA IPA K(Y |X) K(X|Y )20 K Dα J Jα Kij 25 Cij Pij γij Fij χij st ρ
0.8
0.6
0.4
0.2
0
−0.2
−0.4
−0.6
30
−0.8 5
10
15
20
25
30
Fig. 5. Correlation between the synchrony measures
4
Conclusions
In previous studies, brain dynamics in AD and MCI patients were mainly investigated using coherence (cf. Section 2.2) or state space based measures of synchrony (cf. Section 2.7). During working memory tasks, coherence shows significant effects in AD and MCI groups [26] [27]; in resting condition, however, coherence does not show such differences in low frequencies (below 30Hz), neither between AD and controls [28] nor between MCI and controls [27]. These results are consistent with our observations. In the gamma range, coherence seems to decrease with AD [29]; we did not investigate this frequency range, however, since the EEG signals analyzed here were band pass filtered between 4 and 30Hz. Synchronization likelihood, a state space based synchronization measure similar to the non-linear interdependence measures S k , H k , and N k (cf. Section 2.7), is believed to be more sensitive than coherence to detect changes in AD patients [28]. Using state space based synchrony methods, significant differences were found between AD and control in rest conditions [28] [30] [32] [33]. State space based synchrony failed to retrieve significant differences between MCI patient and control subjects on a global level [32] [33], but significant effects were observed locally: fronto-parietal electrode synchronization likelihood progressively decreased through MCI and mild AD groups [30]. We report here a lower p-value for the state space based synchrony measure N k (p = 0.032) than for coherence (p = 0.06); those low p-values, however, would not be statistically significant after Bonferroni correction.
A Comparative Study of Synchrony Measures for the Early Detection of AD
123
By means of Global Field Synchronization, a phase synchrony measure similar to the ones we considered in this paper, Koenig et al. [31] observed a general decrease of synchronization in correlation with cognitive decline and AD. In our study, we analyzed five different phase synchrony measures: Hilbert and wavelet based phase synchrony, phase coherence, evolution map approach (EMA), and instantaneous period approach (IPA). The p-value of the latter is low (p=0.020), in agreement with the results of [31], but it would be non-significant after Bonferroni correction. The strongest observed effect is a significantly higher degree of local asynchronous activity (ρ) in MCI patients, more specifically, a high number of noncoincident, asynchronous oscillatory events (p = 0.00029). Interestingly, we did not observe a significant effect on the timing jitter st of the coincident events (p = 0.92). In other words, our results seem to indicate that there is significantly more non-coincident background activity, while the coincident activity remains well synchronized. On the one hand, this observation is in agreement with previous studies that report a general decrease of neural synchrony in MCI and AD patients; on the other hand, it goes beyond previous results, since it yields a more subtle description of EEG synchrony in MCI and AD patients: it suggests that the loss of coherence is mostly due to an increase of (local) non-coincident background activity, whereas the locked (coincident) activity remains equally well synchronized. In future work, we will verify this conjecture by means of other data sets.
References 1. Jong, J.: EEG Dynamics in Patients with Alzheimer’s Disease. Clinical Neurophysiology 115, 1490–1505 (2004) 2. Pereda, E., Quiroga, R.Q., Bhattacharya, J.: Nonlinear Multivariate Analysis of Neurophsyiological Signals. Progress in Neurobiology 77, 1–37 (2005) 3. Breakspear, M.: Dynamic Connectivity in Neural Systems: Theoretical and Empirical Considerations. Neuroinformatics 2(2) (2004) 4. Kami´ nski, M., Liang, H.: Causal Influence: Advances in Neurosignal Analysis. Critical Review in Biomedical Engineering 33(4), 347–430 (2005) 5. Stam, C.J.: Nonlinear Dynamical Analysis of EEG and MEG: Review of an Emerging Field. Clinical Neurophysiology 116, 2266–2301 (2005) 6. Quiroga, R.Q., Kraskov, A., Kreuz, T., Grassberger, P.: Performance of Different Synchronization Measures in Real Data: A Case Study on EEG Signals. Physical Review E 65 (2002) 7. Sakkalis, V., Giurc˘ aneacu, C.D., Xanthopoulos, P., Zervakis, M., Tsiaras, V.: Assessment of Linear and Non-Linear EEG Synchronization Measures for Evaluating Mild Epileptic Signal Patterns. In: Proc. of ITAB 2006, Ioannina-Epirus, Greece, October 26–28 (2006) 8. Nunez, P., Srinivasan, R.: Electric Fields of the Brain: The Neurophysics of EEG. Oxford University Press, Oxford (2006) 9. Xu, J.-W., Bakardjian, H., Cichocki, A., Principe, J.C.: EEG Synchronization Measure: a Reproducing Kernel Hilbert Space Approach. IEEE Transactions on Biomedical Engineering Letters (submitted to, September 2006)
124
J. Dauwels, F. Vialatte, and A. Cichocki
10. Herrmann, C.S., Grigutsch, M., Busch, N.A.: EEG Oscillations and Wavelet Analysis. In: Handy, T. (ed.) Event-Related Potentials: a Methods Handbook, pp. 229– 259. MIT Press, Cambridge (2005) 11. Lachaux, J.-P., Rodriguez, E., Martinerie, J., Varela, F.J.: Measuring Phase Synchrony in Brain Signals. Human Brain Mapping 8, 194–208 (1999) 12. Rosenblum, M.G., Cimponeriu, L., Bezerianos, A., Patzak, A., Mrowka, R.: Identification of Coupling Direction: Application to Cardiorespiratory Interaction. Physical Review E, 65 041909 (2002) 13. Carmeli, C., Knyazeva, M.G., Innocenti, G.M., De Feo, O.: Assessment of EEG Synchronization Based on State-Space Analysis. Neuroimage 25, 339–354 (2005) 14. Kraskov, A., St¨ ogbauer, H., Grassberger, P.: Estimating Mutual Information. Phys. Rev. E 69(6), 66138 (2004) 15. Aviyente, S.: A Measure of Mutual Information on the Time-Frequency Plane. In: Proc. of ICASSP 2005, Philadelphia, PA, USA, March 18–23, vol. 4, pp. 481–484 (2005) 16. Aviyente, S.: Information-Theoretic Signal Processing on the Time-Frequency Plane and Applications. In: Proc. of EUSIPCO 2005, Antalya, Turkey, September 4–8 (2005) 17. Quiroga, Q.R., Rosso, O., Basar, E.: Wavelet-Entropy: A Measure of Order in Evoked Potentials. Electr. Clin. Neurophysiol (Suppl.) 49, 298–302 (1999) 18. Blanco, S., Quiroga, R.Q., Rosso, O., Kochen, S.: Time-Frequency Analysis of EEG Series. Physical Review E 51, 2624 (1995) 19. Dauwels, J., Vialatte, F., Cichocki, A.: A Novel Measure for Synchrony and Its Application to Neural Signals. In: Honolulu, H.U. (ed.) Proc. IEEE Int. Conf. on Acoustics and Signal Processing (ICASSP), Honolulu, Hawai’i, April 15–20 (2007) 20. Vialatte, F., Martin, C., Dubois, R., Haddad, J., Quenet, B., Gervais, R., Dreyfus, G.: A Machine Learning Approach to the Analysis of Time-Frequency Maps, and Its Application to Neural Dynamics. Neural Networks 20, 194–209 (2007) 21. Chapman, R., et al.: Brain Event-Related Potentials: Diagnosing Early-Stage Alzheimer’s Disease. Neurobiol. Aging 28, 194–201 (2007) 22. Cichocki, A., et al.: EEG Filtering Based on Blind Source Separation (BSS) for Early Detection of Alzheimer’s Disease. Clin. Neurophys. 116, 729–737 (2005) 23. Hogan, M., et al.: Memory-Related EEG Power and Coherence Reductions in Mild Alzheimer’s Disease. Int. J. Psychophysiol. 49 (2003) 24. Musha, T., et al.: A New EEG Method for Estimating Cortical Neuronal Impairment that is Sensitive to Early Stage Alzheimer’s Disease. Clin. Neurophys. 113, 1052–1058 (2002) 25. Vialatte, F., et al.: Blind Source Separation and Sparse Bump Modelling of TimeFrequency Representation of EEG Signals: New Tools for Early Detection of Alzheimer’s Disease. In: IEEE Workshop on Machine Learning for Signal Processing, pp. 27–32 (2005) 26. Hogan, M.J., Swanwick, G.R., Kaiser, J., Rowan, M., Lawlor, B.: Memory-Related EEG Power and Coherence Reductions in Mild Alzheimer’s Disease. Int. J. Psychophysiol. 49(2), 147–163 (2003) 27. Jiang, Z.Y.: Study on EEG Power and Coherence in Patients with Mild Cognitive Impairment During Working Memory Task. J. Zhejiang Univ. Sci. B 6(12), 1213– 1219 (2005) 28. Stam, C.J., van Cappellen van Walsum, A.M., Pijnenburg, Y.A., Berendse, H.W., de Munck, J.C., Scheltens, P., van Dijk, B.W.: Generalized Synchronization of MEG Recordings in Alzheimer’s Disease: Evidence for Involvement of the Gamma Band. J. Clin. Neurophysiol. 19(6), 562–574 (2002)
A Comparative Study of Synchrony Measures for the Early Detection of AD
125
29. Herrmann, C.S., Demiralp, T.: Human EEG Gamma Oscillations in Neuropsychiatric Disorders. Clinical Neurophysiology 116, 2719–2733 (2005) 30. Babiloni, C., Ferri, R., Binetti, G., Cassarino, A., Forno, G.D., Ercolani, M., Ferreri, F., Frisoni, G.B., Lanuzza, B., Miniussi, C., Nobili, F., Rodriguez, G., Rundo, F., Stam, C.J., Musha, T., Vecchio, F., Rossini, P.M.: Fronto-Parietal Coupling of Brain Rhythms in Mild Cognitive Impairment: A Multicentric EEG Study. Brain Res. Bull. 69(1), 63–73 (2006) 31. Koenig, T., Prichep, L., Dierks, T., Hubl, D., Wahlund, L.O., John, E.R., Jelic, V.: Decreased EEG Synchronization in Alzheimer’s Disease and Mild Cognitive Impairment. Neurobiol. Aging 26(2), 165–171 (2005) 32. Pijnenburg, Y.A., Made, Y.v., van Cappellen, A.M., van Walsum, Knol, D.L., Scheltens, P., Stam, C.J.: EEG Synchronization Likelihood in Mild Cognitive Impairment and Alzheimer’s Disease During a Working Memory Task. Clin. Neurophysiol. 115(6), 1332–1339 (2004) 33. Yagyu, T., Wackermann, J., Shigeta, M., Jelic, V., Kinoshita, T., Kochi, K., Julin, P., Almkvist, O., Wahlund, L.O., Kondakor, I., Lehmann, D.: Global dimensional complexity of multichannel EEG in mild Alzheimer’s disease and age-matched cohorts. Dement Geriatr Cogn Disord 8(6), 343–347 (1997)
Reproducibility Analysis of Event-Related fMRI Experiments Using Laguerre Polynomials Hong-Ren Su1,2, Michelle Liou2,*, Philip E. Cheng2, John A.D. Aston2, and Shang-Hong Lai1 1
Dept. of Computer Science, National Tsing Hua University, Hsinchu, Taiwan 2 Institute of Statistical Science, Academia Sinica, Taipei, Taiwan
[email protected]
Abstract. In this study, we introduce the use of orthogonal causal Laguerre polynomials for analyzing data collected in event-related functional magnetic resonance imaging (fMRI) experiments. This particular family of polynomials has been widely used in the system identification literature and recommended for modeling impulse functions in BOLD-based fMRI experiments. In empirical studies, we applied Laguerre polynomials to analyze data collected in an eventrelated fMRI study conducted by Scott et al. (2001). The experimental study investigated neural mechanisms of visual attention in a change-detection task. By specifying a few meaningful Laguerre polynomials in the design matrix of a random effect model, we clearly found brain regions associated with trial onset and visual search. The results are consistent with the original findings in Scott et al. (2001). In addition, we found the brain regions related to the mask presence in the parahippocampal, superior frontal gyrus and inferior parietal lobule. Both positive and negative responses were also found in the lingual gyrus, cuneus and precuneus. Keywords: Reproducibility analysis, Event-related fMRI.
1 Introduction We previously proposed a methodology for assessing reproducibility evidence in fMRI studies using an on-and-off paradigm without necessarily conducting replicated experiments, and suggested interpreting SPMs in conjunction with reproducibility evidence (Liou et al., 2003; 2006). Empirical studies have shown that the method is robust to the specification of hemodynamic response functions (HRFs). Recently, BOLD-based event-related fMRI experiments have been widely used as an advanced alternative to the on-and-off design for studies on human brain functions. In eventrelated fMRI experiments, the duration of stimulus presentation is generally longer and there are no obvious contrasts between the experimental and control conditions to be used in data analyses. In order to detect possible brain activations during stimulus presentation and task performance, there have been a variety of event-related HRFs proposed in the literature. In this study, we introduce the use of orthogonal causal * Corresponding author. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 126–134, 2008. © Springer-Verlag Berlin Heidelberg 2008
Reproducibility Analysis of Event-Related fMRI Experiments
127
Laguerre polynomials for modeling response functions. This particular family of polynomials has been widely used in the system identification literature and was recommended for modeling impulse functions in fMRI experiments (Saha et al., 2004). In the empirical study, we applied Laguerre polynomials to analyze data in the study by Scott et al. (2001). The dataset was published by the US fMRI Data Center and is available for public access. The original experiment involved 10 human subjects and investigated brain functions associated with a change-detection task. In the experimental task, subjects look attentively at two versions of the same picture in alternation, separated by a brief mask interval. The experiment additionally analyzed behavioral responses that subjects detected something changing between pictures and pressed a button with hands. In our reproducibility analysis, a few meaningful Laguerre polynomials matching the experimental design were inserted into a random effect model and reproducibility analyses were conducted based on the selected polynomials. In the analyses, we successfully located brain regions associated with the visual change-detection task similar to those found in Scott et al.. Additionally, we found other interesting brain regions that were not included in the previous study.
2 Method In this section, we will briefly describe the method for investigating the reproducibility evidence in fMRI experiments, and outline the family of Laguerre polynomials including those used in our empirical study. 2.1 Reproducibility Analysis In the SPM generalized linear model, the fMRI responses in the ith run can be expressed as
yi = X i β i + ei ,
(1)
where yi is the vector of image intensity after pre-whitening, Xi is the design matrix, and β i is the vector containing the unknown regression parameters. In the random effect model, the regression parameters
βi
are additionally assumed to be random
from a multivariate Gaussian distribution with common mean μ and variance Ω . The empirical Bayes estimate of
βi
in the random effect model would shrink all
estimates toward the mean μ , with greater shrinkage at noisy runs. In fMRI studies, the true status of each voxel is unknown, but can be estimated using the t-values (i.e., standardized β estimates) within individual runs derived from the random effect model along with the maximum likehood estimation method. By specifying a mixed multinomial model, the receiver-operation characteristic (ROC) curve can be estimated using the maximum likelihood estimation method and t-values of all image voxels. The curve is simply a bivariate plot of sensitivity versus the false alarm rate. The threshold (or the operational point) on the ROC curve for classifying voxels into the active/inactive status was found by maximizing the kappa value. We follow the
128
H.-R. Su et al.
same definition in Liou et al. (2006) to categorize voxels according to reproducibility, that is, a voxel is strongly reproducible if its active status remains the same in at least 90% of the runs, moderately reproducible in 70-90% of the runs, weakly reproducible in 50-70% of the runs, and otherwise not reproducible. The brain activation maps are constructed on the basis of strongly reproducible voxels, but include voxels that are moderately reproducible and spatially proximal to those strongly reproducible voxels. 2.2 Laguerre Polynomials The Laguerre polynomials can be used for detecting experimental responses. This family of polynomials can be specified as follows: L
h (t ) = ∑ f i g ia (t ) ,
(2)
i =1
where h(t) is the design coefficients to be input into Xi in (1); L is the order of Laguerre polynomial; ƒi is the coefficient of the basis function, and gia(t) is the inverse Z transform of the i-th Laguerre polynomial given by ~ a ⎡ z −1 z −1 − a i −1 ⎤ −1 g ia (t ) = Z −1 ⎢ ( ) = Z [ g i ( z )] ⎥ −1 −1 ⎣1 − az 1 − az ⎦
(3)
where a is a time constant. As an illustration, Fig. 1 gives the response coefficients corresponding to L=2 and L =3.
h(t), L=2
h(t), L=3
(a)
(b)
Fig. 1. The boxcar functions of experimental conditions in the Scott et al. study are depicted in (a), and the Laguerre polynomials h(t) with L=2 and L =3 are depicted in (b)
3 Event-Related fMRI Experiments We here introduce the experimental design behind the fMRI dataset used in the empirical study, and select the design matrix suitable for the study.
Reproducibility Analysis of Event-Related fMRI Experiments
129
3.1 Experimental Design In our empirical study, the dataset contains functional MR images of 10 subjects who went through 10-12 experimental runs, with 10 stimulus trials in each run. Experimental runs involved the change-detection task in which two images within a pair differed in either the presence/absence of the position of a single object or the color of the object. The two images were presented alternatively for 40 sec. In the first 30 sec, each image was presented for 300 msec followed by a 100 msec mask. However, the mask was removed in the last 10 sec. Subjects pressed a button when detecting something changing between the pair of images. The experimental images and stimulus duration are shown in Fig. 2.
Fig. 2. The experimental images and stimulus duration in the Scott et al. study
3.2 Experimental Design Matrix We used the Laguerre polynomials in Fig. 1 for specifying the design matrix in (1) instead of the theoretical HRFs. According to the original experimental design, there are two contrasts of interest in the Scott et al. study. The first is the response after the task onset within the 40 secs trial, and the second is the difference between stimulus presentations with and without the mask, that is, responses during the image presentation with the mask (0 ~ 30 sec.) and without the mask (30 ~ 40 sec.) in Fig. 2. The boxcar functions in Fig. 1 can also be specified in the design matrix as was suggested in the study by Liou et al. (2006) for the on-and-off design. However, the two boxcar functions are not orthogonal to each other and carry redundant information on experimental effects. In the event-related fMRI experiment, the duration of stimulus presentation is always longer than that in the on-and-off design. The theoretical HRFs vanish during the stimulus presentation. There might be brain regions continuously responding to the stimulus. The Laguerre polynomials are orthogonal and offer possibilities for examining all kinds of experimental effects. We
130
H.-R. Su et al.
might also consider Laguerre polynomials in Fig. 1 as a smoothed version of the boxcar functions.
4 Results In the Scott et al. study, a response-contingent event-related analysis technique was used in the data analyses, and the original results showed brain regions associated with different processing components in the visual change-detection task. For instance, the lingual gyrus, cuneus, precentral gyrus, and medial frontal gyrus showed activations associated with the task onset. And the pattern of activation in dorsal and ventral visual pathways was temporally associated with the duration of visual search. Finally parietal and frontal regions showed systematic deactivations during task performance. In the reproducibility analysis with Laguerre polynomials, we found the similar activation regions associated with the task onset, visual search and deactivations. In addition, we found activation regions in the parahippocampal, superior frontal gyrus, supramarginal gyrus and inferior parietal lobule. Both positive and negative responses were also found in the lingual gyrus, cuneus and precuneus which are also reproducible across all subjects; this finding is consistent with our previous data analyses of fMRI studies involving object recognition and word/pseudoword reading (Liou et al., 2006). Table 1 lists a few activation regions in the change-detection task for the 10 subjects. Table 1. The activation regions in the change-detection task. The plus sign indicates the positive response and minus sign indicates the negative response.
Subjects Lingual gyrus Precuneus Cuneus Posterior cingulate Medial frontal gyrus Parahippocampal gyrus Superior frontal gyrus Supramarginal gyrus
1 +/+/+/+/+/-
2 +/+/+/-
+ + +
3 +/+/+/+/+/+/+ +
4 +/+/+/+/-
5 +/+/+/+/+/+ +
6 +/+/+/+ +/-
7 +/+/+/+/+/+/-
8 +/+/+/+/+/+/-
9 +/+/+/-
10 +/+/+/+/+/-
+ +
In the table, there are 4 subjects showing activations in the superior frontal gyrus and supramarginal gyrus in the change-detection task. The two regions have been referred to in fMRI studies on language process (e.g., the study on word and pseudoword reading). The 4 subjects, on average, had longer reaction time in the change-detection task, that is, a delay of pressing the button until the image presentation without the mask (30-40 sec.). Fig. 3 shows the brain activation regions for Subjects 5 and 7 in the Scott et al. study. Subject 5 involved the superior frontal gyrus and supramarginal gyrus and had the longest reaction time compared with other subjects in the experiment. On the other hand, Subject 7 had relatively shorter reaction time and showed no activations in the two regions.
Reproducibility Analysis of Event-Related fMRI Experiments
Fig. 3. Brain activation regions for Subjects 5 and 7 in the Scott et al. study
131
132
H.-R. Su et al.
Subject 5
Fig. 3. (continued)
Reproducibility Analysis of Event-Related fMRI Experiments
Subject 7
Fig. 3. (continued)
133
134
H.-R. Su et al.
5 Discussion The reproducibility evidence suggests that the 10 subjects consistently show a pattern of increased/decreased responses in the lingual gyrus, cuneus, and precuneus. Similar observations were also found in our empirical studies on other datasets published by the fMRIDC. In the fMRI literature, the precuneus, posterior cingulate and medial prefrontal cortex are known to be the default network in a resting state and show decreased activities in a variety of cognitive tasks. The physiological mechanisms behind the decreased responses are still under investigation. However, discussions on the network have given a focus on the decreased activities. We would suggest to consider both positive and negative responses when interpreting the default network. By the method of reproducibility analyses, we can clearly classify brain regions that show consistent responses across subjects and those that show patterns and inconsistencies across subjects (see results in Table 1). Higher mental functions are individual and their localization in specific brain regions can be made only with some probabilities. Accordingly, the higher mental functions are connected with speech, that is, external or internal speech organizing personal behavior. Subjects differ from each other as a result of using different speech designs when making decisions in performing experimental tasks. Change of functional localization is an additional characteristic of a subject’s psychological traits. The proposed methodology would assist researchers in identifying those brain regions that are specific to individual speech designs and those that are consistent across subjects. Acknowledgments. The authors are indebted to the fMRIDC at Dartmouth College for supporting the datasets analyzed in this study. This research was supported by the grant NSC 94-2413-H-001-001 from the National Science Council (Taiwan).
References 1. Liou, M., Su, H.-R., Lee, J.-D., Aston, J.A.D., Tsai, A.C., Cheng, P.E.: A method for generating reproducible evidence in fMRI studies. NeuroImage 29, 383–395 (2006) 2. Huettel, S.A., Guzeldere, G., McCarthy, G.: Dissociating neural mechanisms of visual attention in change detection using functional MRI. Journal of Cognitive Neuroscience 13(7), 1006–1018 (2001) 3. Liou, M., Su, H.-R., Lee, J.-D., Cheng, P.E., Huang, C.-C., Tsai, C.-H.: Bridging Functional MR Images and Scientific Inference: Reproducibility Maps. Journal of cognitive Neuroscience 15(7), 935–945 (2003) 4. Saha, S., Long, C.J., Brown, E., Aminoff, E., Bar, M., Solo, V.: Hemodynamic transfer function estimation with Laguerre polynomials and confidence intervals construction from functional magnetic resonance imaging (FMRI) data. IEEE ICASSP 3, 109–112 (2004) 5. Andrews, G.E., Askey, R., Roy, R.: Laguerre Polynomials. In: §6.2 in Special Functions, pp. 282–293. Cambridge University Press, Cambridge (1999) 6. Arfken, G.: Laguerre Functions. In: §13.2 in Mathematical Methods for Physicists, 3rd ed., Orlando, FL, pp. 721–731. Academic Press, London (1985)
The Effects of Theta Burst Transcranial Magnetic Stimulation over the Human Primary Motor and Sensory Cortices on Cortico-Muscular Coherence Murat Saglam1, Kaoru Matsunaga2, Yuki Hayashida1, Nobuki Murayama1, and Ryoji Nakanishi2 1
Graduate School of Science and Technology, Kumamoto University, Japan 2 Department of Neurology, Kumamoto Kinoh Hospital, Japan
[email protected], {yukih,murayama}@cs.kumamoto-u.ac.jp
Abstract. Recent studies proposed a new paradigm of repetitive transcranial magnetic stimulation (rTMS), “theta burst stimulation” (TBS); to primary motor cortex (M1) or sensory cortex (S1) can influence cortical excitability in humans. Particularly it has been shown that TBS can induce the long-lasting effects with the stimulation duration shorter than those of conventional rTMSs. However, in those studies, effects of TBS over M1 or S1 were assessed only by means of motor- and/or somatosensory-evoked-potentials. Here we asked how the coherence between electromyographic (EMG) and electroencephalographic (EEG) signals during isometric contraction of the first dorsal interosseous muscle is modified by TBS. The coherence magnitude localizing for the C3 scalp site, and at 13-30Hz band, significantly decreased 30-60 minutes after the TBS on M1, but not that on S1, and recovered to the original level in 90-120 minutes. These findings indicate that TBS over M1 can suppress the corticomuscular synchronization. Keywords: Theta Burst Transcranial Magnetic Stimulation, Coherence Electroencephalogram, Electromyogram, Motor Cortex.
1 Introduction Previous studies have demonstrated dense functional and anatomical projections among motor cortex building a global network which realizes the communication between the brain and peripheral muscles via the motor pathway [1, 2]. The quality of the communication is thought to highly depend on the efficacy of the synaptic transmission between cortical units. In the past few decades, repetitive transcranial magnetic stimulation (rTMS) was considered to be a promising method to modify cortical circuitry by leading the phenomena of long-term potentiation (LTP) and depression (LTD) of synaptic connections in human subjects [3]. Furthermore, a recently developed rTMS paradigm, called “theta burst stimulation” (TBS) requires less number of the stimulation pulses and even offers the longer aftereffects than conventional rTMS protocols do [4]. Previously, efficiency of TBS has M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 135–141, 2008. © Springer-Verlag Berlin Heidelberg 2008
136
M. Saglam et al.
been assessed by means of signal transmission from cortex to muscle or from muscle to cortex, by measuring motor-evoked-potential (MEP) or somatosensory-evoked-potential (SEP), respectively. It was shown that TBS applied over the surface of sensory cortex (S1) as well as primary motor cortex (M1) could modify the amplitude of SEP (recorded from the S1 scalp site) lasting for tens of minutes after the TBS [5]. On the other hand, the amplitude of MEP was not modified by the TBS applied over S1, while the MEP amplitude was significantly decreased by the TBS applied over M1 [4, 5]. In the present study, we examined the effects of TBS applied over either M1 or S1 on the functional coupling between cortex and muscle by measuring the coherence between electroencephalographic (EEG) and electromyographic (EMG) signals during voluntary isometric contraction of the first dorsal interosseous (FDI) muscle.
2 Methods 2.1 Subjects Seven subjects among whole set of recruited participants (approximately twenty) showed significant coherence and only those subjects participated to TBS experiments. Experiments on M1 and S1 performed on different days and subjects did not report any side effects during or after the experiments. 2.2 Determination of M1 and S1 Location The optimal location of the stimulating coil was determined by searching the largest MEP response(from the contralateral FDI-muscle) elicited by single pulse TMS while
Fig. 1. EEG-EMG signals recorded before and after the application of TBS as depicted in the experiment time line. Subjects were asked to contract four times at each recording set. Location and intensity for actual location were determined after pre30 session. Pre0 recording was done to confirm searching does not make any conditioning. TBS paradigm is illustrated in the rightabove inset.
The Effects of Theta Burst Transcranial Magnetic Stimulation
137
moving the TMS coil in 1cm steps on the presumed position of M1. Stimulation was applied by a with a High Power Magstim 200 machine and a figure-of-8 coil with mean loop diameter of 70 mm (Magstim Co., Whitland, Dyfed, UK). The coil was placed tangentially to the scalp with the handle pointing backwards and laterally at a 45˚ angle away from the midline. Based on previous reports S1 is assumed to be 2cm posterior from M1 site [5]. 2.3 Theta Burst Stimulation Continuous TBS (cTBS) paradigm of 600 pulses was applied to the M1 and S1 location. cTBS consists of 50Hz triplets of pulses that are repeating themselves at every 0.2s (5Hz) for 40s[4]. Intensity of each pulse was set to 80% active motor threshold (AMT) which is defined as the minimum stimulation intensity that could evoke an MEP of no less than 200μV during slight tonic contraction. 2.4 EEG and EMG Recording EEG signals were recorded based on the international 10-20 scalp electrode placement method (19 electrodes) with earlobe reference. EMG signal, during isometric hand contraction at 15% level from the maximum, was recorded from the FDI muscle of the right hand with reference to the metacarpal bone of the index finger. EEG and EMG signals were recorded with 1000 Hz sampling frequency and passbands of 0.5-200Hz and 5-300 Hz, respectively. Each recording set consists of 4 one-minute-long recordings with 30s-rest time intervals. To assess TBS effect with respect to time, each set was performed 30 minutes before (pre30), just before (pre0), 0,30,60,90 and 120 minutes after the delivery of TBS. Stimulation location and intensity were determined between pre30 and pre0 recordings (Fig. 1). 2.5 Data Analysis Coherence function is the squared magnitude of the cross-spectra of the signal pair divided by the product their power spectra. Therefore cross- and power-spectra between EMG and 19 EEG channels were calculated. Fast Fourier transform, with an epoch size of 1024, resulting in frequency resolution of 0.98 Hz was used to convert the signals into frequency domain. The current source density (CSD) reference method was utilized in order to achieve spatially sharpened EEG signals [6]. Coherency between EEG and EMG signals was obtained using the expression below:
0≤κ (f) = 2 xy
S xy ( f )
2
S xx ( f ) S yy ( f )
≤1
(1)
where Sxy(f) represents the cross-spectral density function. Sxx(f) and Syy(f) stand for the auto-spectral density of the signals x and y, respectively. Since coherence is a normalized measure of the correlation between signal pairs, κ2xy (f) =1 represents a perfect linear dependence and κ2xy (f) =0 indicates a lack of linear dependence within those signal pairs. Coherence values for κ2xy (f) > 0 are assumed to be statistically significant only if they are above 99% confidence limit that is estimated by:
138
M. Saglam et al. 1
α ⎞ ( n−1) ⎛ CL(α %) := 1 − ⎜1 − ⎟ ⎝ 100 ⎠ where n is the number of epochs used for cross- and power- spectra calculations.
(2)
3 Results First we have confirmed that EEG-EMG coherence values for all (n=7) subjects lie above 99% significance level (coh ~= 0.02) and within beta (13-30 Hz) frequency Table 1. Mean ± standard error of mean (SEM) of beta band EEG (C3)-EMG coherence values and peak frequencies for TBS-over-M1 and S1 experiments (n=7)
Coherence Magnitude
M1 S1
Peak M1 Frequency(Hz) S1
Pre30 (min)
Pre0 (min)
0 (min)
30 (min)
60 (min)
90 (min)
120 (min)
0.061±0.008
0.051±0.014
0.053±0.014
0.03±0.007
0.031±0.009
0.057±0.015
0.067±0.016
0.059±0.021
0.068±0.026
0.058±0.027 0.061±0.023
0.052±0.014
0.034±0.006
0.040±0.016
20.50±1.92
21.80±1.98
22.48±2.04
23.73±1.84
18.57±1.2
20.75±1.62
18.23±1.82
21.17±1.79
21.80±1.97
23.08±1.82
23.10±2.15
18.22±2.13
21.17±1.08
19.52±.3.21
Fig. 2. Coherence spectra between EEG and EMG signals at pre30 session. A, Coherence spectra between 19 EEG channels and EMG (FDI) for all subjects (n=7) are superimposed and topographically according to the approximate locations of the electrodes on the scalp. Each electrode labeled with respect to its location (Fp: Frontal pole F: Frontal T: Temporal C: Central P: Parietal O: Occipital). B, Expanded view of the coherence spectra between EEG (C3 scalp site) and EMG (FDI) for all subjects (n=7). Each line style specifies different subject’s coherence spectra. Coherence values only above 99% significance level (indicated by the solid horizontal line) are highlighted.
The Effects of Theta Burst Transcranial Magnetic Stimulation
139
band at C3 scalp site. Maximum coherence levels were observed at C3 (n=6) and at F3 (n=1) scalp sites whereas no significant coherence was observed at other locations (Figure 2). These results are in well agreement with the previous studies on coherence between EEG-EMG during isometric contraction [7, 8]. Table 1 shows the average absolute coherence values and peak frequencies at all trials before and after the application of TBS. Figure 3 demonstrates the normalized EEG (C3)-EMG coherence values as average of all subjects. Coherence values obtained before the TBS were taken as control and set to 100%. Average beta band coherence suppressed to 56.2% after 30 minutes and 54.5% after 60 minutes with statistical significance(p<0.05, Bonferonni multiple comparison test) and recovered to the original control level 90 and 120 minutes when TBS applied over M1. However there is no statistically significant coherence change with respect to time when TBS is delivered to S1 (coherence values are 103.8% after 30 minutes, 99.8% after 60 minutes).
Fig. 3. EEG (C3)-EMG Coherence at beta band (n=7). Pre-stimulation coherence values were taken as control level (100%). Coherence was suppressed (* p<0.05, Bonferonni multiple comparison test) 30 and 60 minutes after TBS-over-M1 and recovered back to the control level. Error bars indicate standard error of mean (±SEM).
4 Discussion The present results show that TBS over human motor cortex can change the corticomuscular functional coupling. TBS-over-M1 inhibited EEG (C3)-EMG (FDI) coherence for about 60 minutes, this inhibition has similar time course with MEP
140
M. Saglam et al.
amplitude suppression after TBS-over-M1 [4, 5]. The parallel responses of corticomuscular coherence and MEP to TBS may suggest that TBS-driven LTP/LTD of synaptic connections, involved in the circuits for MEP generation (i.e. I wave circuits), not only produces MEP but also plays an important role in rhythmic activities in the motor system. However it is difficult to assess the exact LTP/LTD scenario after TBS in conscious human experiments. On the other hand, we know that effects of cTBS count on pre- and postsynaptic N-methyl-D-aspartate receptors (NMDARs) which are highly associated with LTP/LTD [10, 11]. Blocking NMDARs with pharmacological agents (i.e. memantine) eliminates the inhibitory effects of cTBS [11]. Thus we may conclude that NMDAR-related LTP/LTD of synaptic connections within motor circuit is involved in corticomuscular coherence phenomenon and it can be temporarily modified by TBS. Among conventional rTMS studies, there are few reports investigating the effects of rTMS on cortico-muscular coherence. Similar changes (suppression) in corticomuscular coherence were observed using longer rTMS conditioning (15min with 0.9Hz rTMS vs. 40s with TBS-over-M1) although it was applied over premotor cortex in that study [9]. Absence of evidence in the literature hinders us to make a complete comparison between rTMS-over-M1 and TBS-over-M1. However it is evident that TBS has prolonged suppressive effect (15min effect with 0.9Hz rTMS vs. 60min effect with TBS). Therefore present data confirms that TBS could be a more prominent technique than rTMS by means of cortico-muscular coherence. Since rTMS has a therapeutic potential in clinical use [12, 13], a shorter yet stronger stimulation technique should be considered. On the other hand, TBS-over-S1 showed no significant coherence change although stimulation site is just 2 cm posterior to M1. The lack of suppressive effect after TBS-over-S1 suggests that there is no conditioning on M1 due to a current spread from S1 to M1 so that TBS can be regarded as a focal stimulation tool. The coherence was suppressed after of TBS-over-M1 rather than TBS-over-S1 condition. This result agrees with the fact that MEP amplitude is suppressed by TBS-over-M1 but not by TBS-over-S1. Therefore present findings could indicate that similar mechanism is evident for the MEP generation and coupling between cortical and muscular sites.
References 1. Brown, P., Marsden, J.F.: Cortical network resonance and motor activity in humans. Neuroscientist 7(6), 518–527 (2001) 2. Chouinard, P.A., Paus, T.: The primary motor and premotor areas of the human cerebral cortex. Neuroscientist 12(2), 143–152 (2006) 3. Filipovic, S.R., Siebner, H.R., Rowe, J.B., Cordivari, C., Gerschlager, W., Rothwell, J.C., Frackowiak, R.S., Bhatia, K.P.: Modulation of cortical activity by repetitive transcranial magnetic stimulation (rTMS): a review of functional imaging studies and the potential use in dystonia. Adv. Neurol. 94, 45–52 (2004) 4. Huang, Y.Z., Edwards, M.J., Rounis, E., Bhatia, K.P., Rothwell, J.C.: Theta burst stimulation of the human motor cortex. Neuron 45, 201–206 (2005) 5. Ishikawa, S., Matsunaga, K., Nakanishi, R., Kawahira, K., Murayama, N., Tsuji, S., Huang, Y.Z., Rothwell, J.C.: Effect of theta burst stimulation over the human sensorimotor cortex on motor and somatosensory evoked potentials. Clinical Neurophysiology (in press, 2007)
The Effects of Theta Burst Transcranial Magnetic Stimulation
141
6. Mima, T., Hallett, M.: Electroencephalographic analysis of cortico-muscular coherence: reference effect, volume conduction and generator mechanism. Clinical Neurophysiology 110, 1892–1899 (1999) 7. Murayama, N., Lin, Y.Y., Salenius, S., Hari, R.: Oscillatory interaction between human motor cortex and trunk muscles during isometric contraction. Neuroimage 14, 1206–1213 (2001) 8. Conway, B.A., Halliday, D.M., Farmer, S.F., Shahani, U., Maas, P., Weir, A.I., Rosenberg, J.R.: Synchronization between motor cortex and spinal motoneuronal pool during the performance of a maintained motor task in man. J. Physiol. 489(3), 917–924 (1995) 9. Chen, W.H., Mima, T., Siebner, T., Oga, H.R., Hara, T., Satow, H., Begum, T., Shibasaki, H.: Low-frequency rTMS over lateral premotor cortex induces lasting changes in regional activation and functional coupling of cortical motor areas. Clinical Neurophysiology 114(9), 1628–1637 (2003) 10. Huang, Y.Z., Chen, R.S., Rothwell, J.C., Wen, H.Y.: The after-effect of human theta burst stimulation is NMDA receptor dependent. Clin. Neurophysiol. 118(5), 1028–1032 (2007) 11. MacDermott, A.B., Mayer, M.L., Westbrook, G.L., Smith, S.J., Barker, J.L.: NMDAreceptor activation increases cytoplasmic calcium concentration in cultured spinal cord neurones. Nature 321(6069), 519–522 (1986) 12. Lefaucheur, J.P.: Repetitive transcranial magnetic stimulation (rTMS): insights into the treatment of Parkinson’s disease by cortical stimulation. Neurophysiol. Clin. 36(3), 125– 133 (2006) 13. Joo, E.Y., Han, S.J., Chung, S.H., Cho, J.W., Seo, D.W., Hong, S.B.: Antiepileptic effects of low-frequency repetitive transcranial magnetic stimulation by different stimulation durations and locations. Clin Neurophysiol 118(3), 702–708 (2007)
Interactions between Spike-Timing-Dependent Plasticity and Phase Response Curve Lead to Wireless Clustering Hideyuki Cˆ ateau1 , Katsunori Kitano2 , and Tomoki Fukai1 2
1 RIKEN BSI, Hirosawa, Wako, 3510198 Saitama, Japan Department of Human and Computer Intelligence, Ritsumeikan University, 1-1-1 Nojihigashi, Kusatsu, Shiga 525-8577, Japan
Abstract. A phase response curve characterizes the signal transduction between neurons in a minimal manner,whereas spike-timing-dependent plasiticity (STDP) characterizes the way to rewire networks in an activitydependent manner. The present paper demonstrates that these two key properties both related to spikes work synergetically to carve functionally useful circuits in the brain. STDP working on a population of neurons that prefer asynchrony turns out to convert the initial asynchronous firing to clustered firing with synchrony within a cluster. They get synchronized within a cluster despite their preference to asynchrony because STDP selectively disrupts intra-cluster connections, which we call wireless eclustering.
1
Introduction
Synchrony tendency of coupled oscillators or neurons is predicted by the phase response curve (PRC) of a neuron that describes an amount of advance or delay by synaptic input given at a specific phase in an interspike interval[1,2]. It is intriguing to know how this useful theory based on the fixed coupling strength between neurons generalizes to the cases where synaptic strength varies as observed in the real brain. A number of experiments[3] have established that synaptic strength changes depending on pre- and postsynaptic spike times and theoretical implications of such spike-timing-dependent plasticity (STDP) have been extensively studied [4,5,6,7,8,9]. Since the PRC and STDP both refer to the timings of spikes, a natural question is how these two properties of a neuronal network interact each other to carve a functional network in the brain. To answer this question, we first use a neuron model whose PRC can be systematically controlled [10] unlike the simpler leaky integrate-and-fire (LIF) model. The model neurons favors either asynchronous (Model A) or synchronous (Model B) firing depending on the values of the model parameters. Our simulations show that STDP working on the network of Model A neurons converts an asynchronously firing neurons into three or more cyclically activated clusters of neurons. Model A neurons can synchronize within a cluster despite their preference to asynchrony because, as we see later, STDP selectively disrupts M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 142–150, 2008. c Springer-Verlag Berlin Heidelberg 2008
Interactions between STDP and PRC Lead to Wireless Clustering
143
intra-cluster connections, nullifying the asynchrony preference. If STDP works on the network of Model B neurons, however, the neurons simply get synchronized globally, analogous to what was observed in [11], and nothing interesting happens. Next we show that the self-organized cyclic activity appears also under biologically realistic settings using a Hodgkin-Huxley type neuron model, suggesting the generality of the concept. In the self-organization, PRC specifies the timing preference and influences the way how STDP works. Importantly, STDP in turn influences the way how PRC is readout. Before the STDP learning begins, the initial slope of an effective PRC (defined later) determines the stability of the global synchrony. After the STDP learning forms the cyclic activity consisting of n clusters, the slope of the effective PRC at θ = 2π(1 − 1/n) determines its stability. Thus, the two key features of spiking neurons, PRC and STDP, work synergetically to organize functional networks in the brain. Previously studies have shown [12,13,14] that STDP helps the pacemaker neuron entrain an innervated neuron(s), which was called ”frequency synchrony” meaning that neurons start firing at the same frequency but with different phases as opposed to the ”phase synchrony” studied here. Building upon the firm theoretical analysis of frequency synchrony [1,12,13,14], it is now important to ask when and how the frequency synchrony specializes to the phase synchrony because the phase synchrony can effectively enlarge otherwise tiny EPSC to the size capable of reliably eliciting spikes in innervated neurons.
2
Self-organization of Izhikevich Neurons
We consider a population of neurons firing quasi-periodically. We use a spiking neuron model proposed by Izhikevich[10]. Depending on the values of four parameters, a, b, c and d, this model can produce many different voltage trajectories similar to what are found in real neurons. Fifty Model A neurons that favor asynchrony (a = 0.02, b = 0.2, c = −50, d = 1.26) are connected in an all-to-all manner with uniform synaptic strength and with a range of synaptic delays of 2±0.2ms. The neurons fire quasi-periodically driven by supra-threshold stochastic inputs, I = I0 + σζ(t) with I0 = 30mV /ms and σ = 1.5mV /ms1/2 , where ζ(t) is the Gaussian white noise. With no synaptic plasticity at work, initial uniform distribution of firing phases (Fig.1a) remain asynchronous because the neurons favor synchrony. However, the effects of the standard additive STDP rule with hard boundaries (0 < w < 1 ) defined [4] by Δw =
A+ exp(−Δt/τ+ ) for Δt ≥ 0, −A− exp(−|Δt|/τ− ) for Δt < 0,
(1)
with A+ = 0.05, A− = 0.0525 and τ+ = τ− = 20ms converts the asynchronous firing into clusters of synchronous firing (Fig. 1b).
144
H. Cˆ ateau, K. Kitano, and T. Fukai
membrane potential [mV]
C
D 40 20 0
−20 −40 −60 −80
4900
100 110 120 130 140 150 160 170 180 190 200
4920
4940
F
E
4960
4980
5000
time [ms]
time [ms]
presynaptic neurons 5
10
15
20
25
30
35
40
45
50
50 5
postsynaptic neurons
45
neuron index
40 35 30 25 20 15 10 5
10 15 20 25 30 35 40 45
9890
G
9910
9930 9950 time [ms]
9970
9990
H
presynaptic neurons 5
10
15
20
25
30
35
40
50
45
50
5
postsynaptic neurons
10 15 20 25 30 35 40 45 50
J
I
1
θ
θ
2
Fig. 1. Clustering of Model A neurons by STDP. Fifty Model A neurons fire in complete asynchrony before the STDP learning starts. (a) Voltage trajectories of all the neurons drawn in different colors are overlaid. (b) The neurons start firing in three different clusters with intra-cluster synchrony due to STDP. (c) A gray scale representation of the connection strength between neurons with black being the strongest. (d) A raster plot corresponding to (b). Neurons are aligned according to spike times. (e) The connection matrix when the higher-order rule of STDP is applied. A synaptic change is discounted by 1 − exp(−(tpost spike 2 − tEPSP by pre )/τremove ) with when a triplet of event, tpost spike 1 < tEPSP by pre < tpost spike 2 , happened. (f) A schematic drawing showing the network topology corresponding to (d) and (e). (g), (h) Positive feedback mechanism leading to the wireless clustering. In (g), the vertical lines indicate spike times of two neurons; the small wedges indiate EPSPs elicited by spikes. In (h), distributed firing phases of neurons are represented by the filled circules in different colors.
The network topology underlying this clustered synchronous is picturized by a matrix of synaptic strengths (Fig.1c), where neurons are indexed according to firing times after the learning (t > 9.89sec). The three divisions apparent in Fig.1c correspond to three synchronously firing clusters of neurons (Fig.1d).
Interactions between STDP and PRC Lead to Wireless Clustering
145
Interestingly, STDP appears to have removed the intra-cluster connections almost completely (Fig.1d). Such clustering without connections is observed commonly under various simulation conditions and we will call it wireless clustering. It turns out the wireles clustering has happened because we reflected experimental procedures of STDP faithfully. The standard STDP rule implies potentiation for positive timing difference, Δt = tpost − tpre > 0, and depression for a negative timing difference, Δt < 0. Many have argued that the asymmetry of this rule produces a one-way coupling (see e.g. [4] ). Such arguments would be valid if Δt represented the time difference between post- and presynaptic spikes. However, actually most experimental literatures [3] define Δt to be the time difference between a postsynaptic spike and the onset or peak of the somatic excitatory postsynaptic potential (EPSP) induced by a presynaptic spike: Δt = tpost spike − tEPSP by pre . Hence the above argument does not apply. A somatic EPSP should lag behind a presynaptic spike for a few msec. Therefore, if two neurons fire in exact synchrony (Fig.1g), Δt < 0 negative [12] for both directions, thereby weakens connections bidirectionally. Now, how does this mechanism convert initial asynchronous firing to clustered synchronous firing (Fig.1b)? Initial asynchronous firing (Fig.1a) is represented as firing phases evenly spread around the circle (Fig.1h, left). The firing remains asynchronous without STDP . However, with the phases of many neurons squeezed into the circle, any single neuron must have neighboring neurons that unwillingly fire synchronously with it(Fig.1h). Among these neurons, the abovementioned mechanism weakens the connections bidirectionally. As their synaptic connections weaken, mutual repulsion is also weakened. This then further synchronizes their firing. This positive feedback mechanism develops wireless clusters (Figs.1h). Although this mechanism qualitatively explains how the clustering happens, a quantitative question how many clusters are formed requires further consideration. We will later see that a stability analysis tells the possible number of clusters. In contrast to the vanishing intra-cluster connections, the inter-cluster connections survive and can be unidirectional (Fig.1d), which defines the cyclic network topology such as shown in Fig.1f, upper. Let us ask how we can change this 3-cycle topology. We find that one of the recently observed higher-order rules of STDP [15,16] increases the number of clusters (Fig.1e). The higher-order rule shown in [16] implies the gross reduction in the LTD effect because LTP override the immediately preceding LTD, while LTP simply cancels partly the immediately preceding LTD. The weakened LTD effect is likely to increase the total number of potentiated synapses, which is in consistent with the increased ratio of black areas in Fig.1e compared to Fig.1d. In contrast to such cluster-wise synchrony observed with Model A neurons, Model B neurons that favor synchrony ( a = 0.02, b = 0.2, c = −50, d = 40) selforganize into the globally synchronous state with or without synchrony (Fig.2). Due to the global synchrony, mutual synaptic connections are largely lost, and each neuron ends up being driven by the external input individually, having little sense of being present as a population. The global synchrony gives too strong
146
H. Cˆ ateau, K. Kitano, and T. Fukai
D
50 45 40 35 30 25 20 15 10 5 0
9750
postsynaptic neurons
neuron index
C
9800
9850 9900 time [ms]
9950
5
presynaptic neurons 10 15 20 25 30 35 40 45 50
5 10 15 20 25 30 35 40 45 50
10000
Fig. 2. Global synchrony observed with Model B neurons that favor synchrony. A raster plot (a) and connection matrix (b) of fifty Model B neurons. The neurons were aligned with the connection-based method: neurons are defined to belong to the same cluster whenever there mutual connections are small enough.
an impact and also has minimal coding capacity because all the neurons behave identically, and it appears to bear more similarity to the pathological activity such as seizure in the brain than to meaningful information processing. By contrast, the clustered synchrony arising in the network of Model A neurons appears functionally useful. Generally in the brain, the unitary EPSP amplitude (∼ 0.5mV ) is designed to be much smaller than the voltage rise needed to elicit firing (∼ 15mV ). Therefore, single-neuron activity alone cannot cause other neurons to respond. Hence, it is difficult to regard the single-neuron activity as a carrier of information transferred back and forth in the brain. In contrast, the self-organized assembly of tens of Model A neurons (Figs. 1d) looks an ideal candidate for a carrier of information in the brain because their impact on other neurons are strong enough to elicit responses. Additionally, a cluster can reliably code the timing information. The PRC, Z(2πt/T ), representing the amount of advance/delay of the next firing time in response to the input at t in the firing interval [0, T ] has been mostly used to decide whether a coupled pair of neurons or oscillators tend to synchronize or desynchronize under the assumption that the connection strengths between the neurons are equal and unchanged. Specifically, suppose that a pair of neurons are mutually connected and a spike of one neuron introduces a current with the waveform of EP SC(t) in an innervated neuron after a transmission T delay of τd . The effective PRC defined as Γ− (θ) = T1 0 Z(2πt /T )EP SC(t − θ τd − T 2π )dt is known to decide their synchrony tendency. If Γ− (θ) < 0 at θ = 0 is positive (negative), the two neurons are desynchronized (synchronized). This synchrony condition is inherited to a population of neurons coupled in an allto-all or random manner as far as the connection strengths remain unchanged. Theoretically calculated Γ− (θ)s for Model A and B (Fig.3a,b) explain that the all-to-all netwok of Model A (B) neurons exhibit global asynchrony (synchrony). Note that both Model A and B neurons belong to type II[19] so that both model neurons favor synchrony if they are delta-coupled with no synaptic delay. After STDP is switched on, the network consisting of Model A neurons, is selforganized into the 3-cycle circuit (Fig.1d) with a successive phase difference of the clusterd activity being Δsuc θ = 2π/3. Stability analysis shows that the slope
Interactions between STDP and PRC Lead to Wireless Clustering
147
(b)
(a) 0.035
0.25
2pi/3
0.03
0.2
2pi/4 0.15
0.025 0.02 0.015
0.1
0.01
0.05
0.005 0 -0.05
0
0
pi/2
pi
3pi/2
2pi
-0.005
0
pi/2
pi
3pi/2
2pi
Fig. 3. Effective PRCs and schema of triad mechanism. The effective PRC of Model A and B were calculated with the adjoint method [19] and shown shown in (a) and (b). The slope at θ = 0 is positive for (a) but negative for (b) although it hardly recognizable with this resolution. The slope at θ = 2π/3 is negative for (a) but positive for (b). The dashed lines represent θ = 2π/3 and θ = 2π/4.
of Γ− (θ) not at the origin but at θ = 2π − Δsuc θ now determines the stability of the 3-cycle activity: the 3-cycle activity is stable if Γ− (2π − Δsuc θ) < 0. Fig.3a tells that the 3-cycle activity shown in Fig.1c is stable. The stable cyclic activity is achieved through the following synergetic process: (1) PRC determines the preferred network activity (e.g. asynchronous or synchronous), (2) the network activity determines how STDP works, STDP modifies the network structure (e.g. from all-to-all to cyclic ), and (3) the network structure determines how the PRC is readout (e.g. θ = 0 or θ = 2π −Δsuc θ) ), closing the loop. Generally, we can show that the n-cylce activity whose successive phase difference equals Δsuc θ = 2π/n is stable if Γ− (2π − Δsuc θ) < 0. PRCs of biologically plausible neuron models or real neurons [20] tend to have a negative slope in a later phase of the firing interval and converge to zero at θ = 2π because the membrane potential starts the regenerative depolarization and becomes insensitive to any synaptic input. The corresponding effective PRCs inherit this negative slope in the later phase and tends to stabilize the n-cycle activity for some n.
3
Self-organization of Hodgkin-Huxley Type Neurons
Next we see that the self-organized cyclic activity with the wireless clustering is also observed in biologically realistic setting. Our simulations as described in [18] with 200 excitatory and 50 inhibitory neurons modeled with the HodgkinHuxley (HH) formalism exhibits the 3-cyclic activity with the wireless clustering (Fig.4a,b). The setup here is biologically realistic in that (1) HH type neurons are used, (2) physiologically known percentage of inhibitory neurons with nonplastic synapses are included, (3) neurons fire with high irregularity due to large noise in the background input unlike the well-regulated firing as shown in Fig.1c. Interestingly, the effective PRC (Fig.4c) of the HH type neuron shares important features with that of Model A: the positive initial slope implying the preference to asynchrony and a negative later slope stabilizing the 3-cycle activity.
148
H. Cˆ ateau, K. Kitano, and T. Fukai (a)
presynaptic neurons
(b)
20
200
postsynaptic neurons
160 140
neuron index
60
80
100 120 140 160 180 200
20
180
120 100 80 60 40 20 0 5000
40
40 60 80 100 120 140 160 180
5040
5080
5120
5160
200
5200
time(msec)
(c) 3.5
2pi/3 2pi/4
3 2.5 2 1.5 1 0.5 0
0
pi/2
pi
3pi/2
2pi
Fig. 4. Conductance-based model also develops the wireless clustering. (a) A raster plot of 200 HH-type excitatory neurons showing 3-cycle activity. (b) The corresponding connection matrix showing the wirelessness. (c) Effective PRC or Γ− (θ) of the conductance-based model.
Generally, technical difficulty in the HH simulations is their massive computational demands due to the complexity of the system. That difficulty has hidered theoretical analysis, and has left the studies largely experimental. In particular, previously we tried hard to understand why we never observed 4-cycle or longer in vain. However, the analytic argument we developed here with the simplified model gives a clear insight into the biologically plausible but complex system. Comparison of Fig.3a and Fig.4c reveals that the negative slope of Γ− (θ) of the HH model is located at more left than that of Model A, indicating less stability of long cycles in the HH simulations. With the larger amount of noise in the HH simulations in mind, it is now understood that 4-cycle and longer can be easily destabilized in the HH simulations. Thus, our analysis developed with the simplefied system serves as a useful tool to understand a biologically realistic but complex systems. There is, howevr, an interesting difference between the Model A and HH simulations. Although the intra-cluster wirelessness is a fairly good first approximation in the HH model simulations (Fig.4b), it is not as exact as in the Model A simulations (Fig.1d,e). Interestingly, an elimination of the residual intracluster connections destroys the cyclic activity, suggesting the supportive role of the tiny residual intra-cluster connections.
4
Discussion
In the previous simulation study[17] using the LIF model, cyclic activity was observed to propagate only at the theoretical speed limit: it takes only τd from one cluster to the next, requiring the zero membrane integration time. To understand
Interactions between STDP and PRC Lead to Wireless Clustering
149
why it was the case, we first remind that the effective PRC needs a negative slope at 2π − Δsuc θ to stabilize the cylic activity. However, the slope of the PRC of an θ LIF model, Z(θ) = c exp( τTm 2π ), is always positive except at the the end point, T where Z(2π − 0) = c exp( τm ) and Z(2π + 0) = c, implying Z (2π) = −∞. This infinitely sharp negative slope of the PRC at θ = 2π is rounded and displaced to 2π − 2πτd /T in Γ− (θ) (see its definition). Since this is the only place where Γ− (θ) has a negative slope, The cyclic activity is stable only if Δsuc θ = 2πτd /T , implying the propagation at the theoretical speed limit. We demonstrated an intimate interplay between PRC and STDP using the Izhikevich neuron model as well as the HH type model. The present study complements previous studies using the phase oscillator [11,14], where its mathematical tractability was exploited to analytically investigate the stability of the global phase/frequency synchrony. The self-organization or unsupervised learning by STDP studied here complements the supervised learning studied in [22]. The propagation of synchronous firing and temporal evolution of synaptic strength under STDP is know to be analyzed semi-analytically with the Fokker-Planck equation STDP [5,6,8,9,21]. It is interesting future direction to see how the Fokker-Planck equation can be used to understand the interplay between PRC and STDP.
Acknowledgement The present authors thank Dr. T. Takewaka at RIKEN BSI for offering the code to calculate the PRC.
References 1. Kuramoto, Y.: Chemical oscillations,waves,and turbulence. Springer, Berlin (1984) 2. Ermentrout, G., Kopell, N.: SIAM J. Math.Anal. 15, 215 (1984) 3. Markram, H., et al.: Science 275, 213 (1997); Bell, C.C., et al.: Nature 387, 278 (1997); Magee, J.C., Johnston, D.: Science 275, 209 (1997); Bi, G.-Q., Poo, M.-M.: J. Neurosci. 18, 10464 (1998); Feldman, D. E., Neuron 27, 45 (2000); Nishiyama, M., et al.: Nature 408, 584 (2000) 4. Song, S., et al.: Nat. Neurosci. 3, 919 (2000) 5. van Rossum, M.C., Turrigiano, G.G., Nelson, S.B.: J. Neurosci. 22,1956 (2000) 6. Rubin, J., et al.: Phys. Rev. Lett. 86, 364 (2001) 7. Abbott, L.F., Nelson, S.B.: Nat. Neurosci. 3, 1178 (2000) 8. Gerstner, W., Kistler, W.M.: Spiking neuron model. Cambridge University Press, Cambridge (2002) 9. Cˆ ateau, H., Fukai, T.: Neural Comput., 15, 597 (2003) 10. Izhikevich, E.M.: IEEE Trans. Neural Netw. 15, 1063 (2004) 11. Karbowski, J.J., Ermentrout, G.B.: Phys. Rev. E. 65, 031902 (2002) 12. Nowotny, T., et al.: J. Neurosci. 23, 9776 (2003) 13. Zhigulin, V.P., et al.: Phys. Rev. E, 67, 021901 (2003) 14. Masuda, N., Kori, H.: J. Comp. Neurosci, 22, 327 (2007) 15. Froemke, R.C., Dan, Y.: Nature, 416, 433 (2002)
150 16. 17. 18. 19. 20.
H. Cˆ ateau, K. Kitano, and T. Fukai
Wang, H.-X., et al.: Nat. Neurosci, 8, 187 (2005) Levy, N., et al.: Neural Netw. 14, 815 (2001) Kitano, K., Cˆ ateau, H., Fukai, T.: Neuroreport, 13, 795 (2002) Ermentrout, G.B.: Neural Comput. 8, 979 (1996) Reyes, A.D., Fetz, E.E.: J. Neurophysiol. 69, 1673 (1993); Reyes, A.D., Fetz, E.E.: J. Neurophysiol. 69, 1661 (1993); Oprisan, S.A., Prinz, A.A., Canavier, C.C.: Biophys. J., 87, 2283 (2004); Netoff, T.I., et al.: J Neurophysiol. 93, 1197 (2005); Lengyel, M., et al.: Nat. Neurosci. 8, 1667 (2005); Galan, R.F., Ermentrout, G.B., Urban, N.N.: Phys. Rev. Lett. 94, 158101 (2005); Preyer, A.J., Butera, R.J.: Phys. Rev. Lett. 95, 13810 (2005); Goldberg, J.A., Deister, C. A., Wilson, C.J.: J. Neurophysiol., 97, 208 (2007); Tateno, T., Robinson, H.P.: Biophys. J., 92, 683 (2007); Mancilla, J.G., et al.: J. Neurosci. 27, 2058 (2007); Tsubo, Y., et al.: Eur J. Neurosci, 25, 3429 (2007) 21. Cˆ ateau, H., Reyes, A.D.: Phys. Rev. Lett. 96, 058101, and references therein (2006) 22. Lengyel, M., et al.: Nat. Neurosci. 8, 1677 (2005)
A Computational Model of Formation of Grid Field and Theta Phase Precession in the Entorhinal Cells Yoko Yamaguchi1, Colin Molter1, Wu Zhihua1,2, Harshavardhan A. Agashe1, and Hiroaki Wagatsuma1 1
Lab For Dynamic of Emergent Intelligence, RIKEN Brain Science Institute, Wako, Saitama, Japan 2 Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
[email protected]
Abstract. This paper proposes a computational model of spatio-temporal property formation in the entorhinal neurons recently known as “grid cells”. The model consists of module structures for local path integration, multiple sensory integration and for theta phase coding of grid fields. Theta phase precession naturally encodes the spatial information in theta phase. The proposed module structures have good agreement with head direction cells and grid cells in the entorhinal cortex. The functional role of theta phase coding in the entorhinal cortex for cognitive map formation in the hippocampus is discussed. Keywords: Cognitive map, hippocampus, temporal coding, theta rhythm, grid cell.
1 Introduction In rodents, it is well known that a hippocampal neuron increases its firing rate in some specific position in an environment [1]. These neurons are called place cells and considered to provide neural representation of a cognitive map. Recently it was found that the entorhinal neurons, giving major inputs to the hippocampus fire at positions distributing in a form of a triangular-grid-like patterns in the environment [2]. They are called “grid cells” and their spatial firing preference is termed “grid fields”. Interestingly, temporal coding of space information, “theta phase precession” initially found in hippocampal place cells were also observed in grid cells in the superficial layer of the entorhinal cortex [3], as shown in Figs.1 – 3. A sequence of neural firing is locked to theta rhythm (4~12 HZ) of local field potential (LFP) during spatial exploration. As a step to understand cognitive map formation in the rat hippocampus, the mechanism to form the grid field and also the mechanism of phase precession formation in grid cells must be clarified. Here we propose a model of neural computation to create grid cells based on known property of entorhinal neurons including “head direction cells” which fires M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 151–159, 2008. © Springer-Verlag Berlin Heidelberg 2008
152
Y. Yamaguchi et al.
when the animal’s head has some specific direction in the environment. We demonstrate that theta phase precession in the entorhinal cortex naturally emerge as a consequence of grid cell formation mechanism.
Fig. 1. Theta phase precession observed in rat hippocampal place cells. When the rat traverses in a place field, spike timing of the place cell gradually advances relative to local field potential (LFP) theta rhythm. In a running sequence through place filed A-B-C, the spike sequence in order of A-B-C emerge in each theta cycle. The spike sequence repeatedly encoded in theta phase is considered to lead robust on-line memory formation of the running experience through asymmetric synaptic plasticity in the hippocampus.
Fig. 2. Network structure of the hippocampal formation (DG, CA3, CA1, the entorhinal cortex (EC deeper layer and EC superficial payer) and cortical areas giving multimodal input. Theta phase precession was initially found in the hippocampus, and also found in EC superficial layer. EC superficial layer can be considered as an origin of theta phase precession.
A Computational Model of Formation of Grid Field and Theta Phase Precession
153
HC place cell EC grid cell EC LFP theta
Time Fig. 3. Top) A grid field in an entorhinal grid cell (left) and a place field (right) in a hippocampal place cell. Bottom) Theta phase precession observed in the EC grid cell and in the hippocampus place cell.
2 Model Firing rate of the ith grid cell at a location (x, y) in a given environment increases in the condition given by the relation:
x = α i + nAi cos φ i + mAi cos(φ i + π / 3), y = β i + nAi sinφ i + mAi sin(φ i + π / 3), with
n, m = integer + r ,
(1)
where φ i , Ai and ( α i , β i ) denote one of angles characterizing the grid orientation, a distance of nearby vertices, and a spatial phase of the grid field in an environment. The parameter r is less than 1.0 giving the relative size of a field with high firing rate.
154
Y. Yamaguchi et al.
Fig. 4. Illustration of a hypercolumn structure for grid field computation in the hypothesized entorhinal cortex. The bottom layer consists of local path integration module with a hexagonal direction system. The middle layer associates output of local path integration and visual cue in a given environment. The top layer consists of triplet of grid cells whose grid fields have a common orientation, a common spatial scale and complementary spatial phases. Phase precession is generated at the grid cell at each grid field.
The computational goal to create a grid field is to find the region with n, m = integer + r . We hypothesize that the deeper layer of the entorhinal cortex works as local path integration systems by using head direction and running velocity. The local path integration results in a variable with slow gradual change forming a grid field. This change can cause the gradual phase shift of theta phase precession in accordance with the phenomenological model of theta phase precession by Yamaguchi et al. [4]. The schematic structure of the hypothesized entorhinal cortex and multimodal sensory system is illustrated in Fig. 4. The entorhinal layer includes head direction cells in the deeper layer and grid cells in the superficial layer. Cells with theta phase precession can be considered as stellate cells. The set of modules along vertical direction form a kind of functional column with a direction preference. These columns form a hypercolumnar structure with a set of directions. Mechanisms in individual modules are explained below. 2.1 A Module of Local Path Integrator
The local path integration module consists of six units. During animal’s locomotion with a given head direction and velocity, each unit integrates running distance in each direction with an angle dependent coefficient. Units have preferred vector directions distributing with π/3 intervals as shown in Fig. 5. Computation of animal displacement in given directions in this module is illustrated in Fig, 6. The maximum integration length of the distance in each direction is assumed to be common in a module, corresponding to the distance between nearby vertices of the subsequently formed grid field. This computation gives (n.m) in eq. (1). These systems distribute in the deeper layer of the entorhinal cortex in agreement with observation of head direction cells. Different modules have different vector
A Computational Model of Formation of Grid Field and Theta Phase Precession
155
directions and form a hypercolumn set covering the entire running directions or entire orientation of resultant grid field. The entorhinal cortex is considered to include multiple hypercolumns with different spatial scales. They are considered to work in parallel possibly to give stability in a global space by compensating accumulation of local errors.
Fig. 5. Left) A module of local path integrator with hexagonal direction vectors and a common vector size. Right) Activity coefficient of each vector unit. A set of these vector units to give a displacement distance measure computes an animal motion in a give head direction.
Fig. 6. Illustration of computation of local path integration in a module. Animal locomotion in a give head direction is computed by a pair of vector units among six vectors to give a position measure.
2.2 Grid Field Formation with Visual Cues
Computational results of local path integration are projected to next module in the superficial layer of the entorhinal cortex, which has multiple sensory inputs in a given environment. The association of path integration and visual cues results in the relative location of path integration measure ( α i , β i ) in eq. (1) in the module. Further interaction in a set of three cells as shown in Fig. 7 can give robustness of the parameter ( α i , β i ). Possible interaction among these cells is mutual inhibition to give supplementary distribution of three grid fields.
156
Y. Yamaguchi et al.
2.3 Theta Phase Precession in the Grid Field
The input of the parameter (n, m) and ( α i , β i ), to a cell at next module, at the top part of the module, can cause theta phase precession. It is obtained by the fundamental mechanism of theta phase generation proposed by Yamaguchi et al. [4] [5]. The mechanism needs the presence of a gradual increase of natural frequency in a cell with oscillation activity. Here we find that the top module consists of stellate cells with intrinsic theta oscillation. The natural increase in frequency is expected to emerge by the input of path integration at each vertex of a grid field.
Fig. 7. A triplet of grid fields with the same local path integration system and different spatial phases can generate mostly uniform spatial representation where a single grid cell fires at every location. The uniformity can help robust assignment of environmental spatial phases under the help of environmental sensory cues. The association is processed in the middle part of each column in the entorinal cortex. Projection of each cell output to the module at the top generates a grid field wit theta phase precession as explained in text.
3 Mathematical Formulation Simple mathematical formulation of the above model is phenomenologically given below. The locomotion of animal is represented by current displacement (R, φc ) computed with head direction φ , and running velocity. An elementary vector at a H column of in local path integration system has a vector angle φ and its length A. The i output of the ith vector system I is given by
⎧⎪1 if I (φi ) = ⎨ ⎪⎩ 0
φi − φ H < π / 2 and − r < S (φi ) < r , otherwise, (2)
with S (φi ) = R cos(φi − φ c ) ( S mod A ).
A Computational Model of Formation of Grid Field and Theta Phase Precession
157
where r and A respectively represent the field radius and the distance between neighboring grid vertices. The output of path integration module Di to the middle layer is given by
Di = ∏ I (φi + kπ / 3). k
(3)
Through association with visual cues, spatial phase of the grid is determined. (Details are not shown here.) The term Eqs. (2-3) from the middle layer to the top layer gives on-off regulation and also a parameter with gradual increase in a grid field. Dynamics of the membrane potential G i of the cell at the top layer is given by d dt
G i = f (G i ,t) + aDi S(φi ) + I theta ,
(4)
where f is a function of time-dependent ionic currents and a is constant. The last term I theta denotes a sinusoidal current representing theta oscillation of inhibitory neurons. In a proper dynamics of f, the second term in the right had side gives activation of the grid cell oscillation and gradual increase in its natural frequency. According to our former results by using a phenomenological model [5], the last term of theta currents leads phase locking of grid cells with gradual phase shift. This realizes a cell with grid field and theta phase precession. One can test Eq. (4) by applying several types of equations including a simple reduced model and biophysical model of the hippocampus or entorhinal cells. An example of computer experiments is given in the following section.
4 Computer Simulation of Theta Phase Precession The mechanism of theta phase precession was phenomenologically proposed by Yamaguchi et al. [4][5] as coupling of two oscillations. One is LFP theta oscillation with a constant frequency of theta rhythm. The other is a sustained oscillation with gradual increase in natural frequency. The sustained oscillation in the presence of LFP theta exhibits gradual phase shift as quasi steady states of phase locking. The simulation by using a hippocamal pyramidal cell [6] is shown in Fig. 8. It is obviously seen that LFP theta instantaneously captures the oscillation with gradual increase in natural frequency into a quasi-stable phase at each theta cycle to give gradual phase shift. This phase shift is robust against any perturbation as a consequence of phase locking in nonlinear oscillations. The simulation with a model of an entorhinal stellate cell [7] was also elucidated. We obtained similar phase precession with stellate cell model. One important property of stellate cell is the presence of sub threshold oscillations, while synchronization of this oscillation can be reduced to a simple behavior of the phase model. Thus, the mechanism of phenomenological model [5] is found to endow comprehensive description of phase locking of complex biophysical neuron models.
158
Y. Yamaguchi et al.
(a)
(b) Fig. 8. Computer experiment of theta phase precession by using a hippocampal pyramidal neuron model [6]. (a)Bottom: Input current with gradual increase. Top: Resultant sustained oscillation with gradual increase in natural frequency of the membrane potential. (b) In the presence of LFP theta (middle), the neuronal activity exhibit theta phase precession.
5 Discussions and Conclusion We elucidated a computational model of grid cells in the entorhnal cortex to investigate how temporal coding works for spatial representation in the brain, A computational model of formation of grid field was proposed based on local path integration. This assumption was found to give theta phase precession within the grid field. This computational mechanism does not need an assumption of learning in repeated trials in a novel environment but enables instantaneous spatial representation. Furthermore, this model has good agreements with experimental observations of head direction cells and grid cells. The networks proposed in the model predict local interaction networks in the entorhinal cortex and also head direction systems distributed in many areas. Although computation of place cells based on grid cells is beyond this paper, emergence of theta phase precession in the entorhinal cortex can be used for place cell formation and also instantaneous memory formation in the hippocampus [8]. These computational model studies with space-time structure for environmental space representation enlightens the temporal coding over distributed areas used in real-time operation of spatial information in ever changing environment.
A Computational Model of Formation of Grid Field and Theta Phase Precession
159
References 1. O’Keefe, J., Nadel, L.: The hippocampus as a cognitive map. Clarendon Press, Oxford (1978) 2. Fyhn, M., Molden, S., Witter, M., Moser, E.I., Moser, M.B.: Spatial representation in the entorhinal cortex. Sience 305, 1258–1264 (2004) 3. Hafting, T., Fyhn, M., Moser, M.B., Moser, E.I.: Phase precession and phase locking in entorhinal grid cells. Program No. 68.8, Neuroscience Meeting Planner. Atlanta, GA: Society for Neuroscience (2006.) Online (2006) 4. Yamaguchi, Y., Sato, N., Wagatsuma, H., Wu, Z., Molter, C., Aota, Y.: A unified view of theta-phase coding in the entorhinal-hippocampal system. Current Opinion in Neurobiology 17, 197–204 (2007) 5. Yamaguchi, Y., McNaughton, B.L.: Nonlinear dynamics generating theta phase precession in hippocampal closed circuit and generation of episodic memory. In: Usui, S., Omori, T. (eds.) The Fifth International Conference on Neural Information Processing (ICONIP 1998) and The 1998 Annual Conference of the Japanese Neural Network Society (JNNS 1998), Kitakyushu, Japan. Burke, VA, vol. 2, pp. 781–784. IOS Press, Amsterdam (1998) 6. Pinsky, P.F., Rinzel, J.: Intrinsic and network rhythmogenesis in a reduced traub model for CA3 neurons. Journal of Computational Neuroscience 1, 39–60 (1994) 7. Fransén, E., Alonso, A.A., Dickson, C.T., Magistretti, J., Hasselmo, M.E.: Ionic mechanisms in the generation of subthreshold oscillations and action potential clustering in entorhinal layer II stellate neurons 14(3), 368–384 (2004) 8. Molter, C., Yamaguchi, Y.: Theta phase precession for spatial representation and memory formation. In: The 1st International Conference on Cognitive Neurodynamics (ICCN 2007), Shanghai, 2-09-0002 (2007)
Working Memory Dynamics in a Flip-Flop Oscillations Network Model with Milnor Attractor David Colliaux1,2 , Yoko Yamaguchi1 , Colin Molter1 , and Hiroaki Wagatsuma1 1
Lab for Dynamics of Emergent Intelligence, RIKEN BSI, Wako, Saitama, Japan 2 Ecole Polytechnique (CREA), 75005 Paris, France
[email protected]
Abstract. A phenomenological model is developed where complex dynamics are the correlate of spatio-temporal memories. If resting is not a classical fixed point attractor but a Milnor attractor, multiple oscillations appear in the dynamics of a coupled system. This model can be helpful for describing brain activity in terms of well classified dynamics and for implementing human-like real-time computation.
1
Introduction
Neuronal collective activities of the brain are widely characterized by oscillations in human and animals [1][2]. Among various frequency bands, distant synchronization in theta rhythms (4-8 Hz oscillation defined in human EEG) is recently known to relate with working memory, a short-term memory for central execution in human scalp EEG [3][4] and in neural firing in monkeys [5][6]. For long-term memory, information coding is mediated by synaptic plasticity whereas short-term memory is stored in neural activities [7]. Recent neuroscience reported various types of persistent activities of a single neuron and a population of neurons as possible mechanisms of working memory. Among those, bistable states, up- and down-states, of the membrane potential and its flip-flop transitions were measured in a number of cortical and subcortical neurons. The up-state, characterized by frequent firing, shows stability for seconds or more due to network interactions [8]. However it is little known whether flip-flop transition and distant synchronization work together or what kind of processings are enabled by the flip-flop oscillation network. Associative memory network with flip-flop change was proposed for working memory with classical rate coding view [9], while further consideration on dynamical linking property based on firing oscillation, such as synchronization of theta rhythms referred above, is likely essential for elucidation of multiple attractor systems. Besides, Milnor extended the concept of attractors to invariant sets with Lyapunov unstability, which has been of interest in physical, chemical and biological systems. It might allow high freedom in spontaneous switching among semi-stable states [12]. In this paper, we propose a model of oscillation associative memory with flip-flop change for working memory. We found that M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 160–169, 2008. c Springer-Verlag Berlin Heidelberg 2008
Working Memory Dynamics in a Flip-Flop Oscillations Network Model
161
the Milnor attractor condition is satisfied in the resting state of the model. We will first study how the Milnor attractor appears and will then show possible behaviors of coupled units in the Milnor attractor condition.
2 2.1
A Network Model Structure
In order to realize up- and down-states where up-state is associated with oscillation, phenomenological models are joined. Traditionally, associative memory networks are described by state variables representing the membrane potential {Si } [9]. Oscillation is assumed to appear in the up-state as an internal process within each variable φi for the ith unit. Oscillation dynamics is simply given by a phase model with a resting state and periodic motion [10,11]. cos(φi ) stands for an oscillation current in the dynamics of the membrane potential. 2.2
Mathematical Formulation of the Model
The flip-flop oscillations network of N units is described by the set of state variables {Si , φi } ∈ N × [0, 2π[N (i ∈ [1, N ]). Dynamic of Si and φi is given by the following equations: dSi Wij R(Sj ) + σ(cos(φi ) − cos(φ0 )) + I± dt = −Si + dφi (1) = ω + (β − ρS i )sin(φi ) dt with R(x) = 12 (tanh(10(x − 0.5)) + 1), φ0 = arcsin( −ω β ) and cos(φ0 ) < 0. R is the spike density of units and input I± will be taken as positive (I+ ) or negative (I− ) pulses (50 time steps), so that we can focus on the persistent activity of units after a phasic input. ω and β are respectively the frequency and the stabilization coefficient of the internal oscillation. ρ and σ represent mutual feedback between internal oscillation and membrane potential. Wij are the connection weights describing the strength of coupling between units i and j. φ0 is known to be a stable fixed point of the equation for φ, and 0 to be a fixed point for the S equation.
3 3.1
An Isolated Unit Resting State
The resting state is the stable equilibrium when I = 0 for a single unit. We assume ω < β so that M0 = (0, φ0 ) is the fixed point of the system. To study the linear stability of this fixed point, we write the stability matrix around M0 : −1 −σsin(φ0 ) DF |M0 = (2) −ρsin(φ0 ) βcos(φ0 )
162
D. Colliaux et al.
The sign of the eigenvalues of DF |M0 and thus the stability of M0 depends only on μ = ρσ. With our choice of ω = 1 and β = 1.2, μc ≈ 0.96. If μ < μc , M0 is a stable fixed point and there is another fixed point M1 = (S1 , φ1 ) with φ1 < φ0 which is unstable. If μ > μc , M0 is unstable and M1 is stable with φ1 > φ0 . Fixed points exchange stability as the bifurcation parameter μ increases (transcritical bifurcation). The simplified system according to eigenvectors (X1 , X2 ) of the matrix DF |M0 gives a clear illustration of the bifurcation as dx1 dt dx2 dt
= ax21 + λ1 x1 = λ2 x2
(3)
Here a = 0 is equivalent to μ = μc and in this condition there is a positive measure basin of attraction but some directions are unstable. The resting state M0 is not a classical fixed point attractor because it does not attract all trajectories from an open neighborhood, but it is still an attractor if we consider Milnor’s extended definition of attractors. Phase plane (S, φ) Fig. 1 shows that for μ close to the critical value, nullclines cross twice staying close to each other in between. That narrow channel makes the configuration indistinguishable from a Milnor attractor in computer experiments.
Fig. 1. Top: Phase space (S, φ) with vector field and nullclines of the system. The dashed domain in B shows that M0 have positive measure basin of attraction when μ = μc . Bottom: Fixed points with their stable and unstable directions for the equivalent simplified system. A: μ < μc . B: μ = μc . C:μ > μc .
Since we showed μ is the crucial parameter for the stability of the resting state, we can now consider ρ = 1 and study the dynamics according to σ with a close look near the critical regime (σ = μc ).
Working Memory Dynamics in a Flip-Flop Oscillations Network Model
3.2
163
Constant Input Can Give Oscillations
Under constant input there are two possible dynamics: fixed point and limit cycle. If ω (4) β − S < 1 there is a stable fixed point (S1 , φ1 ) with φ1 solution of ω + (β − σ(cos(φ1 ) − cos(φ0 )) − I)sin(φ1 ) = 0 S1 = σ(cosφ1 − cosφ0 ) + I
(5)
If condition 4 is not satisfied, the φ equation in 1 will give rise to oscillatory dynamics. Identifying S with its temporal average, dφ dt = ω + Γ sin(φ) with 2π dφ Γ = β − S will be periodic with period 0 ω+(β−S)sin(φ) . This approximation gives an oscillation at frequency ω = ω 2 − (β − S)2 , which is qualitatively in good agreement with computer experiments Fig. 2. σ=μc 6
1.2 S minimum S maximum Frequency (theoretical) Frequency
5
1
4
0.8
3 f
S
0.6 2 0.4 1 0.2 0 0 -1 -0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
I
Fig. 2. For each value of constant current I, maximum and minimum values of S1 are plotted. Dominant frequency of S1 obtained by FFT is compared to the theoretical value when S is identified with its temporal average: Frequency VS Frequency (theoretical).
If we inject an oscillatory input into the system, S oscillates at the same frequency provided the input frequency is low. For higher frequencies, S cannot follow the input and shows complex oscillatory dynamics with multiple frequencies.
164
4
D. Colliaux et al.
Two Coupled Units
For two coupled units, flip-flop of oscillations is observed under various conditions. We will analyze the case μ = 0 and flip-flop properties under various strengths of connection weights, assuming symmetrical connections (W12 = W2,1 = W ). 4.1
Influence of the Feedback Loop
In equation 1, ρ and σ implement a feedback loop representing mutual influence of φ and S for each unit. The Case µ = 0. In the case σ = 0 or ρ = 0, φ remains constant φ = φ0 : the system is then a classical recurrent network. This model was used to provide associative memory network storing patterns in fixed point attractors [9]. For small coupling strength, the resting state is a fixed point. For strong coupling strength, two more fixed points appear, one unstable, corresponding to threshold, and one stable, providing memory storage. After a transient positive input I+ above threshold, the coupled system will be in up-state. A transient negative input I− can bring it back to resting state. For a small perturbation (σ 1 and ρ = 1), the active state is a small up-state oscillation but associative memory properties (storage, completion) are preserved. Growing Oscillations. The up-state oscillation in the membrane potential dynamics triggered by giving an I+ pulse to unit 1 grows when σ increases and saturates to an up-state fixed point for strong feedback. Interestingly, for a range of feedback strength values near μc , S returns transiently near the Milnor attractor resting state. Projection of the trajectories of the 4-dimensional system on a 2-dimensional plane section P illustrates these complex dynamics Fig. 3. A cycle would intersect this plane in two points. For each σ value, we consider S1 for these intersection points. For a range between 0.91 and 1.05 with our choice of parameters, there are much more than two intersection points M*, suggesting chaotic dynamics. 4.2
Influence of the Coupling Strength
The dynamics of two coupled units can be a fixed point attractor, as in the resting state (I = 0), or down-state or up-state oscillation (depending on the coupling strength), after a transient input. Near critical value of the feedback loop, in addition to these, more complex dynamics occur for intermediate coupling strength.
Working Memory Dynamics in a Flip-Flop Oscillations Network Model
165
Fig. 3. A: Influence of the feedback loop- Bifurcation diagram according to σ (Top). S1 coordinates of the intersecting points of the trajectory with a plane section P according to σ(Bottom). B: Influence of the coupling strengh - S1 maximum and minimum values and average phase difference (φ1 − φ2 ) according to W (Top). S1 coordinates of the intersecting points of the trajectory with a plane section P according to W (Bottom).
Down-state Oscillation. For small coupling strength, the system periodically visits the resting state for a long time and goes briefly to up-state. The frequency of this oscillation increases with coupling strength. The two units are anti-phase (when Si takes maximum value, Sj takes minimum value) Fig. 4 (Bottom). Up-state Oscillation. For strong coupling strength, a transient input to unit 1 leads to an up-state oscillation Fig. 4 (Top). The two units are perfectly in-phase at W = 0.75 and phase difference stays small for stronger coupling strength. Chaotic Dynamics. For intermediate coupling strength, an intermediate cycle is observed and more complex dynamics occur for a small range (0.58 < W < 0.78
166
D. Colliaux et al.
Fig. 4. Si temporal evolution, (S1 , S2 ) phase plane and (Si , φi ) cylinder space. Top: Up-state oscillation for strong coupling. Middle: Multiple frequency oscillation for intermediate coupling. Bottom: Down-state oscillation for weak coupling.
with our parameters) before full synchronization characterized by φ1 − φ2 = 0. The trajectory can have many intersection points with P and S ∗ in Fig. 3 shows multiple roads to chaos through period doubling.
Working Memory Dynamics in a Flip-Flop Oscillations Network Model
5
167
Application to Slow Selection of a Memorized Pattern
5.1
A Small Network
The network is a set N of five units consisting in a subset N1 of three units A,B and C and another N2 of two units D and E. In the set N, units have symmetrical all-to-all weak connections (WN = 0.01) and in each subset units have symmetrical all-to-all strong connections (WNi = 0.1 ∗ M ) with M a global parameter slowly varying in time between 1 and 10. These subsets could represent two objects stored in the weight matrix. 5.2
Memory Retrieval and Response Selection
S
We consider a transient structured input into the network. For constant M, a partial or complete stimulation of a subset Ni can elicit retrieval and completion of the subset in an up-state as would do a classical auto-associative memory network.
2 1.5 1 0.5 0
A 0
20000
40000
60000
80000
100000
120000
140000
80000
100000
120000
140000
80000
100000
120000
140000
80000
100000
120000
140000
80000
100000
120000
140000
S
t 2 1.5 1 0.5 0
B 0
20000
40000
60000
S
t 2 1.5 1 0.5 0
C 0
20000
40000
60000
S
t 2 1.5 1 0.5 0
D 0
20000
40000
60000
S
t 2 1.5 1 0.5 0
E 0
20000
40000
60000 t
Fig. 5. Slow activation of a robust synchronous up-state in N1 during slow increase of M
In the Milnor attractor condition more complex retrieval can be achieved when M is slowly increased. As an illustration, we consider transient stimulation of units A and B from N1 and unit E from N2 Fig. 5. N2 units show anti-phase
168
D. Colliaux et al.
oscillations with increasing frequency. N1 units first show synchronous downstate oscillations with long stays near the Milnor attractor and gradually go toward sustained up-state oscillations. In this example, the selection of N1 in up-state is very slow and synchrony between units plays an important role.
6
Conclusion
We demonstrated that, in cylinder space, a Milnor attractor appears at a critical condition through forward and reverse saddle-node bifurcations. Near the critical condition, the pair of saddle and node constructs a pseudo-attractor, which can serves for observation of Milnor attractor-like properties in computer experiments. Semi-stability of the Milnor attractor in this model seems to be associated with the variety of oscillations and chaotic dynamics through period doubling roads. We demonstrated that an oscillations network provides a variety of working memory encoding in dynamical states under the presence of a Milnor attractor. Applications of oscillatory dynamics have been compared to classical autoassociative memory models. The importance of Milnor attractors was proposed in the analysis of coupled map lattices in high dimension [11] and for chaotic itinerancy in the brain [13]. The functional significance of flip-flop oscillations networks with the above dynamical complexity is of interest for further analysis of integrative brain dynamics.
References 1. Varela, F., Lachaux, J.-P., Rodriguez, E., Martinerie, J.: The brainweb: Phase synchronization and large-scale integration. Nature Reviews Neuroscience (2001) 2. Buzsaki, G., Draguhn, A.: Neuronal oscillations in cortical networks. Science (2004) 3. Onton, J., Delorme, A., Makeig, S.: Frontal midline EEG dynamics during working memory. NeuroImage (2005) 4. Mizuhara, H., Yamaguchi, Y.: Human cortical circuits for central executive function emerge by theta phase synchronization. NeuroImage (2004) 5. Rainer, G., Lee, H., Simpson, G.V., Logothetis, N.K.: Working-memory related theta (4-7Hz) frequency oscillations observed in monkey extrastriate visual cortex. Neurocomputing (2004) 6. Tsujimoto, T., Shimazu, H., Isomura, Y., Sasaki, K.: Prefrontal theta oscillations associated with hand movements triggered by warning and imperative stimuli in the monkey. Neuroscience Letters (2003) 7. Goldman-Rakic, P.S.: Cellular basis of working memory. Neuron (1995) 8. McCormick, D.A.: Neuronal Networks: Flip-Flops in the Brain. Current Biology (2005) 9. Durstewitz, D., Seamans, J.K., Sejnowski, T.J.: Neurocomputational models of working memory. Nature Neuroscience (2000) 10. Yamaguchi, Y.: A Theory of hippocampal memory based on theta phase precession. Biological Cybernetics (2003)
Working Memory Dynamics in a Flip-Flop Oscillations Network Model
169
11. Kaneko, K.: Dominance of Minlnor attractors in Globally Coupled Dynamical Systems with more than 7 +- 2 degrees of freedom (retitled from ‘Magic Number 7 +- 2 in Globally Coupled Dynamical Systems’) Physical Review Letters (2002) 12. Fujii, H., Tsuda, I.: Interneurons: their cognitive roles - A perspective from dynamical systems view. Development and Learning (2005) 13. Tsuda, I.: Towards an interpretation of dynamic neural activity in terms of chaotic dynamical systems. Behavioural and Brain Sciences (2001)
Corticopetal Acetylcholine: Possible Scenarios on the Role for Dynamic Organization of Quasi-Attractors Hiroshi Fujii1,2, Kazuyuki Aihara2,3, and Ichiro Tsuda4,5 1
Department of Information and Communication Sciences, Kyoto Sangyo University, Kyoto 603-8555, Japan
[email protected] 2 Institute of Industrial Science, the University of Tokyo, Tokyo 153-8505
[email protected] 3 ERATO, Japan Science and Technology Agency, Tokyo 151-0065, Japan 4 Research Institute for Electronic Science, Hokkaido University, Sapporo 060-0812, Japan
[email protected] 5 Center of Excellence COE in Mathematics, Department of Mathematics, Hokkaido University, Sapporo 060-0810, Japan
( )
Abstract. A new hypothesis on a possible role for the corticopetal acetylcholine (ACh) is provided from a dynamical systems standpoint. The corticopetal ACh helps to transiently organize a global (inter- and intra-cortical) quasi-attractors via gamma range synchrony when it is behaviorally needed as top-down attentions and expectation.
1 Introduction 1.1 Corticopetal Acetylcholine Achetylcholine (ACh) is the first substance identified as a neurotransmitter by Otto Loewi [19]. Although it is increasingly recognized that ACh plays a critical role, not only in arousal and sleep, but in higher cognitive functions as attention, conscious flow, and so on, the question on the way in which ACh works in those cognitive processes remains a mystery [11]. The corticopetal ACh, originated in the nucleus basalis of Meinert (NBM), a part of the basal forebrain (BF), is the primary source of cortical ACh, and the major target of BF projections is the cortex [21]. Behavioral studies and those using immunotoxin as well provide consistent evidence of the role of ACh in top-down attentions. A blockage of NBM ACh, either by deasease-related or drug-induced, causes a severe loss of attentions: selective attention, sustained attention, and divided attention together with a shift of attention. ACh concerns conscious flow (Perry & Perry [24]). Continual death of cholinergic neurons in NBM causes Lewy Body Dementia (LBD), one of the most salient symptoms of which is the complex visual hallucination (CVH) [1].1 1
Perry and Perry [24] noted those hallucinatory LBD patients who see: “integrated images of people or animals which appear real at the time”, “insects on walls”, or “bicycles outside the fourth storey window”. Images are generally vivid and colored, continue for a few minutes (neither seconds nor hours). It is to be noted that “many of those experiences are enhanced during eye closed and relieved by visual input”, and “nicotinic anatagonists, such as mecamylamine, are not reported to induce hallucinations”.
M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 170–178, 2008. © Springer-Verlag Berlin Heidelberg 2008
Corticopetal Acetylcholine: Possible Scenarios on the Role for Dynamic Organization
171
1.2 Attentions, Cortical State Transitions and Cholinergic Control System from NBM Top-down flow of signals which accompanies attentions, expectation and so on may cause state transitions in the “down stream” cortices. Fries et al. [7] reported an increase of synchrony in high-gamma range in accordance of selective attention. (See, also Jones [15], Buschman et al. [2].) Metherate et al. [22] stimulated NBM in in vivo preparations of auditory cortex. Of particular interest in their observations is that NBM ACh produced a change in subthreshold membrane potential fluctuations from large-amplitude, slow (1-5 Hz) oscillations to low-amplitude, fast (20-40 Hz) (i.e., gamma) oscillations. A shift of spike discharge pattern from phasic to tonic was also observed.2 They pointed out that in view of the wide spread projections of NBM neurons, larger intercortical networks could also be modified. Together with Fries et al. data, it is suggested that NBM cholinergic projections may induce a state transition as a shift of frequency, and change of discharge pattern in neocortical neurons. This may be consistent with the observation made by Kay et al. [16]. During perceptual processing in the olfactory-limbic axis, a cascade of brain events at the successive stages of the task as “expectation” and/or “attention” was observed. ‘Local’ transitions of the olfactory structures indicated by modulations of EEG signals as gamma amplitude, periodicity, and coherence were reported to exist. Kay et al. also observed that the local dynamics transiently falls into attractor-like states. Such ‘local’ transitions of states are generally postulated to be triggered by ‘topdown’ glutamatergic spike volleys from “upper stream” organizations. However, such brain events of state transitions with a change in synchrony could be a result of collaboration of descending glutamatergic spike volleys and ascending ACh afferents from NBM. (See, also [26]).3
2 Neural Correlate of Conscious Percepts and the Role of the Corticopetal ACh 2.1 Neural Correlates of Conscious Percepts and Transient Synchrony The corticopetal ACh pathway might be the critical control system which may trigger various kinds of attentions receiving convergent inputs from other sensory and association areas as Sarter et al. [25], [26] argued. In order to discuss the role of the 2
NBM contains cholinergic neurons and non-cholinergic neurons. GABAergic neurons are at least twice more numerous than cholinergic neurons [15] The Metherate et al. observations (above) may be the result of collective functioning of both the cholinergic and GABAergic projections. Wenk [31] argued another possibility that the NBM ACh projections on the reticular thalamic nucleus might cause the cortical state change. 3 Triangular Attentional Pathway: The above arguments may better be complemented by the triangular amplification circuitry, a pathway consisting of parietal cortex → prefrontal cortex → NBM → sensory cortex [10]. This may constitute the cholinergic control system from NBM, i.e., the top-down attentional pathway for cholinergic modulations of responses in cortical sensory areas.
172
H. Fujii, K. Aihara, and I. Tsuda
corticopetal ACh related to attentions, we first begin with the question: What is the neural correlate of conscious percepts? The recent experiments by Kenet et al. [17] show the possibility that in a background (or spontaneous) state where no external stimuli exist, the visual cortex fluctuates between specific internal states. This might mean that cortical circuitry has a number of pre-existing and intrinsic internal states which represent features, and the cortex fluctuates between such multiple intrinsic states even when no external inputs exist. (See, also Treisman et al. [27].) Attention and Dynamical Binding Through Synchrony In order that perception of an object makes sense, its features, or fragmentary subassemblies, must be bound as a unity. How does the “binding” of those local fragmentary dynamics into a global dynamics is done? A widely accepted view is that top-down signals as expectation, attentions, and so on may play the role of such an integration, which is mediated by (possibly gamma) synchronization among the concerned assemblies representing stimuli or events. We postulate that such a process is a basis of global intra- and inter-cortical conjunctions of brain activities. See, Varela et al. [30],Womelsdorf et al. [32]. See, also Dehaene and Changeux [5].) The neural correlate of conscious percepts is a globally integrated state of related brain networks mediated by synchrony over gamma and other frequency bands. In mathematical terms, such transient processes of synchrony of global, but between selected groups of neurons may be described as a transitory state of approaching a global attractor. We note that such a “transitory” state may be conceptualized as an attractor ruin (Tsuda [28], Fujii et al. [8]), in which there are orbits approaching and stay there for a while, but may simultaneously possess repelling orbits from itself. The proper Milnor attractor and its perturbed structures can be a specific representation of attractor ruins [8]. However, since the concept of attractor ruins may include a wider class of non-classical attractors than the Milnor attractor, we may use the term “attractor ruins” in this paper to include possible but unknown classes of ruins. 2.2 Role of the Corticopetal ACh: A Working Hypothesis How can top-down attentions, expectation, etc. contribute to conscious perception with the aid of ACh? Assuming the arguments in the preceding section, this question could be translated into: “How do the corticopetal ACh projections work for the emergence of globally organized attractor ruins via transient synchrony?” We summarize our tentative proposition as a Working Hypothesis in the following. Working Hypothesis: The role of the corticopetal ACh accompanied with top-down contextual signals as attentions and so on is the mediator for dynamically organizing quasi-attractors, which are required in conscious perception, or in executing actions. ACh “integrates” multiple “floating subnetworks” into “a transient synchrony group” in the gamma frequency range. Such a transiently emerging synchrony group can be regarded as an attractor ruin in the dynamical systems-theoretic sense.
Corticopetal Acetylcholine: Possible Scenarios on the Role for Dynamic Organization
173
3 Do the Existing Experimental Data Support the Working Hypothesis? 3.1 Introductory Remarks: Transient Synchronization by Virtue of Pre- and Post-synaptic ACh Modulations ACh may have both pre-synaptic and post-synaptic effects on individual neurons in the cortex. First, top-down glutamatergic spike volleys flow into cortical layers, which might convey contextual information on stimuli. The corticopetal ACh arrives concomitantly with the glutamatergic volleys. If ACh release modulates synaptic connectivity between cortical neurons even in an effective sense by virtue of “pre-synaptic modulations”, metamorphosis of the attractor landscape4 should be inevitable. Post-synaptic influences of the corticopetal ACh on individual neurons – either inhibitory or excitatory, might cause deep effects on their firing behavior, and might induce a state transition with a collective gamma oscillation. A consequence of the three effects together might trigger a specific group of networks to oscillate in gamma frequency with phase synchrony. These are all speculative stories based on experimental evidence. We need at least to examine the present status on the experimental data concerning the cortical ACh influence on individual neurons. This may be the place to add a comment on the specificity of the corticopetal ACh projections on the cortex, which may be a point of arguments. It is reported that the cholinergic afferents specialized synaptic connections with post synaptic targets, rather than releasing ACh non-specifically (Turrini et al..[29].) 3.2 Controversy on Experimental Data Let us review quickly the existing experimental data. As noted before, “there exist little consensus among researchers for more than a half century” [11]. The following is not intended to give a complete review, but to give a preliminary knowledge which may be of help to understand the succeeding discussions. ACh have two groups of receptors, one is the muscarinic receptors, mAChRs with 5 subtypes, and the other is nicotinic receptors, nAChRs with 17 subtypes. The nAChR is a relatively simple cationic (Na+ and Ca2+) channel, the opening of which leads to a rapid depolarization followed by desensitization. Most of mAChRs activation exhibits the slower onset and longer lasting G-protein coupled second messenger generation. Here the primary interest is in mAChRs. 5 The functions of mAChRs are reported to be two-fold: one is pre-synaptic, and the other is postsynaptic modulations. 4
“Attractor landscape” is usually used for potential systems. Here, we use it to mean the landscape of “basins” (absorbing regions), of classical and non-classical attractors as attractor ruins. 5 The nicotinic receptors, nAChRs may work as a disinhibition system to layer 2/3 pyramidal neurons [4]), the exact function of which is not known yet.
174
H. Fujii, K. Aihara, and I. Tsuda
Post-synapticModulations The results of traditional studies may be divided into two opposing data. The majority view is that mAChRs function as excitatory transmitter for post-synaptic neurons (see, e.g., McCormick [20]), while there are minority data that claim inhibitory functioning. The latter, however, has been considered to be a consequence of ACh excitation of interneurons, which in turn may inhibit post-synaptic pyramidal neurons (PYR). Recently, Gulledge et al. [11], [12] stated that transient mAChR activation generates strong and direct transient inhibition of neocortical PYR. The underlying ionic process is the induction of calcium release from internal stores, and subsequent activation of small-conductance (SK–type) calcium-activated potassium channels. The authors claim that the traditional data do not describe the actions of transient mAChR activation, as is likely to happen during synaptic release of ACh by the following reasons. 1.
2.
In vivo ACh concentration: Previous studies typically used high concentrations of muscarinic agonists (1–100 mM). Extracellular concentrations of ACh in the cortex are at least one order of magnitude lower than those required to depolarize PYR in vitro. Phasic (transient) application vs. bath application: Most data depended on experiments with bath applications, which may correspond to prolonged, tonic mAChR stimulation. The ACh release accompanied with attentions, etc., would better correspond to a transient puff application as the authors’ experiment.
The specificity of ACh afferents on postsynaptic targets was already noted. [29]. Pre-synaptic Modulations Experimental works on pre-synaptic modulations are mostly based on ACh bath applications, and the modulation data were measured in terms of local field potentials (LFP). Most results claimed the pathway specificity of modulations. Typically, it was concluded that muscarinic modulation can strongly suppress intracortical(IC) synaptic activity while exerting less suppression, or actually enhancing, thalamocortical (TC) inputs. [14]. Gil et al. [9] reported that 5 μM muscarine decreases both IC and TC pathway transmission, and that those data were presynaptic effects, since membrane potential and input resistance were unchanged. Recently, Kuczewski et al. [18] studied the same problem, and obtained different results from the previous ones. Low ACh (less than 100 μM) shows facilitation, and as ACh concentration goes high, the result is depression. This is true for the both IC and TC pathways, i.e., for the layer 2/3 and layer 4. 3.3 Possible Scenarios The lack of consistent experimental data makes our job complicated. The situation might be compared to playing a jigsaw puzzle with, many pieces missing and some pieces mingled from other jigsaw pictures. What we can do at this moment may be to propose possible alternatives of scenarios for the role of the corticopetal ACh.
Corticopetal Acetylcholine: Possible Scenarios on the Role for Dynamic Organization
175
The following are the list of prerequisites and evidences on which our arguments should be based. 1.
2.
3.
Two modulations may occur simultaneously inside the 6 layers of the cortex. The firing characteristics of individual neurons, and the strength of synaptic connections may change dynamically either as post-synaptic or pre-synaptic modulations. Virtually no models, to our knowledge, have been proposed, which took the net effects of the two modulations into account. The interaction of ACh with top-down glutamatergic spike volleys should be considered. The majority of neurons alter in response to combined exposure to both acetylcholine and glutamate concomitantly. (Perry & Perry [24].) As a post-synaptic influence, ACh release may change the firing regime of neurons, and induce gamma oscillation [6], [23], [31].
As to pre-synaptic modulations, the details of synaptic processes appear to be largely unknown. The significance of experimental studies can not be overemphasized. Now let us try to draw a dessin for possible scenarios on the role of corticopetal ACh. Here we may put three corner stones for the models: 1. 2. 3.
Who triggers the gamma oscillation? Who (and how to) modulates the effective connectivity? What is the mechanism of phase synchrony and what is the role of it?
Scenario I The basic idea is that in the default, low level state of ACh, globally organized attractors do not virtually exist, and may take the form of floating fragmentary dynamics. Then, ACh release may help to strengthen the synaptic connections presynaptically. One of the roles of post-synaptic modulation is to start up the gamma oscillation. (Here the influence of GABAergic projections from NBM might play a role.) Another, but important role will be stated later. Scenario II The effective modulation of synaptic connectivity might be carried by, rather than the pre-synaptic modulation, the phase synchrony of the gamma oscillation itself which is triggered by the post-synaptic modulation. Such a mechanism for the change of synaptic connectivity, and resulting binding of fragmentary groups of neurons was proposed by Womelsdorf et al. [32]. They claimed that the mutual influence among neuronal groups depends on the phase relation between rhythmic activities within the groups. Phase relations supporting interactions among the groups preceded those interactions by a few milliseconds, consistent with a mechanistic role. See, also Buzsaki [3]. For the case of Scenario II, the role of the post-synaptic modulation is to start up the gamma oscillation, and the reset of its phase, as Gulledge and Stuart [11] suggested. The transient hyper-polarization may play the role of referee to start up the oscillation among the related groups in unison. The Scenario I makes the postsynaptic modulation carry the two roles of starting up the gamma oscillation, and
176
H. Fujii, K. Aihara, and I. Tsuda
resetting its phase. For the pre-synaptic modulation a bigger role of realization of attractors by virtue of synaptic strength modulation is assigned.
4 Concluding Discussions The critical role of the corticopetal ACh in cognitive functions, together with its relation to some disease-related symptoms as complex visual hallucinations in DLB, and its apparent involvements in the neocortical state change have motivated us to the study of the functional role(s) of the corticopetal ACh from dynamical systems standpoints. Cognitive functions are phenomena carried by the brain dynamics. We hope that understanding the cognitive dynamics with the dynamical systems language would open new theoretical horizons. It is of some help to consider the conceptual difference of the two “forces” which flow into the 6 layers of the neocortex. Glutamate spike volleys could be, if viewed as an event in a dynamical system, an external force, which may kick the orbit to another orbit, and may sometimes to out of the “basin” of the present attractor beyond the border – the separatrix. In contrary to this situation, ACh projections – though transient, could be regarded as a slow parameter working as a bifurcation parameter that modifies the landscape itself. What we are looking at in the preceding arguments is that the two phenomena happen concomitantly at the 6 layers of the cortex. Hasselmo and McGaughy [13] emphasized the ACh role in memory as: “high acetylcholine sets circuit dynamics for attention and encoding; low acetylcholine sets dynamics for consolidation”, which is based on some experimental data on selective pre-synaptic depression and facilitation. However, in view of the potential role of attentions in local bindings or global integrations, we may pose alternative (but not necessarily exclusive,) scenarios on the ACh function as temporarily modifying the quasi-attractor landscape, in collaboration with glutamatergic spike volleys. Rather, we speculate that the process of memorization itself would realized through such a dynamic formation of attractor ruins, for which mAChR may play a role.
Acknowledgements The first author (HF) was supported by a Grant-in-Aid for Scientific Research (C), No. 19500259, from the Ministry of Education, Culture, Sports, Science and Technology of the Japanese Government. The second author (KA) was partially supported by Grant-in-Aid for Scientific Research on Priority Areas 17022012 from the Ministry of Education, Culture, Sports, Science, and Technology, the Japanese Government. The third author (IT) was partially supported by a Grant-in-Aid for Scientific Research on Priority Areas, No. 18019002 and No. 18047001, a Grant-inAid for Scientific Research (B), No. 18340021, Grant-in-Aid for Exploratory Research, No. 17650056, a Grant-in-Aid for Scientific Research (C), No. 16500188, and the 21st Century COE Program, Mathematics of Nonlinear Structures via Singularities.
Corticopetal Acetylcholine: Possible Scenarios on the Role for Dynamic Organization
177
References 1. Behrendt, R.-P., Young, C.: Hallucinations in schizophrenia, sensory impairment, and brain disease: A unifying model. Behav. Brain Sci. 27, 771–787 (2004) 2. Buschman, T.J., Miller, E.K.: Top-down Versus Bottom-Up Control of Attention in the Prefrontal and Posterior Parietal Cortices. Science 315, 1860–1862 (2007) 3. Buzsaki, G.: Rhythms of the Brain. Oxford University Press, Oxford (2006) 4. Christophe, E., Roebuck, A., Staiger, J.F., Lavery, D.J., Charpak, S., Audinat, E.: Two Types of Nicotinic Receptors Mediate an Excitation of Neocortical Layer I Interneurons. J. Neurophysiol. 88, 1318–1327 (2002) 5. Dehaene, S., Changeux, J.-P.: Ongoing Spontaneous Activity Controls Access to Consciousness: A Neuronal Model for Inattentional Blindness. PLoS Biology 3, 910–927 (2005) 6. Detari, L.: Tonic and phasic influence of basal forebrain unit activity on the cortical EEG. Behav. Brain. Res. 115, 159–170 (2000) 7. Fries, P., Reynolds, J.H., Rorie, A.E., Desimone, R.: Modulation of Oscillatory Neuronal Synchronization by Selective Visual Attention. Science 291, 1560–1563 (2001) 8. Fujii, H., Aihara, K., Tsuda, I.: Functional Relevance of ‘Excitatory’ GABA Actionsin Cortical Interneurons: A Dynamical Systems Approach. J. Integrative Neurosci. 3, 183– 205 (2004) 9. Gil, Z., Connors, B.W., Yael Amitai, Y.: Differential Regulation of Neocortical Synapses by Neuromodulators and Activity. Neuron 19, 679–686 (1997) 10. Golmayo, L., Nunez, A., Zaborsky, L.: Electrophysiological Evidence for the Existence of a Posterior Cortical-Prefrontal-Basal Forebrain Circuitry in Modulating Sensory Responses in Visual and Somatyosensory Rat Cortical Areas. Neuroscience 119, 597–609 (2003) 11. Gulledge, A.T., Stuart, G.J.: Cholinergic Inhibition of Neocortical Pyramidal Neurons. J. Neurosci 25, 10308–10320 (2005) 12. Gulledge, A.T., Susanna, S.B., Kawaguchi, Y., Stuart, G.J.: Heterogeneity of phasic signaling in neocortical neurons. J. Neurophysiol. 97, 2215–2229 (2007) 13. Hasselmo, M.E., McGaughy, J.: High acetylcholine sets circuit dynamics for attention and encoding; Low acetylcholine sets dynamics for consolidation. Prog. Brain Res. 145, 207– 231 (2004) 14. Hsieh, C.Y., Cruikshank, S.J., Metherate, R.: Differential modulation of auditory thalamocortical and intracortical synaptic transmission by cholinergic agonist. Brain Res 880, 51–64 (2000) 15. Jones, B.E., Muhlethaler, M.: Cholinergic and GABAergic neurons of the basal forebrain: role in cortical activation. In: Lydic, R., Baghdoyan, H.A. (eds.) Handbook of Behavioral State Control, pp. 213–233. CRC Press, London (1999) 16. Kay, L.M., Lancaster, L.R., Freeman, W.J.: Reafference and attractors in the olfactory system during odor recognition. Int. J. Neural Systems 4, 489–495 (1996) 17. Kenet, T., Bibitchkov, D., Tsodyks, M., Grinvald, A., Arieli, A.: Nerve cell activity when eyes are shut reveals internal views of the world. Nature 425, 954–956 (2003) 18. Kuczewski, N., Aztiria, E., Gautam, D., Wess, J., Domenici, L.: Acetylcholine modulates cortical synaptic transmission via different muscarinic receptors, as studied with receptor knockout mice. J. Physiol. 566.3, 907–919 (2005) 19. Loewi, O.: Ueber humorale Uebertragbarkeit der Herznervenwirkung. Pflueger’s Archiven Gesamte Physiologie 189, 239–242 (1921) 20. McCormick, D.A., Prince, D.A.: Mechanisms of action of acetylcholine in the guinea-pig cerebral cortex in vitro. J. Physiol. 375, 169–194 (1986)
178
H. Fujii, K. Aihara, and I. Tsuda
21. Mesulam, M.M., Mufson, E.J., Levey, A.I., Wainer, B.H.: Cholinergic innervation of cortex by the basal forebrain: cytochemistry and cortical connections of the septal area, diagonal band nuclei, nucleus basalis (substantia innominata), and hypothalamus in the rhesus monkey. J. Comp. Neurol. 214, 170–197 (1983) 22. Metherate, R., Charles, L., Cox, C.L., Ashe, J.H.: Cellular Bases of Neocortical Activation: Modulation of Neural Oscillations by the Nucleus Basalis and Endogenous Acetylcholine. J. Neurosci. 72, 4701–4711 (1992) 23. Niebur, E., Hsiao, S.S., Johnson, K.O.: Synchrony: a neuronal mechanism for attentional selection? Curr. Opin. Neurobiol. 12, 190–194 (2002) 24. Perry, E.K., Perry, R.H.: Acetylcholine and Hallucinations: Disease-Related Compared to Drug-Induced Alterations in Human Consciousness. Brain Cognit. 28, 240–258 (1995) 25. Sarter, M., Gehring, W.J., Kozak, R.: More attention should be paid: The neurobiology of attentional effort. Brain Res. Rev. 51, 145–160 (2006) 26. Sarter, M., Parikh, V.: Choline Transporters, Cholinergic Transmission and Cognition. Nature Reviews Neurosci. 6, 48–56 (2005) 27. Treisman, A.M., Gelade, G.: A feature-integration theory of attention. Cognit. Psychol. 12, 97–136 (1980) 28. Tsuda, I.: Chaotic Itinerancy as a Dynamical Basis of Hermeneutics of Brain and Mind. World Future 32, 167–185 (1991) 29. Turrini, P., Casu, M.A., Wong, T.P., De Koninck, Y., Ribeiro-da-Silva, A., And Cuello, A.C.: Cholinergic nerve terminals establish classical synapses in the rat cerebral cortex: synaptic pattern and age—related atrophy. Neiroscience 105, 277–285 (2001) 30. Varela, F., Lachaux, J.-P., Rodriguez, E., Martinerie, J.: The Brainweb: Phase synchronization and large-scale integration. Nature Rev. Neurosci. 2, 229–239 (2001) 31. Wenk, G.L.: The Nucleus Basalis Magnocellularis Cholinergic System: One Hundred Years of Progress. Neurobiol. Learn. Mem. 67, 85–95 (1997) 32. Womelsdorf, T., Schoffelen, J.M., Oostenveld, R., Singer, W., Desimone, R., Engel, A.K., Fries, P.: Modulation of neuronal interactions through neuronal synchronization. Science 316, 1578–1579 (2007)
Tracking a Moving Target Using Chaotic Dynamics in a Recurrent Neural Network Model Yongtao Li and Shigetoshi Nara Graduate School of Natural Science and Technology, Okayama University, 3-1-1 Tsushima-naka, Okayama 700-8530, Japan
[email protected]
Abstract. Chaotic dynamics introduced in a recurrent neural network model is applied to controlling an tracker to track a moving target in two-dimensional space, which is set as an ill-posed problem. The motion increments of the tracker are determined by a group of motion functions calculated in real time with firing states of the neurons in the network. Several groups of cyclic memory attractors that correspond to several simple motions of the tracker in two-dimensional space are embedded. Chaotic dynamics enables the tracker to perform various motions. Adaptively real-time switching of control parameter causes chaotic itinerancy and enables the tracker to track a moving target successfully. The performance of tracking is evaluated by calculating the success rate over 100 trials. Simulation results show that chaotic dynamics is useful to track a moving target. To understand them further, dynamical structure of chaotic dynamics is investigated from dynamical viewpoint. Keywords: Chaotic dynamics, tracking, moving target, neural network.
1 Introduction Biological systems have became a hot research around the world because of their excellent functions not only in information processing, but also in well-regulated functioning and controlling, which work quite adaptively in various environments. However, we are yet poor of understanding the mechanisms of biological systems including brains despite many efforts of researchers because enormous complexity originating from dynamics in systems is very difficult to be understood and described using the conventional methodologies based on reductionism, that is, decomposing a system into parts or elements. The conventional reductionism more or less falls into two difficulties: one is “combinatorial explosion” and the other is “divergence of algorithmic complexity”. These difficulties are not yet solved. On the other hand, dynamical viewpoint to understand the mechanism seems to be a plausible method. In particular, chaotic dynamics experimentally observed in biological systems including brains[1,2] has suggested a viewpoint that chaotic dynamics would play important roles in complex functioning and controlling of biological systems including brains. From this viewpoint, many dynamical models have been constructed for approaching the mechanisms by means of large-scale simulation or heuristic methods. Artificial neural networks in which chaotic dynamics can be introduced has been attracting great interests, and the relation between chaos and functions has been discussed M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 179–188, 2008. c Springer-Verlag Berlin Heidelberg 2008
180
Y. Li and S. Nara
[9,10,11,12]. As one of those works, Nara and Davis found that chaotic dynamics can occur in a recurrent neural network model(RNNM) consisting of binary neurons [3], and they investigated the functional aspects of chaos by applying it to solving a memory search task with an ill-posed context[7]. To show the potential of chaos in controlling, chaotic dynamics was applied to solving two-dimensional mazes, which are set as ill-posed problems[8]. Two important points were proposed. One is a simple coding method translating the neural states into motion increments , the other is a simple control algorithm, switching a system parameter adaptively to produce constrained chaos. The conclusions show that constrained chaos behaviours can give better performance to solving a two-dimensional maze than that of random walk. In this paper, we develop the idea and apply chaotic dynamics to tracking a moving target, which is set as another ill-posed problem. Let us state about a model of tracking a moving target. An tracker is assumed to move in two-dimensional space and track a moving target along a certain trajectory by employing chaotic dynamics. the tracker is assumed to move with discrete time steps. The state pattern is transform into the tracker’s motion by the coding of motion functions, which will be given in a later section. In addition, several limit cycle attractors, which are regarded as the prototypical simple motions, are embedded in the network. By the coding of motion function, each cycle corresponds to a monotonic motion in twodimensional space. If the state pattern converges into a prototypical attractor, the tracker moves in a monotonic direction. Introducing chaotic dynamics into the network generated non-period state pattern, which is transformed into chaotic motion of the tracker by motion functions. Adaptive switching of a system parameter by a simple evaluation between chaotic dynamics and attractor’s dynamics in the network results in complex motions of the tracker in various environments. Considering this point, a simple control algorithm is proposed for tracking a moving target. In actual simulation, the present method using chaotic dynamics gives novel performance. To understand the mechanism of better performance, dynamical structure of chaotic dynamics is investigated from statistical data.
2 Memory Attractors and Motion Functions Our study works with a fully interconnected recurrent neural network consisting of N binary neurons. Its updating rule is defined by Si (t + 1) = sgn Wi j S j (t) (1) j∈Gi (r)
sgn(u) = • • • •
+1 u ≥ 0; −1 u < 0.
S i (t) = ±1(i = 1 ∼ N): the firing state of a neuron specified by index i at time t. Wi j : connection weight from the neuron S j to the neuron S i (Wii is taken to be 0) r: fan-in number for the neuron S i , named as connectivity,(0 < r < N). Gi (r): a spatial configuration set of connectivity r.
Tracking a Moving Target Using Chaotic Dynamics
181
At a certain time t, the state of neurons in the network can be represented as a N-dimensional state vector S(t), called as state pattern. Time development of state pattern S(t) depends on the connection weight matrix {Wi j } and connectivity r, therefore, in our study, Wi j are determined in the case of full connectivity r = N − 1, by a kind of orthogonalized learning method[7]and taken as follows. Wi j =
L K μ=1 λ=1
λ † (ξλ+1 μ )i · (ξ μ ) j
(2)
where {ξ λμ |λ = 1 . . . K, μ = 1 . . . L} is an attractor pattern set, K is the number of memory patterns included in a cycle and L is the number of memory cycles. ξλ† μ is the conjugate λ λ† λ vector of ξμ which satisfies ξμ · ξμ = δμμ · δλλ ,where δ is Kronecker’s delta. This method was confirmed to be effective to avoid spurious attractors[3,4,5,6,7,8]. Biological data show that neurons in brain causes various motions of muscles in body with a quite large redundancy. Therefore, the network consisting of N neurons is used to realize two-dimensional motion control of an tracker. We confirmed that chaotic dynamics introduced in the network does not so sensitively depend on the size of the neuron number[7].In our actual computer simulation,N = 400. Suppose that an tracker moves from the position (p x (t), py (t)) to (p x (t + 1), py (t + 1)) with a set of motion increments ( f x (t), fy (t)). The state pattern S(t) at time t is a 400-dimensional vector, and we transform it to two-dimensional motion increments by the coding of motion functions ( f x (S(t)), fy (S(t))). In 2-dimensional space, the actual motion of the tracker is given by 4 A·C p x (t + 1) p (t) f (S(t)) p (t) = x + x = x + (3) py (t + 1) py (t) fy (S(t)) py (t) N B·D where A, B, C, D are four independent N/4 dimensional sub-space vectors of state pattern S(t). Therefore, after the inner product between two independent sub-space vectors is normalized by 4/N, motion functions range from -1 to +1. In our actual simulations, two-dimensional space is digitized with a resolution 0.02 due to the binary neuron state ±1 and N = 400. Now, let us consider the construction of memory attractors corresponding to prototypical simple motions. We take 24 attractor patterns consisting of (L=4 cycles) × (K=6 patterns per cycle). Each cycle corresponds to one prototypical simple motion. We take four types of motion that one tracker moves toward (+1, +1), (-1, +1), (-1, -1), (+1, -1) in two-dimensional space. Each attractor pattern consists of four random subspace vectors A, B, C and D, where C = A or −A, and D = B or −B. So only A and B are independent random patterns. From the law of large number, memory patterns are almost orthogonal each other. Furthermore, in determining {Wi j }, the orthogonalized learning method was employed. Therefore, memory patterns are orthogonalized each other. The corresponding relations between memory attractors and prototypical simple motions are shown as follows. ( f x (ξλ1 ), fy (ξλ1 )) = (+1, +1) ( f x (ξλ3 ), fy (ξλ3 ))
= (−1, −1)
( f x (ξλ2 ), fy (ξλ2 )) = (−1, +1) ( f x (ξλ4 ), fy (ξλ4 )) = (+1, −1)
182
Y. Li and S. Nara
3 Introducing Chaotic Dynamics in RNNM Now let us state the effects of connectivity r. In the case of full connectivity r = N − 1, the network can function as a conventional associative memory. If the state pattern S(t) is one or near one of the memory patterns ξλμ , finally the output sequence S(t + kK)(k = 1, 2, 3 . . .) will converge to the memory pattern ξλμ . In other words, for each memory pattern, there is a set of the state patterns, called as memory basin Bλμ . If S(t) is in the memory basin Bλμ , then the output sequence S(t + kK)(k = 1, 2, 3 . . .) will converge to the memory pattern ξλμ . It is quite difficult to estimate basin volume accurately because of enormous amounts of calculation for the whole state patterns in N-dimensional state space. Therefore, a statistical method is applied to estimating the approximate basin volume. First, a sufficiently large amount of state patterns are sampled in the state space. Second, each sample is taken as initial pattern and updated with full connectivity. Third, it is taken statistic which memory attractor limk→∞ S(kK) of each sample would converge into. The distribution of statistic data over the whole samples is regarded as the approximate basin volume for each memory attractor(see Fig.1). The basin volume shows that almost all initial state patterns converge into one of the memory attractors averagely and there are seldom spurious attractors. 0.06
Basin volume
0.05 0.04 0.03 0.02 0.01 0 0
5
10 15 20 25 Memory pattern number
30
Fig. 1. Basin volume: The horizontal axis represents memory pattern number(1-24). Basin 25 corresponds to samples that converged into cyclic outputs with a period of six steps but not any one memory attractor. Basin 26 corresponds to samples excluded from any other case(1-25). The vertical axis represents the ratios between the corresponding samples and the whole samples.
Next, we continue to decrease connectivity r. When r is large enough, r N, memory attractors are stable. When r becomes smaller and smaller, more and more state patterns gradually do not converge into a certain memory pattern despite the network is updated for a long time, that is, attractors become unstable. Finally, when r becomes quite small, state pattern becomes non-period output, that is, non-period dynamics occurs in the state space. In our previous papers, we confirmed that the non-period dynamics in the network is chaotic wandering. In order to investigate the dynamical structure, we calculated basin visiting measures and it suggests that the trajectory can pass the
Tracking a Moving Target Using Chaotic Dynamics
183
whole N-dimensional state space, that is, cyclic memory attractors ruin due to a quite small connectivity [3,4,5,6,7].
4 Motion Control and Tracking Algorithm When connectivity r is sufficiently large, one random initial pattern converges into one of four limit cycle attractors as time evolves. By the coding transformation of motion functions, the corresponding motion of the tracker in 2-dimensional space becomes monotonic(see Fig.2). On the other hand, when connectivity r is quite small, chaotic dynamics occurs in the state space, correspondingly, the tracker moves chaotically (see Fig.3). If the updating of state pattern in chaotic regime is replaced by random 400-bitpattern generator, the tracker shows random walk(see Fig.4). Obviously, chaotic motion is different from random walk, and has a certain dynamical structure.
6 4 5
4 r=30
r=399
Random walk
2
4
2
0
3 2 1
START
START
-2
-2
-4
-4
-6
0 -1 0
1
2
3
4
5
6
Fig. 2. Monotonic motion
START
-6
-8 -1
0
(500 steps)
-10 -10
-8
-6
-4
-2
-8 0
2
Fig. 3. Chaotic walk
4
(500 steps)
-10 -10
-8
-6
-4
-2
0
2
4
Fig. 4. Random walk
Therefore, when the network evolves, monotonic motion and chaotic motion can be switched by switching the connectivity r. Based on this idea, we proposed a simple algorithm to track a moving target, shown in Fig.5. First, an tracker is assumed to be tracking a target that is moving along a certain trajectory in two-dimensional space, and the tracker can obtain the rough directional information D1 (t) of the moving target, which is called as global target direction. At a certain time t, the present position of the tracker is assumed at the point (p x (t), py (t)). This point is taken as the origin point and two-dimensional space can be divided into four quadrants. If the target is moving in the nth quadrant, D1 (t) = n (n = 1, 2, 3, 4). Next, we also suppose that the tracker can know another directional information D2 (t) = m (m = 1, 2, 3, 4), which is called global motion direction. It means that the tracker has moved toward the mth quadrant from time t − 1 to t, that is, in the previous step. Global target direction D1 (t) and global motion direction D2 (t) are time-dependent variables. If these information are taken as feedback to the network in real time, the connectivity r also becomes a time-dependent variable r(t) and is determined by D1 (t) and D2 (t). In Fig.5, RL is a sufficiently large connectivity and RS is a quite small connectivity that can lead to chaotic dynamics in the neural network. Adaptive switching of connectivity is the core idea of the algorithm. When the synaptic connectivity r(t) is determined by comparing two directions,D1(t−1) and D2 (t−1), the motion increments of the tracker are calculated from the state pattern of the network updated with r(t). The new motion
184
Y. Li and S. Nara
Fig. 5. Control algorithm of tracking a moving target: By judging whether global target direction D1 (t) coincides with global motion direction D2 (t) or not, adaptive switching of connectivity r between RS and RL results in chaotic dynamics or attractor’s dynamics in state space. Correspondingly, the tracker is adaptively tracking a moving target in two-dimensional space.
causes the next D1 (t) and D2 (t), and produces the next connectivity r(t + 1). By repeating this process, the synaptic connectivity r(t) is adaptively switching between RL and RS , the tracker is alternatively implementing monotonic motion and chaotic motion in two-dimensional space.
5 Simulation Results In order to confirm that this control algorithm is useful to tracking a moving target, the moving target should be set. Firstly, we have taken nine kinds of trajectories which the target moves along, which are shown in Fig.6 and include one circular trajectory and eight linear trajectories. Suppose that the initial position of the tracker is the origin(0,0) of two-dimensional space. The distance L between initial position of the tracker and that of the target is a constant value. Therefore, at the beginning of tracking, the tracker is at the circular center of the circular trajectory and the other eight linear trajectories are tangential to the circular trajectory along a certain angle α, where the angle is defined by the x axis. The tangential angle α = nπ/4 (n = 1, 2, . . . , 8), so we number the eight linear trajectories as LTn , and the circular trajectory as LT0 . 20 15 10 5
Object
Target
0 -5 -10
Capture
-15 -20 -20
Fig. 6. Trajectories of moving target: Arrow represents the moving direction of the target. Solid point means the position at time t=0.
-15
-10
-5
0
5
10
15
20
Fig. 7. An example of tracking a target that is moving along a circular trajectory with the simple algorithm. the tracker captured the moving target at the intersection point.
Tracking a Moving Target Using Chaotic Dynamics
185
Next, let us consider the velocity of the target. In computer simulation, the tracker moves one step per discrete time step, at the same time, the target also moves one step with a certain step length S L that represents the velocity of the target. The motion increments of the tracker ranges from -1 to 1, so the step length S L is taken with an interval 0.01 from 0.01 to 1 up to 100 different velocities. Because velocity is a relative quantity, so S L = 0.01 is a slower target velocity and S L = 1 is a faster target velocity relative to the tracker. Now, let us look at a simulation of tracking a moving target using the algorithm proposed above, shown in Fig.7. When an target is moving along a circular trajectory at a certain velocity, the tracker captured the target at a certain point of the circular trajectory, which is a successful capture to a circular trajectory.
6 Performance Evaluation
(a) circular target trajectory
-2
et
10 20 30 40 50 60 0 Connectivity
Ve l
oc
ity
x10 100 80 60 40 20 rg
Success Rate
1 0.8 0.6 0.4 0.2 0 0
Ta
Ve
et
10 20 30 40 50 60 0 Connectivity
lo
ci
ty
x10-2 100 80 60 40 20 rg
1 0.8 0.6 0.4 0.2 0 0
Ta
Success Rate
To show the performance of tracking a moving target, we have evaluated the success rate of tracking a moving target that moves along one of nine trajectories over 100 initial state patterns. In tracking process, the tracker sufficiently approaching the target within a certain tolerance during 600 steps is regarded as a successful trial. The rate of successful trials is called as the success rate. However, even though tracking a same target trajectory, the performance of tracking depends not only on synaptic connectivity r, but also on target velocity or target step length S L. Therefore, when we evaluate the success rate of tracking, a pair of parameters, that is, one of connectivity r(1 ≤ r ≤ 60) and one of target velocity S L(0.01 ≤ T ≤ 1.0), is taken. Because we take 100 different 100 target velocity with a same interval 0.01, we have C60 pairs of parameters. We have evaluated the success rate of tracking a moving target along different trajectories. Two examples are shown as Fig.8(a) and (b). By comparing Fig.8(a) and (b), we are sure that tracking a moving target of circular trajectory has better performance than that of linear trajectory. However, to some linear trajectories, quite excellent performance was observed. On the other hand, the success rate highly depends on connectivity r and the target velocity S L even if the same target trajectory is set. In order to observe the performance clearly, we have taken the data
(b) linear target trajectory
Fig. 8. Success rate of tracking a moving target along (a)a circle trajectory;(b)a linear trajectory: The positive orientation obeys the right-hand rule. The vertical axis represents success rate, and two axes in the horizontal plane represents connectivity r and target velocity S L, respectively.
Y. Li and S. Nara
1
1
0.8
0.8 Success rate
Success rate
186
0.6 0.4 0.2
0.6 0.4 0.2
0
0 0
20
40
60
80
100
0
-2
Target Velocity ( x10 )
(a) r = 16: downward tendency
20
40
60
80
100
Target Velocity ( x10-2)
(b) r = 51: upward tendency
Fig. 9. Success rates drawn from Fig.8(a): We take the data of a certain connectivity and show them in two dimension diagram. The horizontal axis represents target velocity from 0.01 to 1.0, and the vertical axis represents success rate.
of certain connectivities from Fig.8(a), and plot them in two-dimensional coordinates, shown as Fig.9. Comparing these figures, we can see a novel performance, when the target velocity becomes faster, the success rate has a upward tendency, such as r = 51. In other words, when the chaotic dynamics is not too strong, it seems useful to tracking a faster target.
7 Discussion In order to show the relations between the above cases and chaotic dynamics, from dynamical viewpoint, we have investigated dynamical structure of chaotic dynamics. To a quite small connectivity from 1 to 60, the network performs chaotic wandering for long time from a random initial state pattern. During this hysteresis, we have taken a statistics of continuously staying time in a certain basin [8] and evaluated the distribution p(l, μ) which is defined by p(l, μ) = {the number of l | S(t) ∈ βμ in τ ≤ t ≤ τ + l and S(τ − 1) βμ and S(τ + l + 1) βμ , μ| μ ∈ [1, L]} K βμ = Bλμ
(4)
(5)
λ=1
T =
lp(l, μ)
(6)
l
where l is the length of continuously staying time steps in each attractor basin, and p(l, μ) represents a distribution of continuously staying l steps in attractor basin L = μ within T steps. In our actual simulation, T = 105 . To different connectivity r=15 and r=50, the distribution p(l, μ) are shown in Fig.10(a) and Fig.10(b). In these figures, different basins are marked with different symbols. From the results, we can know that continuously staying time l becomes longer and longer with increase of the connectivity r. Referring to those novel performances talked in previous section, let us try to consider the reason.
100000
Frequency distribution of staying
Frequency distribution of staying
Tracking a Moving Target Using Chaotic Dynamics
Basin 1 Basin 2 Basin 3 Basin 4
10000 1000 100 10 1 2
4
6
8
10
12
14
Continuously staying time steps
(a) r = 15: shorter
16
187
100000 Basin 1 Basin 2 Basin 3 Basin 4
10000 1000 100 10 1 10
20
30
40
50
60
Continuously staying time steps
(b) r = 50: longer
Fig. 10. The log plot of the frequency distribution of continuously staying time l: The horizontal axis represents continuously staying time steps l in a certain basin μ during long time chaotic wandering, and the vertical axis represents the accumulative number p(l, μ) of the same staying time steps l in a certain basin μ. continuously staying time steps l becomes long with the increase of connectivity r.
First, in the case of slower target velocity, a decreasing success rate with the increase of connectivity r is observed from both circular target trajectory and linear ones. This point shows that chaotic dynamics localized in a certain basin for too much time is not good to track a slower target. Second, in the case of faster target velocity, it seems useful to track a faster target when chaotic dynamics is not too strong. Computer simulations shows that, when the target moves quickly, the action of the tracker is always chaotic so as to track the target. In past experiments, we know that motion increments of chaotic motion is very short. Therefore, shorter motion increments and faster target velocity result in bad tracking performance. However, when continuously staying time l in a certain basin becomes longer, the tracker can move toward a certain direction for l steps. This would be useful for the tracker to track the faster target. Therefore, when connectivity becomes a little large (r=50 or so), success rate arises following the increase of target velocity, such as the case shown in Fig.9. As an issue for future study, a functional aspect of chaotic dynamics still has context dependence.
8 Summary We proposed a simple method to tracking a moving target using chaotic dynamics in a recurrent neural network model. Although chaotic dynamics could not always solve all complex problems with better performance, better results often were often observed on using chaotic dynamics to solve certain ill-posed problems, such as tracking a moving target and solving mazes [8]. From results of the computer simulation, we can state the following several points. • A simple method to tracking a moving target was proposed • Chaotic dynamics is quite efficient to track a target that is moving along a circular trajectory.
188
Y. Li and S. Nara
• Performance of tracking a moving target of a linear trajectory is not better than that of a circular trajectory, however, to some linear trajectories, excellent performance was observed. • The length of continuously staying time steps becomes long with the increase of synaptic connectivity r that can lead chaotic dynamics in the network. • Continuously longer staying time in a certain basin seems useful to track a faster target.
References 1. Babloyantz, A., Destexhe, A.: Low-dimensional chaos in an instance of epilepsy. Proc. Natl. Acad. Sci. USA. 83, 3513–3517 (1986) 2. Skarda, C.A., Freeman, W.J.: How brains make chaos in order to make sense of the world. Behav. Brain. Sci. 10, 161–195 (1987) 3. Nara, S., Davis, P.: Chaotic wandering and search in a cycle memory neural network. Prog. Theor. Phys. 88, 845–855 (1992) 4. Nara, S., Davis, P., Kawachi, M., Totuji, H.: Memory search using complex dynamics in a recurrent neural network model. Neural Networks 6, 963–973 (1993) 5. Nara, S., Davis, P., Kawachi, M., Totuji, H.: Chaotic memory dynamics in a recurrent neural network with cycle memories embedded by pseudo-inverse method. Int. J. Bifurcation and Chaos Appl. Sci. Eng. 5, 1205–1212 (1995) 6. Nara, S., Davis, P.: Learning feature constraints in a chaotic neural memory. Phys. Rev. E 55, 826–830 (1997) 7. Nara, S.: Can potentially useful dynamics to solve complex problems emerge from constrained chaos and/or chaotic itinerancy? Chaos. 13(3), 1110–1121 (2003) 8. Suemitsu, Y., Nara, S.: A solution for two-dimensional mazes with use of chaotic dynamics in a recurrent neural network model. Neural Comput. 16(9), 1943–1957 (2004) 9. Tsuda, I.: Chaotic itinerancy as a dynamical basis of Hermeneutics in brain and mind. World Futures 32, 167–184 (1991) 10. Tsuda, I.: Toward an interpretation of dynamic neural activity in terms of chaotic dynamical systems. Behav Brain Sci. 24(5), 793–847 (2001) 11. Kaneko, K., Tsuda, I.: Chaotic Itinerancy. Chaos 13(3), 926–936 (2003) 12. Aihara, K., Takabe, T., Toyoda, M.: Chaotic Neural Networks. Phys. Lett. A 114, 333–340 (1990)
A Generalised Entropy Based Associative Model Masahiro Nakagawa Nagaoka University of Technology, Kamitomioka 1603-1, Nagaoka, Niigata 940-2188, Japan
[email protected]
Abstract. In this paper, a generalised entropy based associative memory model will be proposed and applied to memory retrievals with analogue embedded vectors instead of the binary ones in order to compare with the conventional autoassociative model with a quadratic Lyapunov functionals. In the present approach, the updating dynamics will be constructed on the basis of the entropy minimization strategy which may be reduced asymptotically to the autocorrelation dynamics as a special case. From numerical results, it will be found that the presently proposed novel approach realizes the larger memory capacity even for the analogue memory retrievals in comparison with the autocorrelation model based on dynamics such as associatron according to the higher-order correlation involved in the proposed dynamics. Keywords: Entropy, Associative Memory, Analogue Memory Retrieval.
1 Introduction During the past quarter century, the numerous autoassociative models have been extensively investigated on the basis of the autocorrelation dynamics. Since the proposals of the retrieval models by Anderson, [1] Kohonen, [2] and Nakano, [3] some works related to such an autoassociation model of the inter-connected neurons through an autocorrelation matrix were theoretically analyzed by Amari, [4] Amit et al . [5] and Gardner [6] . So far it has been well appreciated that the storage capacity of the autocorrelation model , or the number of stored pattern vectors, L , to be completely associated vs the number of neurons N, which is called the relative storage capacity or loading rate and denoted as c = L / N , is estimated as c ~0.14 at most for the autocorrelation learning model with the activation function as the signum one ( sgn (x) for the abbreviation) [7,8] . In contrast to the abovementioned models with monotonous activation functions, the neuro-dynamics with a nonmonotonous mapping was recently proposed by Morita, [9] Yanai and Amari, [10] Shiino and Fukai [11]. They reported that the nonmonotonous mapping in a neurodynamics possesses a remarkable advantage in the storage capacity, c ~0.27, superior than the conventional association models with monotonous mappings, e.g. the signum or sigmoidal function. In the present paper, we shall propose a novel approach based on the entropy defined in terms of the overlaps, which are defined by the innerproducts between the M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 189–198, 2008. © Springer-Verlag Berlin Heidelberg 2008
190
M. Nakagawa
state vector and the analogue embedded vectors instead of the previously investigated binary ones [1-16,25].
2 Theory Let us consider an associative model with the embedded analogue vector (r ) ei (1 i N,1 r L) , where N and L are the number of neurons and the number of embedded vectors. The states of the neural network are characterized in terms of the output vector si (1 i N ) and the internal states i (1 i N ) which are related each other in terms of
s i =f
(1iN),
i
(1)
where f ( • ) is the activation function of the neuron. Then we introduce the following entropy I which is to be related to the overlaps as L
I=- 1 2 (r)
where the overlaps m
(r) 2
m
log m
(r) 2
,
(2)
r=1
(r=1,2,...,L) are defined by N (r)
m
†(r)
= e
(3)
i s i;
i=1
here the covariant vector e relation,
†(r )
is defined in terms of
i
the following orthogonal
N
e
†(r)
ie
(s)
a
(r') rr' e i ,
(4a)
,
(4b)
i = rs
(1r,sL) ,
(4)
i=1
e
L †(r) i = r'=1
a
rr' =(
-1
)
rr'
e
(r) (r') i e i
N
rr' =
.
(4c)
i=1
The entropy defined by eq.(2) can be minimized by the following condition
m and
(r)
=
rs
(1r,sL),
(5)
A Generalised Entropy Based Associative Model L
m
(r) 2
191
(6)
=1
r=1
That is, regarding m (1 r L) as the probability distribution in eq.(2), a target (r) pattern may be retrieved by minimizing the entropy I with respect to m or the state vector si to achieve the retrieval of a target pattern in which the eqs.(5a) and (5b) are to be satisfied. Therefore the entropy function may be considered to be a functional to be minimized during the retrieval process of the auto-association model instead of the conventional quadratic energy functional, E, i.e. (r) 2
E=- 1 2
N N
w ij s †
is j
,
(6a)
js j ,
(6b)
i=1 j=1
†
where s i is the covariant vector defined by L N
s
†
i=
e
†(r)
ie
†(r)
r=1 j=1
and the connection matrix w ij is defined in terms of L (r)
w ij = e
ie
†(r)
j.
(6c)
r=1
According to the steepest descent approach in the discrete time model, the updating rule of the internal states i (1 i N ) may be defined by
i (t+1) =-
I s
(1iN) ,
†
(7)
i
where (> 0 ) is a coefficient. Substituting eqs.(2) and (3) into eq.(7) and noting the following relation with aid of eq.(6b), N
m
(r)
N
= e
†(r)
(r)
i s i = e
i=1
is
† i
,
(8)
i=1
one may readily derive the following relation.
i (t+1)=- I† =+ 1 † 2 s i s i = 1 † 2 s i L
m
(r) 2
2
log
= e = e
(r)
i
r=1 L
N
e
(r)
†
js
j
t
r=1 j=1
(r)
(r) 2
r=1
e
(r)
js
js
†
†
j
2
t
j=1 N
e
(r)
js
†
j
t
1+log
j=1 im
log m
N
N (r)
r=1
L
L
e
(r)
j
t
2
j=1
1+log m
(r) 2
.
(9)
192
M. Nakagawa
Generalizing somewhat the above dynamics in order to combine the quadratic approach ( 0) and the present entropy one ( 1) , we propose the following dynamic rule, in a somewhat ad-hoc manner, for the internal states L
N
i (t+1)= e
(r)
i
r=1
N
†(r)
e
js j
t
1+log 1- +
j=1
e
†(r)
js j
t
2
j=1
L
= e
(r)
i
m
(r)
t
1+log 1- + m
(r)
2
t
. (10)
r=1
In practice, in the limit of 0 , the above dynamics will be reduced to the autocorrelation dynamics. L
e (r) i
i (t+1)=lim L
= e
m
0 r=1
(r)
(r)
im
r=1
(r)
t
L
1+log 1- + m N
t =- e
(r)
r=1
i
(r)
2
t
N
e
† (r)
j=1
j s j (t) = w ij s j (t) . j=1
(11)
On the other hand, eq.(10) results in eq.(9) in the case of 1 . Therefore one may control the dynamics between the autocorrelation ( 0) and the entropy based approach ( 1 ).
3 Numerical Results The embedded vectors are set to the binary random vectors as follows.
e (r) i =z
(r) i
(1iN,1rL)
(12)
where z i (1 i N ,1 r L) are the zero-mean pseudo-random numbers between -1 and +1. For simplicity, the activation function , eq.(1), is assumed to be a piecewise linear function instead of the previous signum form for the binary embedded vectors[25] and set to (r )
s i =f ( i )=
1+sgn 1- i
2
i
+sgn
1-sgn 1- i
where denotes the signum function sgn ( •) defined by
-1 (x<0) sgn x = 0 (x=0) . +1 (x>0)
2
i
,
(13)
(14)
A Generalised Entropy Based Associative Model
193
The initial vector si (0) (1 i N ) is set to
s i (0)=
where e
(r ) i
-e (s) i (1iH d ) , +e (s) i (H d +1iN)
(15)
is a target pattern to be retrieved and H d is the Hamming distance
between the initial vector si (0) and a target vector e
(s) i
. The retrieval is succeeded if
N
m
(s)
(t ) = e
†(s) i s
i (t
)
(16)
i=1
results in ±1 for t 1, in which the system may be in a steady state such that
s i (t+1)=s i (t) ,
(17a)
i (t+1)= i (t) .
(17b)
To see the retrieval ability of the present model, the success rate Sr is defined as the rate of the success for 1000 trials with the different embedded vector sets e (r )i (1 i N ,1 r L) . To control from the autocorrelation dynamics after the initial state (t~1) to the entropy based dynamics (t~ Tmax ) , the parameter in eq.(10) was simply controlled by
= t T max
max
(0tT
max ) ,
(18)
where Tmax and max are the maximum values of the iterations of the updating according to eq.(10) and , respectively. Choosing N =200, η = 1, Tmax = 25, L/N=0.5 and α max = 1, we first present an example of the dynamics of the overlaps in Figs.1(a) and (b) (Entropy based approach). Therein the cross symbols( × ) and the open circles(o) represent the success of retrievals, in which eqs.(5a) and (5b) are satisfied, and the entropy defined by eq.(2), respectively, for a retrieval process. In addition the time dependence of the parameter α / α max defined by eq.(18) are depicted as dots ( i ). In Fig. 1 after a transient state, it is confirmed that the complete association corresponding to eqs.(5a) and (5b) can be achieved. Then we shall present the dependence of the success rate Sr on the loading rate = L / N are depicted in Figs.2 (a) and (b) for H d / N = 0.3 , N =100 for the entropy approach and the associatron, respectively. From these results, one may confirm the larger memory capacity of the presently proposed model defined by eq.(10) in
194
M. Nakagawa <EAM2A> N=100 Ns=100 Tmax=50 k=0 Hd=10 idia=1 iana=1 ictl=1 iotg=1 ient=1 izero=0 alpmax=1 1
0.8
0.6
Overlaps < o(n) >
0.4
0.2 0
-0.2 -0.4 -0.6 -0.8
-1 0
5
10
15
(a)
25 n
20
30
35
40
45
50
H d / N = 0.1
<EAM2A> N=100 Ns=100 Tmax=50 k=0 Hd=30 idia=1 iana=1 ictl=1 iotg=1 ient=1 izero=0 alpmax=1
1
0.8 0.6
Overlaps < o(n) >
0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 0
5
10
15
(b)
20
25 n
30
35
40
45
50
H d / N = 0.3 (r)
Fig. 1. The time dependence of overlaps m eq.(10)
of the present entropy based model defined by
A Generalised Entropy Based Associative Model
195
<EAM2A> N=100 Ns=100 Tmax=50 k=0 Hd=30 idia=1 iana=1 ictl=1 iotg=1 ient=1 izero=0 alpmax=1
1 0.9
Success Rate Sr(L/N)
0.8 0.7
Success Rate MemCap= 0.9999 Hd/N= 0.3
0.6 0.5 0.4 0.3 0.2 0.1 0
0
0.1
0.2
0.3
0.4
0.5 L/N
0.6
0.7
0.8
0.9
1
(a) Entropy based Model defined by eq.(10)
<EAM2A> N=100 Ns=100 Tmax=50 k=0 Hd=30 idia=1 iana=1 ictl=1 iotg=1 ient=0 izero=0 alpmax=1 1
Success Rate MemCap= 0.0134 Hd/N= 0.3
0.9
Success Rate Sr(L/N)
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0
0.1
0.2
0.3
0.4
0.5 L/N
0.6
0.7
0.8
0.9
1
(b) Conventional Associatron Model defined by eq.(11) Fig. 2. The dependence of the success rate on the loading rate α = L / N of the present entropy based model defined by eqs.(10) and (11). Here the Hamming distance is set to H d / N = 0.3.
196
M. Nakagawa
comparison with the conventional autoassociation model defined by eq.(11). In practice, it is found that the present approach may achieve the high memory capacity beyond the conventional autocorrelation strategy even for the analogue embedded vectors as well as the previously concerned binary case [15,16,25].
4 Concluding Remarks In the present paper, we have proposed an entropy based association model instead of the conventional autocorrelation dynamics. From numerical results, it was found that the large memory capacity may be achieved on the basis of the entropy approach. This advantage of the association property of the present model is considered to result from the fact such that the present dynamics to update the internal state eq.(10) assures that the entropy, eq.(2) is minimized under the conditions, eqs.(5a) and (5b), which corresponding to the succeeded retrieval of a target pattern. In other words, the higher-order correlations in the presently proposed dynamics, eq.(10), which was ignored in the conventional approaches, [1-11] was found to play an important role to improve memory capacity, or the retrieval ability. To conclude this work, we shall show the dependence of the storage capacity, which is defined as the area covered in terms of the success rate curves as shown in Fig.3 , on the Hamming distance in Fig.3 for the analogue embedded vectors (Ana) as well as the previous binary ones (Bin). In addition OL and CL imply the orthogonal learning model and the autocorrelation learning model, respectively. Therein one may see again the great advantage of the present model based on the entropy functional to be minimized beyond the conventional quadratic form [12,13] even for the analogue embedded vectors. In fact one may realize the considerably larger storage capacity in the present model in comparison with the associatron over H d / N 0.5 . The memory retrievals for the associatron based on the quadratic 1
n a m
n a m
n a m
a n m
a n m
0.9 Memory Capacity
0.8
a m n
a m n
t
a m n
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
t s s
t s
t s
t s
t s
t s
a m
t s
n
a m n
a
Entropy based Model (OL:Ana)
m
Entropy based Model (OL:Bin)
n
Entropy based Model (CL:Bin)
s
Associatron(OL:Bin)
t
Associatron(OL:Wii=0:Bin) t s
t s
0.010.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 Hd/N
Fig. 3. The dependence of the storage capacity on the Hamming distance. Here symbols a, m and n are for the entropy based approach with eq. (10) as well as the orthogonal learning (OL) and the autocorrelation learning (CL) [16,17], in which Ana and Bin imply the analogue embedded vectors and the binary ones, respectively. In addition we presented the associatron in symbols s with the orthogonal learning [13], and the associatron in symbols t with orthogonal learning under the condition wii = 0 [12], respectively.
A Generalised Entropy Based Associative Model
197
Lyapunov functionals to be minim ized become troublesome near H d / N = 0 .5 as seen in Fig.3 since the directional cosine between the initial vector and a target pattern eventually vanishes therein. Remarkably, even in such a case, the present model attains a remarkably large memory capacity because of the higher-order correlations involved in eq.(10) as expected from Figs. 1 and 2 for the analogue vectors as well as the binary ones previously investigated [15,16,25]. As a future problem, it seems to be worthwhile to involve a chaotic dynamics in the present model introducing a periodic activation function such as sinusoidal one as a nonmonotonic activation function [14]. The entropy based approach [15] with chaos dynamics [14] is now in progress and will be reported elsewhere together with the synergetic models [17-24] in the near future.
References 1. Anderson, J.A.: A Simple Neural Network Generating Interactive Memory. Mathematical Biosciences 14, 197–220 (1972) 2. Kohonen, T.: Correlation Matrix Memories. IEEE Transaction on Computers C-21, 353– 359 (1972) 3. Nakano, K.: Associatron-a Model of Associative Memory. IEEE Trans. SMC-2, 381–388 (1972) 4. Amari, S.: Neural Theory of Association and Concept Formation. Biological Cybernetics 26, 175–185 (1977) 5. Amit, D.J., Gutfreund, H., Sompolinsky, H.: Storing Infinite Numbers of Patternsin a Spinglass Model of Neural Networks. Physical Review Letters 55, 1530–1533 (1985) 6. Gardner, E.: Structure of Metastable States in the Hopfield Model. Journal of Physics A19, L1047–L1052 (1986) 7. Kohonen, T., Ruohonen, M.M.: Representation of Associated Pairs by Matrix Operators. IEEE Transaction C-22, 701–702 (1973) 8. Amari, S., Maginu, K.: Statistical Neurodynamics of Associative Memory. Neural Networks 1, 63–73 (1988) 9. Morita, M.: Neural Networks. Associative Memory with Nonmonotone Dynamics 6, 115– 126 (1993) 10. Yanai, H.-F., Amari, S.: Auto-associative Memory with Two-stage Dynamics of nonmonotonic neurons. IEEE Transactions on Neural Networks 7, 803–815 (1996) 11. Shiino, M., Fukai, T.: Self-consistent Signal-to-noise Analysis of the Statistical Behaviour of Analogu Neural Networks and Enhancement of the Storage Capacity. Phys. Rev. E48, 867 (1993) 12. Kanter, I., Sompolinski, H.: Associative Recall of Memory without Errors. Phys. Rev. A 35, 380–392 (1987) 13. Personnaz, L., Guyon, I., Dreyfus, D.: Information Storage and Retrieval in Spin-Glass like Neural Networks. J. Phys(Paris) Lett. 46, L-359 (1985) 14. Nakagawa, M.: Chaos and Fractals in Engineering, p. 944. World Scientific Inc., Singapore (1999) 15. Nakagawa, M.: Autoassociation Model based on Entropy Functionals. In: Proc. of NOLTA 2006, pp. 627–630 (2006) 16. Nakagawa, M.: Entropy based Associative Model. IEICE Trans. Fundamentals EA-89(4), 895–901 (2006)
198
M. Nakagawa
17. Fuchs, A., Haken, H.: Pattern Recognition and Associative Memory as Dynamical Processes in a Synergetic System I. Biological Cybernetics 60, 17–22 (1988) 18. Fuchs, A., Haken, H.: Pattern Recognition and Associative Memory as Dynamical Processes in a Synergetic System II. Biological Cybernetics 60, 107–109 (1988) 19. Fuchs, A., Haken, H.: Dynamic Patterns in Complex Systems. In: Kelso, J.A.S., Mandell, A.J., Shlesinger, M.F. (eds.), World Scientific, Singapore (1988) 20. Haken, H.: Synergetic Computers and Cognition. Springer, Heidelberg (1991) 21. Nakagawa, M.: A study of Association Model based on Synergetics. In: Proceedings of International Joint Conference on Neural Networks 1993 NAGOYA, JAPAN, pp. 2367– 2370 (1993) 22. Nakagawa, M.: A Synergetic Neural Network. IEICE Fundamentals E78-A, 412–423 (1995) 23. Nakagawa, M.: A Synergetic Neural Network with Crosscorrelation Dynamics. IEICE Fundamentals E80-A, 881–893 (1997) 24. Nakagawa, M.: A Circularly Connected Synergetic Neural Networks. IEICE Fundamentals E83-A, 881–893 (2000) 25. Nakagawa, M.: Entropy based Associative Model. In: Proceedings of ICONIP 2006, pp. 397–406. Springer, Heidelberg (2006)
The Detection of an Approaching Sound Source Using Pulsed Neural Network Kaname Iwasa1, Takeshi Fujisumi1 , Mauricio Kugler1 , Susumu Kuroyanagi1, Akira Iwata1 , Mikio Danno2 , and Masahiro Miyaji3 1
Nagoya Institute of Technology, Gokiso-cho, Showa-ku, Nagoya, 466-8555, Japan
[email protected] 2 Toyota InfoTechnology Center, Co., Ltd, 6-6-20 Akasaka, Minato-ku, Tokyo, 107-0052, Japan 3 Toyota Motor Corporation, 1 Toyota-cho, Toyota, Aichi, 471-8572, Japan
Abstract. Current automobiles’ safety systems based on video cameras and movement sensors fail when objects are out of the line of sight. This paper proposes a system based on pulsed neural networks able to detect if a sound source is approaching a microphone or moving away from it. The system, based on PN models, compares the sound level difference between consecutive instants of time in order to determine its relative movement. Moreover, the combined level difference information of all frequency channels permits to identify the type of the sound source. Experimental results show that, for three different vehicles sounds, the relative movement and the sound source type could be successfully identified.
1
Introduction
Driving safety is one of the major concerns of the automotive industry nowadays. Video cameras and movement sensors are used in order to improve the driver’s perception of the environment surrounding the automobile [1][2]. These methods present good performance when detecting objects (e.g., cars, bicycles, and people) which are in line of sight of the sensor, but fail in case of obstruction or dead angles. Moreover, the use of multiple cameras or sensors for handling dead angles increases the size and cost of the safety system. The human being, in contrast, is able to perceive people and vehicles around itself by the information provided by the auditory system [3]. If this ability could be reproduced by artificial devices, complementary safety systems for automobiles would emerge. Cause of diffraction, sound waves can contour objects and be detected even when the source is not in direct line of sight. A possible approach for processing temporal data is the use of Pulsed Neuron (PN) models [4]. This type of neuron deals with input signals on the form of pulse trains, using an internal membrane potential as a reference for generating pulses on its output. PN models can directly deal with temporal data and can be efficiently implemented in hardware, due to its simple structure. Furthermore, M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 199–208, 2008. c Springer-Verlag Berlin Heidelberg 2008
200
K. Iwasa et al.
high processing speeds can be achieved, as PN model based methods are usually highly parallelizable. A sound localization system based on pulsed neural networks has already being proposed in [5] and a sound source identification system, with a corresponding implementation on FPGA, was introduced in [6]. This paper focuses specifically on the relative moving direction of a sound emitting object, and proposes a method to detect if a sound source is approaching or moving away from it using a microphone. The system, based on PN models, compares the sound level difference between consecutive instants of time in order to determine its relative movement. Moreover, the proposed method also identifies the type of the sound source by the use of PN model based competitive learning pulsed neural network for processing the spectral information.
2
Pulsed Neuron Model
When processing time series data (e.g., sound), it is important to consider the time relation and to have computationally inexpensive calculation procedures to enable real-time processing. For these reasons, a PN model is used in this research.
Input Pulses
A Local Membrane Potential p1(t)
IN 1(t) The Inner Potential of the Neuron IN 2 (t)
w1 w2 p2(t) wk pk(t)
IN k (t)
I(t)
θ
Output Pulses o(t)
w n pn(t)
IN n (t)
Fig. 1. Pulsed neuron model
Figure 1 shows the structure of the PN model. When an input pulse INk (t) reaches the k th synapse, the local membrane potential pk (t) is increased by the value of the weight wk . The local membrane potentials decay exponentially with a time constant τk across time. The neuron’s output o(t) is given by o(t) = H(I(t) − θ) n I(t) = pk (t)
(1) (2)
k=1 t
pk (t) = wk INk (t) + pk (t − 1)e− τ
(3)
The Detection of an Approaching Sound Source
201
where n is the total number of inputs, I(t) is the inner potential, θ is the threshold and H(·) is the unit step function. The PN model also has a refractory period tndti , during which the neuron is unable to fire, independently of the membrane potential.
3
The Proposed System
The basic structure of the proposed system is shown in Fig.2. This system consists of three main blocks, the frequency-pulse converter, the level difference extractor and the sound source classifier, from which the last two are based on PN models. The relative movement (approaching or moving away) of the sound source is determined by the sound level variation. The system compares a signal level x(t) from a microphone with the level in a previous time x(t−Δt). If x(t) > x(t−Δt), the sound source is getting closer to a microphone, if x(t) < x(t−Δt), it is moving away. After the level difference having been extracted, the outputs of the level difference extractors contain the spectral pattern of the input sound, which is then used for recognizing the type of the source. 3.1
Filtering and Frequency-Pulse Converter
Initially, the input signal must be pre-processed and converted to a train of pulses. A bank of 4th order band-pass filters decomposes the signal in 13 frequency channels equally spaced in a logarithm scale from 500 Hz to 2 kHz. Each frequency channel is modified by the non-linear function shown in Eq.(4), and the resulting signal’s envelope is extracted by a 400 Hz low-pass filter. Finally, Input Signal Filter Bank & Frequency - Pulse Converter f1
f2
fN
Time Delay
x(t)
Time Delay
x(t- D t)
Level Difference Extractor
x(t)
x(t- D t)
Level Difference Extractor
Time Delay
x(t)
x(t- D t)
Level Difference Extractor
Sound Source Classifier Approaching Detection & Sound Classification
Fig. 2. The structure of the recognition system
202
K. Iwasa et al.
each output signal is independently converted to a pulse train, whose rate is proportional to the amplitude of the signal. 1 x(t) 3 x(t) ≥ 0 F (t) = 1 (4) 1 3 x(t) x(t) < 0 4 3.2
Level Difference Extractor
Each pulse trains generated by the Frequency-Pulse converter is inputted in a Level Difference Extractor (LDE) independently. The LDE, shown in Fig. 3, is composed by two parts, the Lateral Superior Olive (LSO) model and the Level Mapping Two (LM2) model [7]. In LSO model and LM2 model, each neurons work as Eq.(3). The LSO is responsible for the time difference extraction itself, while the LM2 extracts the envelope of the complex firing pattern. Each pulse train correspondent to each frequency channel is inputted in a LSO LSO model. The PN potential of f th channel, ith LSO neuron Ii,f (t) is calculated as follows: LSO B Ii,f (t) = pN i,f (t) + pi,f (t)
pN i,f (t)
=
N wi,f xf (t)
+
pN i,f (t
(5) − 1)e
t LSO
−τ
B B pB i,f (t) = wi,f xf (t − Δt) + pi,f (t − 1)e
−τ t LSO
(6) (7)
N where τLSO is the time constant of the LSO neuron and the weights wi,f and B wi,f are defined as: ⎧ ⎧ 0.0 i=0 0.0 i=0 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨1.0 ⎨ i>0 1.0 i<0 N B wi,f = wi,f = (8) i i γ γ ⎪ ⎪ −10 −b < i < 0 −10 0
where α, γ are parameters for adjustment of the weights K is the index of the last neuron of each side of the LSO (totalizing 2K + 1 neurons, including the central neuron) and b is the index of the last inner neuron of each side of the LSO. The inner neurons have current input weights smaller than delayed input weights. They are used to make a feature of the input level difference clear when the input level difference is small. As larger the signal becomes, more neurons fire on the LSO model. The LM2 stage then generates a clearer output, extracting the envelope of the firing pattern generated by the LSO. The potentials in the LM2 are calculated as follows: LM2 S Il,f (t) = pD l,f (t) + pl,f (t)
(9)
D pD l,f (t) = mi,f (t) + pl,f (t − 1)e
pSl,f (t)
= −mi,f (t) +
pSl,f (t
t LM 2
−τ
− 1)e
t LM 2
−τ
(10) (11)
where τLM2 is the time constant of the LM2 neuron and mi,f (t) is the output of the ith LSO neuron in f th frequency channel.
The Detection of an Approaching Sound Source
203
Fig. 3. Level difference extractor
3.3
Sound Source Classifier
The sound source classifier is based on the Competitive Learning Network using Pulsed Neurons (CONP) proposed in [5]. The basic structure of CONP is shown in Fig.4. The CONP is constructed on PN models. In the learning process of CONP, the neuron with the most similar weights to the input (winner neuron) should be chosen for learning in order to obtain a topological relation between inputs and outputs. However, in the case of two or more neurons firing, it is difficult to decide which one is the winner, as their outputs are only pulses, and not real values. In order to this, CONP has extra external units called control neurons. Based on the output of the Competitive Learning (CL) neurons, the control neurons’ outputs increase or decrease the inner potential of all CL neurons, keeping the number of firing neurons equal to one. Controlling the inner potential is equivalent to controlling the threshold. Two types of control neurons are used in this work. The No-Firing Detection (NFD) neuron fires when no CL neuron fires, increasing their inner potential. Complementarily, the Multi-Firing Detection (MFD) neuron fires when two or more CL neurons fire at the same time, decreasing their inner potential [5]. The CL neurons are also controlled by another potential, named the input potential pin (t), and a gate threshold θgate . The input potential is calculated as the sum of the inputs (with unitary weights), representing the rate of the input pulse train. When pin (t) < θgate , the CL neurons are not updated by the control neurons and become unable to fire, as the input train has a too small potential for being responsible for an output firing. Furthermore, the input potential of each CL neuron is decreased along time by a factor β, to follow rapid changes on the inner potential and improving its adjustment.
204
K. Iwasa et al.
Fig. 4. Competitive Learning Network using Pulsed Neurons (CONP)
Considering all the described adjustments on the inner potential of CONP neurons, the output equation (3) of each CL neurons becomes: n o(t) = H pk (t) − θ k=1
+ pnf d (t) − pmf d (t) − β · pin (t)
(12)
where pnf d (t) and pmf d (t) corresponds respectively to the potential generated by NFD and MFD neurons’ outputs, pin (t) is the input potential and β (0 ≤ β ≤ 1) is a parameter.
4
Experimental Results
Three different sound sources were used on the experiments: “police car”, “ambulance” and “scooter”. The first two correspond to the alarm sounds of the vehicles, while the last corresponds to the engine sound of a scooter. All the signals were recorded from a static sound source. The moving sound source signals were generated by computer, with the sound intensity at each instant of time calculated as: S(t) = 20Sb log10
d(t) db
(13)
where Ib is a sound intensity in the center position, db and d(t) are, respectively, the distance between the sensor and the sound source at center position and the distance at time t. All signal have 4.0 s of duration and the sound source is normal to the sensor at 2.0 s, as shown in Fig. 5.
The Detection of an Approaching Sound Source
205
Microphone
d(t)
d b = 1m Sound Source 0.0s
2.0s
S(t)
4.0s
Sb Fig. 5. Sound source movement on experiments Table 1. Parameters of each module used on the experiments Input Sound Sampling frequency 48 kHz Quantization bits 16 bits Number of frequency channels 13 channels Delay time Δt 0.4 s Level Difference Extractor Number of total LSO neurons 2K + 1 51 units Number of inner LSO neurons 2b + 1 11 units Number of output neurons L 48 units Threshold θLSO /θLM 2 0.001 / 0.001 Time constant τLSO /τLM 2 0.1 s / 35.0 μs Parameter α/β 60 / 60
4.1
Level Difference Information Extraction
The level difference information was extracted as described in section 3.2. The used parameters for the signal acquisition, preprocessing and level difference extraction are shown in Table 1. Figure 6 shows the output of the LDE model for the “police car” signal in four distinct intervals of time. The x-axis corresponds to the index of the neurons in the LM2, representing the level difference information, and the y-axis corresponds to the frequency channels. The gray level intensity represents the rate of the output pulse train. The firing pattern differs significantly from each interval of time, especially when comparing the graphics of opposite relative movements. Although the LM2 could not successfully extract the envelope from the firing pattern of the signals corresponding to a sound source moving away from the sensor, the result is enough clear for distinguishing it from an approaching sound source signal. Figure 7 shows the firing patterns of each kind of sound for the approaching (interval of 0.0 ∼ 2.0 s) and moving away (2.0 ∼ 4.0 s) cases. As different frequency components present different firing information, it is possible to classify the sound source, as described in the next section.
206
K. Iwasa et al.
Fig. 6. Level Difference Extractor output of the “police car” dataset
Fig. 7. Comparing the output of level difference information for each dataset
The Detection of an Approaching Sound Source
207
Table 2. Parameters of CONP used on the experiments Competitive learning Neuron Number of Inputs of CL neurons 637 units Number of CL neurons 30 units Threshold θ 1.0×10−4 Gating threshold θgate 150.0 Rate for input pulse frequency β 0.0629 Time constant τp 0.1 s Refractory period tndti 10 ms Learning coefficient α 2.0×10−8 Learning iterations 1000 Control Neurons(NFD/MFD) Time constant τNF D /τM F D 0.5 ms / 1.0 ms Threshold θNF D /θM F D -1.0×10−3 / 2.0 Connection weight to each CL neurons 16.0 / -16.0
Table 3. Results of sound recognition (A = approaching, M = moving away)
Input Sound police A M ambulance A M scooter A M
4.2
police car A M 70.6 6.8 6.8 88.3 1.1 4.2 3.8 0.2 0.0 0.0 0.0 1.9
Recognition Rate[%] ambulance scooter A M A M 2.4 7.3 12.9 0.0 0.0 4.9 0.0 0.0 82.8 9.9 2.0 0.0 7.3 86.3 0.0 2.4 5.7 0.0 94.3 0.0 0.3 5.4 0.0 92.4
Sound Source Classification
The firing information patterns provided by all the level difference extractors are recognized by the CONP model described in section 3.3. The CONP model was trained according to the parameters shown in Table 2. Table 3 shows the accuracy of the CONP model for each dataset. The recognition rate is defined as the ratio between the number of neuron’s firing corresponding to the correct vehicle and relative movement and the total number of firings. The correct sound source and relative direction could be recognized with an average accuracy of 85.8%. The results of the “scooter” dataset present a better recognition rate than the “police car” and “ambulance” datasets. The reason for this is that the sound signal of the “scooter” dataset is constant over time, in opposite to the alarm sounds of the other two vehicles, which actually correspond to two different and alternated sounds. Thus, the CONP model can be more efficiently trained with
208
K. Iwasa et al.
the “scooter” data than the others, which would require more data in order to obtain a comparable accuracy.
5
Conclusions
This paper proposes a system for detecting the approaching and classifying a sound source using pulsed neural networks. The system extracts the level difference information from pulse trains corresponding to several frequency bands. The firing pattern is then classified by a CONP model, which identifies the type and recognizes the relative movement of the sound source. The experimental results confirmed that the PN model based level difference extractor can successfully detect the relative movement (approaching or moving away) of a sound source. By using the firing pattern provided by the LDE, the sound source type and relative movement could be correctly classified with a average accuracy of 85.8%. Future works include the detection of a sound source position (its distance from the sensor) and the combination of the proposed system with a sound localization method. The hardware implementation of the proposed systems using an FPGA device is also in progress.
Acknowledgment This research is supported in part by a grant from the Hori Information Science Promotion Foundation.
References 1. Surendra, G., Osama, M., Robert, F.K.M., Nikolaos, P.P.: Detection and Classification of Vehicles. IEEE Trans. ITS 3(1), 37–47 (2002) 2. Chieh-Chi, W., Thorpe, C., Thrun, S.: Online simultaneous localization and mapping with detection and tracking of moving objects: theory and results from a ground vehicle in crowded urban areas. In: Proceedings of ICRA 2003, pp. 842–849 (2003) 3. Pickles, J.O.: An Introduction to the Physiology of Hearing. Academic Press, London (1988) 4. Maass, W., Bishop, C.M.: Pulsed Neural Networks. MIT Press, Cambridge (1998) 5. Kuroyanagi, S., Iwata, A.: A Competitive Learning Pulsed Neural Network for Temporal Signals. In: Proceedings of ICONIP 2002, pp. 348–352 (2002) 6. Iwasa, K., Kuroyanagi, S., Iwata, A.: A Sound Localization and Recognition System using Pulsed Neural Networks on FPGA. In: Proceeding of International Joint Conference of Neural Networks 2007 (to appear, August 2007) 7. Kuroyanagi, S., Iwata, A.: Auditory Pulse Neural Network Model to Extract the Inter-Aural Time and Level Difference for Sound Localization. IEICE Trans. Information and Systems E77-D(4), 466–474 (1994) 8. Kuroyanagi, S., Iwata, A.: Auditory Pulse Neural Network Model for Sound Localization -Mapping of the ITD and ILD-. IEICE J78-D2(2), 267–276 (1996) (in Japanese)
Sensitivity and Uniformity in Detecting Motion Artifacts Wen-Chuang Chou1, Michelle Liou1, and Hong-Ren Su1,2 1
Institute of Statistical Science, Academia Sinica, 128, Academia Rd. Sec.2, Taipei 115, Taiwan 2 Department of Computer Science, National Tsing Hwa University, 101, Sec.2, Kuang-Fu Rd., Hsinchu, 300 Taiwan {wcchou,mliou,stevensu}@stat.sinica.edu.tw
Abstract. Removing artifacts due to head motion is a preprocessing procedure necessary for any fMRI analysis. In fMRI tool boxes, there have been standard algorithms for correcting motion artifacts. However, those tool boxes fail to indicate the extent to which the correction has been successfully done. Without knowing motion contamination especially after correction, the subsequent analysis using averaged fMRI data across subjects could be misleading. In this study, we proposed seven summary indices for measuring motion artifacts. The indices can be applied after motion correction by the image registration algorithms. In the simulation studies, we analyzed a real fMRI data set using a statistical method and estimated the brain activation maps. The real image data were then randomly shifted or rotated to simulate different degrees of head motion. The data contaminated by random motion were then corrected using the SPM image coregistration algorithms. The indices of motion contamination were computed using the corrected images. The corrected images were then analyzed again using the same statistical method. The consistency between the brain activation maps based on real data and those based on simulated data was used as a standard to evaluate the usefulness of the proposed seven indices. The results show that some indices are informative with regards to the degree of motion contamination in preprocessed fMRI data.
1 Introduction Implementing image registration (or motion correction) in fMRI tool boxes has become a routine procedure before statistical analyses. A major purpose of image registration is to make a distinction between the change of signal intensity due to head motion and that due to brain activities. There are situations in which fMRI images are seriously contaminated by head motion and cannot be completely recovered by image registration methods. It would be informative to have indices indicating motion contamination in the preprocessed data. Without knowing motion contamination, the subsequent analysis using averaged fMRI data across subjects could be misleading. Rather than focusing on motion correction algorithms [1], M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 209–218, 2008. © Springer-Verlag Berlin Heidelberg 2008
210
W.-C. Chou, M. Liou, and H.-R. Su
[2], [3], we propose seven indices for detecting motion contamination in preprocessed fMRI data. The indices can be applied after motion correction by the image registration algorithms. In the simulation studies, we analyzed a real fMRI data set using a statistical method and estimated the brain activation maps. The image data were then randomly shifted or rotated to simulate different degrees of head motion. The data contaminated by random motion were corrected using the SPM image coregistration algorithms. The motion indices were computed using the corrected images. The corrected images were then analyzed again using the same statistical method. The consistency between the activation maps of real data and those of simulated data was used as a standard to evaluate the usefulness of the proposed indices. In the next section, the seven indices are introduced in details. In Section 3, the performance of indices are evaluated by comparing their uniformity in and sensitivity to detecting motion artifacts in real fMRI data. We finally discuss the use of different indices.
2 Measuring Motion Artifacts The indices we propose here provide a summary measure for motion contamination especially in the preprocessed fMRI data without necessarily knowing real motion artifacts. In real applications, it is nearly impossible to define the true errors caused by head movement or brain warping. In this study, we give a focus on finding adequate indices based on the notion that similarity between image volumes adjacent in the time domain would be a good indicator of any dislocation due to head movement in fMRI time series. The ensuing indices are therefore completely decided by the innate nature of image signals. In this section, we give seven indices and compare between them according to their sensitivity to small motion artifacts and uniformity in measuring different degrees of contamination. 2.1 Ratio Image Uniformity We use the ratio image uniformity adopted in AIR 3.0 [4] as the first index Iriu. The ratio image is created by means of computing the ratio of intensity of two image volumes on a voxel-by-voxel basis, and then the uniformity of this ratio is represented by its standard deviation σ. Uniformity guarantees the similarity of volumes, and gives smaller standard deviations. In other words, we would expect small motion contamination in fMRI data if the ratio has a smaller standard deviation across image volumes. Here we denote the intensity value of the focal image as M, the intensity value of the reference image as N, and the voxel coordinate by r. The first index Iriu can be simply expressed as
I riu = σ (
M (r ) ). N (r )
(1)
Sensitivity and Uniformity in Detecting Motion Artifacts
211
2.2 Scaled Least-Squared Difference The second index Isls , referred to as scaled least-squared difference, is to describe the global intensity difference by using a modified least-squared approach [5]. In this index, global intensity of one volume is rescaled to that of another volume. Given the total voxel number v, this normalized index can be shown as
I sls = with M and
⎞ 1 ⎛ M ⎜⎜ M (r ) − N (r ) ⎟⎟ ∑ v r ⎝ N ⎠
σ
2 M (r )
+σ
2 N (r )
2
,
(2)
N being the mean intensity of these two volumes, respectively.
2.3 Correlation Coefficient The information of how two images correlate with each other can refer to their correlation coefficient, here denoted as the third index Icc [6]. The high correlation indicates that the image time-series are under a stable status. Let Cov(M,N) be the covariance between M and N and the correlation coefficient is defined as
I cc =
Cov( M (r ), N (r ))
σ M ( r )σ N ( r )
.
(3)
2.4 Joint Entropy The voxel similarity based on information theory has been broadly applied in image registration because of their accuracy and robustness [1], [7], [8]. The commonlyused joint entropy, as adopted here, measures the dispersion of a distribution computed from a joint intensity histogram. The joint distribution pM,N can be estimated by each entry in the joint histogram of two image volumes divided by the total number of voxels. The entropy is low when the images are so similar that two anatomical structures overlap each other and their joint distribution shows certain concentrated clusters. The joint entropy of two images M and N can be calculated by the following equation:
I je = − ∑ p M ,N (i , j ) log p M ,N (i , j )
(4)
i, j
where i and j indicate the intensity value of M and N, respectively. 2.5 Relative Entropy The relative entropy, also called Kullback–Leibler divergence, measures the distance between two probability distributions. Given the probability of the focal image PM and that of the reference image PN , we can compute the relative entropy using the following formula:
212
W.-C. Chou, M. Liou, and H.-R. Su
I re =
P ( j) ⎤ 1⎡ PM ( j ) + ∑ PN ( j ) N ⎢ ∑ PM ( j ) ⎥. 2⎣ j PN ( j ) PM ( j ) ⎦ j
(5)
where j denotes the jth intensity value. 2.6 Weighted Kappa The weighted kappa index Ikw proposed by Cohen [13] is a measure of agreement between two ratings of the same image. The calculation procedure follows the following steps: Step 1: The observed weighted proportion of agreement is given by
Po = ∑ w(i, j ) pM , N (i, j ) ,
(6)
i, j
where the summation is over all possible intensity values, and the chance-expected proportion of agreement by
Pc = ∑ w(i, j ) pM (i) pN ( j ) .
(7)
i, j
In the equations above, the weights are computed by
w(i, j ) = 1 −
i− j , G −1
(8)
where G is the total number of distinct intensity values. Equation (8) assigns higher weights to intensity values closer to each other. Step 2: The weighted kappa value can be calculated by
I kw =
Po − Pc 1 − pc
(9)
Cicchetti and Fleiss [14] gave the sample standard error of a kappa estimate, and the sample estimate in (9) can be tested for significance in applications. 2.7 Mean Distance to the Principal Component The final index is defined as the average distance to the principal component. Here we use the principal components analysis [10] to find the principal axis in the joint histogram of two image volumes. The distance to the principal axis is defined as the orthogonal projection of each voxel to the axis. The average of the distances can be used as an indicator of the degree of image contamination.
Sensitivity and Uniformity in Detecting Motion Artifacts
213
3 Experimental Results In the empirical study, we used a real fMRI dataset collected in the Mechelli et al. study [11], which is a part of the general collection of the US fMRI Data Center. The data of the third subject in the Mechelli et al. study with 360 image volumes were selected for our empirical study. Based on the preprocessed images provided in the Table 1. The means and standard deviations of motion indices fordifferent simulation datasets Datasets Degree of rotation - X axis - Y axis - Z axis
1
2
3
4
5
6
15 2 2
2 15 2
2 2 15
15 15 2
2 15 15
15 2 15
Motion Indices
Ratio Image Uniformity
mean variance
47.95 42.92 19.25 67.40 50.56 54.47 2288.2 1374.8 128.64 2190.8 1350.0 1870.3
Least-squared
mean variance
0.193 0.143 0.155 0.244 0.219 0.252 0.0157 0.0085 0.0081 0.0112 0.0071 0.0126
Correlation Coefficient
mean variance
0.903 0.927 0.922 0.877 0.889 0.873 0.0040 0.0021 0.0020 0.0028 0.0018 0.0032
Joint Entropy
mean variance
7.565 7.457 7.493 7.638 7.601 7.623 0.109 0.113 0.098 0.066 0.045 0.064
Relative Entropy
mean variance
0.0921 0.0756 0.0745 0.0805 0.0897 0.0731 0.0154 0.0125 0.0120 0.0136 0.0205 0.0119
Weighted Kappa
mean variance
0.773 0.802 0.789 0.741 0.747 0.706 0.0066 0.0084 0.0045 0.0044 0.0033 0.0045
mean variance
8.501 7.417 8.126 10.121 9.863 10.449 11.586 8.333 8.571 8.113 5.710 8.518
Mean Distance to the Principal Component
214
W.-C. Chou, M. Liou, and H.-R. Su
Reproducibility Maps (Original data)
Reproducibility Maps (Dataset No. 1)
Reproducibility Maps (Dataset No. 2)
Reproducibility Maps (Dataset No. 3)
Reproducibility Maps (Dataset No. 4)
Reproducibility Maps (Dataset No. 5)
Reproducibility Maps (Dataset No. 6)
Fig. 1. The activation maps for empirical and simulated datasets estimated using the method of reproducibility analysis for a few slices in the Mechelli et al. study. The increased and decreased responses for Subject 3 are indicated by the red and green colors respectively. The superior frontal gyrus, and supramarginal gyrus are shown in the slices located in the upper yellow block, and the noise mainly appears in images located in the lower block.
Sensitivity and Uniformity in Detecting Motion Artifacts
215
dataset, we conducted the reproducibility analysis to estimate the brain activation maps [12]. The activation maps are used as the standard to evaluate the performance of the proposed motion indices. In the simulation study, we rotated the preprocessed
Fig. 2. Plots of motion indices for the preprocessed images provided by the US fMRI Data Center and the corrected fMRI images in the datasets No. 2 and 6. The contaminated datasets were corrected for motion by maximizing mutual information based on the Powell algorithm. In the six plots, the gray line refers to the motion indices of the preprocessed images; the red line dotted with solid triangles and yellow line with empty reversed triangles respectively refer to the indices of corrected images in the datasets No. 2 and 6. The vertical and horizontal axes represent the index values and image volumes along the time scale, respectively.
216
W.-C. Chou, M. Liou, and H.-R. Su
data along the X-, Y-, and Z-directions randomly within the range shown in Table 1 to simulate different degrees of motion contamination. The contaminated images were then corrected using the maximization of normalized mutual information based on the Powell algorithm [7], [8]. The motion indices were also applied to the corrected images. Finally, we conducted the reproducibility analysis again to estimate the brain activation maps based on the corrected images. Table 1 gives the means and standard deviations of motion indices for different simulated datasets. Datasets 1, 2 and 3 are less contaminated by motion as compared with datasets 4, 5 and 6. Therefore, a sensitive index should give larger values (or smaller correlations and kappa values) for datasets 4, 5 and 6. In general, the proposed motion indices more or less show the contrast between modest and serious motion contamination. The mean distance to the principal component tends to give greater contrast compared with other indices. Because the Mechelli et al. study was conducted for investigating the words and pseudo-words reading, the activation regions found by the reproducibility analysis correspond to the superior frontal gyrus and supramarginal gyrus. The activation maps based on corrected images are shown in Fig. 1. Among the three datasets with modest contamination, Dataset No. 2 after coregistration using the Powell algorithm gives almost the same activation maps as the original results. The mean distance to the principal component also outperforms other motion indices by giving the smallest value to Dataset No. 2. In terms of uniformity, the scaled leastsquared, correlation coefficient, joint entropy, weight kappa, and mean distance to the principal component indices yield reasonable results. The five indices are also sensitive to motion contamination. Freire et al. reported similar unfavorable results for the ratio image uniformity index [1]. The change in marginal intensity probabilities between adjacent image volumes are not sensitive to serious head motion. Therefore the relative entropy index fails to detect motion contamination for the simulated datasets. In Fig. 2, we plot the motion indices (except for the relative entropy index) across image volumes based on the original preprocessed images, and the corrected images for the simulated datasets No. 2 and No. 6. The sharp peaks in the plots for the original fMRI data suggests the dislocations (head motion) between two adjacent image volumes. The plots suggest that the scaled least-squared, correlation coefficient, and mean distance to the principal component indices serve as good criteria by comparing between the ensuring reproducibility maps of original data and those of simulated data. However, the plots for joint entropy and weighted kappa suggest that the two indices do not differentiate the degree of contamination in datasets No. 2 and No. 6. The algorithm used for correcting motion contamination was based on maximizing the normalized mutual information defined as (HM(i)+HN(i))/HM,N(i,j). If the maximized normalized mutual information and individual marginal entropy of corrected images in datasets No. 2 and No. 6 are similar, we can always obtain similar joint entropy.
Sensitivity and Uniformity in Detecting Motion Artifacts
217
4 Discussion In this study, we have proposed the use of seven indices for detecting motion contamination in fMRI data. Based on the empirical results, the scaled least-squared, correlation coefficient, joint entropy, weighted kappa and mean distance to the principal component perform uniformly in measuring dislocations between two image volumes. In applications, the plotting of motion indices across image volumes before and after image registration would be informative for detecting possible effects due to motion contamination on the subsequent statistical analyses. Among the five indices, the mean distance to the principal component is the best choice for informing the usefulness of preprocessed fMRI data. Although we have some conclusions in this study, much quantitative work has to be done to further verify the performance of these indices. Acknowledgments. This research was supported by the grant NSC 95-2413-H-001001-MY3 from the National Science Council (Taiwan).
References 1. Freire, L., Mangin, J.F.: Motion correction algorithms create spurious brain activations in the absence of subject motion. Neuroimage 14, 709–722 (2001) 2. Hellier, P., Barillot, C., Corouge, I., Gibaud, B., Le Goualher, G., Collins, D.L., Evans, A., Malandaln, G., Ayache, N., Christensen, G.E., Johnson, H.J.: Retrospective evaluation of intersubject brain registration. IEEE Transactions on Medical Imaging 22, 1120–1130 (2003) 3. Oakes, T.R., Johnstone, T., Walsh, K.S.O., Greischar, L.L., Alexander, A.L., Fox, A.S., Davidson, R.J.: Comparison of fMRI motion correction software tools. Neuroimage 28, 529–543 (2005) 4. Woods, R.P., Grafton, S.T., Holmes, C.J., Cherry, S.R., Mazziotta, J.C.: Automated image registration: I. General methods and intrasubject, intramodality validation. Journal of Computer Assisted Tomography 22, 139–152 (1998) 5. Alpert, N.M., Berdichevsky, D., Levin, Z., Morris, E.D., Fischman, A.J.: Improved methods for image registration. Neuroimage 3, 10–18 (1996) 6. Ghahramani, S.: Fundamentals of Probability, 2nd edn (2000) 7. Maes, F., Collignon, A., Vandermeulen, D., Marchal, G., Suetens, P.: Multimodality image registration by maximization of mutual information. IEEE Transactions on Medical Imaging 16, 187–198 (1997) 8. Studholme, C., Hill, D.L.G., Hawkes, D.J.: An overlap invariant entropy measure of 3D medical image alignment. Pattern Recognition 32, 71–86 (1999) 9. Cohen, J.: Weighted Kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin 70, 213–220 (1968) 10. Cicchetti, D.V., Fleiss, J.L.: Comparison of the null distributions of weighted kappa and the C ordinal statistic. Applied Psychological Measurement 1, 195–201 (1977)
218
W.-C. Chou, M. Liou, and H.-R. Su
11. Jolliffe, I.T.: Principal Component Analysis. Springer, Heidelberg (2002) 12. Mechelli, A., Friston, K.J., Price, C.J.: The effects of presentation rate during word and pseudoword reading: A comparison of PET and fMRI. Journal of Cognitive Neuroscience 12, 145–156 (2000) 13. Liou, M., Su, H.R., Lee, J.D., Aston, J.A.D., Tsai, A.C., Cheng, P.E.: A method for generating reproducible evidence in fMRI studies. Neuroimage 29, 383–395 (2006)
A Ring Model for the Development of Simple Cells in the Visual Cortex Takashi Hamada1 and Kazuhiro Okada2 1
National Institute of Advanced Industrial Science and Technology (AIST), Kansai center 1-8-31, Midorigaoka, Ikeda, Osaka 586-8577, Japan
[email protected] 2 Murata Machinery, Ltd., Communication Equipment Division 136, Takeda-Mukaishiro-cho, Fushimi, Kyoto 612-8686, Japan
[email protected]
Abstract. A model was proposed for the development of simple receptive fields in the cat visual cortex, based on the empirical evidence that the development is due to dark or spontaneous activities in the lateral geniculate nucleus (LGN). We assumed that several cortical cells are arranged in a ring, with mutual excitatory and inhibitory connections of fixed weights; The cells also receive excitatory synapses from LGN cells, whose synaptic weights are initially set to be random and then modified according to the Hebb rule as well as to positive correlations among nearby LGN cells. Computer simulation showed that the cortical cells finally acquire two-dimensional simple receptive fields with their phases gradually varied along the ring. Keywords: Simple cell, development, spontaneous activity, correlation, Hebb rule, model.
1 Introduction Cells in the layer 4 of the cat visual cortex have so-called simple receptive fields in which several elongated On and Off sub-areas are alternately arranged [1]. If a slit of light is positioned on the On sub-area with axes of their elongations aligned each other and turned on, or if a slit of light positioned and oriented on the Off sub-area is turned off, the cells yield firings. If firings to the stimulations of the On sub-areas are plotted positively and those to the Off sub-areas negatively with respect to the positions of the slit, the profile of the firings is described with a Gabor function [2]: A sinusoid with a phase parameter between 0 and 2 π multiplied by a Gaussian. Interestingly, if many simple cells are sampled and each of them is best described by a Gabor function, their phases are distributed not merely around 0 and π /2, i.e. as those of cosine and sine Gabor functions, but uniformly between 0 and 2 π [3]. Based on this evidence, we have already proposed a neural model for simple receptive fields, where several simple cells are arranged in a ring with mutual excitatory or inhibitory connections of fixed weights [4]; Phases of their receptive fields thereby varied gradually with rotation along the ring. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 219–227, 2008. © Springer-Verlag Berlin Heidelberg 2008
220
T. Hamada and K. Okada
Meanwhile, such simple receptive fields are known to be basically organized before the eyes are naturally opened in kittens [5]. This suggests that the organizing process does not require any visual experiences. As a matter of fact, experimental studies suggested that the process could be due to some correlations in dark or spontaneous firings in the lateral geniculate nucleus (LGN) [6, 7]. A pioneering theoretical study [8] assumed the correlation to have the shape of "Mexican hat", i.e. the firings are correlated positively among nearby cells and negatively among farther cells. Assuming correlations of the "Mexican hat" as well as the ring structure, we have theoretically shown that cortical cells in the ring, whose synaptic weights from LGN are initially set to be random, develop simple receptive fields with their phases gradually varied between 0 and 2 π along the ring [9]. This study, however, concerned only with one-dimensional profiles of receptive fields perpendicular to their preferred orientations. We here propose a model, again with the ring structure, for the development of two-dimensional profiles of simple receptive fields. Computer simulation revealed that the cortical cells develop simple receptive fields whose phases gradually vary with rotation along the ring, only if the correlation was assumed to have a shape of Gaussian as recently shown experimentally in the developing LGN of ferrets [10], but fail to develop such receptive fields if the shape of “Mexican hat” is assumed for the correlation.
2 Methods 2.1 Structure of the Model For the sake of simplicity, we consider only a set of cortical cells whose receptive fields will be the same as to orientations, ocular dominances and retinal positions, but different as to phases. We assume that several cortical cells Cp, p=1~n, are arranged in a ring (the upper part in Fig. 1a) with mutual intra-cortical connection Ip,q (p, q=1~n):
⎛ 2π × p − q ⎜ = cos I p ,q ⎜ n ⎝ =0.1
⎞ ⎟ ⎟ ⎠
for p ≠ q
(1)
for p = q .
The interaction is excitatory between nearby cells and inhibitory between diagonally opposed cells on the ring (see 2.4 for detail). The ring structure is meant to be merely functional. We consider two two-dimensional arrays of LGN cells, one for On center cells and the other for Off center cells, each with the size m x m (the lower part in Fig. 1a). Each of the On LGN cells yields an excitatory synapse to each of the cortical cells: The synapses from the On array to a cortical cell Cp (p=1,…, or n) thereby compose a two-dimensional m x m list of weights
W
on p
. Similarly
W
off p
for Off LGN cells.
A Ring Model for the Development of Simple Cells in the Visual Cortex
221
Fig. 1. The model. a: Cortical cells in a ring with interactions Ip,q and arrays of On and Off LGN cells. b: Correlated spontaneous activities among LGN cells of the same type (upper) and of the different types (lower) in one of the trials with correlations in (2) and (3).
2.2 Correlation of Spontaneous Activities in LGN Although correlations of spontaneous activities have been physiologically studied in adult retina of cats [11] and in developing LGN of ferrets [10], those in the developing LGN of cats are not yet studied physiologically. We thereby tested two types of functions for the correlations. The function firstly tested for correlations between cells of the same type, i.e. among On center cells or among Off center cells, is a Gaussian in (2) where r is the separation between the centers of their receptive fields and
σ
c
the
standard deviation of the Gaussian. Thus, spontaneous activities of two LGN cell of the same type are assumed to be more positively correlated if their receptive fields are closer to each other.
C
same
2 ⎛ ⎞ ⎜ ⎟ r (r ) = exp⎜ − . 2⎟ ⎜ 2× ⎟ σc ⎠ ⎝
(2)
Correlation between spontaneous activities of LGN cells of different types, i.e. one On center cell and the other Off center cell, is assumed to follow the function (2) multiplied by -0.3:
C
diff
( r ) = − 0 .3 × C
same
(r ) .
(3)
Thus, LGN cells of different types are negatively and weakly correlated only if their receptive fields are close to each other.
222
T. Hamada and K. Okada
The function secondly tested for correlations between LGN cells of the same type has the shape of "Mexican hat":
C
same
2 ⎛ ⎞ ⎛ ⎜ ⎟ ⎜ r (r ) = exp⎜ − − 0.1 × exp⎜ − 2⎟ ⎜ 2× ⎜ 2× σ c ⎟⎠ ⎝ ⎝
⎞
2
⎟ r . (3σ c) ⎟⎟⎠
(4)
2
Thus, spontaneous activities of two cell of the same type are positively correlated if r is small, but negatively at larger values of r. Correlation between cells of different types is assumed to be the function of (4) multiplied by -0.3. The developmental process is assumed to be an iteration of many (usually more than 5000) trials, in which a set of xy coordinates for a point as the center of correlations is randomly selected in either of the two LGN arrays and correlated activities spread around the point on the array according to the same-type function. Additionally, correlated activities spread around a point with the same coordinates on the other array according to the different-type function. Fig. 1b explains such correlated spontaneous activities in one of the trials, where correlations in (2) and (3) are used. The point is incidentally selected at {x, y}={7, 6} on the On array and correlated activities is thereby positive on the On array and negative on the Off array. In another trial, the point would be differently positioned, correlation on one of the arrays would be incidentally positive and that on the other array would be negative. Correlated spontaneous activities on the On and Off arrays in the i-th trial are named
spont on
and i
spont off
, respectively. i
2.3 Initial Synaptic Weights At the earlier stage of development, LGN axons are attracted toward a cortical neuron presumably according to a gradient of diffusible molecules from the cortical cell [12]. Hence, we suppose that the closer to the center of the array a LGN cell is positioned, the stronger tends to be the synaptic weight from the LGN cell to one of the cortical cells. For describing this tendency, an arbor function A(r) is defined: 2 ⎛ ⎞ ⎜ ⎟ r A(r ) = exp⎜ − . 2⎟ ⎜ 2 ×σ ⎟ a ⎠ ⎝
(5)
It is a Gaussian with respect to distance r of a LGN cell from the center of the array, with the standard deviation
σ
a
. The list of initial synaptic weights
on
Wq
i =0
(q=1~n) is thereby assumed to be a list (m x m) of random numbers between 0 and 1, Rand, multiplied by A(r):
A Ring Model for the Development of Simple Cells in the Visual Cortex
on
Wq Similarly for
off
Wq
i =0
223
= Rand × A(r ) .
. Fig. 2a shows the i =0
(6)
on
Wq
for q=1~n in the case n=5. i =0
2.4 Modification of the Synaptic Weights Activity of a cortical cell in a trial is determined by inputs from the LGN cells as well as from the other cortical cells in the ring. Inputs from LGN cells to a cortical cell Cq (q=1,…, or n) in the i-th trial, On LGN array multiplied by i-th trial multiplied by
pre q
, is described by spontaneous activities on the i
on plus spontaneous activity on the Off array at the
Wq
i
off :
Wq
i
on off + W off pre q = W on q × spont q × spont i
i
i
i
The final activity of Cp in the i-th trial,
post p
.
(7)
i
, is a sum of the inputs from the i
LGN cells plus inputs from the other cortical cells in the ring, which is described as: n
post p Note that
I
p ,q
i
= ∑ I p ,q × q =1
pre q
.
(8)
i
(p, q=1~n) is defined so that inputs from LGN are only weakly
weighted by 0.1, while inputs from the cortical cells are weighted by 1*Cosine function (see the equation (1)). Such a design of p ,q is based on the anatomical evidence that
I
the number of synaptic buttons that a layer 4 cell receives from LGN cells occupies only a few percent among the number of synapses on its cell body and dendrites [13], which means that synaptic influences from the other cortical cells are much stronger than those from LGN cells. Synapses in the visual cortex are known to be modifiable during development [12]. We thereby assume that the synapses are modified due to the Hebb rule: Synaptic weight is strengthened when the pre-synaptic activity is correlated with the postsynaptic activity and weakened otherwise. Changes in trials are then hypothesized:
W
on p
between i-th and i+1-th
224
T. Hamada and K. Okada
on
Wp λ is off W p i +1 . where
i +1
= W on p + λ × A(r ) × i
pre p × post p i
.
(9)
i
a constant which determines the rate of development. Similarly for
We assume that the total synaptic weights over each cortical cell are conserved:
on
Wp
+ i +1
off
Wp
= const .
(10)
i +1
Note that On and Off synaptic weights are conserved jointly [14], but not separately as in [10]. The formula in (9) is computed under the constraint of (10) in a trial, and the trials are iterated until the values converge. 2.5 Visual Responses After the synaptic weights are converged, visual responses of the cortical cells are computed. We assume the receptive field of a LGN cell positioned at 0 on the array
σ
has a Gaussian form, where
r
r
is the standard deviation of the distribution.
⎛ ⎜ Rr 0 = exp⎜⎜ − ⎝ A light spot at a position
r
(r −r 0) ⎞⎟ . 2
⎟ 2×σ r ⎟⎠
(11)
2
q
0
yields inputs from LGN to Cq,
previsr , as: 0
q
previsr = Rr ×W 0
0
on q
− R ×W q . r0 off
(12) p
The final visual activity in Cp due to the spot at
I
p ,q
r postvisr , 0
, results from 0
:
postvisr = ∑ previsr × I p
q =1
0
The visual receptive field of cell respect of
r
0
.
q
n
c
p ,q
.
(13)
0
p
p
is a spatial summation of
postvisr
with 0
A Ring Model for the Development of Simple Cells in the Visual Cortex
225
2.6 Parameters The parameters were set as follows, if not noted otherwise: n=5, m=15,
σ
a
=3,
σ
c
σ
=0.5,
r
=1.5,
λ =0.1.
3 Results 3.1 Development of Simple Receptive Fields Each of the subfigures in Fig. 2a is the initial synaptic weight
W
on p
for p=1,…or n
which are random numbers multiplied by a Gaussian envelop. Note that although these subfigures, as well as those in the other columns in the figure, are vertically arranged in order of p, they actually constitute a ring structure. Fig. 2b shows the synaptic weights,
W
on p
in the left column and
W
off p
in the right, after 10000 iterations when the
Gaussians in (2) and (3) are used as the correlation functions. Note firstly that each of
W
on p
and
W
off p
(p=1, …, or n) is segregated, i.e. strengthened in some sub-areas
and weakened in the other sub-areas. Secondly, the boundaries between the sub-areas
Fig. 2. a: Initial
W
on
. b:
W
on
(left) and
W
off
(right) after trials with Gaussian correla-
tions. c: Visual receptive fields, resulted from the synaptic weights in b and Ip, q. Bars are added for clarifying translations of the receptive fields along the column. d: Visual receptive fields, after trials with the Mexican hat correlations. White arrows are added for clarifying rotations of the fields along the column. The subfigures in each of the columns are actually arranged in a ring.
226
T. Hamada and K. Okada
are roughly straight. Thirdly, configuration of the sub-areas is similar between two adjacent cells on the ring, but mostly reversed between diagonally opposed cells, so that the boundaries gradually translate with rotations along the ring: In the case of Fig. 2b, the boundaries gradually translate toward the lower-right along the downward direction in the column. Besides,
W
on p
and
W
off
for the same cell
p
C
p
(p=1,…, or
n), i.e. two subfigures in one of the rows of Fig. 2b, are reversed images each other: Sub-areas which are strongly innervated by synapses from On LGN cells are weakly innervated by synapses from Off LGN cells, and vice versa. Fig. 2c shows visual receptive fields, i.e. distribution of visual responses due to the synaptic weights in Fig. 2b as well as due to the intra-cortical interaction p,q , which
I
again show the properties described above for
W
on p
and
W
off p
. As an example, the
receptive fields are composed of On and Off sub-areas, each of which corresponds to the sub-areas in
W
on p
and
W
off p
. Besides, the configurations of the sub-areas
gradually translate with rotation along the ring. 3.2 Development of Labyrinthine Receptive Fields If the correlations have the shape of "Mexican hat" in (4) and (5), boundaries between the sub-areas after a sufficient number of trials become labyrinthine. Thus, the model fails to develop simple receptive fields. Interestingly, if the value of λ is additionally increased to 1.0, the labyrinth almost always rotates along the ring. In the case of Fig. 2d, the visual receptive fields rotate clockwise along the downward direction in the column. Note that the third property described above is again held, i.e. the configuration of the sub-areas is similar between two adjacent cells on the ring and reversed between diagonally opposed cells. The rotation along the ring is a compromise between the labyrinthine configurations and the third property.
4 Discussion If correlations of spontaneous activities among the same-type LGN cells were assumed to have the shape of "Mexican hat", labyrinthine visual receptive fields emerged, presumably due to the inhibitory peripheries in the hat. However, if the correlations were assumed to have the shape of Gaussian, simple receptive fields often emerged. As a matter of fact, correlations with the shape of Gaussian were experimentally observed in the developing LGN of ferrets [10]. The same authors also theoretically showed that the Gaussian correlations result in simple receptive fields only if the On and Off synaptic weights were constrained separately [10]. This separate constraints is biologically less plausible, however, than the joint constraint [14]. The present study showed that the Gaussian correlation plus the joint constraint result in simple receptive fields. We assumed an intra-cortical interaction, where nearby cells on the ring were mutually excited and cells diagonally opposed on the ring mutually inhibited. These interactions could be due to excitatory and inhibitory synaptic connections, but otherwise
A Ring Model for the Development of Simple Cells in the Visual Cortex
227
due to near diffusion of some excitatory substances and far diffusion of some inhibitory ones. The latter possibility reminds a chemical explanation for the Turing's diffusion-reaction equation [15], which also emerges stripes and labyrinths [16]. Configurations of the receptive fields after a sufficient number of trials were either linearly translated as shown in Fig.2c or rotated as in Fig. 2d, along the ring. These configurations could be interpreted as neural representations on the ring of translational and rotational motions, respectively.
References 1. Hubel, D.H., Wiesel, T.N.: Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J. Physiol. 160, 106–154 (1962) 2. Marcelja, S.: Mathematical description of the responses of simple cortical cells. J. Opt. Soc. Am. 70, 1297–1300 (1980) 3. DeAngelis, G.C., Ohzawa, I., Freeman, R.D.: Spatiotemporal organization of simple-cell receptive field in the cat’s striate cortex. 1. General characteristics and postnatal development. J. Neurophysiol. 69, 1091–1117 (1993) 4. Hamada, T., Yamashima, M., Kato, K.: A ring model for spatiotemporal properties of simple cells in the visual cortex. Biol. Cyb. 77, 225–233 (1997) 5. Blakemore, C., Sluyters, R.C.V.: Innate and environmental factors in the development of the kitten’s visual cortex. J. Physiol. 248, 663–716 (1975) 6. Chapman, B., Godecke, I.: Cortical cell orientation selectivity fails to develop in the absence of On-center retinal ganglion cell activity. J. Neurosci. 20, 1922–1930 (2000) 7. Sengpiel, R., Kind, P.C.: The role of activity in development of the visual system. Curr. Bio. 12, 818–826 (2002) 8. Miller, K.D.: A model for the development of simple cell receptive fields and the ordered arrangement of orientation columns through activity-dependent competition between onand off-center inputs. J. Neurosci. 14, 409–441 (1994) 9. Hamada, T., Kato, K., Okada, K.: A model for development of Gabor-receptive fields in simple cortical cells. NeuroReport 7, 745–748 (1996) 10. Ohshiro, T., Meliky, M.: Simple fall-off pattern of correlated neural activity in the developing lateral geniculate nucleus. Nature Neurosci. 9, 1541–1548 (2006) 11. Mastronarde, D.N.: Correlated firing of retinal ganglion cells. Trends Neurosci. 12, 75–80 (1989) 12. Goodman, C.S., Shatz, C.J.: Developmental mechanisms that generate precise patterns of neuronal connectivity. Cell 72/Neuron. 10(suppl.), 77–98 (1993) 13. Ahmed, B., Anderson, J.C., Douglas, R.J., Martin, K.A.C., Nelson, J.C.: Polyneuronal innervation of spiny stellate neurons in cat visual cortex. J. Comp. Neurol. 341, 39–49 (1994) 14. Willshaw, D.J., von der Malsburg, C.: How pattered neural connections can be set up by self-organization. Biol. Cyb. 58, 63–70 (1988) 15. Turing, A.M.: The chemical basis of morphogenesis. Phil. Trans. Royal Soc. London B237, 37-72 (1952) 16. Shoji, H., Iwasa, Y.: Labyrinthine versus straight-striped patterns generated by two-dimensional Turing systems. J. Theor. Biol. 237, 104–116 (2005)
Practical Recurrent Learning (PRL) in the Discrete Time Domain Mohamad Faizal Bin Samsudin, Takeshi Hirose, and Katsunari Shibata Department of Electrical and Electronic Engineering, Oita University, 700 Dannoharu, Oita 870-1192 Japan
[email protected]
Abstract. One of the authors has proposed a simple learning algorithm for recurrent neural networks, which requires computational cost and memory capacity in practical order O(n2)[1]. The algorithm was formulated in the continuous time domain, and it was shown that a sequential NAND problem was successfully learned by the algorithm. In this paper, the authors name the learning “Practical Recurrent Learning (PRL)”, and the learning algorithm is simplified and converted in the discrete time domain for easy analysis. It is shown that sequential EXOR problem and 3-bit parity problem as non linearlyseparable problems can be learned by PRL even though the learning performance is often quite inferior to BPTT that is one of the most popular learning algorithms for recurrent neural networks. Furthermore, the learning process is observed and the character of PRL is shown. Keywords: Recurrent Neural Network (RNN), Supervised Learning, Practical Recurrent Learning (PRL), BPTT, Short-Term Memory.
1 Introduction When we think of the higher functions in humans, such as logical thinking, conversation, and so on, it is easily noticed that memory plays an important role in the functions. Accordingly, it is expected that the need for the RNN is going to grow drastically in the near future as the increase of the desire to the higher functions. Conventionally, there are two popular learning algorithms for recurrent neural networks that have been proposed. One is BPTT (Back Propagation Through Time)[2] and the other one is RTRL[3] (Real Time Recurrent Learning). In BPTT, all the past states of the network are stored using O(nT) of memory where n is the number of neurons and T is the present time step, and the learning is done by tracing back to the past using the memory. The order of the computational cost is O(n2T). The traced-back time step is often truncated at a constant number when T becomes large, but it is difficult to know the sufficient number of steps. On the other hand, in RTRL, the influence of each connection weight to the output of each neuron is kept in O(n3) of memory, and the order of the computation of the influence is as large as O(n4). BPTT is not practical in the meaning that the learning should be done with tracing back to the past. Even though the special hardware is developed, iteration of learning for the traceback is necessary. RTRL is not practical in the meaning that the required M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 228–237, 2008. © Springer-Verlag Berlin Heidelberg 2008
Practical Recurrent Learning (PRL) in the Discrete Time Domain
229
order O(n3) in the memory capacity and O(n4) in the computational cost are larger than O(n2) that is the order of the number of connections in a neural network. Even though each connection has some memory, a memory on the connection should have O(n) size, that means that the size of each memory should be larger according to the size of the neural network. S. Hochreiter and J. Schmidhuber have proposed a special network architecture that has some memory cells. In each memory cell, there is a linear unit with a fixed weight self-connection that enables constant, non-vanishing error flow within the memory cell[4]. They used a variant RTRL and only O(n2) of computational cost is required. However, special structure is necessary and it cannot be applied to the general recurrent neural networks. Therefore, a practical learning algorithm for the general recurrent neural networks that need O(n2) or less memory and O(n2) or less computational cost is strongly required. Then Practical Recurrent Learning (PRL) was proposed in the continuous time domain. In this paper, PRL is simplified and converted in the discrete time domain for easy analysis, and the learning performance is compared to BPTT.
2 Practical Recurrent Learning (PRL) Here, PRL is explained using an Elman-type recurrent neural network as shown in Fig. 1. Output layer (3rd layer) Hidden layer (2nd layer)
...........
Input layer (1st layer)
...........
Fig. 1. An Elman-type recurrent neural network
2.1 PRL in the Continuous Time Domain[1] This section describes roughly about PRL in the continuous time domain proposed in [4]. The forward calculation is the same as the conventional neural network that means that each hidden or output neurons calculate the weighted sum of the inputs and then non-linear function f is applied to get the output. Here, the sigmoid function whose value range is from -0.5 to 0.5 is used. In the output layer, the error signal is calculated by δ j (3) = Tr j (t ) − x j (3) (t )
(1)
where Tr : training signal, xj(3):output of the output unit. Differing from the regular BP, the derivative of the output function f’j(3) is not included. As well as the regular
230
M.F.B. Samsudin, T. Hirose, and K. Shibata
BP, the error signal in the hidden layer δi(2) is calculated from the δj(3) in the upper layer as described by the following equations. δ i (2) =
∑v
ji δ j
( 3)
(2)
(t )
j
(
(
) )
d d (3) v ji = w ji (t ) (3) f ' S (j3) (t ) − v ji x j (t ) dt dt
(3)
where wji(3): connection weight (ith hidden unit - jth output unit) , Sj(3) : the net value of the jth neuron in the output layer. f’ is included in this equation on behalf that f’ disappears in Eq. (1) in order to use f’ when the output changed. Then, in order to modify the value of weight without tracing back to the past, it is considered that the following information should be held. (a) the latest outputs of pre-synaptic neurons, (b) the outputs of pre-synaptic neuron that changes recently among all the inputs to the post-synaptic neuron, (c) the outputs of the pre-synaptic neuron that caused the change of the postsynaptic neuron’s output. Corresponding to the (a),(b),(c), three variables p(t), q(t), r(t) that hold the past information in various ways are introduced and they are always modified according to the following differential equations. τj
(
d p ji (t ) = − p ji (t ) + xi (t ) f ' S j (t ) dt
(
(
)
d q ji (t ) = xi (t ) f ' S j (t ) − q ji (t ) dt
(
( )
)
)∑ dtd xi (t )
(4)
(5)
i
)
d d r ji (t ) = xi (t ) f ' S j (t ) − r ji (t ) x j (t ) dt dt
(6)
Using the three variables, each connection weight is modified. The following equation is an example but the details can be seen in [1].
(
)
dw ji (t ) = p ji (t ) + q ji (t ) + r ji (t ) δ j (t )
(7)
Among the three variables, rji(t) is considered to be a particularly important variable with respect to the learning of a problem that needs the past information before a long time lag. Fig. 2 shows an example of the temporal change of the variable rji(t) according to the input signal xi(t) and the output signal xj(t). As shown in Fig. 2, it is the important character that rji(t) holds the information about the output of the pre-synaptic neuron that caused the change of the post-synaptic neuron’s output. This variable ignored the inputs while the output did not change. Accordingly the variable is expected to keep past and important information without tracing back to the past.
Practical Recurrent Learning (PRL) in the Discrete Time Domain
231
1.0 Input xi(t) Input 0.5 value 0 1.0 Output 0.5 value
50
Times 100
50
Times 100
50
Times 100
Output xj(t)
0 0.2 Value of 0.1 Variable rji(t) variable rji(t) 0
Fig. 2. An example of the variable rji(t) transition. From equation (11), variable rji(t) integrates the value of input xi(t) when the output xj(t) changes, and holds the information of the previous state when the output does not change.
2.2 PRL in the Discrete Time Domain In order to make the analysis of PRL learning easy, PRL learning method in the discrete time domain is introduced here. The method of learning is similar to the conventional Back Propagation method in the meaning that each connection weight are modified according to the product of the propagated error signal δ of the postsynaptic(upper) neuron and the signal that represents the output xi of the presynaptic(lower) neuron. Furthermore, to make the learning process become simple, conventional BP method is used for the learning of the connection weights between the hidden layer and the output layer and PRL learning method is used only between the input layer and the hidden layer. In the output layer, the error signal δj(3) is calculated as δ j (3) =
∂E (t ) ∂S j (3) (t )
=
∂E (t )
⋅
∂x j (3)
∂x j (3) (t ) S j (3)
(
) (
)
= Tr j (t ) − x j (3) (t ) f ' S j (3) (t ) .
(8)
Same as the conventional Back Propagation method, the modification of connection weights are calculated by Δw ji (3) = ηδ j (3) xi ( 2) .
(9)
Each neuron in the hidden layer is trained by PRL and signal δj(2) is calculated as
δ j (2) =
∑δ
( 3) k
⋅ wkj (3) (t ) .
(10)
k
From the equation above, f’(t) is not multiplied as the conventional BP method because f’(t) is included in the variable rji(t) as shown in equation (11). Considering
232
M.F.B. Samsudin, T. Hirose, and K. Shibata
that variable rji(t) does not changed when the output does not changed, and integrates the input’s value when the output changes, it is calculated as
(
)
r ji (t ) = r ji (t − 1)⎛⎜1 − Δx j ( 2) (t ) ⎞⎟ + xi ( 2) (t ) f ' S j ( 2) (t ) Δx j ( 2) (t ) ⎝ ⎠
(11)
where Δx j (t ) = x j (t ) − x j (t − 1) . Then, the modification of each connection weight in the hidden layer is calculated using only the variable rji by Δw ji (t ) ( 2) = ηδ j ( 2) r ji (t ) .
(12)
3 Simulation of EXOR and 3-Bit Parity Problems 3.1 Simulation of EXOR Problem From the previous work[1], it was shown that a sequential NAND problem could be learned by PRL, but a sequential EXOR problem could not be learned. Here, sequential EXOR problem in a fix pattern order was tried to be learned by PRL in the discrete time domain. At first, the sequential EXOR logic function is explained. EXOR problem is a logical operation on two operands that results in a logical value of 1 if and only if exactly one of the operands has a value of 1 and the other has a value of 0. The network architecture used in this paper is the same as shown in Fig. 1 besides it contains 1 output, 20 hidden units and 3 input signals. The input 1 is considered as a signal to distinguish the starting time of a pattern presentation and it is always 1 at t=0. As shown in Table 1, the value of 0 or 1 is entered to the input 2 at t=5 and the input 3 at t=15. At the other times, the signal is always 0. The training signal is given when t=time_lag (from the starting time to the time when the training signal was given) and the time_lag is set to 20 unless mentioned particularly. Parameter setup is shown in Table 2. As shown in Table 2, we used value 4.0 for initial connection weight for self-feedback connection to prevent the propagated error value from diverging or vanishing in BPTT method considering that the maximum derivative of output function is 0.25. All the valuables r are reset to 0 at t=0. Table 1. The timing of inputs and training signal in the learning of one pattern
Time, t Input 1 Input 2 Input 3
0 1 0 0
1~4 0 0 0
5 0 0 or 1 0
6~14 0 0 0
15 0 0 0 or 1
16~time_lag 0 0 0
time_lag Training signal was given
Table 2. Parameter setup
Initial weight value for self-feedback Initial weight value for the other feedback Initial weight value (input layer-hidden layer) Initial weight value (hidden layer-output layer) Termination condition
4.0 0.0 Random number (1.0~1.0) 0.0 30000 iteration(1 pattern for 1 iteration)
Practical Recurrent Learning (PRL) in the Discrete Time Domain
233
3.1.1 Simulation Result Table 3 shows the simulation result when EXOR problems was tried. Successful learning is defined as the state that the difference from the difference from the training signal is less than 0.1 for the last 4 iterations before the end of the learning. From the simulation results, it is shown that sequential EXOR problem as a non linearly-separable problem can be learned by PRL successfully as well as the case of BPTT. Moreover, we recognized that the learning performance for both learning methods has been improved when the learning rate for the feedback connections are smaller than the learning rate for the other connections in the network. Table 3. Simulation result when the learning rates on the network were varied Learning rate 1.0
0.5
0.1
Learning rate for feedback connections 1.0 0.5 0.1 0.05 0.01 0.5 0.1 0.05 0.01 0.1 0.05 0.01
Success Rate PRL (/100times) 1 12 96 100 100 42 94 100 100 94 94 60
Success Rate BPTT (/100times) 6 25 95 100 100 27 92 98 100 84 82 75
In addition, Table 4 summarizes the result of comparison for both methods when we exceeded time_lag to 100, but the timing of inputs is the same as shown in Table.1. In terms of learning ability, the conventional BPTT performs better than PRL even though the time for training by PRL is far smaller than BPTT. Table 4. Simulation result when time lag is exceeded to 100 Time Lag 20 40 60 80 100
Success Rate PRL (/100times) 100 90 79 70 68
Success Rate BPTT (/100times) 100 100 100 100 100
Training time PRL (sec) 4 9 13 17 22
Training time BPTT(sec) 8 19 33 49 69
The sequential EXOR problem as a non linearly-separable problem can be learned successfully to some extent by the PRL in the discrete time domain rather than continuous counterpart. The reason of failure for the learning in the continuous time domain is not clear, but maybe the difficulty of setting the training signal. The value of the training signal was not given at a moment, but a shape of training signal for some duration was given in [1].
234
M.F.B. Samsudin, T. Hirose, and K. Shibata
3.2 Simulation of 3-Bit Parity Problems This section presents the learning performances of the PRL in comparison to the BPTT in a sequential 3-bit parity problem in the random pattern order. In the 3-bit parity problem, the training signal is -0.4 when the number of signals whose value is 1 in 3 given inputs except for the input 1 is even, and the training signal is 0.4 when the number is odd. It is considered that the task is more difficult than EXOR because the number of inputs is larger and it might be difficult for the variable rji to keep the past information. We used the same network architecture (refer to Fig.1), but there are 4 input signals that are 1 input as a starter signal and 3 inputs that is used to calculate the parity signal. The input 1 is entered when t=0 and the value is always 1. Time_lag is set to 20 and the input 2, 3 and 4 are set to enter at t=5, 10 and 15 respectively and the value is chosen randomly from 0 and 1. Parameter setup was the same as in the previous section, but the termination condition is that the state with the squared error is less than 10-3 continues for 100 pattern of learning. Furthermore, the random pattern order is employed here to eliminate the possibility of memorizing the pattern order during the learning 3.2.1 Simulation Results The result of simulation for the 3-bit parity problem in random pattern order is shown in Table 5. Even though no traceback is done in PRL learning, this 3-bit parity problem is learned by PRL to some extent. However, the BPTT outperforms PRL for success rates and average number of iterations. Table 5. The comparison result of learning success rate and average number of iterations
Learning rate
1
Learning rate for feedback connections 0.001 0.003 0.01 0.03 0.1
Random pattern order Learning success rate (/100times) PRL BPTT 75 100 81 100 62 100 26 96 30 1
Average number of iterations PRL 133003 119909 107270 119020 108248
BPTT 17494 11393 7929 7710 6033
Here, we focused on the big difference on the average number of iterations to find the reason of inferiority. Firstly, the transition of the output neuron’s output just after the learning process begins is observed as shown in Fig. 3 for both methods. In order to make a comparison, the initial values of connection weights and pattern order are same for the both methods. Fig.3 shows that there is a big difference in the transition of output when the value of input 2 is 1 between both methods. The output seemed to increase drastically due to the presenting of the input 2 in the case of BPTT, but in the case of PRL, little change of the output is seen despite the presence of input 2. For example, the transition of the output in the case of pattern P5 and pattern P4 where a circle is put in Fig. 3 show the difference between both methods.
Practical Recurrent Learning (PRL) in the Discrete Time Domain
Time
0
21
42
63
84
105
126
147
168
189
235
210
0.1
Output
0.05
(BPTT)
0
(PRL) -0.05
-0.1
Pattern
P6 (1,1,0)
P3 P6 (0,1,1) (1,1,0)
P5 P4 (1,0,1) (1,0,0)
P5 P5 P1 P6 P3 (1,0,1) (1,0,1) (0,0,1) (1,1,0) (0,1,1)
Fig. 3. The transition of output in PRL and BPTT methods. The horizontal axis indicates time and the vertical axis indicates output’s value. P3 indicates Pattern 3 for example.
Fig. 4 shows the connection weights from hidden neurons to the output neuron and the output of each hidden neuron when input 2 is 1 at t=5 in the case of pattern P5. The initial value of connection weights from hidden neuron to output is set to 0. As shown in Fig.4, the sign of the connection weight between the hidden 20 and the output neuron is positive in the case of PRL while it is negative in the case of BPTT. As a result, the output is almost the same in the case of PRL while increases in BPTT. 0.4
0.4 PRL
Hidden neuron’s output
PRL 0.3
Connection Weight
BPTT 0.2 0.1 0 -0.1 -0.2 -0.3 -0.4
0.3
BPTT
0.2 0.1 0 -0.1 -0.2 -0.3 -0.4
0
5
10
15
20
Number of hidden neurons
0
5
10
15
20
Number of hidden neurons
Fig. 4. Connection weight from the hidden layer to the output layer and the output of hidden neurons when t=5
Then, the change of the connection weight from input 1, 2, 3 and 4 to hidden 20 is observed. As shown in Fig. 5, it is noticed that the transition of connection weights from input 1, 2, 3 and 4 to hidden 20 in the case of PRL is far smaller compared to the case of BPTT. Considering that the sign of the connection weight from hidden 20 to output are opposite between both methods, it is not a problem that the change of connection weights from the input 1 to the hidden 20 in PRL is also going to the opposite direction of the case in BPTT.
236
M.F.B. Samsudin, T. Hirose, and K. Shibata
Then, the transition of variable r in one cycle for the pattern P5 is shown in Fig. 6. The values are so small, but as expected, they changed at the time when the corresponding input is 1. Even though the values decreased a little bit when another input exists, they keep the information until the end phase of the learning process. In order to compensate the small variable r and to promote the change of the connection weights from input 1, 2, 3, and 4 to hidden neurons, the learning rate for the connection is raised up. The result of simulation is shown in Table 6. -0.4
-0.5
-0.45
Connection Weight
-0.55
Connection Weight
Input2 - Hidden20
Input1-Hidden20
-0.6
BPTT
BPTT
-0.5
-0.55
-0.65
-0.7
PRL
-0.75
PRL
-0.6
-0.65
-0.7
-0.8 0
100
200
300
400
500
0
600
100
200
300
400
500
Iteration 1.1
1.1
Input3 – Hidden20
1.05
Input4 – Hidden20
1.05
1
Connection Weight
Connection Weight
600
Iteration
BPTT
0.95
0.9
PRL
0.85
1
0.95
BPTT 0.9
0.85
PRL 0.8
0.8
0
100
200
300
400
500
600
0
100
200
300
Iteration
400
500
600
Iteration
Fig. 5. The change of the connection weight from input 1, 2, 3, and 4 to hidden20 for both methods at the early phase of learning process
0.06
r20,4
Value of variable r
0.05 0.04
r20,1
0.03
r20,2
0.02 0.01
r20,3 0
Time
-0.01 0
5
10
15
20
Fig. 6. The value of variable r from input 1, 2, 3, and 4 to hidden 20 in the case of P5
Practical Recurrent Learning (PRL) in the Discrete Time Domain
237
Table 6. The comparison result of learning success rate and average number of iteration Learning rate
1
Learning rate between input 1, 2, 3, 4 to hidden neurons 3
Learning rate for feedback connections
0.001 0.003 0.01 0.03 0.1
Learning success rate (/100times) PRL BPTT 53 59 58 47 50
100 100 100 100 6
Average number of iterations PRL BPTT 110107 96847 83112 86625 87693
22530 12943 7360 5720 4670
As shown in Table 6, even though the learning rate of input 1, 2, 3, and 4 to hidden neurons is set to be higher, BPTT still outperforms PRL in the viewpoints of both success rates and average number of iterations. Table 6 shows the characteristic of BPTT where the learning will become more successful when the learning rate is set to be small. However, the PRL does not have the same characteristic as BPTT because the learning success rate for PRL does not depend on the learning rate for the feedback connections. More experiments and analysis is required to examine whether the learning performance of the practical recurrent learning can be improved or not.
4 Conclusion By formulating PRL learning method in the discrete time domain, it could be shown that sequential EXOR problem and 3-bit parity problem as non linearly-separable problem could be learned by PRL even thought PRL is practical as opposed to BPTT and RTRL with respect O(n2) of memory and O(n2) of computational cost. However the learning performance of PRL is inferior to BPTT. A big difference is seen in the weight transition between PRL and BPTT even though the variable r keeps the past information as expected. More additional analysis and experiment is needed to develop and improve the performance of this learning method.
Acknowledgment A part of this research was supported by JSPS Grant in-Aid for Scientific Research #15300064 and #19300070.
References [1] Shibata, K., Okabe, Y., Ito, K.: Simple Learning Algorithm for Recurrent Networks to Realize Short-Term Memories. In: Proc. of IJCNN(Int’l Joint Conf. on Neural Network) 1998, pp. 2367–2372 (1998) [2] Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations by errorpropagating. Parallel Distributed Processing 1, 318–362 (1986) [3] Williams, R.J., Zipser, D.: A learning algorithm for continually running fully recurrent neural network. Neural Computation 1, 270–280 (1989) [4] Hochreiter, S., Schimidhuber, J.: Long Short Term Memory. Neural Computation 9, 1735– 1780 (1997)
Learning of Bayesian Discriminant Functions by a Layered Neural Network Yoshifusa Ito1 , Cidambi Srinivasan2 , and Hiroyuki Izumi1 1
2
Department of Policy Science, Aichi-Gakuin University Nisshin, Aichi-ken, 470-0195 Japan {ito,hizumi}@psis.aichi-gakuin.ac.jp Department of Statistics, University of Kentucky Lexington, Kentucky 40506, USA
[email protected]
Abstract. Learning of Bayesian discriminant functions is a difficult task for ordinary one-hidden-layer neural networks, because the teacher signals are dichotomic random samples. When the neural network is trained, the parameters, the weights and thresholds, are usually all supposed to be optimized. However, those included in the activation functions of the hidden-layer units are optimized at the second step of the BP learning. We often experience difficulty in training such ’inner’ parameters when teacher signals are dichotomic. To overcome this difficulty, we construct one-hidden-layer neural networks with a smaller number of the inner parameters to be optimized, fixing some components of the parameters. This inevitably causes increment of the hidden-layer units, but the network learns the Bayesian discriminant function better than ordinary neural networks. Keywords: Bayesian, layered neural network, learning, quadratic form.
1
Introduction
We construct one-hidden-layer neural networks individually having a smaller number of inner parameters to be optimized. The goal of this paper is to show that the neural networks can learn the Bayesian discriminant function better than ordinary neural networks. The inner parameters are tentatively those included in the activation functions of the hidden-layer units, the connection weights between the input and hidden units and the thresholds. We use an approximation theorem in [3] to construct the network. Generally, learning of a conditional expectation from dichotomic random samples is a difficult task for neural networks. The posterior probability, which is used as a Bayesian discriminant function, is an example of such conditional expectations. Our neural network is to overcome the difficulties in learning with dichotomic random samples. In the case of approximation of a deterministic function, say f , the teacher signals are pairs (x, f (x)). Hence, the network can approximate the target function by straightly minimizing the deviations of its respective outputs F (x) from the sample signals f (x). However, in the case of learning the conditional expectation, the teacher signals are (x, ξ(x, θ)) and the target function is the conditional M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 238–247, 2008. c Springer-Verlag Berlin Heidelberg 2008
Learning of Bayesian Discriminant Functions by a Layered Neural Network
239
expectation E[ξ(x, ·)|x], where ξ is a dichotomic random variable and θ stands for the category. Accordingly, the network has to calculate the local expectations E[ξ(x, ·)|x] and hence the learning is difficult. If the network tries to reduce the deviation of its output F from ξ each time it receives a pair (x, ξ(x, θ)), the output does not converge. There have been many theoretical considerations on learning of the conditional expectation [2], [4], [5], [8-10]. But works which have treated the simulations are rare. We have been trying to construct neural networks which can learn the Bayesian discriminant functions [4-7]. However, only when the state-conditional probability distributions were simple, the learning was successful. We have finally learned that it is almost impossible for the ordinary one-hidden-layer neural networks to learn higher dimensional Bayesian discriminant functions. Hence, we now think that it is unavoidable to simplify the parameter space of the network in order that the network can overcome the difficulties in learning with dichotomic random teacher signals. In this paper, we simplify the parameter space by decreasing the number of the inner parameters to be optimized, though this causes increment of hiddenlayer units. It is a general belief that the smaller the number of hidden layer units is, the better training of a neural network proceeds. Our method contradicts to this, because the problems in learning are untangled by increasing the number of the hidden-layer units. We illustrate a few simulation results.
2
Preliminaries
In this paper we treat the two-category, normal-distribution case similarly to [2]. The categories are denoted by θ1 and θ2 respectively and we set Θ = {θ1 , θ2 }. Let Rd be the d-dimensional Euclidean space (R = R1 ) and let x ∈ Rd be the patterns to be classified. Denote by N (μi , Σi ), i = 1, 2, the normal distributions, where μi and Σi are the mean vectors and the covariance matrices. They are the distributions of the patterns x from the respective categories. The stateconditional probability density functions are t −1 1 1 √ e− 2 (x−μi ) Σi (x−μi ) , 2πΣi
i = 1, 2.
(1)
We suppose that the covariance matrices are not degenerate and, hence, Σi as well as Σi−1 are positive definite. Let P (θ1 ) and P (θ2 ) be the prior probabilities and let P (θ1 |x) and P (θ2 |x) be the posterior probabilities of the patterns from the respective categories. Then, by the Bayesian relation, we have P (θ1 |x) P (θ1 )p(x|θ1 ) = . P (θ2 |x) P (θ2 )p(x|θ2 )
(2)
This ratio can be used as a Bayesian discriminant function [1]. The logistic function σ is defined by σ(t) = (1 + e−t )−1 . Since this is a strictly monotone increasing function, the logistic transform of the logarithm of (2):
240
Y. Ito, C. Srinivasan, and H. Izumi
σ(log
P (θ1 |x) ) = P (θ1 |x) P (θ2 |x)
(3)
is also a Bayesian discriminant function [1]. In the case where the state-conditional probability distributions are normal, the log ratio of the posterior probabilities is a quadratic form: g1 (x) = log
P (θ1 |x) 1 = − {(x − μ1 )t Σ1−1 (x − μ1 ) − (x − μ2 )t Σ2−1 (x − μ2 )} P (θ2 |x) 2 + log
P (θ1 ) 1 |Σ1 | − log . P (θ2 ) 2 |Σ2 |
(4)
The activation function of the output unit of our neural network is the logistic function and the network is trained so as to approximate (3). Hence, the inner potential of the output unit approximates the quadratic form (4). The Mahalanobis distance with respect to the normal distribution N (μi , Σi ) is defined by di (x − y) = (x − y)t Σi−1 (x − y), i = 1, 2. Hence, 1 g2 (x) = − {(x − μ1 )t Σ1−1 (x − μ1 ) − (x − μ2 )t Σ2−1 (x − μ2 )} 2
(5)
can be a Mahalanobis discriminant function. This can be obtained by removing a constant P (θ1 ) 1 |Σ1 | log − log (6) P (θ2 ) 2 |Σ2 | from (4) as stated in [7].
3
Difficulty of Learning Posterior Probabilities
A one-hidden-layer neural network outputs a linear sum of the form n
ai (wi · x + ti ) + a0 ,
i=1
where a0 , ai and ti are constants and wi = (wi1 , · · · , wid ) are vectors. Since these constants are to be modified while training, we call them parameters. Among them we call wi1 , · · · , wid and ti inner parameters and ai and a0 outer parameters. We sometimes identify an approximation formula with the neural network based on this approximation formula. The network above has n(d+2)+1 constants to be optimized. Among them n(d + 1) are inner parameters, and n + 1 are outer parameters. Figure 1 is to show the difficulty in learning with dichotomic random sample. The space of patterns is one-dimensional.
Learning of Bayesian Discriminant Functions by a Layered Neural Network
241
The dots in Figure 1 are random samples from {(x, ξ(x, θ))}, x ∈ R, and θ ∈ Θ = {θ1 , θ2 }, where ξ(x, θ1 ) = 1 and ξ(x, θ2 ) = 0. The probability of the event, ξ(x, θ) = 1, is P (θ1 |x) = E[ξ(x, θ1 )|x]. For simple figurative representation, it is based on one-dimensional normal distributions. The smooth curve is the posterior probability, which the neural network has Fig. 1. Dichotomic samples and the to approximate. posterior probability. The abscissa is In the higher dimensional case, the the x-axis. network has to approximate a (hyper) surface. If the teacher signals are {(x, P (θ1 |x))}, P (θ1 |x) = E[ξ(x, θ)|x], then the learning is an ordinary function approximation by a neural network. However, the teacher signals are {(x, ξ(x, θ))} for our neural network. Comparing the two sets of teacher signals, one may understand that approximation of P (θ1 |x), using the random samples from {(x, ξ(x, θ))}, is more difficult. For approximation of a quadratic formula (4), Funahashi [2] has proposed an approximation formula 2d
ai g(wi · x + ti ) + a0 .
(7)
i=1
Since d is the dimension of the space of the pattern x, this approximation formula has 2d(d + 2) + 1 constants to be optimized. Among them, 2d(d + 1) constants are inner constants. Actually an approximation formula d
ai g(wi · x) + a0
(8)
i=0
having a smaller number of the activation functions can approximate the quadratic form [3], [5]. In this formula, the number of the constants to be optimized is (d + 1)2 + 1 and that of the inner constants is d(d + 1). Since one of the activation function is to approximate a linear form in x, the approximation formula (8) can be replaced by d
ai g(wi · x) + w0 · x + a0 .
(9)
i=1
The values of the constants are of course distinct in (8) and (9), though the same notation is used. In (9), the numbers of the activation functions and the parameters are respectively decreased to d and (d + 1)2 . Hence, we used (9) in [6] and [7], expecting that the parameter space was simplified remarkably. Nevertheless, the network based on (9) was successful only in the case where the probability distributions are simple. In particular, training of the inner parameters was very difficult. This convinced us that to overcome the difficulties there is no other way other than decreasing the inner parameters.
242
4
Y. Ito, C. Srinivasan, and H. Izumi
Modified One-Hidden-Layer Neural Network
In this section we construct a neural network having more hidden-layer units but less parameters to be optimized. There are unit vectors uk , k = 1, · · ·, 1 d 2 d(d + 1), in R such that (uk · x), k = 1, · · · , d, are linearly independent and 2 (uk · x) , k = 1, · · · , 12 d(d + 1), are also linearly independent. Here, we illustrate an example. Suppose that uk , k = 1, · · ·, 12 d(d + 1), are distinct unit vectors in Rd such that, for k = 1, · · · , d, the d-th element is 1 and others are 0, and, for k = d + 1, · · ·, 12 d(d + 1), two elements are equal to √12 and others are 0. Then, they satisfy the condition [3]. Hence, any quadratic form Q(x) in Rd can be expressed as 1 2 d(d+1)
Q(x) =
a2k (uk · x)2 +
k=1
d
a1k (uk · x) + a0 .
k=1
Though this expression can be generalized to polynomials of higher degree [3], this is enough in this paper. Let u be a unit vector in Rd . For a probability measure μ in Rd , we denote by μu its projection onto a line {tu : −∞ < t < ∞}. Then, we have a theorem below. Theorem 1. Let μ be a probability measure on Rd and let g ∈ C N (R) (N ≥ 2) such that g (i) (0) = 0 (1 ≤ i ≤ 2). Suppose that there exists a function g0 on R such that , for any δ (|δ| < 1), |g (i) (δt)ti | < g0 (t),
0 ≤ i ≤ 2,
and, for any unit vector u in Rd , g0 ∈ Lp (R, μu )(1 ≤ p ≤ ∞). Then, for any quadratic form Q and any ε > 0, there exist coefficients aik and constants δi (0 ≤ i ≤ 2, 1 ≤ k ≤ 12 d(d + 1)) for which
∂α ∂α ¯ Q − Q p d < ε, ∂xα ∂xα L (R ,μ)
|α| ≤ N,
where 1 2 d(d+1)
¯ Q(x) =
k=1
a2k g(δ2 uk · x) +
d
a1k g(δ1 uk · x) + a0 .
(10)
k=1
and uk are unit vectors satisfying the condition above. This theorem is a special case of a theorem in [3]. Though we have not enough space to prove this, it may be imaginable that (10) comes from the expression above of the quadratic form Q. The condition on g and μ is to suppress the Lp -norm in a neighborhood of infinity. If the activation function and its derivatives are bounded and the probability measure is rapidly decreasing, this condition is always satisfied.
Learning of Bayesian Discriminant Functions by a Layered Neural Network
243
Since the second sum in (10) is to approximate a linear form, the approximation formula can be modified to
d(d+1)/2
¯ Q(x) =
a2k g(δuk · x) +
k=1
d
a1k xk + a0 .
(11)
k=1
When δ is infinitesimal, (11) can coincide with the quadratic form Q. This approximation formula can be realized by a one-hidden-layer neural network having direct connections between the input and output layers. Figure 2 illustrates such a neural network. Since the vectors uk are fixed beforehand, the number of the parameters to be optimized in (11) is 12 (d2 + 3d + 4). Moreover, only one is an inner parameter among them and others are outer parameters. This must make the parameter space very simple. We performed simulations to confirm usefulness of the network. The network has 12 d(d+1) hidden-layer d units. The linear sum k=1 a1k xk is input to the output unit via the direct connections. Hence, the inner potential of the output unit of the neural network illustrated in Figure 2 can approximate (10). Fig. 2. A one-hidden-layer neural net- If the activation function of the output work having direct connections be- unit is the logistic function, then, by (3) tween the input layer and output unit. and (4), it can approximate the posterior probability P (θ1 |x) [2]. D = 12 d(d + 1).
5
Training of the Neural Network
Let F (x, w) be the output of the neural network. Training of the network is based on the proposition below by Ruck et al. [9]. As we have repeatedly used this proposition [4-7], we omit details here. Proposition 2. Set E(w) =
2
(F (x, w) − ξ(x, θi ))2 P (θi )p(x|θi )dx.
(12)
Rd i=1
Then, E(w) =
(F (x, w) − E[ξ(x, ·)|x]) p(x)dx +
V [ξ(x, ·)|x]p(x)dx.
2
Rd
(13)
Rd
If ξ(x, θ1 ) = 1 and ξ(x, θ2 ) = 0, then E[ξ(x, ·)|x] = P (θ1 |x). The integrand of the second integral is the conditional variance of ξ and independent of the weight
244
Y. Ito, C. Srinivasan, and H. Izumi
vector w. Hence, when E(w) is minimized, the output F (x, w) is expected to approximate P (θ1 |x) if the network has a capability of approximating the posterior probability. Accordingly, learning of the network is carried out by minimizing 1 En (w) = (F (x(k) , w) − ξ(x(k) , θ(k) ))2 , n n
(14)
k=1
where {(x(k) , ξ(x(k) , θ(k) ))}nk=1 , (x(k) , θ(k) ) ⊂ Rd × Θ, is the training set. This method of training has been treated by many authors [2,4-10]. In the case of the ordinary approximation of functions, to be minimized is 1 (F (x(k) , w) − f (x(k) ))2 . n n
En (w) =
(15)
k=1
It is obvious that minimization of (14) is more difficult than that of (15).
6
Simulations
Unlike ordinary one-hidden-layer neural networks, the network based on (11) can learn two-dimensional Bayesian discriminant functions. We have confirmed this by simulation. Fig. 3a and 3b illustrate the probability density functions of the normal distributions N (μ1 , Σ1 ) and N (μ in one of the simulations, 2 , Σ2 ) used 21 10 where μ1 = (1, 0), μ2 = (0, 0) and Σ1 = , Σ2 = respectively. These 11 01 are the state-conditional probability density functions of the categories θ1 and θ2 respectively. Fig. 3c illustrates the posterior probability of the category θ1 in the case of the prior probabilities P (θ1 ) = 0.7 and P (θ2 ) = 0.3. As stated in Section 2, this can be used as a Bayesian discriminant function. The neural network was trained with 1000 teacher signals.
a
b
c
Fig. 3. a and b are the state-conditional probability density functions of Categories θ1 and θ2 respectively, and c is the posterior probability
Fig 4 illustrates an example of the learning process. Fig 4a is the output of the network with the initial values of the parameters. It converged to Fig 4c via Fig 4b. The unit vectors are u1 = (1, 0), u2 = (0, 1) and u3 = ( √12 , √12 ). While learning, the parameters a0 , a11 , a12 , a21 , a22 , a23 were optimized, but the value
Learning of Bayesian Discriminant Functions by a Layered Neural Network
a
b
245
c
Fig. 4. Learning process. The target is the surface illustrated in Fig. 3c.
of the constant δ did not change much. This probably implies difficulty of training of the inner parameter. Even this neural network is not always successful in learning: its success depends on the choice of the initial values of the parameters. In particular, if the initial output of the network is a flat surface, learning rarely proceeds. One may recognize slight differences between the Bayesian discriminant function theoretically obtained (Fig. 3c) and that obtained by simulation (Fig. 4c). These are probably because the teacher signals are random variables and the inner parameter δ is not infinitesimal. Classification capabilities of the two discriminant functions (Fig. 3c and Fig. 4c) are compared in Table 1. By the method stated in [7], we can easily convert the Bayesian discriminant function (4) to the Mahalanobis discriminant function (5), because both are quadratic forms distinct from each other by a constant. Hence, in the table the capabilities of the Mahalanobis discriminant function theoretically obtained and that obtained by simulation are also compared. The test signals are distinct from those used for training. In this table, the numbers in the row of “Alloc. to θ1 (resp. θ2 )” are those of patterns allocated to the category θ1 (resp. category θ2 ), and in the row of “Correc. Alloc.” is the total number of correctly allocated patterns. The columns TS, BT, BS, MT and MS stand for the test signals (TS), the Bayesian discriminant functions theoretically obtained (BT, Fig. 3c) and obtained by simulation (BS, Fig. 4c), and the Mahalanobis discriminant functions theoretically obtained (MT) and obtained by simulation (MS) respectively. Only in the column TS, all signals are correctly allocated. The BT allocated 843 (resp. 157) patterns to Category θ1 (resp. θ2 ), among which 175 (resp. 37) patters were to be allocated to
Table 1. Classification results in the first example (See Text) TS
BT
BS
MT
MS
Alloc. to θ1
705
843(175)
861(187)
614 (87)
741(109)
Alloc. to θ2
295
157 (37)
139 (31)
386(178)
259(150)
Correc. Alloc.
1000
788
782
735
741
246
Y. Ito, C. Srinivasan, and H. Izumi Table 2. Classification results in the second example (See Text) TS
BT
BS
MT
MS
Alloc. to θ1
708
717 (32)
706 (26)
687 (17)
664 (11)
Alloc. to θ2
292
283 (23)
294 (28)
313 (38)
336 (55)
Correct Alloc.
1000
945
946
945
934
Category θ2 (resp. θ1 ). Since the distributions of the two categories are overlapping, even the BT can allocate correctly only 1000 − (175 + 37) = 788 patterns among 1000. However, the allocations by the BT and BS coincide at 982 patterns among 1000 (not shown in Table 2), and those by MT and MS at 924. These implies that the learning is successful. Another isillustrated example in Table 2, where μ1 = (1, 1), μ2 = (−1, −2) and 21 1 0.5 Σ1 = , Σ2 = . In this case the two probability distributions 11 0.5 1 are rather separated. Hence, the classifications by the respective discriminant functions are more successful. Moreover, the allocations by BT and BS coincide at 989 patterns among 1000, and those by MT and MS at 977.
7
Conclusion and Discussions
We have often experienced that the ordinary layered neural network based on the approximation formula (9) fails in approximating the posterior probability, unless the probability distribution is very simple. In the case where the dimension of the patterns x is higher than or equal to 2, it looks impossible for the network to learn the posterior probability. However, the neural network based on (11) performs the task better even in the two dimensional case. We think that this is because the latter has a smaller number of parameters to be optimized and, more importantly, they are mostly outer parameters. Our network has more hidden layer units than the minimum requirement. This contradicts to the general belief that a neural network having a smaller number of units works better. However, the number of the parameters to be trained and whether they are inner or outer may be more directly related to the complexity in learning. Since our network has less parameters to be trained and most of them are outer parameters, it is understandable that the network works better. When the network is based on (11) the error function (12) has no global minimum usually, because the output F (x, w) converges to the posterior probability P (θ1 |x) as the scaling parameter δ tends to infinitesimal. Hence, we expected that δ converges toward zero while training, but it usually stays in the neighborhood of the initial value. The network often realizes the approximation using the initially given value of δ with a slight modification. This probably implies difficulty of training of inner constants with random dichotomic teacher signals. This may in turn justify our policy of decreasing the inner parameters to be
Learning of Bayesian Discriminant Functions by a Layered Neural Network
247
trained, even if it causes increment of the number of the hidden-layer units. In short, the inner parameters are converted to the outer parameters in our neural network. The flexibility of the basis functions may characterize the neural network. It makes neural networks useful powerful tools for multiple purposes. One of the standard methods of approximating functions is to use the basis functions of a complete orthonormal system. In this case only the coefficients of the basis functions are optimized. Approximation by our neural network is an intermediate method between the two methods because the basis functions are flexible only to some extent. Since the space of target functions is restricted to quadratic forms, the flexibility of basis functions can be restricted in this paper. However, usefulness of the neural network is not limited to this special case. The approximation based on Theorem 1 is on the whole space in the Lp sense and the original theorem in [3] is on approximation of polynomials of arbitrary degree. Hence, by the polynomial approximation theorem, our neural network may be useful in approximating general continuous functions when the probability measure is rapidly decreasing. We think that the ideas of making easier the learning by converting inner parameters to outer parameters can be widely applied.
References 1. Duda, R.O., Hart, P.E.: Pattern classification and scene analysis. John Wiley & Sons, New York (1973) 2. Funahashi, K.: Multilayer neural networks and Bayes decision theory. Neural Networks 11, 209–213 (1998) 3. Ito, Y.: Simultaneous approximations of polynomials and derivatives and their applications to neural networks (submitted) 4. Ito, Y., Srinivasan, C.: Multicategory Bayesian decision using a three-layer neural network. In: Kaynak, O., Alpaydın, E., Oja, E., Xu, L. (eds.) ICANN 2003 and ICONIP 2003. LNCS, vol. 2714, pp. 253–261. Springer, Heidelberg (2003) 5. Ito, Y., Srinivasan, C.: Bayesian decision theory on three-layer neural networks. Neurocomputing 63, 209–228 (2005) 6. Ito, Y., Srinivasan, C., Izumi, H.: Bayesian learning of neural networks adapted to changes of prior probabilities. In: Duch, W., Kacprzyk, J., Oja, E., Zadro˙zny, S. (eds.) ICANN 2005. LNCS, vol. 3697, pp. 253–259. Springer, Heidelberg (2005) 7. Ito, Y., Srinivasan, C., Izumi, H.: Discriminant analysis by a neural network with Mahalanobis distance. In: Kollias, S., Stafylopatis, A., Duch, W., Oja, E. (eds.) ICANN 2006. LNCS, vol. 4132, pp. 350–360. Springer, Heidelberg (2006) 8. Richard, M.D., Lipmann, R.P.: Neural network classifiers estimate Bayesian a posteriori probabilities. Neural Computation 3, 461–483 (1991) 9. Ruck, M.D., Rogers, S., Kabrisky, M., Oxley, H., Sutter, B.: The multilayer perceptron as approximator to a Bayes optimal discriminant function. IEEE Transactions on Neural Networks 1, 296–298 (1990) 10. White, H.: Learning in artificial neural networks: A statistical perspective. Neural Computation 1, 425–464 (1989)
RNN with a Recurrent Output Layer for Learning of Naturalness J´ an Dolinsk´ y and Hideyuki Takagi Kyushu University, 4-9-1 Shiobaru, Minami-ku, Fukuoka, 815-8540 Japan
[email protected],
[email protected]
Abstract. The behavior of recurrent neural networks with a recurrent output layer (ROL) is described mathematically and it is shown that using ROL is not only advantageous, but is in fact crucial to obtaining satisfactory performance for the proposed naturalness learning. Conventional belief holds that employing ROL often substantially decreases the performance of a network or renders the network unstable, and ROL is consequently rarely used. The objective of this paper is to demonstrate that there are cases where it is necessary to use ROL. The concrete example shown models naturalness in handwritten letters.
1
Introduction
In engineering, recurrent neural networks (RNN) have not been often proposed as a promising solution. The difficulties with training a RNN have been overcome, and recent theoretical advances in the field have made training a RNN quicker and easier [4]. Recurrent connections have not been found to increase the approximational capabilities of the network [7]. Nevertheless, we may obtain decreased complexity, network size, etc. while solving the same problem. In some applications - such as speech recognition or object detection or prediction - our classification / prediction at time t should be more accurate if we can account for what we saw at earlier times. The most common approach to model such systems is to use a suitably powerful feed-forward network and feed it with a finite history of the inputs and optionally the outputs through a sliding window. Early attempts to improve this, often tedious, technique resulted in various network architectures based on the feed-forward topology with the recurrent connections being set to fixed values to ensure that the backpropagation rule can be used [1][6]. Many works have been done on autonomous Hopfield networks as well as on training algorithms that can be applied to a RNN in a feed-forward fashion (i.e. BPTT). Recent theoretical advances in RNN research such as Echo State Networks (ESN) afford the modeling of fully general topologies, which were difficult to train directly with the former techniques. An interesting example is a topology where output units have connections from not only the internal units but also the input units and output units, yielding a recurrent output layer - ROL. Although connections from the input units are often used, connections from the output layer are rare. In the following M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 248–257, 2008. c Springer-Verlag Berlin Heidelberg 2008
RNN with a Recurrent Output Layer for Learning of Naturalness
249
chapters, we explain what behavior ROL implies, introduce the concept of our proposed naturalness learning, and show that using ROL not only increases the performance but is actually an intrinsic part of modeling with the proposed naturalness learning. One of the earliest RNN, where the output activation values from the previous step were used to compute the output activation values in the next step, was the Jordan network [6][5]. In the Jordan network, the activation values of the output units are fed back into the input layer through a set of extra input units called the state units. This type of network is called output-recurrent network. Various modifications to output-recurrent networks have been proposed and successfully used for modeling difficult non-linear tasks [8]. RNN with ROL, in contrast to output-recurrent network, uses the output activation values of the previous step directly to compute the output in the next step. The output activation values of the previous step can be thought of as a extra hidden units. We are aware of no papers discussing applications of RNN with ROL.
2
Dynamics of RNN with a Recurrent Output Layer
Adopting a standard perspective of system theory, we view a deterministic and discrete-time dynamical system as a function G which yields the next system output, given an input and the output history: y(n + 1) = G(..., u(n), u(n + 1); ...y(n − 1), y(n))
(1)
where u(n) is the input vector and y(n) is the output vector for the time step n. The echo-state approach enables us to approximate systems represented by G directly, without the necessity to convert a time series into static input patterns by the sliding window technique [2]. Consider a discrete-time ESN [4] consisting of K input units with an activation vector u(n) = (u1 (n), ..., uK (n))t , N internal units with an activation vector x(n) = (x1 (n), ..., xN (n))t , and L output units with an activation vector y(n) = (y1 (n), ..., yL (n))t , where t denotes the transpose. The corresponding input, internal and output connection weights are collected in the N × K, N × N , L×(K +N +L) weight matrices Win , W, Wout respectively. Optionally, a N ×L matrix Wback may be used to project the output units back to the internal units. The internal units’ activation is computed according to x(n + 1) = f(Win u(n + 1) + Wx(n) + Wback y(n))
(2)
where f denotes the component-wise application of the transfer (activation) function to each internal unit. The output is computed as y(n + 1) = fout (Wout (u(n + 1), x(n + 1), y(n))
(3)
where (u(n + 1), x(n + 1), y(n)) represents the concatenated vector consisting of input, internal and output activation vectors. The concatenated vector often
250
J. Dolinsk´ y and H. Takagi
Fig. 1. Echo-state network: the dotted lines plot connections which can be trained, the gray lines plot connections which are optional
consits only of input and internal activations or internal activation only. Fig. 1 shows the architecture of an ESN. See [4] for further details concerning the training of ESN. A closer look at Eq. (3) reveals that a system output y(n + 1) is constructed from the given input and output history via two distinct mechanisms: from the activation vectors of the internal units x(n) indirectly (by computing x(n + 1) via Eq. (2)) and optionally from the activation vectors of the input units u(n+1) and/or output units y(n) directly. The internal units’ activation x(n+1) is computed using the input and output activation u(n + 1), y(n) and the activation x(n) of the internal units from the previous step which recursively reflects the influence of input and output activations from previous steps. We can therefore rewrite Eq. (2) as x(n + 1) = E(..., u(n), u(n + 1); ..., y(n − 1), y(n))
(4)
where E depends on the history of input signal u and the history of desired output signal y itself, thus in each particular task, E shares certain properties with the desired output and/or given input. How strongly the internal activation x(n+1) is influenced by the activations u(n+1), y(n) and x(n) (which recursively consist of previous input/output activations) is controlled by the size of the weights in matrices Win , Wback and W respectively. Algebraic properties of the matrix W are particularly important for the short-term memory property of an ESN [3]. Besides using the activations of internal units, sometimes it is advantageous to use also the activations of input and output units directly. Although the activation vector x(n) reflects the history of the desired output and/or given input, the activation vectors u(n + 1), y(n) in Eq. 3 are used merely as another form of input. This usage corresponds to connecting the input units to the output units and output units to output units themselves directly. Direct connection of input units to output units is often used whereas direct connection of output
RNN with a Recurrent Output Layer for Learning of Naturalness
251
units to output units is rare. It is the connecting of output units to each other what makes the output layer recurrent. ROL implies a substantial influence of the previously generated output y(n) on the successive output y(n+1). The activation y(n) is only an approximation of a system output at the step n and, thus, it is always generated with a certain error. This error is included in the computation of the successive output activation y(n + 1) and can easily accumulate with each update step. It is for this reason that computation using ROL has been rare.
3
RNN with a Recurrent Output Layer for Naturalness Learning
In this section, we will demonstrate the principles of naturalness learning by showing how to express and model the unique quality of hand-written letters. We also explain why ROL works well with the naturalness learning. The style of writing of any given person is very much individual and can be distinguished easily from mechanically or electronically printed text. Moreover, we can distinguish between the writing of different people. Everybody learns their alphabet in a school, and while writing in one’s own individual way, a person is trying to approximate the shapes of the letters as they learned them in school. We can therefore understand handwriting as turning the basic shape of a letter as learned in a school into the writer’s particular, unique, handwritten form. A human then can be seen as a ‘filter’ which adds his or her own characteristics to the shapes of those basic letters. To explain the term of naturalness, first we must define our terminology. Let us refer to the letters used in textbooks (either printed or cursive) as the font letters (fontL). We view handwriting as the process of turning a fontL into its handwritten form. The unique quality of the handwritten letter (handL) can be then understood as the difference between the handL and the fontL and expressed as a 2-D displacement vector field of evenly spaced points along the strokes of the font and its respective handwritten version (see Fig. 2) 1 . We refer to this difference as naturalness. In other words, adding the naturalness to the fontL will result in a handL of a unique form. We can thus, assume a relation between the fontL and the naturalness. This assumption enables us to model naturalness by a system which employs certain characteristics of the fontL as its input. The task of naturalness learning is to find and learn the relation between font letters and naturalness. Speaking in the terms of naturalness learning, the target system (handwritten letters) resembles the basic system (font letters) with its behavior (shape of handwritten letters) deviating from the basic system to a certain extent. Learning and modeling naturalness using a RNN with ROL produced interesting results. As mentioned in section 2., computing with ROL is prone to 1
A displacement vector is not the only mean of expressing naturalness, it can be expressed using an arbitrary mechanism.
252
J. Dolinsk´ y and H. Takagi
Fig. 2. Hiragana letter /ka/. Naturalness expressed by 2-D displacement vector field. Font letter strokes shown in black, handwritten strokes shown in gray.
the accumulation of error from previous steps. This phenomena has been found harmful in many tasks. With naturalness learning, however, ROL performs well. In the handwriting task, we found that introducing an activation y(n) into play via update Eq. (3) always helped network to generate y(n+1) with a greater accuracy. An intuitive explanation is as follows. The way a person writes the first half of a handwritten stroke influences, to a certain extent, how the second part is going to look, i.e. distortion of a certain part of a stroke usually implies some other distortion to a successive part of the stroke. It is therefore reasonable to assume that a short sequence of points on a handwritten stroke influences where the next point will appear, with the very last point of a sequence having the greatest influence. The same holds for the naturalness extracted as a difference between handL and fontL. On the other hand, it is not only the very last point which influences the position of the next point of a stroke, but a short sequence of the previous points. Thus, a backprojection matrix Wback was used so as to ensure recent short sequence of generated output ..., y(n − 1), y(n) is also reflected in y(n + 1) via the activation x(n + 1) (see Eq. (2) and Eq. (3)). The same principle holds for the input activations u(n + 1) extracted from fontL strokes. Update Eq. (2) ensures that a short sequence ..., u(n), u(n + 1) is reflected in the activation x(n + 1) which is in turn used to compute y(n + 1) via update Eq. (3). The activation u(n + 1) is also used directly in Eq. (3) to ensure that the very last point of the input sequence has significant influence in the computation of y(n + 1). The fact that naturalness is being modeled, instead of the target system, allows us to control the amount of the naturalness being added to the font letters. A weight of value 1 will render generated letters as close to a person’s handwriting as possible. The value of, say, 0.6 will reduce the naturalness to 60%, providing us with a neat handwritten letters. Values close to 0 will render generated letters very close to the font letters. It is also possible to combine several individual’s naturalness (e.g. 40% of person A’s naturalness with 60% of person B’s naturalness).
RNN with a Recurrent Output Layer for Learning of Naturalness
253
The naturalness learning approach is not limited to the handwriting task only. We believe, one can generate natural looking movements for parts of the human body with the inverse kinematics algorithm being employed as the basic system. Generating emotional human speech with synthesized speech being used as the basic system might be another interesting application of the naturalness learning approach.
4
Experiments
In this chapter, we demonstrate how the naturalness learning approach copes with the handwriting task. The letters used in the experiments are symbols of Japanese syllabary - hiragana. The fontL were extracted from the Bitstream Vera Sans font onto 250x250 pixel canvas. Every stroke of a given letter was turned into a set of bezier curves - a path. Every path was then evenly spaced into a set of points. Let each Pfijl be the set of points represented by nij × 2 matrix where i denotes the index of the letter, j the index of the stroke within the letter and nij is the number of points. Pfijl (k) = (xk , yk ) therefore represents the k-th point of the j-th stroke of the i-th letter (k-th row of the matrix Pfijl ). The handL were first scanned and then appropriately scaled so as to ensure they fit the canvas. Every handL stroke was then aligned to its fontL stroke counterpart. This alignment ensures that the displacement vector field expresses only a shape transformation between a pair of fontL and handL strokes. In order to split strokes of handL, the same procedure was applied as with the fontL, saving the points for each stroke into Phl ij . The spacing interval along each stroke (path) of a handL was, however, also adjusted so that number of points matched those in the corresponding fontL stroke. 4.1
Data Specification
The input signals were extracted from the points of fontL’s strokes as follows. Let Dfijl (k) be the difference vector Pfijl (k + 1) − Pfijl (k). Then the inertia for the k-th point of the j-th stroke of the i-th letter is given by inertiaij (k) = Dfijl (k)
(5)
with each inertiaij (k) being the k-th row of the (nij − 1) × 2 matrix inertiaij . The inertia can be thought of as a representation of the movement of an imaginary pen which ‘wrote’ the font letter. The curvature for the k-th point of the j-th stroke of the i-th letter is given by 0 fl Dij (k) 1 curvij (k) = (6) Dfijl (k)Dfijl (k)t
254
J. Dolinsk´ y and H. Takagi
with each curvij (k) being the k-th row of the (nij − 1) × 1 matrix curvij . Figure 3 illustrates the geometrical meaning of (5) and (6). Each matrix curvij and inertiaij was merged into a single (nij − 1) × 3 matrix Uij with each column being normalized into the interval (−1, 1). In order to partially erase the transient dynamics from the previous stroke, a zero sequence of size ngap × 3 was inserted before every Uij , resulting in the final input matrix U. Naturalness, which serves as the (2 dimensional) output signal, is represented by 2-D displacement vector field. The 2-D displacement vector field for the j-th stroke of the i-th letter is given by fl Yij = Phl ij − Pij
(7)
fl with Yij , Phl ij and Pij each being nij × 2 matrices. The last row of every Yij was dropped to ensure each pair of Yij andUij have the same length of nij − 1. Each column of Yij was scaled down to the interval (−1, 1). A zero sequence of size ngap × 2 was inserted before every Yij , resulting in the final output matrix Y.
4.2
Network Parameters
A 300 unit network was used with activation function f = RBF . Internal weights in the matrix W were randomly assigned values of 0, 0.2073, -0.2073 with probabilities 0.95, 0.025, 0.025. For a 300 × 300 matrix W, this implies a spectral radius of ≈ 0.85, providing for a relatively long short-term memory [4]. 3 input and 2 output units were attached. Input weights were randomly drawn from a uniform distribution over (−1, 1) with probability 0.9 or set to 0. With such an input matrix, the network is strongly driven by the input activations because many elements of the matrix Win are non-zero values. The network had output feedback connections, which were randomly set to one of the three values of 0, 0.1, -0.1 with probabilities 0.9, 0.05, 0.05. Such a setup of feedback connections makes the network excited only marginally by previous output activations. The activation function for the output units was identity fout (x) = x.
Fig. 3. Geometrical meaning of the input data. inertiaij (k) represents difference vector between two successive points Pijf l (k + 1), Pijf l (k). curvij (k) is the sine of angle φ.
RNN with a Recurrent Output Layer for Learning of Naturalness
255
Fig. 4. Testing with the training data: Letters produced by RNN with/out ROL with teacher-forcing switched off from n = 301
Fig. 5. Testing with the testing data: Letters produced by RNN with/out ROL
4.3
Training and Testing
The training data was made from the letters shown in Fig. 4, resulting in a 3027×3 input matrix for Utrain and a 3027×2 output matrix for Ytrain , which were prepared according to the data specification with ngap = 16 being used. Eq. (2) was used for updating with u(n), y(n) being the transposed n-th rows of matrices Utrain and Ytrain respectively. The first 300 steps were discarded and the network internal states with input unit states x(n), u(n) were collected from n = 301 through n = 3355. The output weights Wout were computed using the internal states and input unit states only. The training errors of first and second output units were msetrain,1 ≈ 9.5 × 10−4 and msetrain,2 ≈ 2.8 × 10−3 respectively. Making the output layer recurrent (ROL) and computing Wout using not only internal/input states x(n), u(n) but also output activation values
256
J. Dolinsk´ y and H. Takagi
y(n−1) reduced the training errors msetrain,1 and msetrain,2 down to ≈ 6×10−4 and ≈ 2.0 × 10−3 respectively. A visual comparison is shown in Fig. (4). The testing data was made from letters shown in Fig. 5, resulting in a 6045×3 input matrix Utest and a 6045×2 output matrix Ytest with ngap = 16 being used. For non-ROL topology the test errors were found to be msetest,1 ≈ 1.2 × 10−2 , msetest,2 ≈ 3.5 × 10−2 . Using ROL reduced the test errors to msetest,1 ≈ 0.9 × 10−3 , msetest,2 ≈ 3.0 × 10−2 . The errors msetest,i provide only a rough indication of network performance. A visual comparison between these two trials is shown in Fig. 5. We can observe that the network also produces appropriate naturalness for letters on which it had not been trained.
5 5.1
Discussion Network
Here we try to provide an insight into why the setup from section 4. works the best. The network has information concerning the shape of the strokes in fontL and handL via its history of input and output activations. The matrices Win and W from the setup in sec. 4. make the network strongly driven by the (finite) history of input. Moreover, using the most recent input activation u(n + 1) in Eq. (3) directly makes the impact even stronger. It is most likely the case that the last several points of the path from where a fontL stroke is coming has a substantial impact on the naturalness and thus also on where the handL stroke is going to continue. An intuitive explanation is that a human, while writing in one’s own individual way, is trying to ‘approximate’ a font letter shape as memorized in a school. This finding is in line with the basic idea of the naturalness learning: to model a target system (handL) by means of a basic system (fontL) and its difference (naturalness) with the target system. The feedback weights Wback from the setup in section 4. only slightly excite the network with the output history. Surprisingly, using the most recent output activation y(n) in Eq. (3) directly (= ROL) always improved the performance. A plausible explanation is that the already written part of a handL stroke influences how the going to be written part will look like with the naturalness of the the previous step having highest relevance to the naturalness generated in the next step. (i.e. distortion of a certain part of a stroke usually implies some other distortion to a successive part of the stroke). Feedback connections with larger weights (up to +1) and no ROL were also tested. With this setup the network training error was about the same as in sec. 4 but driving the network with the testing data rendered worse performance or made the network unstable. To check the robustness of the advantage of ROL, we tested the setup in sec. 4 with a range of different starting internal weights W. In all cases ROL enabled better performance.
RNN with a Recurrent Output Layer for Learning of Naturalness
5.2
257
Data Structure
The naturalness in this paper is represented as a 2-D displacement vector field (Fig. 2). We are, however, not strictly bound to this representation. The naturalness can be also represented as a set of parameters in a system which represents a difference between the target (handL) and the basic (fontL) system. The input data characterizing the basic system (handL) was made position independent so as to ensure the same stroke will be represented by the same data regardless of its starting position. This reduces significantly the complexity of the handwriting task because while separate strokes are substantially different in shape, some of their small parts are often similar. The short-term memory of a RNN makes distinction between a identical/similar short stroke sequences possible because the RNN accounts also for points before the identical/similar stroke part.
6
Conclusion
In the handwriting task we showed that by modeling the target system by means of a basic system and its difference from the target system, a substantial relevance is revealed in the difference produced in step n and step n + 1. In such a case the usage of a ROL turned out to be advantageous. We would like to confirm these findings by applying naturalness learning to other tasks as well. Modeling the unique individualistic quality of human motion is the next step in confirming the feasibility of both naturalness learning and the usability of a ROL for tasks formulated in terms of naturalness learning.
References 1. Elman, J.L.: Finding structure in time. Cognitive Science: A Multidisciplinary Journal 14(2), 179–211 (1990) 2. Jaeger, H.: The “echo state” approach to analysing and training recurrent neural networks. Sankt Augustin: GMD-Forschungszentrum Informationstechnik, GMDReport 148 (December 2001) 3. Jaeger, H.: Short term memory in echo state networks. Sankt Augustin: GMDForschungszentrum Informationstechnik, GMD-Report 152 (March 2002) 4. Jaeger, H.: Supervised training of recurrent neural networks, especially with ESN approach. Sankt Augustin: GMD-Forschungszentrum Informationstechnik, GMDReport 159 (October 2002) 5. Jordan, M.I.: Serial Order: a parallel distributed processing approach. Tech. Rep. 8604, Univ. of California at San Diego, Inst. for Cognitive Science (May 1986) 6. Jordan, M.I.: Attractor dynamics and parallelism in a connectionist sequential machine. In: Eighth Annual Conf. of Cognitive Science Society, Amherst, MA, USA, pp. 531–546 (August 1986) 7. Krose, B., van der Smagt, P.: Recurrent networks. In: Ch. 5, An introduction to neural networks, Eighth Edition, Univ. of Amsterdam (November 1996) 8. Wang, Y.-C., Chien, C.-J., Teng, C.-C.: Direct adaptive iterative learning control of nonlinear systems using an output-recurrent fuzzy neural network. IEEE Trans. on SMC-B 34(3), 1348–1359 (2004)
Using Generalization Error Bounds to Train the Set Covering Machine Zakria Hussain and John Shawe-Taylor Centre for Computational Statistics and Machine Learning Department of Computer Science University College, London {z.hussain,j.shawe-taylor}@cs.ucl.ac.uk
Abstract. In this paper we eliminate the need for parameter estimation associated with the set covering machine (SCM) by directly minimizing generalization error bounds. Firstly, we consider a sub-optimal greedy heuristic algorithm termed the bound set covering machine (BSCM). Next, we propose the branch and bound set covering machine (BBSCM) and prove that it finds a classifier producing the smallest generalization error bound. We further justify empirically the BBSCM algorithm with a heuristic relaxation, called BBSCM(τ ), which guarantees a solution whose bound is within a factor τ of the optimal. Experiments comparing against the support vector machine (SVM) and SCM algorithms demonstrate that the approaches proposed can lead to some or all of the following: 1) faster running times, 2) sparser classifiers and 3) competitive generalization error, all while avoiding the need for parameter estimation.
1
Motivation
Two algorithms that use very different mechanisms in order to build their classifiers are the support vector machine (SVM) [2,6,3] and the set covering machine (SCM) [4]. Taking the binary classification task as an example, the first more familiar algorithm finds a hyperplane in feature space that maximizes the margin between the two classes of examples. The second looks for a classifier constructed from a small subset of data derived decision functions consistent with the set of positive examples. Fundamentally these two algorithms differ in their approaches to solving a classification problem. However, as was argued in [4] it may be more effective (for some learning tasks) to build a classifier from a small subset of data derived decision functions as opposed to a maximum margin separating hyperplane. In this paper, we demonstrate a method of model selection for the SCM without the need for parameter estimation. Instead, we apply the generalization error bound directly to the algorithm in order to determine its output classifier. First we apply the bound directly to the SCM using a greedy approach and call M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 258–268, 2008. c Springer-Verlag Berlin Heidelberg 2008
Using Generalization Error Bounds to Train the Set Covering Machine
259
the heuristic the bound set covering machine (BSCM). Next we look to globally optimize the bound by proposing the branch and bound set covering machine (BBSCM) and prove that the algorithm will indeed find the hypothesis with the smallest generalization error bound. This algorithm turns out to be too slow experimentally and so we introduce a relaxation called BBSCM(τ ) and prove that its solutions are a factor τ from the optimal. The structure of this paper will be as follows: In the following section we describe the set covering machine (SCM) and state two generalization error bounds for the SCM. Section 3 describes the bound set covering machine (BSCM). In part 4 we discuss in detail the branch and bound set covering machine (BBSCM). Section 5 describes a heuristic derived from the BBSCM algorithm called BBSCM(τ ). Section 6 proves the optimality of the algorithms proposed. The experimental section 7 discusses results for several UCI repository data sets with section 8 concluding the paper.
2
The Set Covering Machine
Definition 1. Let S = {(x1 , y1 ) . . . (xm , ym )} be a sample where xi ∈ X = P ∪ N is a training example and yi ∈ {−1, +1} its classification. We use the convention x = (x1 , . . . , xn ) for each training example and refer to X as the training set. Let P be the set of positive (+1) and N the set of negative (−1) training examples for the conjunction case and let P be the set of negative (−1) and N the set of positive (+1) training examples in the disjunction case. The set covering machine tries to build a conjunction or disjunction of datadependent features (data derived decision functions) in order to classify future test examples. The set of data derived decision functions we will use are the following set of data-dependent balls. Definition 2. For a training example xi ∈ X with label yi ∈ {−1, 1} and (realvalued) radius ρ, let fi,ρ be the following data-dependent ball centered around xi : yi if d(x, xi ) ≤ ρ fi,ρ (x) = y¯i otherwise where y¯i is the complement of yi and d(x, xi ) is the distance between x and xi . Furthermore, we define a center xi ∈ X and restrict the definition of a border xj ∈ P. Also the radius ρ is defined as: d(xi , xj ) + α if xi ∈ P ρ= d(xi , xj ) − α if xi ∈ N where α is a small positive real number. For any ball h ∈ H, let ν(h) be the set of N examples correctly classified by h and let π(h) be the set of P examples misclassified by h. Given these definitions we have the following description for the usefulness of a data-dependent ball.
260
Z. Hussain and J. Shawe-Taylor
Definition 3. The usefulness (or utility) U of h is expressed as: U (h) = |ν(h)| − p|π(h)|
(1)
where p is a small positive real number. Finally, when we discuss a P example we will mean an example from the set of P. Similarly, an N example will denote an example from the set of N examples. Using these definitions we can now describe how to train the SCM with datadependent balls. 2.1
Training
The SCM algorithm uses a greedy approach to try and completely classify the set of N examples whilst misclassifying a small number of P examples. Let N be the set of N examples to be covered and let P be the set of P examples that have been misclassified. Therefore, initially hypothesis B ← ∅, N ← N and P ← ∅. At the first iteration the SCM algorithm looks for ball hi that maximizes the utility value U (hi ). After the ball hi with the highest usefulness is added, the hypothesis becomes B ← B ∪ {hi }, N ← N {ν(hi )} and P ← P ∪ {π(hi )}. This is repeated until no more N examples are left to cover (N = ∅) or until the early stopping criterion |B| ≤ s is satisfied. Clearly the algorithm is greedy as it only adds ball h to hypothesis B if it maximizes the utility U (h). Once the SCM has output its hypothesis B, it can be used to make predictions. 2.2
Testing
Given hypothesis B containing a conjunction or disjunction of data-dependent balls hi , we can classify example x like so: y if hi (x) = y for i = 1, . . . , |B| B(x) = (2) y¯ otherwise where y¯ is the complement of y and, from Definition 1, y = 1 for a conjunction and y = −1 for a disjunction. 2.3
Generalization Error Bounds
We now state a loss bound for the SCM, that will be used for the theoretical and experimental section of the paper. We also use a second bound but only reference it in order to save space. Theorem 1. [4] Suppose a SCM finds a solution given by a set B of features with R = R(B) = |B| features, R+ = R+ (B) of which are centered around P examples, with kp = kp (B) the number of P training errors and kn = kn (B) the number of N training errors on a sample of size m > 2R + kp + kn . Then with
Using Generalization Error Bounds to Train the Set Covering Machine
261
probability 1 − δ over random draws of training sets, the generalization error of the resulting classifier can be bounded by −1 m 2R + (m, R, R , kp , kn ) ≤ 1 − exp ln + ln m − 2R − kp − kn 2R R+ m − 2R 2m2 R + ln + ln (3) kp + kn δ The second bound we will use is the Equation 10 of [5] and is slightly tighter than the above bound. These bounds are tight and non-trivial (i.e., always less than 1). Exploiting this verity we will apply the generalization error bound directly to obtain classifiers for the SCM and, with it, remove the need for parameter estimation in the SCM. Because of its greedy application of the bound, we call this first heuristic the bound set covering machine (BSCM).
3
The Bound Set Covering Machine
In this variant of the SCM we allow the algorithm to be driven by one of the generalization error bounds given by equation (3) or Equation 10 of [5] . For simplicity and to save space we only describe the remaining work with Theorem 1. Although the bound of [5] can also be applied with the same reasoning. Given any hypothesis B and ball h, we can calculate the risk bound of adding ball h to B as (B ∪ {h}) using (m, R, R+ , kp , kn ). This is true for adding any ball h. Therefore, similarly to the SCM, the bound set covering machine (BSCM) algorithm can be described as follows: Initially hypothesis B ← ∅, N ← N , P ← ∅ and best bound ∗ ← 1. At the first iteration, the BSCM algorithm looks for ball hi that minimizes the generalization error bound (B ∪ {hi }). Ball hi that minimizes (B ∪ {hi }) is added to the hypothesis B ← B ∪ {hi }, N ← N {ν(hi )}, P ← P ∪ {π(hi )} and best bound ∗ ← (B ∪ {hi }). This is repeated until no new loss bound (B ∪ {h}), for any new ball h, can be found such that (B ∪ {h}) < ∗ . Finally the resulting hypothesis B can be used to classify new test examples using (2). Note that the BSCM has eliminated the need for penalty parameter p and soft-stopping parameter s that was present in the SCM. It is clear that the BSCM heuristic is greedy in its solution in much the same way as the SCM, with the difference that it looks to minimize the generalization error bound as opposed to maximizing the utility function for each ball. However, the nature of a greedy approach implies that we are not always guaranteed a globally optimal classifier. Therefore, in the following section, we present a branch and bound approach to solving the set covering machine that ensures a much stronger result, that of global optimality.
4
The Branch and Bound Set Covering Machine
In this section we use a branch and bound approach to enumerate all possible hypotheses that can be generated from the set of data-dependent balls. This is
262
Z. Hussain and J. Shawe-Taylor
done by evaluating the bound every time a new ball can be added to the current hypothesis. If a hypothesis cannot be extended with a ball such that the new bound is smaller than the best bound currently found then there is no need to include this ball in the current hypothesis. Note that the choice of balls can also be thought of in terms of a branch and bound search tree. Where any ball at a particular depth d is found from its parent ball at depth d − 1 and can produce child balls at level d + 1. The motivation for such a strategy for solving the set covering machine is that if the function to be minimized is the generalization error bound then we are assured of finding a hypothesis with the smallest generalization error bound. 4.1
Algorithm
The algorithm relies on functions addball and createtable. We have detailed below each function and also included the pseudo code. Before we describe the functions in detail we will now give some notation. Let S be the sample containing training set X and label set Y , H the set of data-dependent balls and T the machine type which can either be a conjunction or disjunction of balls. As earlier, let π(h) be the set of P examples misclassified by h and ν(h) the set of N examples correctly classified by h. For any hypothesis B and any ball hi ∈ B let Bi = B ∪{hi }. The generalization error bound of B is given by (B) where, (B) = (m, R, R+ , kp , kn ), also for the same hypothesis B the best potential (bp) generalization error bound η(B) is given by, η(B) = (m, R + 1, R+ , kp , 0). So the bp generalization error bound η(B) is the bound (Bi ) for Bi = B ∪{hi } if a single ball hi can be added to hypothesis B such that all of the remaining N examples are covered and none of the remaining P examples misclassified. Contrast this to the generalization error bound (B) which is simply the bound given for hypothesis B. Algorithm BBSCM: The input of the algorithm is sample S that contains training set X = P ∪ N , machine type T that is either a ‘conjunction’ or ‘disjunction’ and H that is the set of data-dependent balls computed from training set X. The algorithm contains local variables B,N ,P ,∗ and global variable . Initially B ← ∅ is the empty hypothesis that we start with, N contains all the N training examples that are still to be covered, P is empty because no P examples are initially misclassified, ∗ is the best generalization error bound found so far for any hypothesis B and initially set to 1. Global variable is set to the
Using Generalization Error Bounds to Train the Set Covering Machine
263
Input : S,T ,H B ← ∅, N ← N , P ← ∅, ∗ ← 1, ← |H|; Call : addball(B,∗ ,N ,P ) Output : A T =‘conjunction’ or T =‘disjunction’ of data-dependent balls B∗ ⊆ H
Algorithm 1. BBSCM(S, T, H) Function addball(B,∗,N ,P )
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20:
Data: Consider all h ∈ H B ; Order according to ({h} ∪ B) → (h1 , 1 , η1 ), . . . , (h , , η ) ; for i = 1, . . . , do Btemp ← B, N temp ← N , P temp ← P if ηi < ∗ then B ← B ∪ {hi }, N ← N {ν(hi )}, P ← P ∪ {π(hi )} if i < ∗ then B∗ ← B, ∗ ← i call createtable(∗ ,m) end if found ← false, R ← |B|, A ← −|N | while ¬found do R ← R + 1, A ← A + |N temp| − |N |, kn ← table(R, |P |) if kn = −1 or A ≥ −kn then found ← true end if end while if kn = −1 then call addball(B,∗ ,N ,P ) end if end if end for
number of balls possible for current B. Using the above inputs and variables, algorithm BBSCM calls recursively function addball(B,∗,N ,P ) (see below). Finally the output of algorithm BBSCM is a conjunction or disjunction of balls (classifier/hypothesis) B ∗ that can be used for classification using equation (2). Function addball: This function adds each possible ball hi to the current hypothesis B and checks to see if its generalization error bound i is smaller than the best generalization error bound ∗ found so far. If so, then the value of B ∗ is replaced with B and the best risk bound ∗ is replaced with i (line 6). Also at this stage function createtable is called (line 7) to get table (see description of function createtable below). On line 11 if table(R,|P |) returns kn = −1 then this indicates that there is no bound for R and |P | that is smaller than ∗ . If table(R,|P |) is a positive integer kn, then there is a possibility of finding a ball to add to the current
264
Z. Hussain and J. Shawe-Taylor
hypothesis that will give a smaller risk bound than ∗ provided there exists a set of R additional balls that leave no more than kn N examples uncovered and no additional P examples misclassified. If kn ≥ 0, then line 12 checks whether a larger number of N examples can be covered using R balls (see Lemma 2). If so, then the procedure calls itself recursively (line 17) until all balls in H have been enumerated. Function createtable: Local variable array table is an m × m matrix where m = |X| is the number of training examples. Initially all values in the table are set to −1. Function createtable calculates for R balls and kp misclassifications (on the P examples) the number of N examples that can be covered without creating a bound lower than the best bound ∗ found so far (line 5). This function returns table. Function createtable(∗,m) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13:
5
Initialize: table ← −1, R+ ← 0, kpfound ← true, kp ← −1; while kpfound do kpfound ← false, kp ← kp + 1, Rfound ← false, R ← 0 while Rfound do Rfound ← false, R ← R + 1, kn ← 0 while (m, R, R+ , kp, kn) < ∗ do kn ← kn + 1 end while if kn > 0 then kpfound ← true, Rfound ← true end if table(R, kp) ← kn − 1 end while end while
BBSCM(τ )
BBSCM(τ ) allows a trade-off between the accuracy of the classifier and speed. In function createtable(∗, m) the while condition computes (m, R, R+ , kp, kn) <∗ . In the BBSCM(τ ) heuristic this while condition becomes (m, R, R+ , kp, kn) <τ · ∗ . Allowing the BBSCM algorithm to ignore solutions whose generalization error bounds are not a factor τ from the optimal found so far. Clearly setting τ = 1 gives the BBSCM algorithm – however as mentioned above this algorithm is too slow for data sets with m > 100 training examples. Therefore we would like to set τ < 1. Setting τ close to 1 may cause the heuristic to be too slow but create hypotheses that have small generalization error bounds similar to those for τ = 1. Setting τ close to 0 will speed up the solution but may not create a large enough search space in order for BBSCM(τ ) to find hypotheses with relatively small generalization error bounds. Hence, setting τ
Using Generalization Error Bounds to Train the Set Covering Machine
265
is unlike setting a regularization parameter since it is trading accuracy against time - bigger τ is always better in terms of generalization error bounds, but costs more in terms of computational time.
6
Theory
Before looking at the first Theorem we would like to present the following lemma that allows the theorem to be proved. Note that because of space constraints we only use Theorem 1 for this section, although, the results also hold for other SCM bounds. Also, we only give a sketch proof of the main theorem because the full proofs of all the results in this section can be found in the longer version of the paper. Lemma 1. The generalization error bound given by equation (3) (Theorem 1) is monotonically increasing in the second parameter. Theorem 2. Let ∗ be the smallest generalization error bound currently found for hypothesis B ∗ by the BBSCM algorithm. For any hypothesis Bi that gives η(Bi ) > ∗ there exists no extension Bˆ ⊇ Bi such that the generalization error ˆ ≤ ∗ . bound for (B) This theorem states that for any hypothesis Bi if the bp generalization error bound ηi is worse than the best bound ∗ then there is no need to try and cover any more N examples from this ball as there will never be a smaller bound than the best ∗ found so far. Lemma 2. Let U be a set covered by A1 , . . . , Ak . For any V ⊆ U ∃ j such that |Aj ∩V | ≥ k1 . |V | Lemma 3. Suppose createtable(∗ , m) has been executed and kn table(R, kp) ≥ 0. It follows that (m, R, R+ , kp, kn + 1) > ∗ for R+ ≥ 0.
=
Using Lemma 2 and 3 we can now prove that the BBSCM(τ ) algorithm will only disregard a ball for inclusion if it cannot lead to a hypothesis with generalization error bound smaller than the one currently found by BBSCM(τ ). Hence, giving solutions that are a τ factor from the optimal. Theorem 3 (main theorem). If algorithm BBSCM(τ ) outputs hypothesis B ∗ with generalization error bound ∗ then there exists no hypothesis Bˆ such that ˆ < τ · ∗ . generalization error bound (B) Proof (sketch). We prove this result by contradiction. We consider a hypothesis whose generalization error bound is smaller than the one found by the BBSCM(τ ). Next we claim and prove, using Lemma 2 and 3, that if this hypothesis does indeed have a smaller loss bound than the BBSCM(τ ) solution then it should have been chosen by the BBSCM(τ ). However, this hypothesis was not chosen by the BBSCM(τ ) algorithm. Hence, a contradiction.
266
Z. Hussain and J. Shawe-Taylor
Theorem 4. If algorithm BBSCM outputs a hypothesis B ∗ then its generalization error bound ∗ will be globally optimal (i.e. the smallest bound possible). Proof. Apply Theorem 3 with τ = 1.
From these theoretical results we have shown that the BBSCM is guaranteed to find the hypothesis B ∗ with the smallest generalization error bound ∗ .
7
Experiments
Experiments were conducted on seven standard UCI repository data sets [1] and are described in [4]. All examples with contradictory labels and whose attributes contained unknown values were removed (this reduced considerably the Votes data set). We compared the results against the SVM and SCM. The SVM was equipped with a Gaussian kernel and the SCM used the L2 metric to construct its set of data-dependent balls. All training sets were evaluated using 10-fold cross validation and both the SVM and SCM were trained with an extra 10-fold cross validation split in order to evaluate the regularization parameters for each algorithm. Note, the running times include this parameter tuning phase. In contrast, the BSCM and BBSCM(τ ) experiments were only conducted using one 10-fold cross validation split as there were no parameters to tune. For the SVM, we report in Table 1 the number of Support Vectors (SVs), the number of seconds (time) spent training and tuning parameters and the error (err) of the classifiers averaged over all 10 sets for various values of Gaussian width parameter γ and regularization parameter C. For the SCM we report the machine type (T ), where “c” denotes a conjunction and “d” denotes a disjunction of balls that gave the smallest CV error. To compare against the number of support vectors in the SVM we also report the average number of balls (b) for each classifier over the 10 folds. For the BSCM and BBSCM(τ ) we also give the generalization error bound (thm) used for each heuristic where “(1)” denotes Theorem 1 and “(2)” denotes the Theorem of [5] given by their Equation 10. Finally, we chose “c” and “d” for those machines that gave the smallest generalization error bound for each CV split. Furthermore, for the BBSCM(τ ) we only used τ = 0.1 as the search space grew too large for anything larger than this value. From Table 1 we observe that the BSCM and BBSCM(τ ) heuristics are competitive with the SVM and SCM in terms of generalization error for the BreastW, Votes, Pima, Haberman and Glass data sets. Also, both heuristics are considerably sparser than the SVM. Finally, the BSCM exhibits faster running times when compared with the SVM and SCM.
8
Conclusion
We have proposed the branch and bound set covering machine (BBSCM) algorithm and two heuristics BSCM and BBSCM(τ ) along with theoretical results
Using Generalization Error Bounds to Train the Set Covering Machine
267
Table 1. SVM and SCM model-selection results using 10-fold cross validation Data Set
SVM SCM BSCM BBSCM(τ ) SVs time err T b time err thmT b timeerr thmT berr BreastW 79.2 506.1 .035 c 2 658.4 .0395 (2) c 1.546.7 .0335 (2) c 1.0351 Votes 18.5 8.8 .1033 c 1 5.4 .1154 (2) d 1.50.1 .14 (2) c 1.1538 Pima 398.111071.9.246 c 5.9 1091.9.2721 (2) c 2.899.4 .2579 (2) c 1.2891 Haberman 136.45129.2 .2413 d13 272.5 .2789 (1) c 1 25.5 .2585 (2) c 1.2585 Bupa 199.43083.1 .2694 d32.6128 .3478 (2) d 4.143.2 .3769 (2) d 1.3913 Glass 92.4 183.8 .2383 c 3.8 57.9 .2331 (2) c 1 1.2 .2379 (2) c 1.2393 Credit 325.97415.1 .2545 d3.7 831.2 .317 (1) c 1 53.3 .3247 (2) c 1.3247
and experimental evidence to show that the algorithm is competitive with the SCM and SVM. The main novelty of this paper is that we have eradicated the need for the parameters in the SCM by using tight generalization error bounds to train the algorithm. We have shown theoretical results that guarantee the BBSCM algorithm will find a classifier with the smallest generalization error bound. Furthermore, we have also shown that heuristics that try to minimize directly the loss bound are competitive with traditional cross-validation model selection techniques. We believe this helps bring together the ideas of learning theory and practical machine learning – using tight bounds directly to carry out model selection. This, we hope, is the first step towards learning machines that need little human intervention in terms of tuning regularization parameters with (hopefully) no degradation in generalization ability. Future work involves pruning further the search space of the BBSCM algorithm to allow larger data sets to be tractable. Also, the development and use of tighter loss bounds in the BBSCM, BSCM and BBSCM(τ ) algorithms is a future research direction.
Acknowledgements This work was supported in part by the IST programme of the European Community, under the PASCAL network of excellence.
References 1. Blake, C., Merz, C.: UCI Repository of machine learning databases. Department of Information and Computer Science. University of California, Irvine, CA (1998), http://www.ics.uci.edu/∼ mlearn/MLRepository.html 2. Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory, pp. 144–152. ACM Press, New York (1992) 3. Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press, Cambridge (2000)
268
Z. Hussain and J. Shawe-Taylor
4. Marchand, M., Shawe-Taylor, J.: Learning with the set covering machine. In: Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001), pp. 345–352 (2001) 5. Marchand, M., Sokolova, M.: Learning with decision lists of data-dependent features. Journal of Machine Learning Reasearch 6, 427–451 (2005) 6. Vapnik, V.N.: Statistical Learning Theory. Wiley, New York (1998)
Model of Cue Extraction from Distractors by Active Recall Adam Ponzi Laboratory for Dynamics of Emergent Intelligence, RIKEN BSI, Saitama, Japan
[email protected]
Abstract. Cues are informative signals animals must use to make decisions in order to obtain rewards, usually after intervening temporal delays, typified in the cue-action-reward task. In behavioural experiments the cue is often clearly distinguished from other stimuli, by a salience such as brightness for example, however in the real world animals face the problem of recognizing real cues from among other environmental distracting stimuli. Furthermore once the cue is recognized it must cause the animal to make a certain action to obtain reward. Therefore the animal faces a compound chicken-and-egg problem to obtain reward. First it must recognize the cue and then it must learn that the cue must initiate a certain action. But how can the animal recognize the cue before it has learned the action to obtain reward, since in this initial learning stage the cue is only partially predictive of reward? Here we present a simple neural network model of how animals extract cues from background distractor stimulus, all presented with equal salience, based on successive testing of different stimulus-action allocations over several trials. A stimulus is selected and gated into working memory to drive an action and then reactivated at the end period to be reinforced if correct. If the stimulus is not reinforced over several trials it is suppressed and a different stimulus is selected. If the stimulus is a real cue but it drives the incorrect action, its cue-action allocation is suppressed. This mechanism is enhanced by the property of cue mutual exclusion in trials which also provides a simple model of bottom-up attention and pop-out. The model is based on the cortical and hippocampal projections to the dopamine system through the striatum including a model of salience gated working memory and a reinforcement and punishment system based on dopamine feedback balance. We illustrate the model by numerical simulations of a rat learning to navigate a T-maze and show how it deterministically discovers the correct cue-action allocations.
1 Introduction The Basal Ganglia are well known to be involved in reward learning mechanisms [1,2] and cortico-basal ganglia loops are critical for the learning of rewarded cued procedures and in cued working memory tasks [3]. Dopamine is thought to play the role of reward prediction error [1,4], where the burst firing of dopamine cells is increased by unexpected rewards and reduced if an expected reward is omitted. The striatum, which is the main input structure of the Basal Ganglia, recieves a strong input from the midbrain dopaminergic system and the prominent striatal projection neurons, the medium spiny neurons, are also known to reflect reward expectation themselves [5,6]. Dopamine firing M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 269–278, 2008. c Springer-Verlag Berlin Heidelberg 2008
270
A. Ponzi
can also be triggered by novel stimuli that do not involve reward [7]. Rapid detection of a cue change by striatal neurons has been studied by Pasupathy [3]. The basal-ganglia, the prefrontal cortex and the midbrain dopamine nuclei projecting to them are also strongly implicated in working memory [8]. Persistent neural activity in recurrent circuits plays a central role in the maintenance of information in working memory [9]. Some theories suggest a gating role for phasic dopamine release, such that dopamine release is required for read-in to working memory [11,10,12]. Seemingly paradoxically dopamine can suppress or enhance striatal activity [13], and can extend the duration of enhanced activity. In spatial working memory tasks some hippocampal neurons (‘splitter cells’ or ‘episodic cells’) fire selectively dependent on the context of a specific recent response or future goal [14,15,18]. For example, during performance of a spatial alternation task, many hippocampal pyramidal neurons fire selectively after right turn or left turn trials, even though the rat is running in the same direction through the same location on the stem of the maze on both types of trials. Ferbinteaneu et al. [17] describe retrospective and prospective coding in hippocampal neural assemblies in spatial working memory tasks. The origin of journeys influenced firing even when rats made detours showing that recent memory modulated neuronal activity more than spatial trajectory. Diminshed retrospective and prospective coding was observed in error trials suggesting this signal was important for task performance. Mulder et al. [19] recorded hippocampal output structures associated with the motor system (nucleus accumbens and ventromedial caudate nucleus) in rats solving a plus-maze. They found a variety of responses including neurons that fired continuously from the moment the rat left one location until it arrived at a goal site, or at an intermediated place, such as the maze centre. They suggest their results support the view that the ventral striatum provides an interface between limbic and motor systems, permitting contextual representations to trigger movements and have an impact on action sequences in goal directed behaviour. Barnes et al. [20] have also described task and expert neurons in dorsal striatum.
2 Model This model is composed of two parts, the cue detection part and the action selection and reinforcement part. The model is an extension of the model presented in [21,22]. In those papers we describe a cue response task typical of primate studies which consists of three stages, stage 1 is the cue presentation stage, stage 2, the action selection stage and stage 3, the reward stage, where each stage is separated by delay periods. In the work presented here, this model is re-motivated as a spatial working memory temporal credit assignment model typical for rat maze tasks and described in Fig.1(a). In stage 1 a cue is presented together with some distractors. Here we suggest this corresponds to the spatial view of the animal from the initial location. The spatial views from each of the two initial locations have some parts in common and also some differences. For example a visual cue may be visible from one location but not the other, or the same cue may be visible, but in a different location with respect to the animal’s head direction, or other background cues. Here we simplify these possibilities and represent the view simply as an activation of several units of the input ‘P ’ layer. In initial position ‘A’
Model of Cue Extraction from Distractors by Active Recall
271
a fixed set of background distractors is activated simultaneously to a cue. In initial position ‘B’ the same fixed set of background distractors is activated, but the cue from location ‘A’ is not activated and instead a different unit is activated. In stage 2 an action must be selected by the animal from some possible alternatives. The correct action to choose depends on the cue presented in stage 1, i.e. the initial location. Here the animal must turn right at the junction if it started in one location and left if it started in the other location. In stage 3 reward is given to the animal if the action made in stage 2 is the correct one. The basic idea of the model is that the cue presented in a layer P in stage 1 is gated into working memory in a recurrent layer ‘Q’, see Fig.1(b). P refers to primary since this layer simply reflects the external environment, while Q is the layer after P . The working memory of the cue is reactivated in layer Q during the action selection stage 2. The reactivation is itself driven by an external trigger signal which is also presented in layer P and given by the external environment, see Fig.1(a). During this stage 2 the reactivated cue drives a winner-take-all (WTA) action selection system in layer ‘M ’ which activates the action which is driven strongest by the cue. Layer Q is connected to layer M in an all-to-all fashion so the action j activated in layer M when QM the cell i is activated in layer Q depends on the synaptic weights Jij between layers QM Q and M . The action j selected is given by the strongest weight maxj {Ji=cue,j }. As explained in [21,22] the cue in working memory in layer Q is again reactivated in stage 3 and reward is given simultaneously if the action made in stage 2 is correct. The reactivation is itself driven by another end signal in the P layer, see Fig.1(a). Since the second reactivation activates the same Q layer cell i as stage 2, this also reactivates the
Fig. 1. (a) Task structure described in the text. The animal runs around the track and the task is simplified to three stages shown as solid boxes on the track. The cue presentation stage 1 is at the initial location, the action selection stage 2 (Tjunction ) is when the junction is sensed and the end stage 3 (Tend ) is when the end of the track is sensed. Reactivations of stimuli presented in stage 1 occur in stages 2 and 3 together with activation of the action selecting M (MSN) cells (G(t) = 1) in the latter halves of these stages, depicted by hatched boxes. Here the task is depicted as an alternation task, but in this paper we study the more difficult task where the animal is removed from the end location and randomly placed at one of the two initial locations each trial. (b) Anatomy of the model system described in the text. The primary P and recurrent Q layers are suggested to be part of cortex or hippocampus which projects to striatum, while the M layer where the cells coding for actions are located may be striatum medium spiny neurons (MSN). The N layer cell is not shown but is also suggested to be striatum projecting to thalamus.
272
A. Ponzi
QM same action j. If reward is present the synapse Jij is strengthened, while if reward is absent the same synapse is weakened. This process therefore targets a particular synapse for reinforcement or punishment. Regardless of whether reward is received or not, after a longer delay the next trial starts with one of the cues chosen at random presented together with some distractors. In [21,22] we show that the system can easily find the correct cue to action allocation by an exploratory process using the punishment signal to depress selected but unwanted actions. When the correct cue-action allocation is discovered it is stabilized by dopamine negative feedback which limits the growth in QM the weights Jij . A drawback of the model presented in [21,22] was that the distractors and cue presented in stage 1 had to be artificially distinguished by an incentive salience level which was set to be high for the cue and low for the distractors, which were presented simultaneously. In fact the working memory was a winner-take-all system so that the unit activated with maximum salience during stage 1 was the unit gated into working memory to be reactivated in stages 2 and 3. Only if the cue had sufficiently high relative salience was it gated into working memory. If its salience was too low, a distractor could be gated in instead. Here we address this defect and therefore consider the problem of cue extraction from among distractors, where all cues are presented externally with the same salience. I.e. we allow the system to discover the cue and give it a high salience, while the distractors are given a low salience. The cue extraction model described here is comprised of the P and Q layers shown in Fig.1(b) together with the thalamus T and the dopamine system. The Q layer activities qi (t) are given by, ⎛ ⎞ dqi = −k1 qi + k2 g ⎝ wij qj ⎠f (T (t) − TK ) + k3 xi Pi f (TK − T (t)). (1) dt j=i
In this equation the qi (t) can be considered firing rates or membrane potentials. The first term on the RHS represents the exponential decay of qi (t) back to zero with rate k1 when there is no activation. The second term models the effect of the all-to-all modifyable recurrent collaterals with weights wij (t), where g(x) is the sigmoidal function which provides the non-linearity and limits the activation of this term. The weights wij (t) are given by, dwij = k5 qi − k4 qi qj − k6 wij . (2) dt The combination of Eq.1 and Eq.2 describes a winner-take-all system as described in more detail in [21,22]. The k6 term is an exponential decay which reverts the wij (t) to zero between trials. The third term in Eq.1 is the one-to-one input from primary P layer. In this third term Pi (t) = 1, 0 are the activities of P layer cells which respond directly to the external environment and are unity when active and zero at other times. As described above during the cue presentation period one of the two cue cells is set to unity while the other is set to zero, e.g. PcueA = 1, PcueB = 0 and all the distractors are all set to unity. Here we have five distractors and two cues so that the total number of P cells and Q cells is N P = N Q = 7. The factor xi (t) in the third term of Eq.1 is the modifyable
Model of Cue Extraction from Distractors by Active Recall
273
salience of input Pi . The input with the maximum salience from among those presented is the input gated into working memory. We wish xi to be high for the cues and low for the distractors, and it is this variable which was artificially and externally fixed in the previous modeling [22], we will describe it below. The factors f (T (t) − TK ) and f (TK − T (t)) in the second and third terms of Eq.1 model the reactivation and P layer input down-gating respectively. T (t) is considered to be the activity of the thalamus while TK is its baseline activity and f (x) = x when x > 0 and f (x) = 0 otherwise is a positivity function. When the thalamus activity is above baseline T (t) > TK the recurrent collateral term is activated which causes the reactivation of the cue presented in stage 1. At these times the input from the P layer is down-gated so that any input active at reactivation times does not interfere with the reactivation of the previously presented cue. On the other hand when T (t) <= TK there is no reactivation and the activity of the Q layer cells is driven and determined by their topographic inputs Pi . The thalamus activity is given by, dT = −T (t) + TK + Tjunction + Tend . dt
(3)
This describes activation of the thalamus only when the animal is at the junction stage 2 (Tjunction = 1) and the end stage 3 (Tend = 1), see Fig.1(b), where reward is located. In fact we suggest the animal has already learned to make these reactivations and they are represented in striatal cells projecting to thalamus. In the discussion section we describe how the Tend signal can be learned by the animal. The important variables are the saliences xi . Here we suggest that the xi can be appropriately modeled by, dxi = τx qi (D(t) − DK ). dt
(4)
Here τx is a slow timescale generating learning over many trials and the term (D(t) − DK ) is the excess dopamine level D(t) over its baseline DK , and it is given by, dD = −D(t) + DK − k7 qi (t) + kR R(t) + otherterms dt i
(5)
This equation includes an inhibitory term as the sum over the Q layer activities and an excitatory term R(t) with factor kR which describes the primary reward, activated in stage 3 during the end period where Tend = 1, when the animal makes a correct action in stage 2. It is easy to understand how the pair of equations Eq.4 and Eq.5 produce the desired behaviour. Suppose the cue has a higher salience xcue than the distractors it is presented together with. Then it will drive the corresponding qcue during the cue presentation period stage 1 more than the other qi are driven and will therefore be gated into working memory and reactivated at the action selection stage 2 and the end stage 3. Suppose this cue also drives the correct action, then dopamine will be positively activated in stage 3 by the R(t) term during the second reactivation of the qcue . Eq.4 will be positive for the reactivated qcue and that particular xcue will be increased by LTP. Therefore next trial it occurs the cue qcue will be even more likely to be gated into working memory. Indeed the salience xcue for this cue has a positive feedback
274
A. Ponzi
and would grow without bound except for the fact that during the cue presentation period LTD is generated by the inhibition of dopamine by the term −k7 i qi (t) in Eq.5, which depends on the magnitude xcue and will therefore limit the growth of the salience xcue to a fixed value. Therefore we see that there is a stable attractor state for the cue. However the situation is different for the distractors. Suppose one of the distractors has the maximum salience xdist1 from among the presented stimuli. This qdist1 will be gated into working memory and drive a given action in stage 2 and be reactivated at the end period 3. However since this is not a cue there is only a fifty-fifty chance that it is driving the correct action for the cue it happens to be presented with and primary reward will therefore only be found half the time. During trials where primary reward is not found the reactivation at the end period 3 will actually inhibit dopamine Eq.5 and cause a suppression of xdist1 , Eq.4, during the reactivation. xdist1 will wander around depending on chance sequences of trials without a stable fixed value, but with a long time average value well below what can be attained for xcue . After some time xdist1 will drop down sufficiently for another distractor, xdist2 to become maximal and be gated into working memory and then tested over several trials, eventually being suppressed. Once xcue has been selected however, if it drives the correct action, it will quickly attain its maximum stable attractor value. We now also describe the action selection system presented previously in [21,22]. The system is described by the M layer which takes an all-to-all projection from the Q layer and forms a winner take all system, ⎛ ⎞ M N QM dMi = −Mi + f ⎝ Jij (t)qj − k2 Mj + k3 Mi ⎠ G(t) (6) dt j j=1 The M units represent the actions and here we have two of them N M = 2 corresponding to right and left turns. This equation describes a standard winner-take-all system. The term G(t) is set to unity when actions are allowed to be selected and zero otherwise, see Fig.1(a). We suggest it is represented in striatum in a similar way to Tjunction and Tend in Eq.3 and the reason it is different from Tjunction and Tend will be described in more detail below. The actual action selected is given by the sign of F (t) fixed by integration of the M units over the junction period Tjunction = 1, M
N /2 dF (t) = Tjunction ( Mi (t) − dt i=1
M N
Mi (t)) − (1 − Tjunction )F (t)
(7)
i=N M /2
QM As described the synaptic weights Jij from the Q to the M layer are updated by three way Hebbian learning [23,24] which is reinforced in the presence of dopamine and depressed without dopamine, QM dJij QM QM = −Jij (t) + f ((D(t) − DK )Mi qj + Jij (t)). dt
(8)
The dopamine system Eq.5 is extended to include a negative feedback projection from the M cells, −k8 j Mj (t). As described in [21,22] this system can find and stabilize
Model of Cue Extraction from Distractors by Active Recall
275
the correct cue-action allocation providing the cues are known. Together with the cue extraction model described above, the cue is successfully extracted and bound to the correct action. We now illustrate this model by numerical simulations integrated by fourth order Runge-Kutta.
3 Results The behaviour of the model is best illustrated by studying examples of its time series. Fig.2 describes a time series of the Q layer activities qi (t) for all seven cells, two cues and five distractors over a 100 trial experiment. This figure illustrates successive cue testing where the stimulus with maximum salience xmax is gated into working memory and drives an action and is then reactivated at the end period and reinforced if it drove the correct action and punished if not. Fig.3 describes in more detail time series of individual trials late in learning after the correct cues have been found. This system is able to extract the correct cues and find the correct cue-action allocations. If a cue is extracted once and happens to drive the correct action its salience is easily strengthened as described above and the action it is driving is reinforced as described in more detail in [21,22]. Should a cue be extracted but not drive the correct action the situation is different however. If there is no positive dopamine signal at the end period, not only will the correct action not be reinforced but also the salience will not be strengthened, even though it is in fact the correct cue, but is simply not driving the correct action. A parsimonious point about this model is that this situation is easily
Q layer activities qi(t), Dopamine Modulator D(t)
Q layer activities qi(t), Dopamine Modulator D(t)
70 100
80
60
40
20
0
60 50 40 30 20 10 0
20
40
Trial Number
60
80
100
6
8
10
12 Trial Number
14
16
18
Fig. 2. (a) Time series of 100 trials of the Q layer activites, qi (t), overlaid on top of each other, each consisting of three stages separated by intertrial delay periods. In this figure the three trial stages of cue presentation, action selection reactivation, and end period reactivation are difficult to see each trial. Both cues are found by around trial number 80. Also shown is the dopamine modulator. (b) Magnification from panel (a) showing trials 6 to 18 illustrating successive cue testing. During trials 6 to 12 one particular distractor has maximum salience and is gated into working memory. Some trials it drives the correct action in stage 2 and some not, dependent on which particular cue it is presented with, and therefore is not consistently reinforced as can be seen from the dopamine time series. By trial 12 this distractor is punished sufficiently for its salience to become non-maximal and another new distractor becomes the maximally salient one, until trial 18 when it too is punished sufficiently. In trial 18 one of the correct cues becomes the dominant stimulus for the first time.
A. Ponzi Q layer activites, M layer activites, Dopamine modulator
276
A layer activites, Dopamine modulator
30
20
10
M1
1
10
0
10
D
-1
10
-2
10
M2 -3
10
0 5000
5020
5040 5060 Time
5080
5240
5260
5280 5300 Time
5320
5340
2260
2280
2300 Time
2320
2340
Fig. 3. (a) Time series of Q layer activities qi (t) for two consecutive trials after the cues have been recognised in two different panels one for each cue type. (b) A similar figure in log scale for clarity. The selective activation of a single cue at the presentation stage 1, the action selection stage 2 and end period stage 3 is clear. The other qi (t) are only weakly activated and invisible near zero. Also shown is the dopamine D(t) time series which is depressed below baseline during the cue presentation stage 1 and reactivation stage 2. It is excited above baseline during the end stage 3 by primary reward because these cues are driving the correct actions. Also shown are the two M layer action selection cells, M1 (t) and M2 (t). Each of the two cues activates a different M cell as required, while the other is suppressed. The M cells are activated in the end stage 3 for reinforcement as well as in the action selection stage 2. Notice that the M cells are activated late in the two reactivation periods to allow the cue reactivation to reach its fixed point level. This is modeled by setting the term G(t) to unity in Eq.6 late in the action selection periods.
dealt with. In fact supposing the correct cue is reactivated at stage 2 but drives the wrong action, in this model the cue is still reactivated by the end signal stage 3 but there is no primary reward and in fact the reactivation suppresses dopamine via the inhibitory projection in Eq.5. This negative dopamine signal targets the action selected by the cue for punishment. Since the cue in fact selects the incorrect action this is an appropriate behaviour of the model. Although the cue will not be given salience by this action, next time the cue is extracted from the distractors by successive testing it will more likely drive the correct action. Once it has driven the correct action its salience will be forever strengthened and the correct action reinforced and stabilized. Indeed this situation is shown in Fig.2. One of the two cues is found for the first time at trial 18, however as can be seen in the below baseline dopamine trace during the end stage 3 of trial 18 in Fig.2(b) it happens to be driving the incorrect action. However the negative dopamine is sufficient to depress the incorrect cue-action allocation so that at a later trial when this correct cue is again tested it drives the correct action. The other cue is similarly extracted, tested and its cue-action allocation reversed and by trial 80 they form a stable attractor state and remain at high salience for the rest of the experiment, while the distractors are suppressed as can be seen in Fig.2(a). Another parsimonious point of this model is that the cues is are preferentially picked out over the distractors, or pop-out due to the fact that they exclude each other in trials. Exclusion means that cues are not presented during the stimulus presentation stage 1 as often as the distractors are. As described above the combination of Eqs. 4 and 5 means that qi (t) cells active during the cue presentation stage cause inhibition of dopamine
Model of Cue Extraction from Distractors by Active Recall
277
D(t) in Eq.5 which causes LTD in their corresponding xi (t). Cues which are not active do not receive this LTD punishment. This exclusion property of the cues means the excluded cue is favoured over the distractors and is picked out more readily as a form of novelty response.
4 Discussion We have presented a model of cue extraction in a general approach to the problem of temporal credit assignment and shown that though simple the model behaves as desired with several parsimonious aspects. We have described cue extraction and action binding by successive testing enhanced by cue mutual exclusion. We describe constant distractors but note extraction will also occur if distractors are sampled randomly each trial, providing distractors have a non-zero probability of co-occurrence with each other in any given trial. Even in the case the distractors have exclusive sets, since they are not predictive of reward, the system will still work. One drawback of the model presented above is the necessity for an end signal to drive final stage 3 reactivation. At times when primary reward is found the stage 3 reactivation could in principle be driven by primary reward, however the problem with this idea would be that no reactivation would occur when primary reward was not found, in contradiction to the requirement described above for discovery of the correct action from the punishment signal. To deal with this difficulty we extend the model to include another set of cells N . These cells learn to fire at the reward locations even when there is no reward and are used to drive the thalamus reactivations. Here we only have one and it is given by, dN/dt = −N + j JjP N (t)Pj , where the weights JjP N are given by dJjP N /dt = (D(t) − DK )N Pj . It is activated by projection from the primary P layer representing the external environment and LTP occurs only in the presence of excess dopamine. Accordingly the dopamine signal Eq.5 is modified further to include an inhibitory projection −k9 N (t) from this cell. Therefore when the N cell has been bound to a given location represented by the Pj by positive R(t) and the animal returns to this location when primary reward is absent (since the cue was wrong) the N unit acts to generate a negative dopamine signal. This enhances the action punishment behaviour described above when reward is not found. Furthermore we modify the thalamus signal T (t) Eq.3 by replacing the end signal by this N cell activation, dT /dt = −T (t)+TK + Tjunction + N (t), so that the locations where reward has been experienced activate replays even if reward is not found on subsequent visits. We do not implememt this model here, but as shown in [25] it works well without the necessity for an end signal. In [25] we also include working memory maintenance of actions as well as cues to allow rapid reversal of cue-action allocations without any noise, accounting for some of the known properties of the dopamine signal [1]. In future work we will show how to produce the junction thalamus signal in an internally defined way. We have not specified the location of the P and Q layers, but anywhere in cortex or hippocampus is possible. At a certain level the Q layer may even represent the hippocampus (CA3) while the P layer may correspond to cortex projecting to hippocampus, the M and N region the nucleus accumbens while the dopamine system the VTA. Our ‘gating’ of PFC layer input into hippocampus would agree with Lisman and Grace [7] suggestions concerning control of behaviourally significant information flow.
278
A. Ponzi
The reactivation of the cue in the Q layer at the junction stage 2 for action selection means that after learning the cell that is firing at the junction depends on where the animal is coming from and also after learning of the task, where the animal is going to. This may therefore correspond to the hippocampal splitter cells [14,15] described in the introduction. Since the plus maze in both allocentric and egocentric conditions will obey similar principles, where however the cue presentation stage corresponds to the spatial view of the animal from the cross junction, which depends on the direction the animal approaches, encoded in the place cells, this work may also explain the plus maze retrospective and prospective coding [17]. Furthermore the striatal MSN neurons show response characteristics of context or task neurons, such as are found in the dorsal striatum [20] and ventral striatum [19], described in the introduction. This can be both the M action selecting cells, and the N reactivation generating cells. N cells may also generate the specific spatial response characteristic of expert neurons at the T-maze junction seen in Barnes et al. [20].
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26.
Schultz, W.: J. Neurophysiol. 80(1), 1–27 (1998) Wickens, J.: 8, R77–R109 (1997) Pasupathy, A., Miller, E.K.: Nature 433, 873 (2005) Kawagoe, R., Takikawa, Y., Hikosaka, O.: J. Neurophysiol. 91(2), 1013–1024 (2004) Hollerman, J.R., Tremblay, L., Schultz, W.: J. Neurophysiol. 80, 947–963 (1998) Nicola, S.M., Surmeier, J., Malenka, R.C.: Ann. Rev. Neuroscience. 23, 185 (2000) Lisman, J.E., Grace, A.A.: Neuron 46, 703 (2005) Goldman-Rakic, P.S.: Neuron 14(3), 477–485 (1995) Gruber, A.J., Dayan, P., Gutkin, B.S., Solla, S.A.: J. Comput. Neurosci. 20, 153–166 (2006) Cohen, J.D., Braver, T.S., Brown, J.W.: Curr. Opin. Neurobiol. 12(2), 223–229 (2002) Dreher, J.C., Guigon, E., Burnod, Y.: J. Cog. Neurosci. 14(6), 853–865 (2002) Frank, M.J., Loughry, B., O’Reilly, R.C.: Cog. A. & B. Neurosci. 1(2), 137–160 (2001) Kiyatkin, E.A., Rebec, G.V.: J. Neurophysiol. 75(1), 142–153 (1996) Hasselmo, M.E., Eichenbaum, H.: Neural Networks, 18, 1172 (2005) Wood, E.R., Dudchenko, P.A., Robitsek, R.J., Eichenbaum, H.: Neuron 27, 623–633 (2000) Kunec, S., Hasselmo, M.E., Kopell, N.: J.Neurophysiol. 94 (2005) Ferbinteanu, J., Shapiro, M.L.: Neuron 40, 1227–1239 (2003) Anderson, M.I., Jeffery, K.J.: J. Neurosci. 23, 8827–8835 (2003) Mulder, A.B., Tabuchi, E., Wiener, S.I.: European J.Neuroscience 19, 1923–1932 (2001) Barnes, T.D., Kubota, Y., Hu, D., Jin, D.Z., Graybiel, A.M.: Nature 437, 1158 (2005) Ponzi, A.: Proceedings of IJCNN 2007 (2007) Ponzi, A.: Neural Networks (to appear, 2008) Reynolds, J.N.J., Hyland, B.I., Wickens, J.R.: Nature 413, 67 (2001) Nakahara, H., Amari, S., Hikosaka, O.: Neural Computation 14, 819–844 (2002) Ponzi, A.: Proceedings of ICCN 2007 (2007) Ponzi, A.: IEICE Technical Report 103(163), 19–24 (2006)
PLS Mixture Model for Online Dimension Reduction Jiro Hayami and Koichiro Yamauchi Graduate School of Information Science and Technology, Hokkaido University, Sapporo Hokkaido 060-0814, Japan
Abstract. This article presents an online learning method for modeling high dimensional input data. This method approximates a nonlinear function by summing up several local linear functions. Each linear function is represented as the weighted sum of a small number of dominant variables, which are extracted by the partial least squares (PLS) regression method. Moreover, a radial function, which represents the respective input area of each linear function, is also redefined using the dominant variables. This article also presents an online deterministic annealing expectation maximization (DAEM) algorithm which includes a temperature control mechanism for acquireing the most suitable system parameters. Experimental results show the effective learning behavior of the new method. Keywords: Partial Least Squares (PLS) Method, PLS Mixture Model, Dimension Reduction, Online DAEM Algorithm.
1
Introduction
In real-world applications, learning systems usually have to perform online learning of high-dimensional data. In such learning tasks, the learning machines may have to overcome the following problems. – The learning machine must achieve a quick response to newly presented samples. – High-dimensional data usually have multicollinearity or include irrelevant dimensions. If the learning machines learn them, they normally waste a huge amount of resources, so somehow they need to reduce the amount of resources for learning such datasets. To solve these problems, we propose a variation of the mixture of experts(ME). The ME used here is a mixture of the local linear models. Each linear model is tuned by the learning samples distributed in a local narrow area so that it shows relatively quick responses to newly presented learning samples. Moreover, in the proposed method, each local linear model and corresponding kernel reduces the number of useless dimensions by using both the partial least squares method (PLS) (e.g. [1]) and the online DAEM algorithm proposed later M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 279–288, 2008. c Springer-Verlag Berlin Heidelberg 2008
280
J. Hayami and K. Yamauchi
in sec. 4.4. PLS is specialized for selecting dimensions from multi-dimensional patterns, which have multicollinearity. Using PLS, the proposed system achieves online dimension selection not only for each linear model but also for the corresponding Gaussian kernel.
2
Related Work
Schaal et al. have proposed Receptive Field Weighted Regression (RFWR) [2] This has a similar architecture to our model, but it does not reduce useless dimensions. Later, RFWR was improved to Locally Weighted Projection Regression(LWPR) [3]. The improved model reduces useless dimensions of each linear model by using the PLS method but does not reduce the dimensions of each kernel. Ishii et al. have constructed an online EM (expectation maximization) algorithm on a normalized Gaussian network, which has a similar architecture to that of RFWR[4]. However, their model may not reduce the input dimensions if the data have multicollinearity.
3
PLS Regression
The PLS regression method extracts latent variables, which correlate with both inputs and outputs, and approximates the output function. Let x, D be the input vector x = (x1 · · · xD ) and let D be its dimensions D = dim(x). In this paper, the output of the learning machine is a one-dimensional output value y. Now, let N and r be the number of learning samples and the dimensions of the latent variables, respectively. Then, a sample matrix can be written as X = (x1 · · · xN )T , y = (y1 · · · yN )T . We need to standardize the learning samples before PLS regression. In the standardization process, the averaged vector is subtracted from each learning sample. The input and output vectors are represented using latent vectors ti and the two loading matrices P and q as follows ¯ = t1 pT + t2 pT + · · · + tr pT + E = TPT + E X−X (1) 1
2
r
¯ = t1 q1 + t2 q2 + · · · + tr qr + f = Tq + f , y−y (2) ¯ and y ¯ denote the average input where E and f denote residual errors and, X vector and the output value. In the initial state, the number of latent variables is zero. The latent variables are increased one by one until the residual errors are small enough to ignore. For example, let’s consider the case where there is one latent variable. First ˜ = X−X ¯ of all, we have to generate the standard input and output matrices: X ˜ = y−y ¯ . We assume that the latent variable t1 is generated by a linear y transformation: ˜ t1 = w1 X. (3) ˜ T t1 for w1 = 1 using the Now, we derive the matrix w1 so as to maximize y Lagrangian technique.
PLS Mixture Model for Online Dimension Reduction
w1 =
˜ Ty ˜ X ˜ Ty ˜ X
281
(4)
The loading matrices p1 and q1 can be derived by least squares estimation. ˜ T t1 ˜ T t1 X y p1 = T , q1 = T (5) t1 t1 t1 t1 The residual errors E and f are ¯ − t1 pT , E = X−X
¯ − t1 q1 . f =y−y
1
(6)
˜ and y ˜ by E and f , we can derive the next latent variable by If we replace X executing Eqs.(3)-(6) again. Now we can describe the output value y using the latent variables as follows[5]. ¯ )W(PT W)−1 q, f (x) = y¯ + (x − x
(7)
T
where P = (p1 · · · pr ), q = (q1 · · · qr ) and W = (w1 · · · wr ). 3.1
Online PLS Method
So far, we have explained the PLS method using all the learning samples: T T X = (x1 · · · xN ) y = (y1 · · · yN ) . To execute the PLS method in an online manner, we need a buffer for storing all the learning samples. However, this approach not only wastes a huge amount of storage space but also has a high computational cost. To solve this problem, the modified PLS method [6][7] derives the latent variables from two covariance matrices: ΣXy and Σ. Here, ΣXy is the covariance ˜Ty ˜ /N , while Σ denotes the covariance matrix of x and y, given by ΣXy = X T ˜ X/N ˜ ˜ and y ˜ are the standardized matrices matrix of x give by Σ = X , where X ˜ = X−X ¯ and y ˜ =y−y ¯. X Therefore, wr , pr and qr are sequentially derived as below. wr =
ΣXy r ΣXy r
,
pr = Σr wr /cr ,
T qr = ΣXy wr /cr , r
(8)
where cr = wr T Σr wr . To get the next (the r + 1-th) new latent variable, we just repeat the above equations using the updated covariance matrices. Xy ΣXy − cr pr qr , r+1 ← Σr
Σr+1 ← Σr − cr pr pr T
(9)
ΣXy 1
and Σ1 can be derived in an online manner, so the latent variables can also be derived in an online manner. Therefore, ΣXy = xy T / 1 − x y T / 1 2 1 Σ1 = xxT / 1 − x x T / 1 2 ,
(10)
where the notation x denotes an approximated average of x, which will be defined in Eq(21). Equations (8) and (9) are repeated until ΣXy r < ε or Σr < ε where ε is a small threshold. In this study, we executed the online PLS method combined with a new online learning method namely the ‘online-DAEM’ algorithm described in 4.4.
282
4
J. Hayami and K. Yamauchi
Proposed System
The proposed system architecture is similar to that of normalized Gaussian networks (NGnet)[8]. The PLS regression method is embedded in the learning method for reducing dimensions. In particular, the PLS regression method is applied not only to the learning method for the linear regression model but also to that for the kernel. In this paper, we call the kernel the ‘PLS kernel.’ Section 4.1 explains the whole system structure. 4.3 and 4.2 explain the linear function and PLS kernel, respectively. 4.4 explains the ‘online DAEM’ algorithm, which determines the PLS parameters. 4.5 explains the unit manipulation procedure. 4.1
Outline
Output
The proposed system consists of a linear function and a corresponding Gaussian kernel function, as shown in Fig.1. The Gaussian kernel function determines the respective input area for the corresponding linear function. Let fc (x) be the output of each model unit and g(x|c) be the kernel function. Then, the ultimate output is fc (x)g(x|c) yˆ = c , (11) c g(x|c)
PLS regression PLS kernel
Dimension reduction
Dimension reduction
Dimension reduction
Inputs
Fig. 1. PLS mixture model where g(x|c) denotes the kernel function. It is expressed by ¯ c )Σ−1 ¯ c )T (x − x c (x − x g(x|c) = exp − . 2
(12)
Note that the kernel function above does not depend on the scaling of data. By this modification, we cannot deal with the network as a statistical model. 4.2
Linear Function
In this model, the PLS regression method explained in section 3 is applied to construct the linear function. As a result, the linear function overcomes the problem of multicollinearity. The output of the c-th unit is −1 fc (x) = yc + (x − xc )Wc (PT qc , c Wc )
where qc denotes the loading matrix of the c-th unit.
(13)
PLS Mixture Model for Online Dimension Reduction
4.3
283
PLS Kernel
To define the kernel function, we have to calculate the inverse of the covariance matrix Σc . However, if there are too few samples, Σc cannot be a regular matrix, so Σ−1 cannot be derived. To overcome this problem, the proposed method c derives a pseudo inverse matrix of Σc . Let Σ∗P LSc be the pseudo inverse matrix. −1 T Σ∗P LSc = Wc WcT Σc Wc Wc (14) ¯ c |P LS = (x − x ¯ c )Σ∗P LSc (x − x ¯ c )T |x − x ¯ c |P LS |x − x g(x|c) = exp − 2
(15) (16)
where Wc is the c-th unit’s matrix defined in Eq(7). ¯ c |P LS . Then we can obtain the Mahalanobis distance in the input space: |x − x In this paper, we call the kernel defined in Eq(16) the ‘PLS kernel.’ Figures. 2, 3 and 4 show examples of the PLS kernel where the numbers of latent variables are 0, 1, and 2, respectively. If the number of latent variable is zero, the PLS kernel is a uniform distribution. In the case of one latent variable, the PLS kernel has variance along the 1st PLS direction but has a uniform distribution along the other directions. In the case of two latent variables, the PLS kernel has two variances along the two PLS directions. Therefore, the respective area of the PLS kernel is larger when there are fewer latent variables. Note that this causes the problem that the units allocated in early steps of the learning become dominant in the network. This causes the degrades the generalization ability of the network. To overcome this problem, the system restricts the width of the PLS kernel according to the number of learning iterations of the unit. Therefore, in the early steps of the learning, the shape of the PLS kernel is a hyper sphere, but it gradually changes to an oval-shaped hyper ellipse reflecting the PLS directions. Therefore, Σ∗c = λ(t)c kc ID + (1 − λ (t)c ) Σ∗P LSc where λ(t)c =
1
1c −t˜ 2
1 + exp
(17)
where 1 c denotes an approximated average of the c-th unit’s respective ratio, which reflects the number of parameter-updates of the unit approximately. y
y
y
1
1
1
0.5
0.5
0.5 1
0-1
0
-0.5 x0
0
0.5
-0.5 1 -1
x1
1
1
0.5 0-1
0.5 0
-0.5 x0
0
0.5
-0.5 1 -1
x1
0-1
0.5 0
-0.5 x0
0
0.5
-0.5
x1
1 -1
Fig. 2. PLS kernel without Fig. 3. PLS kernel with Fig. 4. PLS kernel with two latent variables any latent variable one latent variable
284
J. Hayami and K. Yamauchi
1 c is calculated by using Eq(21). Thus, the shape of the PLS kernel is the weighted sum of the hyper sphere and hyper ellipse reflecting the latent variables. The weight for the hyper sphere λ(t) is large in the early steps of the learning, but it gradually decreases to zero, whereas the weight for the hyper ellipse 1 − λ(t) gradually increases to 1. The kc determines the initial variance of the hyper sphere, t˜ denotes the averaged fitness value required for the weights for the 1st and 2nd terms to become 0.5, and ID denotes D dimensional unit matrix. Thus, the modified PLS kernel is ¯ c )Σ∗c (x − x ¯ c )T (x − x g(x|c) = exp − (18) 2 4.4
Online DAEM Algorithm
The PLS mixture model is a variation of the mixture of experts(ME). There are various learning methods suitable for learning the ME. In this method, we selected a variation of the EM algorithm as the learning method because its convergence speed is fast. Although the PLS mixture model is not a statistical model, we can apply an online EM algorithm for learning the PLS mixture model. An online EM algorithm for NGnet parameter estimation, which has a similar structure to our model, has been proposed by Ishii et al. [4]. Generally speaking, however, the quality of the resultant parameters is affected by their initial parameter values. Ueda et al. proposed the Deterministic Annealing EM (DAEM) algorithm to overcome this problem[9]. The original DAEM algorithm, however, was not an online learning algorithm. To apply DAEM to our model, we must extend it to an online learning algorithm. The following text explains the online DAEM learning algorithm for a mixture of PLS models. The covariance matrix of x and y, ΣXy c , the covariance matrix ¯ c , and the average of y, y¯c are modified by the of x, Σc , the average vector of x, x online DAEM algorithm. The online DAEM algorithm executes E- and M-steps alternately to updating parameters every time it receives one instance. E-step In the E-step, the respective ratio of eachunit to the i-th instance (current instance) is calculated. Let p c|xi , yi , θ(i−1) be the respective ratio of the c-th unit to the i-th instance. g (xi |c)τ yc (xi , yi ) p c|xi , yi , θ(t) = C , (19) τ g (xi |m) ym (xi , yi ) m=1
where τ denotes inverted temperature and is gradually increased as τ ← τ · β (β > 1) every time it receives one instance1 . yc (xi , yi ) denotes a suitability of the c-th linear unit for the current input (xi , yi ): 1
Although PLS kernel function is not probability density function and Eq(19) is also not precise probability density function, they have similar properties as that of probability density function.
PLS Mixture Model for Online Dimension Reduction
yc (xi , yi ) = exp −(fc (x) − yi )2 /(2σ 2 )
285
(20)
M-step In the M-step, the parameters are updated using the value calculated in the E (t) (t) step. Let pc be the respective value of module c at time t pc = p c|xt , yt , θ(t) (t)
and let x c be the approximated average of x for the module c. If pc > θlearning , the c-th unit executes below procedure. (t−1) x (t) +p(t) c = ρ x c c xt ,
(21)
where ρ denotes the forgetting coefficient and 0 ρ ≤ 1. Then, the parameters that maximize the likelihood for incomplete data are Σc = xxT c / 1 c − x c x Tc / 1 2c ΣXy c = xy T c / 1 c − x c y Tc / 1 2c xc = x c / 1 c ,
y c = y c / 1 c
(22) (23) (24)
Note that the notation ‘t’ is omitted in the above equations. Thus, xc , yc , Σc and ΣXy c are derived. From Σc and ΣXy c , latent variables are derived by repeating Eqs.(8) and (9). 4.5
Unit Manipulation
Each unit is allocated adaptively. If the suitabilities of all units for a new learning sample i are less than a threshold, a new unit is allocated. Therefore, if the condition (∀c) (p (xi , yi , c|θ) < )
(25)
is satisfied, a new unit N + 1 is allocated: xN +1 = xi ,
y N +1 = yi ,
Σ∗N +1 = kN +1 ID ,
(26)
where kN +1 is the parameter used in Eq.(17) and denotes a small positive threshold. Note that the initial output function of the new allocated linear model is f (x) = y¯. Although unit pruning or merging strategies can also be applied to the PLS mixture model, we do not use such strategies in this paper to show the effect of the PLS clearly.
5
Experiments
To examine the performances of the proposed system, we tested it using two datasets: ones including irrelevant and correlated dimensions.
286
5.1
J. Hayami and K. Yamauchi
Performances of Dataset Including Irrelevant Dimensions
The proposed PLS mixture model is good at reducing variables in the case of there are several correlated dimensions. However, it could also reduce irrelevant dimensions if its parameters were optimized. To verify this property, we tested our model using the following synthetic dataset. y = cos (2(x0 + x1 )) ,
(27)
where x0 and x1 are independent from each other. We set both x0 and x1 to uniform random values in the intervals 0 ≤ x0 ≤ 1, 0 ≤ x1 ≤ 1, respectively.
0.2
Fig. 6. PLS kernels after 5000 observations
RMSE(NG-OEM) RMSE(PLS-ODAEM) Number of Models(NG-OEM) Number of Models(PLS-ODAEM)
0.18 0.16
RMSE
0.14
80 70 60 50
0.12 40 0.1 30
0.08
Number of Models
Fig. 5. PLS kernels after 1000 observations
20
0.06
10
0.04 0.02 0
5000
0 10000
step
Fig. 7. RMSE, number of units versus number of iterations
PLS Mixture Model for Online Dimension Reduction
287
Note that x0 and x1 can be converted to 1 1 t0 = √ x0 + √ x1 , 2 2
1 1 t1 = √ x0 − √ x1 . 2 2
(28)
From these equations, we can see that t0 is related to y but t1 is not. Therefore, y is one dimensional function in nature. PLS-kernel distributions after 1000 and 5000 observations are shown in Figs. 5 and 6. We can see that, in the early steps of the learning, the PLS kernels was shaped like a sphere like shape, but they became ellipse shaped in the latter steps of the learning. In particular, the widths along t0 became narrow, while those along t1 became wide. This means that the kernels are forced to ignore irrelevant t1 . Using this strategy, the model achieves a high generalization ability. 5.2
Performances for a Multicollinearity Dataset
To verify the effect of PLS kernels, we tested the proposed model using multicollinearity dataset. This is a 10 dimensional dataset including two essential dimensions x0 and x1 , which have following relation to y.
y = max exp −10x0 2 , exp −50x1 2 , 1.25 exp −5 x0 2 + x1 2 , (29) where x0 and x1 are independent of each other. We set them to uniform random values in the intervals −1 ≤ x0 ≤ 1, −1 ≤ x1 ≤ 1, respectively. The other dimensions correlate with x0 and x1 as follows. 1 x2 = x0 , x4 = √ (x0 − x1 ), x6 = 2 1 x3 = x1 , x5 = √ (x0 + x1 ), x7 = 2
1 1 √ (x0 − 2x1 ), x8 = √ (2x0 − x1 ), 5 5 1 1 √ (x0 + 2x1 ), x9 = √ (2x0 + x1 ) 5 5
We also compared the performance of our model with that of an NGnet using online EM algorithm (NG-OEM)[4] without pruning, merging, or dividing strategies2 . For these two models, we performed five trials of 10000 iterations, and their root mean square errors (RMSEs) were averaged over the five trials. The number of units was also examined in the same manner as the RMSE. The performances of PLS-ODAEM (our model) and NG-OEM are shown in Fig. 7. We can see that the number of units was smaller for PLS-ODAEM than for NG-OEM. The convergence speed of the RMSE of PLS-ODAEM was slower than that of NG-OEM, but the ultimate RMSE of PLS-ODAEM was less than that of NG-OEM. 2
There are cases that the pruning and merging strategies also bring the dimension reduction effect. To investigate PLS effect clearly, we examined the performances without the pruning and merging strategies.
288
6
J. Hayami and K. Yamauchi
Conclusion
In this paper, we proposed a PLS mixture model using an online DAEM algorithm. The new model reduces the number of input dimensions using the partial least squares method (PLS). The PLS was applied to both the linear model and the Gaussian kernel and the parameters were optimized by the online DAEM algorithm. Actually, this model does not satisfy the condition for the statistical model so far. However, if we restrict λ(t)c in Eq(17) larger than 0, we will be able to use p(x|c) = g(x|c)/ g(x|c)dx instead of g(x|c) to make our method satisfy the condition. This model can be applied not only to a multicollinearity dataset but also to datasets that include irrelevant variables. We plan to improve the model by adding pruning, merging and dividing strategies.
References 1. Wold, H.: Soft modeling by latent variables; the nonlinear iterative partial least squares approach. In: Gani, J. (ed.) Perspectives in Probability and Statistics, Papers in Honour of M. S Bartlett, pp. 520–540. Academic Press, Inc., London (1975) 2. Schaal, S., Atkeson, C.G.: Constructive incremental learning from only local information. Neural Computation 10(8), 2047–2084 (1998) 3. Vijayakumar, S., Souza, A.D., Schaal, S.: Incremental online learning in high dimensions. Neural Computation 17, 2602–2634 (December) 4. Sato, M.-a., Ishii, S.: On-line em algorithm for the normalized gaussian network. Neural Computation 12, 407–432 (2000) 5. Manne, R.: Analysis of two partial-least-squares algorithms for multivariate calibration. Chemometrics and Intelligent Laboratory Systems 22, 187–197 (1987) 6. Lindgren, S.W.F., Geladi, P.: The kernel algorithm for pls. Journal of Chemometrics 7(1), 45–59 (1993) 7. Ter Braak, C.J.F., De Jong, S.: Comments on the pls kernel algorithm. Journal of Chemometrics 8(2), 169–174 (1994) 8. Xu, L., Jordan, M.I., Hinton, G.E.: An alternative model for mixtures of experts. In: Tesauro, G., Touretzky, D., Leen, T. (eds.) Advances in Neural Information Processing Systems, vol. 7, pp. 633–640. The MIT Press, Cambridge (1995) 9. Ueda, N., Nakano, R.: Deterministic annealing variant of the em algorithm. Advances in Neural Information Processing System 7, 545–552 (1995)
Analysis on Bidirectional Associative Memories with Multiplicative Weight Noise Chi Sing Leung1 , Pui Fai Sum2 , and Tien-Tsin Wong3 1
2
Department of Electronic Engineering, City University of Hong Kong
[email protected] Institute of Electronic Commerce, National Chung Hsing University Taiwan 3 Department of Computer Science and Engineering, The Chinese University of Hong Kong
Abstract. In neural networks, network faults can be exhibited in different forms, such as node fault and weight fault. One kind of weight faults is due to the hardware or software precision. This kind of weight faults can be modelled as multiplicative weight noise. This paper analyzes the capacity of a bidirectional associative memory (BAM) affected by multiplicative weight noise. Assuming that weights are corrupted by multiplicative noise, we study how many number of pattern pairs can be stored as fixed points. Since capacity is not meaningful without considering the error correction capability, we also present the capacity of a BAM with multiplicative noise when there are some errors in the input pattern. Simulation results have been carried out to confirm our derivations.
1
Introduction
Associative memories have a wide range of applications including content addressable memory and pattern recognition [1,2]. An important feature of associative memories is the ability to recall the stored patterns based on partial or noisy inputs. One form of associative memories is the bivalent additive bidirectional associative memory (BAM) [3] model. There are two layers, FX and FY , of neurons in a BAM. Layer FX has n neurons and layer FY has p neurons. A BAM is used to store pairs of bipolar patterns, (X h , Y h )’s, where h = 1, 2, · · · , m; X h ∈ {+1, −1}n; Y h ∈ {+1, −1}p; and m is the number of patterns stored. We shall refer to these patterns pairs as library pairs. The recall process is an iterative one starting with a stimulus pair (X (0) , Y (0) ) in FX . After a number of iterations, the patterns in FX and FY should converge to a fixed point which is desired to be one of the library pairs. BAM has three important features [3]. Firstly, BAM can perform both heteroassociative and autoassociative data recalls: the final state in layer FX represents the autoassociative recall, while the final state in layer FY represents the heteroassociative recall. Secondly, the initial input can be presented in one of the two layers. Lastly, BAM is stable during recall. In other words, for any connection matrix, a BAM always converges to a stable state. Several methods have been proposed to improve its capacity [4,5,6,7,8]. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 289–298, 2008. c Springer-Verlag Berlin Heidelberg 2008
290
C.S. Leung, P.F. Sum, and T.-T. Wong
Although the capacity of BAM has been intensively studied with a perfect laboratory environment consideration[9,10,11,12,13], practical realization of BAM may encounter the problem of inaccuracy in the stored weights. All the previous studies assume that the stored weights matrix is noiseless. However, this is not always the case when training a BAM for some real applications. One kind of weight faults is due to the hardware or software precision [14,15]. For example, in the digital implementation, when we use a low precision floating point format, such as 16-bit half-float[16], to represent trained weights, truncation errors will be introduced. The magnitude of truncation errors is proportional to that of the trained weights. Hence, truncation errors can be modelled as multiplicative weight noise[17,18]. This paper focuses on the quantitative impact of multiplicative weight noise to the BAM capacity. We will study how many number of pattern pairs can be stored as fixed points when multiplicative weight noise presents. Since capacity is not meaningful without considering the error correction capability, we also present the capacity of BAM with multiplicative noise when there are some errors in the input pattern. The rest of this paper is organized as follows. Section 2 introduces the BAM model and multiplicative weight noise. In section 3, the capacity analysis on BAM with multiplicative weight noise is used. Simulation examples are given in Section 4. Then, we conclude our work in Section 5.
2 2.1
BAM with Multiplicative Weight Noise BAM
The BAM, as proposed by Kosko [3], is a two-layer nonlinear feedback heteroassociative memory in which m library pairs (X 1 , Y 1 ), · · · , (X m , Y m ) are stored, where X h ∈ {−1, 1}n and Y h ∈ {−1, 1}p. There are two layers of neurons in BAM; layer FX has n neurons and layer FY has p neurons. The connection matrix between the two layers is denoted as W . The encoding equation, as proposed by Kosko, is given by W =
m
Y h X hT
(1)
h=1
which can be rewritten as wji =
m
xih yjh ,
(2)
h=1
where Xh = (x1h , x2h , · · · , xnh )T and Yh = (y1h , y2h , · · · , yph )T . The recall process employs interlayer feedback. An initial pattern X (0) presented to FX is passed through W and is thresholded, and a new state Y (1) in FY is obtained which is then passed back through W T and is thresholded again, leading to a new state X (1) in FX . The process repeats until the state of BAM converges. Mathematically, the recall process is: Y (t+1) = sgn W X (t) , and X (t+1) = sgn W T Y (t+1) , (3)
Analysis on BAM with Multiplicative Weight Noise
291
where sgn(·) is the sign operator: ⎧ x>0 ⎨ +1 x<0 . sgn(x) = −1 ⎩ state unchanged x = 0 Using an element-by-element notation, the recall process can be written as: ⎛ ⎞ n p (t+1) (t) (t+1) (t+1) ⎠ yi = sgn wji xi , and xj = sgn ⎝ wji yi , (4) i=1 (t)
j=1 (t)
where xi is the state of the ith FX neuron and yj is the state of the jth FY neuron. The above bidirectional process produces a sequence of pattern pairs (X (t) , Y (t) ): (X (1) , Y (1) ), (X (2) , Y (2) ), · · ·. This sequence converges to one of the fixed points (X f , Y f ) and this fixed point ideally should be one of the library pairs or nearly so. A fixed point (X f , Y f ) has the following properties: Y f = sgn(W X f ) and X f = sgn(W T Y f ).
(5)
Hence, a library pair can be retrieved only if it is a fixed point. One of the advantages of Kosko’s encoding method is the ability of incremental learning, i.e., the ability of encoding new library pairs to the model based on the current connection matrix only. With the Kosko’s encoding method, BAM can only min(n,p) correctly store up to 2 log min(n,p) library pairs. When the number of library pairs exceeds that value, a library pair may not be stored as a fixed point. 2.2
Multiplicative Weight Noise
In some electrical circuits, inaccuracy occurs in the implementation of trained weights. Errors in the weights may be caused by quantization error due to limited bits used to store the trained weights, or percentage error due to voltage perturbation. In the digital implementation, such as DSP and FPGA, the trained weights are usually stored as floating-point format. Floating-point representation of real numbers is more desirable than integer representation because floatingpoint provides large dynamic range. When we use a low precision floating point format, such as 16-bit half-float[16], to represent trained weights, truncation errors will be introduced. The magnitude of the truncation errors is proportional to that of the trained weights[19]. Hence, the truncation errors can be modelled as multiplicative weight noise. An implementation of a weight wji is denoted by w ˜ji . In multiplicative weight noise, each implemented weight deviates from its nominal value by a random percent, i.e., w ˜ji = wji + bji wji (6) where bji ’s are identical independent mean zero random variables with variance σb2 . The density function of bji ’s is symmetrical.
292
3 3.1
C.S. Leung, P.F. Sum, and T.-T. Wong
Analysis on BAM with Multiplicative Weight Noise Capacity
We will investigate the BAM’s memory capacity when the multiplicative weight noise presented. The following assumptions and notations are used. – The dimensions, n and p, are large. Also, p = rn, where r is a positive constant. – Each component of library pairs (X h , Y h )’s is a ±1 equiprobable independent random variable. n – EUj,h is the event that i w ˜ji xih is equal to yjh (the j-th component of the library pattern Y h ). Also, EU j,h is the complement event of EUj,h . p – EVi,h is the event that j w ˜ji yjh is equal to xih (the i-th component of the library pattern X h ). Also, EV i,h is the complement event of EVi,h . With the above assumptions and the multiplicative weight noise, we will introduce Lemma 1 and Lemma 2. They will assist us to derive the capacity of BAM with multiplicative weight noise. Lemma 1: The probability Prob(EU j,h ) is approximately equal to n Q (1 + σb2 )m for j = 1, · · · , n and h = 1, · · · , m, where Q(z) =
√1 2π
∞ z
exp( −z2 )dz. 2
Proof: Event EU j,h means that sgn ( ni=1 w ˜ji xih ) = yjh . From (2) and (6), we have n
w ˜ji xih =
i=1
n
wji (1 + bj i)xih =
n m ( yjh xih )(1 + bj i)xih
i=1
i=1
h
n m n m = nyjh + ( yjh xih )xih + ( yjh xih )bji xih i=1 h =h
(7)
i=1 h =1
Without loss of generality, we consider the library pair (X h , Y h ) having all components positive: X h = (1, . . . , 1)T and Y h = (1, . . . , 1)T . This consideration is usually used [13] and does not affect our results. We can easily verify this by use of conditional probability. Now,(7) becomes n
w ˜ji xih = n +
i=1
=n+
n m n m ( yjh xih ) + ( yjh xih )bji i=1 h =h n
n
i=1
i=1
αji +
(8)
i=1 h =1
βji ,
(9)
Analysis on BAM with Multiplicative Weight Noise
293
where αji ’s are independent identical zero mean random variables (i.e., E[αji ] = 0, and E[αji αji ] = 0 for i = i ) and the variance of αji ’s, denoted as Var[αji ], is equal to (m−1). Since bji ’s are independent identical zero mean random variables and they are independent of ( m h =1 yjh xih )’s. Hence, βji ’s are independent identical zero mean random variables, where E[βji ] = 0, E[βji βji ] = 0 for i = i ), Var[βji ] = σb2 m. n n For large n, the summations, i=1 αji and i=1 βji , tend to normal with variances equal to (m − 1)n and σb2 mn,respectively. Besides, the sum of the two normal random variables is still a normal random variable. Hence, (9) becomes n
w ˜ji xih = n +
i=1
n i=1
αji +
n
βji = n + α + β .
(10)
i=1
Note that βji ’s are independent of ( m h =h yjh xih )’s. We have E[αβ] = 0, E[α+ 2 β] = 0 and Var[α + β] = (m − 1)n + σ that α + β <−n. b mn. Event EUj,h means n n Hence, Prob(EUj,h ) ≈ Q . For large m, Q ≈ (m−1)+σb2 m (m−1)+σb2 m n Q . (Proof completed). (1+σ2 )m b
Using the similar way, we can have Lemma 2. Lemma 2: The probability Prob(EVi,h ) is approximately equal to p Q (1 + σb2 )m for i = 1, · · · , p and h = 1, · · · , m. Now, we start to estimate the capacity. Let the probability that a library pair (X h , Y h ) is fixed point be P∗ : P∗ = Prob (EU1h ∩ · · · ∩ EUnh ∩ EV1h ∩ · · · ∩ EUph ) = 1 − Prob EU 1h ∪ · · · ∪ EU nh ∪ EV 1h ∪ · · · ∪ EV ph ≥ 1 − pProb(EU jh ) − nProb(EV ih ) .
(11)
From Lemmas 1 and 2, (11) becomes n p P∗ ≥ 1 − pQ − nQ . (1 + σb2 )m (1 + σb2 )m Letting PB = pQ
n (1+σb2 )m
and PA = nQ
P∗ ≥ 1 − PB − PA .
p (1+σb2 )m
(12)
, we get (13)
294
C.S. Leung, P.F. Sum, and T.-T. Wong
If z is large,
2 z 1 Q(z) ≈ exp − − log z − log 2π , 2 2
(14)
which is quite accurate for z > 3. Using the approximation (14), n 1 n 1 PA = exp log p − − log − log 2π 2(1 + σb2 )m 2 (1 + σb2 )m 2 n 1 n 1 = exp log r + log n − − log − log 2π 2(1 + σb2 )m 2 (1 + σb2 )m 2 n 1 n = exp log n − − log + constant (15) 2(1 + σb2 )m 2 (1 + σb2 )m Clearly, if m < 2(1+σn2 ) log n , PA tends zero as n tends infinity. Similarly, we can b get that as p → ∞ and m < 2(1+σp2 ) log p , PB → 0. To sum up, for large n and b p, If min(n, p) m< (16) 2(1 + σb2 ) log min(n, p) then P∗ → 1. That means if the number m of library pairs is less than min(n,p) , a library pair is with a very high chance to be a fixed point. 2(1+σb2 ) log min(n,p) So the capacity of BAM with multiplicative weight noise is equal to min(n,p) . 2(1+σ2 ) log min(n,p) b
3.2
Error Correction
In this section, we will investigate the capacity of BAM with weight noise when the initial input is a noise version Xhnoise of a library pattern X h . Let Xhnoise contains ρn bit errors, where ρ is the input noise level. If ˜ X noise Y h = sgn(W ) h T ˜ X h = sgn(W Y h ) ˜ X h) , Y h = sgn(W
(17) (18) (19)
then a noise version X noise can successfully recall the correct the desire library h pair (X h , Y h ). Similarly, we hope that a noise version Y noise of Y h can such cessfully recall the correct the desire library pair (X h , Y h ). We will study under what condition of m, the probability of successful recall tends to one. noise Define EUj,h be the event that ni w ˜ji xnoise is equal to yjh (the j-th comih noise
ponent of the library pattern Y h ). Also, EU j,h is the complement event of noise noise noise EUj,h . Also, define EVi,h be the event that pj w ˜ji yjh is equal to xih noise
(the i-th component of the library pattern X h ). Also, EV i,h is the complenoise ment event of EVi,h . With the above definition, we can following the proof of Lemma 1 to get the following two lemmas.
Analysis on BAM with Multiplicative Weight Noise
295
noise
Lemma 3: The probability Prob(EU i,h ) is approximately equal to Q
(1 − 2ρ)n (1 + σb2 )m
for i = 1, · · · , p and h = 1, · · · , m. noise
Lemma 4: The probability Prob(EV i,h ) is approximately equal to Q
(1 − 2ρ)p (1 + σb2 )m
for i = 1, · · · , p and h = 1, · · · , m. Define P∗∗ be the probability that a noise version with ρ fraction of errors can recall the desired library pair. It is not difficult to show that noise noise P∗∗ ≥ 1−p Prob(EU jh ) + Prob(EU jh ) −n Prob(EV ih + Prob(EV ih )) . (20) From Lemmas 1-4, To sum up, for large n and p, if m<
(1 − 2ρ) min(n, p) 2(1 + σb2 ) log min(n, p)
(21)
then P∗∗ → 1. That means, when there are ρn (or ρp) bit errors in the initial input, the capacity of BAM with multiplicative weight noise is equal to (1−2ρ) min(n,p) . 2(1+σ2 ) log min(n,p) b
4 4.1
Simulation Capacity
We consider two dimensions, 512 and 1024. For each m, we randomly generate 1000 sets of library pairs. The Kosko’s rule is then used to encode the matrices. Afterwards, we add the multiplicative weight noise to the matrices. The variances σb2 of weight noise are set to 0, 0.2, 0.4. Figure 1 shows the percentage of a library pair being successfully stored. From the figure, as the variance of weight noise increases, the successful rate decreases. This phenomena agrees with our expectation. From our analysis, i.e., (16), for n = p = 512, a BAM can store up to 41, 34, and 29 pairs for σb2 equal to 0, 0.2 and 0.4, respectively. From Figure 1(a), all the corresponding successful rates are very large. Also, there are a sharply decreasing changes in successful for {m > 41, σb2 = 0}, {m > 34, σb2 = 0.2}, and {m > 29, σb2 = 0.4}.
296
C.S. Leung, P.F. Sum, and T.-T. Wong 1
1 σ2b=0 σ2b=0.2 σ2b=0.4
2
σb =0 2 σb =0.2 2 σb =0.4
0.9 0.8
0.9 0.8 0.7 Successful Rate
Successful Rate
0.7 0.6 0.5 0.4
0.6 0.5 0.4
0.3
0.3
0.2
0.2
0.1 0 0
0.1 10
20
30 40 50 60 m : Number of library pairs
70
0 0
80
20
(a) n = p = 512
40
60 80 100 120 m: Number of library pairs
140
160
(b) n = p = 1024
Fig. 1. Successful rate of a library pair being a fixed point. (a) The dimension is 512. (b) The dimension is 1024. For each value of m, we generate 1000 sets of library pairs.
1
0.8
0.7 0.6 0.5 0.4 0.3
0.7 0.6 0.5 0.4 0.3
0.2
0.2
0.1
0.1
0 0
10
20
30 40 50 60 m : Number of library pairs
(a) weight noise level
σb2
70
= 0.2
ρ=0.03125 ρ=0.0625 ρ=0.125
0.9
Successful Recall Rate
0.8 Sucessful Recall Rate
1
ρ=0.03125 ρ=0.0625 ρ=0.125
0.9
80
0 0
10
20
30 40 50 60 m : Number of library pairs
(b) weight noise level
σb2
70
80
= 0.4
Fig. 2. Successful recall rate from a noise input. The dimension is 512. For each value of m, we generate 1000 sets of library pairs. For each library pattern, we generate 10 noise versions.
Similarly, from (16), for n = p = 1024, a BAM can store up to 73, 61, and 52 pairs for σb2 equal to 0, 0.2 and 0.4, respectively. From Figure 1(b), all the corresponding successful rates are also large. Also, there are a sharply decreasing changes in successful for {m > 73, σb2 = 0}, {m > 61, σb2 = 0.2}, and {m > 52, σb2 = 0.4}. To sum up, the simulation result is consistent with our analysis (16). 4.2
Error Correction
The dimension is 512. We consider two weight noise levels, σb2 = 0.2, 0.4 ,and three input error levels, ρ = 0.003125, 0.0625, 0.125. For each m, we randomly generate 1000 sets of library pairs. The Kosko’s rule is then used to encode the matrices. Afterwards, we add the multiplicative weight noise to the matrices. For each library pair, we generate ten noise versions. We then feed the noise versions as initial input and check whether the desire library can be recalled
Analysis on BAM with Multiplicative Weight Noise 1
0.8
0.7 0.6 0.5 0.4 0.3 0.2
0.7 0.6 0.5 0.4 0.3 0.2
0.1 0 0
ρ=0.015625 ρ=0.03125 ρ=0.0625
0.9
Sucessful Recall Rate
0.8 Sucessful Recall Rate
1
ρ=0.015625 ρ=0.03125 ρ=0.0625
0.9
297
0.1 20
40 60 m: Number of library pairs
80
(a) weight noise level σb2 = 0.2
100
0 0
20
40 60 m: Number of library pairs
80
100
(b) weight noise level σb2 = 0.4
Fig. 3. Successful recall rate from a noise input. The dimension is 1024. For each value of m, we generate 1000 sets of library pairs. For each library pattern, we generate 10 noise versions.
or not. Figures 2 and 3 shows the successful recall rate. From the figures, as the input error level ρ increases, the successful rate decreases. This phenomena agrees with our expectation. From our analysis, i.e., (21), for the dimension n = p = 512 and weight noise level σb2 = 0.2, a BAM can store up to 32, 30, and 26 pairs for the input error level ρ equal to 0.03125, 0.0625 and 0.125, respectively. From Figure 2(a), all the corresponding successful rates are large. Also, there are a sharply decreasing changes in successful recall rates for {m > 32, ρ = 0.03125}, {m > 30, ρ = 0.0625}, and {m > 26, ρ = 0.125}. For other weight noise levels and dimension, we obtained similar phenomena.
5
Conclusion
We have examined the statistical storage behavior of BAM with multiplicative weight noise. The capacity of a BAM is m < 2(1+σmin(n,p) . When the num2 b ) log min(n,p) ber of library pairs is less that value, the chance of it being a fixed point is very high. Since we expect BAM has certain error correction ability, we have investigated the capacity of BAM with weight noise when the initial input is (1−2ρ) min(n,p) a noise version of a library pattern. We show that if m < 2(1+σ ,a 2 b ) log min(n,p) noise version with ρn (or ρp) errors has a high chance to recall the desire library pairs. Computer simulations have been carried out The results presented here can be extended to Hopfield network. By adopting the approach set above, we can easily obtain the result in Hopfield network by replacing min(n, p) with n in the above equations.
Acknowledgement The work is supported by the Hong Kong Special Administrative Region RGC Earmarked Grants (Project No. CityU 115606) and (Project No. CUHK 416806).
298
C.S. Leung, P.F. Sum, and T.-T. Wong
References 1. Kohonen, T.: Correlation matrix memories. IEEE Transaction Computer 21, 353– 359 (1972) 2. Palm, G.: On associative memory. Biolog. Cybern. 36, 19–31 (1980) 3. Kosko, B.: Bidirectional associative memories. IEEE Trans. Syst. Man, and Cybern. 18, 49–60 (1988) 4. Leung, C.S.: Encoding method for bidirectional associative memory using projection on convex sets. IEEE Trans. Neural Networks 4, 879–991 (1993) 5. Leung, C.S.: Optimum learning for bidirectional associative memory in the sense of capacity. IEEE Trans. Syst. Man, and Cybern. 24, 791–796 (1994) 6. Wang, Y.F., Cruz, J.B., Mulligan, J.H.: Two coding strategies for bidirectional associative memory. IEEE Trans. Neural Networks 1, 81–92 (1990) 7. Lenze, B.: Improving leung’s bidirectional learning rule for associative memories. IEEE Trans. Neural Networks 12, 1222–1226 (2001) 8. Shen, D., Cruz, J.B.: Encoding strategy for maximum noise tolerance bidirectional associative memory. IEEE Trans. Neural Networks 16, 293–300 (2005) 9. Leung, C.S., Chan, L.W.: The behavior of forgetting learning in bidirectional associative memory. Neural Computation 9, 385–401 (1997) 10. Leung, C.S., Chan, L.W., Lai, E.: Stability and statistical properties of secondorder bidirectional associative memory. IEEE Transactions on Neural Networks 8, 267–277 (1997) 11. Wang, B.H., Vachtsevanos, G.: Storage capacity of bidirectional associative memories. In: Proc. IJCNN 1991, Singapore, pp. 1831–1836 (1991) 12. Haines, K., Hecht-Nielsen, R.: A bam with increased information storage capacity. In: Proc. of the 1988 IEEE Int. Conf. on Neural Networks, pp. 181–190 (1988) 13. Amari, S.: Statistical neurodynamics of various versions of correlation associative memory. In: Proc. of the 1988 IEEE Int. Conf. on Neural Networks, pp. 181–190 (1988) 14. Burr, J.: Digital neural network implementations. In: Neural Networks, Concepts, Applications, and Implementations, vol. III, Prentice Hall, Englewood Cliffs, New Jersey (1991) 15. Holt, J., Hwang, J.-N.: Finite precision error analysis of neural network hardware implementations. IEEE Transactions on Computers 42(3), 281–290 (1993) 16. Lam, P.M., Leung, C.S., Wong, T.T.: Noise-resistant fitting for spherical harmonics. IEEE Transactions on Visualization and Computer Graphics 12(2), 254–265 (2006) 17. Bernier, J.L., Ortega, J., Rodriguez, M.M., Rojas, I., Prieto, A.: An accurate measure for multilayer perceptron tolerance to weight deviations. Neural Processing Letters 10(2), 121–130 (1999) 18. Bernier, J.L., Diaz, A.F., Fernandez, F.J., Canas, A., Gonzalez, J., Martin-Smith, P., Ortega, J.: Assessing the noise immunity and generalization of radial basis function networks. Neural Processing Letters 18(1), 35–48 (2003) 19. Sripad, A., Snyder, D.: Quantization errors in floating-point arithmetic. IEEE Transactions on Speech, and Signal Processing 26, 456–463 (1978)
Fuzzy ARTMAP with Explicit and Implicit Weights Takeshi Kamio1 , Kenji Mori1 , Kunihiko Mitsubori2 , Chang-Jun Ahn1 , Hisato Fujisaka1 , and Kazuhisa Haeiwa1 1
Dept. of Systems Engineering, Hiroshima City University, 3-4-1, Ozuka-higashi, Asaminami-ku, Hiroshima-shi, Hiroshima, 731-3194, Japan 2 Dept. of Electronics and Computer Systems, Takushoku University, 815-1, Tatemachi, Hachioji-shi, Tokyo, 193-0985, Japan
[email protected]
Abstract. ARTMAP is one of the famous supervised learning systems. Many learning methods for ARTMAP have been proposed since it was devised as a system to solve Stability-Plasticity Dilemma. AL-SLMAP was implemented by slightly modifying FCSR which was the original learning method for fuzzy ARTMAP (FAM). Although AL-SLMAP can solve pattern recognition problems in a noisy environment more effectively than FCSR, AL-SLMAP is less suitable for region classification problems than FCSR. This means that AL-SLMAP has some problems which do not exist in FCSR. In this paper, we propose a learning method for FAM with explicit and implicit weights to overcome the problems. Keywords: ARTMAP, FAM, Explicit weight, Implicit weight.
1
Introduction
Adaptive resonance theory neural network (ART) is an unsupervised learning system which can generate and grow categories by comparing the similarity between an input and memories with the fineness of classification called “vigilance parameter”. In contrast, ARTMAP is a supervised learning system which consists of a learning ART (ARTa ), a supervising ART (ARTb ) and a map field (MF). ARTa and ARTb classify sample data and recognition codes respectively. MF maps each category in ARTa to the corresponding one in ARTb . An erroneous mapping is corrected by the process called “match tracking (MT)”. Many learning methods for ART and ARTMAP have been proposed since they were devised as systems to solve the Stability-Plasticity Dilemma. Here, we discuss learning methods for fuzzy ARTMAP (FAM) [1] which can classify not only binary but also analog inputs. Fast commit slow recode (FCSR) [1] is the original learning method for FAM. Since FCSR has little ability to reduce the influence of the input noise, it may generate many categories unnecessarily and decrease the recognition performance. To solve this category proliferation problem, J. S. Lee et al. proposed AL-SLMAP [2] which is the combination of average learning (AL) [2] and slow M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 299–308, 2008. c Springer-Verlag Berlin Heidelberg 2008
300
T. Kamio et al.
learning of map field weights vector (SLMAP) [3]. Also, they reported that AL-SLMAP could solve the character recognition problem in a noisy environment more effectively than FCSR [2]. We have investigated the ability of AL-SLMAP in detail since we were very interested in it from the viewpoint of simplicity and performance. As a result, we have confirmed that AL-SLMAP cannot inhibit the category proliferation for the character recognition problem in a highly noisy environment [4] and that AL-SLMAP is less suitable for the region classification problem than FCSR. These facts made us realize that AL-SLMAP had problems in category selection, weight update, and MT. In this paper, we propose a novel FAM model and its learning method. Each node of ARTa has a weight set for each individual recognition code. When a new node is created in ARTa , all its weight sets define its category. That is to say, they work as explicit weights. As the learning proceeds, each node of ARTa has weight sets which do not define its category. Such weight sets are called implicit weights. Our learning method uses both explicit and implicit weights to solve the problems in AL-SLMAP. Finally, it has been shown by simulations that our learning method is better than FCSR and AL-SLMAP.
2 2.1
Fuzzy ARTMAP Learned by AL-SLMAP System
Here, we review fuzzy ARTMAP (FAM) [1] and AL-SLMAP [2]. First, we explain the behavior of fuzzy ART. As shown in Fig.1, it has an attentional subsystem (AS) and an orienting subsystem (OS). AS consists of an input layer (F0 ), a matching layer (F1 ) and a category layer (F2 ). If a normalized original input a ∈ [0, 1]n is given, F0 provides F1 with A ∈ [0, 1]2n . A is the complement code of a (i.e., A ≡ [a,ac ] = [a1 , · · · , an , 1 − a1 , · · · , 1 − an ]). After A goes through F1 and reaches F2 , F2 node j calculates the choice strength Tj : Tj = |A ∧ U j | /(α + |U j |), (j = 1, · · · , m), (1) where (p ∧ q)i = min(pi , qi ), |p| = |pi |, U j ∈ 2n is the bottom-up weight vector, α > 0 is the choice parameter and m is the number of F2 nodes. The node J with the maximal choice strength is activated in F2 . If several nodes have the maximal choice strength, the node with the minimal index is selected from them. F2 activity y ∈ m satisfies yJ = 1 and yj =J = 0. The activated F2 node J provides F1 with D J ∈ 2n which is the same as the top-down weight vector, and then F1 activity x becomes A ∧ DJ . Then, OS calculates the matching degree SJ from A and x: SJ = |A ∧ D J | / |A| .
(2)
In addition, OS compares SJ with the vigilance parameter ρ ∈ [0, 1]. If SJ ≥ ρ, the node J has the category for the present input a. The category is defined by U J and D J . If SJ < ρ, the node J is reset. The reset node J remains inactivated
Fuzzy ARTMAP with Explicit and Implicit Weights
j
1
Tj
J
m
U j , Dj
F1
DJ
i
1
㪼 㫅㪺 㫅㪸 㫊㫆 㪩㪼 㫋㪆 㫊㪼 㪩㪼
F2 y
301
ρ
2n x
A F0
i
1
2n A
AS
OS
(a,a c)
Fig. 1. Fuzzy ART
Fab MF
mb
WJ
ab
yb
K k 1
Wj ab
F2 b
ab
x F2a
1
j
J
r ab
ma
1
k
K
mb
yb T ja F1a
Uj a, D ja 1
ia
D Ja
ρa
2na
x
A F0a
MT
1
ARTa
ia
F 1b
U kb, Dkb 1
ib
DKb
xb
B F 0b
A
ρb
2nb
a
2na
(a,a c)
Tkb
1
ARTb
ib
2nb
B (b,b c)
Fig. 2. Fuzzy ARTMAP (FAM)
until the input a changes. The above processes are iterated until ART finds an F2 node with SJ ≥ ρ. If all F2 nodes are reset in a learning period, a new node is added to F2 . Therefore, ρ can be regarded as the fineness of classification. Next, we explain the behavior of FAM. As shown in Fig.2, FAM consists of a learning ART (ARTa ), a supervising ART (ARTb ) and a map field (MF). In the learning period, ARTa receives a sample a ∈ [0, 1]na which may contain noise and ARTb receives the corresponding recognition code b ∈ [0, 1]nb . The vigilance parameter of ARTa (i.e., ρa ) is set to the baseline value ρa0 , whenever ARTa receives a new sample. If the category of ARTa is designated by Fa2 node mb J, ARTa provides Fab in MF with W ab which is the same as the map J ∈ field weight vector. If the category of ARTb is designated by Fb2 node K, ARTb
302
T. Kamio et al.
b b provides Fab with y b ∈ mb which satisfies yK = 1 and yk =K = 0. After ab b ab receiving W J and y , F checks the mapping from ARTa to ARTb by ab b b x ≡ y ∧ W ab (3) J ≥ ρab y ,
where xab is Fab activity and ρab ∈ [0, 1] is the vigilance parameter of MF. If Eq.(3) is true, MF judges that the mapping is correct. In the case of an erroneous mapping, MF executes the match tracking (MT). MT resets Fa2 node J by increasing ρa as follows: ρa = SJa + ,
(4)
where SJa is the matching degree of Fa2 node J and is an arbitrary small positive value. Therefore, ARTa selects another node or generates a new node after MT. The above processes are iterated until Eq.(3) is satisfied. This means that MT can correct an erroneous mapping. When the mapping is deemed correct, AL-SLMAP [2] updates weight vectors as follows: DJ = (1 + cJ )−1 A + (1 − (1 + cJ )−1 )D J a(new) a(new) a(old) UJ = βa (A ∧ D J ) + (1 − βa )U J , a(new)
a(old)
b(new)
b(old)
,
(5)
b(old)
DK = βb (B ∧ D K ) + (1 − βb )DK , b(new) b(old) b(old) UK = βb (B ∧ U K ) + (1 − βb )U K , ab(new)
WJ
ab(old)
= βab (y b ∧ W J
ab(old)
) + (1 − βab )W J
(6) ,
(7)
Fa2
where cJ is the count of selections of node J in correct mappings. βa , βb and βab ∈ (0, 1] are learning rates. Note that βb must be set to 1, whenever Fb2 node K learns for the first time. All the components of weight vectors are set to 1 before the first update. Eqs.(5), (6), and (7) are based on AL [2], FCSR [1] and SLMAP[3] respectively. The reason why FCSR updates DbK and U bK is that FCSR can optimally learn recognition codes, which are noiseless data. After the supervised learning is finished, ARTa can be used as a recognition system. This is because W ab j gives a mapping from ARTa to ARTb . However, if ab ab W j has more than one component which satisfies Wjk ≥ ρab , Fa2 node j must be deleted before using ARTa as a recognition system. 2.2
Characteristics
First, we discuss the characteristics of weight vectors created by AL-SLMAP (i.e., Eqs.(5)-(7)). D aj converges on the average of samples which have been classified into Fa2 node j. U aj approximates the distribution of samples. It is expressed by a hyper rectangle in the input space. D bk and U bk correspond to the immediate k-th kind of recognition code bk . Recognition codes bk denote noiseless data. ab a b Although the initialized W ab j relates F2 node j to all the F2 nodes, W j relates a b F2 node j to only one F2 node finally.
Fuzzy ARTMAP with Explicit and Implicit Weights
303
Next, let us consider the result of Ref.[2]. Ref.[2] shows that AL-SLMAP can inhibit the category proliferation for the character recognition problem which consists of noisy samples and noiseless recognition codes. This result must be obtained by satisfying the following conditions: (a) Since Eq.(7) makes W ab j learn slowly, the occurrence of MT is reduced. As a result, Fa2 node j can get a lot of samples. a b (b) When W ab j has just related F2 node j to only F2 node k, most of the a samples learned by F2 node j correspond to the recognition code bk . (c) Since Eq.(5) averages samples, the influence of noise is eliminated from D aj and U aj . (d) Since D aj and U aj become a good category, unnecessary MT can hardly occur. However, we have found two disappointing facts about AL-SLMAP. One is a fact that AL-SLMAP cannot inhibit the category proliferation for the character recognition problem in a highly noisy environment [4]. The other is a fact that AL-SLMAP is less suitable for the region classification problem than FCSR. From these facts, we have noticed three problems. The first problem is as follows. When Fa2 node j learns a certain set of samples corresponding to the identical recognition code, the presentation order of the samples influences U aj . The second problem is that the elimination of noise from Daj and U aj is incomplete because the condition (b) is not always satisfied. The third problem is a potential drawback of MT [4]. To solve these problems we change the choice strength for ARTa , propose FAM with explicit and implicit weights, and modify MT by using implicit weights.
3 3.1
FAM with Explicit and Implicit Weights Choice Strength for ARTa
As the distance between a sample and a category becomes smaller and the size of the category grows larger, the choice strength by Eq.(1) becomes larger. However, we think that Eq.(1) is unsuitable for the choice strength for ARTa . To verify our opinion we made Fa2 node j learn a certain set of samples corresponding to the identical recognition code. In each experiment, the samples were given in different order. Simulation results show the following. Daj was always a the same a vector. Although U j became a different vector, the fluctuation of U j was small. From these results and the characteristics of Eq.(1) mentioned above, we have judged that U aj may define an improper category. To solve this problem the A ∧ U aj distance between a sample and a category should be estimated by not but A ∧ Daj . In this case the bottom-up should be defined by not a weight a a a vector U j but a scalar Uj because only U j is needed. Therefore, we change the definition of the choice strength for ARTa as follows: Tja = A ∧ D aj /(α + Uja ), (j = 1, · · · , ma ). (8)
304
T. Kamio et al. Fab Wabjk
WabjK
Wabj1
Wab jmb gj1
gjk
gjK
gjmb
u aj1
u ajk
u ajK
uajmb
d aj1
d ajk
d ajK
dajmb
Uaj, D aj
F1 a
Fig. 3. Fa2 node j of our proposed FAM
3.2
Explicit and Implicit Weights
a a b a When W ab j has just related F2 node j only to F2 node k, D j and Uj should be created by the samples corresponding only to bk . To achieve this request, we propose FAM with explicit and implicit weights. As shown in Fig.3, Fa2 node j has a weight set (dajk , uajk ) for each Fb2 node k. D aj and Uja is calculated by
D aj
m mb b a = gjk njk djk (gjk njk ) , k=0
(9)
k=0
m mb b Uja = gjk njk uajk (gjk njk ) , k=0
(10)
k=0
where njk is the count of updates of (dajk , uajk ). All the components of dajk are set to 1 and uajk is set to 2na before the first update. Also, gjk is given by gjk =
ab 1, if Wjk ≥ ρab . 0, otherwise
(11)
Eqs.(9)-(11) show that (D aj , Uja ) is defined only by weight sets (dajk , uajk ) with gjk = 1. Therefore, we call a weight set (dajk , uajk ) with gjk = 1 an explicit weight and a weight set (dajk , uajk ) with gjk = 0 an implicit weight. If MF judges that the mapping from Fa2 node j to Fb2 node k is correct, (dajk , uajk ) is updated by −1 = n−1 , jk A + (1 − njk )djk a(new) a(old) = n−1 ∧ A . d + (1 − n−1 jk jk jk )ujk
a(new)
djk a(new)
ujk
a(old)
(12) (13)
Fuzzy ARTMAP with Explicit and Implicit Weights
305
Eqs.(12) and (13) mean that (dajk , uajk ) is updated by the samples corresponding a b only to bk . That is to say, when W ab j has just related F2 node j only to F2 node a a k, D j and Uj are created by the samples corresponding only to bk . When ARTa is used as a recognition system, Fa2 node j must be deleted if ab W ab j has more than one component which satisfies Wjk ≥ ρab . Furthermore, all the implicit weights should be deleted from the viewpoint of resource costs. 3.3
Match Tracking
After the original MT resets the activated Fa2 nodes by increasing ρa , the erroneous mapping from ARTa to ARTb is corrected. As a result, even if there are Fa2 nodes besides the reset ones, a new node may be forcibly generated. However, there is the possibility that the erroneous mapping is corrected by the restricted MT which resets activated Fa2 nodes without increasing ρa . These facts illustrate that the original MT may needlessly generate Fa2 nodes. To solve this problem we propose the modified MT using implicit weights. From now on, we call the original MT “MTup ” and the restricted MT “MTfix ”. The modified MT is the combination of MTfix and the forcible node generation. The former is used to inhibit the increment of Fa2 nodes. The latter is used to correct erroneous mappings which cannot be solved by MTfix . MTfix is the same as MTup except using ρa = ρa0 instead of Eq.(4). The forcible node generation is executed as follows. It is assumed that Fa2 node j and Fb2 node k are activated when the first erroneous mapping happens for t-th input set (a, b). At this point, an implicit weight (dajk , uajk ) is updated by Eqs.(12) and (13), and then this update increases zjk by 1. That is to say, zjk counts the updates of the weight set (dajk , uajk ) after it becomes an implicit weight. Next, the occurrence rate of erroneous mappings L(t) is calculated by r/PR , if t ≥ PR L(t) = , (14) 1, otherwise where r is the number of input sets which give rise to erroneous mappings in the period [t − PR + 1, t]. Moreover, the change of L(t) is evaluated every PL input sets. If L(t) − L(t − PL ) > 0, the modified MT judges that there are erroneous mappings which cannot be solved by MTfix . In this case, the following equation is checked for Fa2 nodes with only one explicit weight: Zj /τ ≥ χ,
(15)
where Zj is the maximal zjk of implicit weights in Fa2 node j, τ is the summation of zjk of all the implicit weights in Fa2 layer, and χ ∈ (0, 1] is the standard for the forcible node generation. If Fa2 node J has the implicit weight (daJK , uaJK ) satisfying Eq.(15), then a new Fa2 node J is generated as follows: a (dJK , uaJK , 1, 1, 0), if k = K a a ab (dJ k , uJ k , nJ k , WJ k , zJ k ) = . (16) (1, 2na , 0, 1, 0), otherwise
306
T. Kamio et al.
However, if the same Fa2 node J has satisfied Eq.(15) at the last check of Eq.(15), the node J is modified instead of generating a new Fa2 node: a (dJk , na , 1, 1, 0), if gJk = 1 a a ab (dJk , uJk , nJk , WJk , zJk ) = . (17) (1, 2na , 0, 1, 0), otherwise This is because such Fa2 node J may provide the categories around it with bad influences. After completing these processes, the modified MT executes MTfix . Even if erroneous mappings happen again for t-th input set, the modified MT executes only MTfix .
4
Simulation Results
Simulations have been carried out to demonstrate the effectiveness of our proposed method (PM). For the alphabet character recognition problem, PM is compared with FCSR [1] and AL-SLMAP [2]. The main difference of FCSR and AL-SLMAP is the weight update method for D aJ , U aJ , and W ab J . In the case of FCSR, D aJ and U aJ are updated by Eq.(6). However, (A, D aJ , U aJ , βa ) must be given to Eq.(6) instead of (B, DbK , U bK , βb ). Also, W ab J is updated by Eq.(7) with βab = 1.
Fig. 4. Alphabet characters
The original patterns of the alphabet characters are shown in Fig.4. Each pattern is illustrated by a (7×7)-pixel image. The pixel values are set to 0 for white pixels and 1 for black ones. In the learning period, ARTa receives noisy patterns (i.e., sample data) a ∈ 7×7 and ARTb receives the corresponding recognition codes b ∈ 26 . A noisy pattern a is constructed by inverting some pixels in an original pattern selected randomly. The number of inverted pixels depends on Hamming distance (HD). In a recognition code b, one element is set to 1 and the others are set to 0. For instance, the code b corresponding to the character “A” is [1, 0, · · · , 0]. The quantity of learning data NL is 20000. In the test period, ARTa receives noisy patterns (i.e., test data) a. The quantity of test data NT is 50000. We estimate each learning method by the learning time TL (sec.), the number of generated Fa2 nodes ma , and the recognition rate for test data RT . The parameters of each learning method are as follows. In the case of FCSR, αa = 0.1, βa = 0.2, ρa0 = 0.5, αb = 1, βb = 1, ρb = 1, βab = 1, and ρab = 1. In the case of AL-SLMAP, αa = 0.1, βa = 0.2, ρa0 = 0.5, αb = 1, βb = 1, ρb = 1, βab = 0.02, and ρab = 0.75. They are the same as Ref.[2]. In the
Fuzzy ARTMAP with Explicit and Implicit Weights
TL(sec.)
PM
100
AL-SLMAP FCSR
0 0
20 HD
10 (a) The learning time.
ma
2000
PM AL-SLMAP FCSR
1000
0
0
20 HD
10 a
(b) The number of F2 nodes.
RT
1 0.8 0.6
PM AL-SLMAP
0.4
FCSR
0
10
20 HD
(c) The recognition rate for test data. Fig. 5. Simulation results for the alphabet character recognition problem
307
308
T. Kamio et al.
case of PM, αa = 0.1, ρa0 = 0.5, αb = 1, βb = 1, ρb = 1, βab = 0.02, ρab = 0.75, PR = 1000, PL = 800, and χ = 0.08. Fig.5 illustrates TL , ma , and RT . Fig.5(a) and Fig.5(b) show that TL of PM is the largest of the three methods and each method has the same ma when HD = 0. This is a result predicted before executing simulations because PM needs the highest calculation cost per Fa2 node. However, even if HD becomes large, PM keeps the constant ma . But ma in other learning methods increases rapidly. As a result, PM can finish the learning much faster than the others when HD is large. Moreover, Fig.5(b) and Fig.5(c) show that PM has better ma and RT . As HD becomes larger, this tendency becomes stronger. Therefore, we have concluded that PM can inhibit the category proliferation and keep high recognition performance in a highly noisy environment.
5
Conclusions
AL-SLMAP is one of useful learning methods for fuzzy ARTMAP (FAM) from the viewpoint of simplicity and performance. However, AL-SLMAP has problems in category selection, weight update, and match tracking. To solve these problems, we have proposed FAM with explicit and implicit weights. Simulation results have shown that our proposed method (PM) is better than FCSR and AL-SLMAP in terms of category proliferation and recognition performance when the sample data contains a large amount of noise. In the future, we will try to reduce the learning calculation cost of our FAM. Moreover, we have to compare PM with other learning methods and apply PM to more practical problems.
References 1. Carpenter, G.A., Grossberg, S., Markuzon, N., Reynolds, J.H., Rosen, D.B.: Fuzzy ARTMAP: A neural network architecture for incremental supervised learning of analog multidimensional maps. IEEE Trans. Neural Networks 3(5), 698–713 (1992) 2. Lee, J.S., Yoon, C.G., Lee, C.W.: Improvement of recognition performance for the fuzzy ARTMAP using average learning and slow learning. IEICE Trans. Fundamentals E81-A(3), 514–516 (1998) 3. Carpenter, G.A., Grossberg, S., Reynolds, J.H.: A fuzzy ARTMAP nonparametric probability estimator for nonstationary pattern recognition problems. IEEE Trans. Neural Networks 6(6), 1330–1336 (1995) 4. Kamio, T., Nomura, K., Mori, K., Fujisaka, H., Haeiwa, K.: Improvement of Fuzzy ARTMAP by Controlling Match Tracking. In: Proc. International Symposium on Nonlinear Theory and its Applications, pp. 791–794 (2006)
Neural Network Model of Forward Shift of CA1 Place Fields Towards Reward Location Adam Ponzi Laboratory for Dynamics of Emergent Intelligence, RIKEN BSI, Saitama, Japan
[email protected]
Abstract. In a recent experimental paper I. Lee et al. [1] showed that the firing patterns of CA1 complex-spike neurons gradually shifted forward across trials toward prospective goal locations within a recording session over multiple trials. Here we propose a simple model of this result based on the phenomenon of awake sequence reverse replay [2] which occurs when the animal pauses at the reward location. The model is based on the CA3-CA1 anatomy with modulation of CA3-CA1 synaptic plasticity by feedback from CA3 projecting CA1 interneurons. Sequence replays, which are generated in CA3 by removal of septal inhibition on CA1 interneurons, are recoded into the synaptic weights of individual CA1 cells. This produces spatially extended CA1 firing fields, whose response provides a value function on experienced paths towards goal locations. Simulations show that the CA1 firing fields show positive movement in center of mass towards reward locations over many trials with negative shift in first few trials, and development of positive skew.
1 Introduction It is well known that the hippocampus is involved in remembering episodic events in particular environments, [3, 4, 5, 1]. The spatial information of events is represented by place cells which show strong firing correlations with particular locations. However, place coding in the hippocampus also interacts with nonspatial factors such as task demands and/or the reinforcement schedule, especially in complex memory tasks. It has been demonstrated that the place cells in the hippocampus are influenced by changes in the physical environmental context [6, 7, 8, 9, 5] but other studies have also shown that changes in the physical context are not necessary to alter the spatial firing pattern. One important factor is the reward or goal postion, and place cells change their firing characteristics when the pattern of reinforcement is altered in the same environment [10, 11, 12, 13]. In addition, task demands modulate the place cell responses in the absence of physical changes in the environment [14,15,16]. For example, [16] have shown that as a rat traverses the stem of a modified T-maze continuously while alternating between different goal locations, CA1 place cells fire more strongly in association with a particular trial type (i.e., left-to-right or right-to-left trials) of the alternation task. A recent work [1] which studied firing patterns over multiple trials revealed that the spatial firing correlates of CA1 place cells gradually shifted forward across trials, via the stem, toward prospective goal locations within a recording session. The results M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 309–316, 2008. c Springer-Verlag Berlin Heidelberg 2008
310
A. Ponzi
suggest that the shift in a reference frame bound to physical objects [7] is not necessary to produce a systematic shift in firing locations of hippocampal neurons. Instead, the results imply that a goaloriented, cognitive reference frame can significantly influence the place cell characteristics of the hippocampal neurons, especially when animals need to parse a given physical space into multiple cognitive maps according to the mnemonic task demands [1]. Here we describe a simple neural network model of this forward shift phenonemon. The model is an extension of a previous [26, 27] modeling of CA3 generated reverse sequence replay [2], to include a projection to CA1 to describe the forward shift phenomenon. As suggested in discussion in [1], the forward shift is hypothesized to result from the reverse replay which has been observed to occur at the reward location [2]. In the present model, replays generated in CA3 at the reward location are recoded into the CA3-CA1 synaptic weights of a CA1 cell which therefore represents the trajectory of the animal preceeding the replay. Moreover the replay strength of replayed place cells depends on their distance back along the animal path from the reward location so that the CA1 cell can represent a value function, whose firing rate represents the distance to the reward location. In order to ensure that the CA1 weights reflect the replay strength we suggest that LTP/LTD in the CA3-CA1 connections is modulated by an intereuron so that LTP occurs simultaneously to CA3 replay while LTD occurs at other times. This is regulation is hypothesised to be controlled by the CA3 projecting CA1 interneurons [17, 18, 19] which are in turn regulated by septum GABAergic input. With the exception of the CA3 projection, these cells cross-field projecting cells are remarkably similar to the trilaminar neurons, particularly in regard to their dendritic morphology, laminar distribution of local (CA1) axons, and a projection through the fimbria. According to the laminar specificity of their dendritic tree, they are likely to be driven primarily by the local collaterals of CA1 pyramidal cells in a feedback manner. They exert their inhibitory effects on the dendritic tree of pyramidal cells in the CA1 region and also most notably in area CA3. The inhibition mediated by backprojection cells, therefore, is in a direction opposite to the excitatory dentate gyrusCA3-CAl axis. In fact it is suggested in [17] that a cross-regional timing of action potentials by these interneurons may be important to secure population synchrony of principal cells in distributed networks and may allow a coordinated induction of synaptic plasticity [17, 18, 19].
2 Model The model system is depicted in Fig.1(a). The CA3 system consists of a set of N pyramidal cells. In general each pyramidal cell receives inhibition from the population of various types of interneuron [17], proximal excitation from the dentate gyrus [22], mid-dendritic excitation from other pyramidal cells via the recurrent collaterals [22], and excitation from the enthorhinal cortex. Here we omit all these connections except for one-to-one excitation which carries signals generated either from dentate-gyrus or entorhinal cortex. These signals carry external environmental information such as
Neural Network Model of Forward Shift
DG/EC
311
1 to 1 CA3 Pyramid All to All Modifyable
Basket Cell
CA1 Pyramid All to All
SEPTUM
Interneuron H
Fig. 1. (a) Anatomy of model system described in text. The solid lines denote excitatory connections, the dashed lines are inhibitory and the dot-dashed lines are modifyable excitatory connections. (b) Circular track task, animal runs clockwise. The rectangles represent the topographic place specific inputs Ii = 0, 1 from the dentate gyrus or EC to CA3 pyramidal cells. They are overlapping in this example for generality. Replays are generated by activating the septal inhibition everytime the animal runs throught the location marked R for reward.
landmarks etc, or internally generated position specific signals. The pyramidal cells are then described by activities pi (t), simply given by, dpi = −k1 pi + k2 g(wi )f (HX − H(t)) + k3 Ii (x(t))| cos(Θx(t))|. dt
(1)
Here the activities pi (t) may represent membrane potentials or firing rates and we do not model spiking explicitly. The Ii (x(t)) is one-to-one topographic input which can as explained above be either place specific input which enters the CA3 pyramidal cell via the dentate gyrus, assumed to be formed from path integration of internal motion signals on the entorhinal grid [23]. In this model each pyramidal cell i is activated by a single place specific input for simplicity. Alternatively Ii (x(t)) can model sensory type information from the enthorhinal cortex on the perforant path. x(t) is the position of the animal which fixes Ii . For completeness this signal is considered modulated by the theta rhythm described by the factor | cos(Θx(t))| where Θ is a constant, x(t) is the position and |x| denote the absolute value of x. The wi (t) in Eq.1 are internal ‘excitability’ variables they are simply given by, dwi = k5 pi − k6 wi . dt
(2)
These are the variables which generate the reverseness of the replay [2], as has been described in [26] and will be explained below. They are driven by the firing pi and must decay slowly across the spatial environment, this decay is controlled by the parameter k6 . Both the pi (t) and the wi (t) are necessary because the cell has localized firing given by pi while the wi (t) reflect internal memory across the whole environment. The function g(x) in Eq.1 is the sigmoidal function which is used to limit the activity of the k2 term in Eq.1. It is only necessary to avoid fine tuning of parameters and is set
312
A. Ponzi
to be approximately linear in the region of wi (t) used in these simulations. Notice that although we suggest a simpler model with topographic inputs here, it would not be problematic to replace the Ii in Eq.1 with j Iij for a set of inputs j to cell i. The term f (HX − H(t)) models the modulation by the interneuron H cells, (see Fig.1(a)), whose firing rate is given by H(t), which will be described below. Here f (x) = x is the positivity function f (x) = x, (x > 0), f (x) = 0, (x < 0). This term means that when the H cell firing rate is above its baseline HX the cell soma is inhibited, but when the interneuron H activity drops below its baseline activity, HX , the soma can become activated. The strength of this replay reactivation depends on the the excitabiltity wi (t) at the time of the reactivation. Therefore the cell firing Eq.1 can be driven by two different factors, the input Ii (x(t)) and a reactivation by removal of H interneuron inhibition. Notice we do not include the CA3 recurrent collaterals. It is hypothesized that these are used to produce forward replay theta sequences [20,21] encoded by time assymetric Hebbian learning. Since we here we only address the reverse replay here we do not require them. Unlike the CA3 pyramidal cells the CA1 cells do not have strong recurrent collaterals and we model them as a winner-take-all (WTA) system. They are given by, dqi = −qi + f ( uij pj − B(t) + qi ) dt j
(3)
where qi are the activities of the CA1 pyramidal cells and uij (t) are modifyable weights for the all-to-all projection from CA3 to CA1. In the CA1 system we also include an inhibition from the CA1 basket cell activity which produces a WTA in the absence of strong recurrent collaterals. The basket cell is simply modeled as B(t) = i qi according to the activation from the pyramidal cell population. We describe the projection weight uij (t) update by, duij = −uij (t) + f ((HX − H(t))pi qj + uij (t)). dt
(4)
In this Eq.4 we suggest the uij (t) weights are modified by the postsynaptic qj and presynaptic pi firing rates and whether LTP or LTD occurs is modifyable by the H(t) cell firing. This can occur by modification of backpropagating EPSPs in the dendritic tree of the CA1 pyramidal cells. The reason for this is to ensure that LTP occurs at the same time as replay activity in CA3, while LTD occurs outside replay activity. It is this LTP modulation that allows the CA1 cells to inherit the CA3 replayed activity and produces their broad firing fields which develop over trials. Furthermore the LTD outside the replay activity produces a stabilization of the firing over many trials and a normalization over the environment. Such a modulation of learning has been included in [25, 27, 26]. CA3 projecting CA1 interneuron H cells are known within the hippocampal region [17]. Their cell bodies and axons rest primarily in the stratum oriens, while their dendrites may extend across the strata to stratum radiatum and stratum lacunosummoleculare and to CA3. In addition the H cell population receives excitatory input from the active pyramidal cells and is regulated by an inhibitory GABAergic septal
Neural Network Model of Forward Shift
313
input [24, 22]. The medial septum diagonal band of broca (MSBD) projects to the H cells and related O-LM cells in stratum radiatum and oriens of CA1 and CA3 [22]. The H cells are simply modeled by a single unit, dH = −H(t) + HX + k6 qi (t) − kS S(t). dt i
(5)
They are activated by projection from the the CA1 pyramidal cells qi and their firing is inhibited by activity from the septum S(t). When the pyramidal cell input and the septal input are zero then basket cell activity decays exponentially to its baseline level HX . The septal activity generates the replays as explained in [26]. The septal signal itself is considered to be possibly generated by reward signals from the hypothalamus or from the thalamus.
3 Results The model is best understood by studying examples of its time series. In this paper we only consider the circular environment depicted in Fig.1(b).
1.2
5
4
0.8
CA1 cell activities
CA1 cell activities
1
0.6
0.4
0.2
3
2
1
0
0 2400
2600
2800
Time
3000
3200
2500
3000
3500
4000
Time
Fig. 2. (a) Activity of a CA1 pyramidal cell on an early trials. (b) Activity of same CA1 pyramidal cell as (a) on a successive pair of trials late in learning.
The activation of a CA1 cell around the track for a successive pair of trials early in learning is shown in Fig.2(a) for repeated traversals around the track shown in Fig.1(b). The firing rate is quite flat throughout a whole trial. The periodic modulations are due to the periodic theta driving of inputs Ii (t) in Eq.1 while the periodic peaks are due to the different contributions of difference CA3 layer cells each CA3 cell activated topographically as depicted in Fig.1(b). The large activation at the end of each trial is caused by the replay reactivation event in CA3. The same cell for a pair of later trials is shown in Fig.2(b). Here the firing rate clearly increases across the trial and this produces the forward shift towards reward location in the center of mass. Further the increase in skew in the direction of motion towards reward is clear.
A. Ponzi
CA3 activities, H cell
314
3
2
1
0 5530
5532
5534
5536
5538
Time
5540
Fig. 3. Time series of activity of CA3 cells at replay event. Each CA3 cell is shown as solid line with different symbol. Also shown is the H cell time series as dashed line.
How this occurs is very easy to understand. As explained elsewhere [26] and shown in Fig.3 the replay reactivates the most recently experienced place cells which strongest magnitude. This creates a kind of reverseness. According to the model Eqns.1,5 the replay occurs when the H interneuron is inhibited by the activation of the septum S(t). The septum is suggested to be activated at reward locations or locations where the animal pauses. This H cell inhibition below its baseline HX = 1 is also is shown in Fig.3. This is also when LTP occurs at the at CA3-CA1 synapses according to Eq.4. Therefore the CA3 replay is recoded into a CA1 cell, whose synaptic weights uij now represent the replay strength. This CA1 cell can be said to provide a value function along the trajectory leading to the reward location since its firing rate depends on the distance to the reward location. How the center of mass (COM ) of this CA1 cell varies across trials is shown in Fig.4(a). The center of mass for each trial is calculated acording to, N i
Total firing rate Q for a CA1 cell each trial
Center of Mass (COM) of CA1 cells each trial
COM =
0.0062
0.006
0.0058
0.0056
0.0054
0.0052
q(ti )x(ti ) Q
(6)
1.5
1
0.5
0 0
10
20 Trial Number
30
0
5
10
15
20
25
30
Trial Number
Fig. 4. (a) Variation of center of mass of CA1 cells versus trial number. (b) Variation of total firing rate of CA1 cell versus trial number.
Neural Network Model of Forward Shift
315
where, Q=
N
q(ti ).
(7)
i
In these equations ti is the time of the i − th iteration of the runge kutta integrator, and there are N iterations each trial. Therefore Q is the total firing rate for each trial while COM is the total firing rate weighted by the position x(ti ) of the animal on the track at time ti each trial. As shown in Fig.4(a) the COM initially decreases rapidly and then slowly increases. We also show the change in total firing rate Q for each trial in Fig.4(b) which increases across trials and then slowly stabilizes later. In fact on the first replay event all the synaptic weights of CA1 cells have weights given by previous experience which has no relevance to the track in the simulation. The first replay event causes positive learning at these synapses and then they are slowly adjusted over trials to reflect the replayed strengths measuring the distance to the reward location in the current context and specific environment. This is what causes the drop in the COM on the first few trials and then increase after. Indeed this is as observed in [1] when tasks are switched. Fig.4(a) also reveals a small 6 trial backwards and forwards periodic variation in the center of mass. This is a quasiperiodicity induced by the non-rational ratio of track length, which determines the reward replay location, and the imposed theta oscillation period in Eq.1. However it may be relevant to real animal behaviour by counting laps around the track.
4 Discussion We have presented a simple model of the forward shift in COM of CA1 place fields based on sequence replay at the reward location. The model relies on a coordination between replays and LTP at the CA3-CA1 synapses which is proposed to be provided by the CA3 projecting CA1 interneurons [17, 18] regulated by the septum. The model reproduces the gradual positive shift in COM. In future work we will address how the model behaves under task switching, for example from the circular track to the alternation task or T-Maze. A more detailed explanation of the reverse replay modeling is given in [26].
References 1. Lee, I., Griffin, A.L., Zilli, E.A., Eichenbaum, H., Hasselmo, M.E.: Gradual Translocation of Spatial Correlates of Neuronal Firing in the Hippocampus toward Prospective Reward Locations. Neuron 51, 639–650 (2006) 2. Foster, D.A., Wilson, M.A.: Reverse replay of behavioural sequences in hippocampal place cells during the awake state. Nature 440(7084), 615–617 (2006) 3. Eichenbaum, H.: A cortical-hippocampal system for declarative memory. Nat. Rev. Neurosci. 1, 41–50 (2000) 4. McClelland, J.L.: Complementary learning systems in the brain. A connectionist approach to explicit and implicit cognition and memory. Ann. N Y Acad. Sci. 843, 153–169 (1998)
316
A. Ponzi
5. O’Keefe, J., Nadel, L.: The hippocampus as a cognitive map. Oxford University Press, Oxford (1978) 6. Anderson, M.I., Jeffery, K.J.: Heterogeneous modulation of place cell firing by changes in context. J. Neurosci. 23, 8827–8835 (2003) 7. Gothard, K.M., Skaggs, W.E., Moore, K.M., McNaughton, B.L.: Binding of hippocampal CA1 neural activity to multiple reference frames in a landmark-based navigation task. J. Neurosci. 16, 823–835 (1996) 8. Lee, I., Yoganarasimha, D., Rao, G., Knierim, J.J.: Comparison of population coherence of place cells in hippocampal subfields CA1 and CA3. Nature 430, 456–459 (2004) 9. Leutgeb, S., Leutgeb, J.K., Treves, A., Moser, M.B., Moser, E.I.: Distinct ensemble codes in hippocampal areas CA3 and CA1. Science 305, 1295–1298 (2004) 10. Breese, C.R., Hampson, R.E., Deadwyler, S.A.: Hippocampal place cells: stereotypy and plasticity. J. Neurosci. 9, 1097–1111 (1989) 11. Fyhn, M., Molden, S., Hollup, S., Moser, M.B., Moser, E.: Hippocampal neurons responding to first-time dislocation of a target object. Neuron 35, 555–566 (2002) 12. Kobayashi, T., Nishijo, H., Fukuda, M., Bures, J., Ono, T.: J.Neurophysiol. 78, 597–613 (1997) 13. Markus, E.J., Qin, Y.L., Leonard, B., Skaggs, W.E., McNaughton, B.L., Barnes, C.A.: Interactions between location and task affect the spatial and directional firing of hippocampal neurons. J.Neurosci. 15, 7079–7094 (1995) 14. Bower, M.R., Euston, D.R., McNaughton, B.L.: Sequential- context-dependent hippocampal activity is not necessary to learn sequences with repeated elements. J. Neurosci. 25, 1313– 1323 (2005) 15. Ferbinteanu, J., Shapiro, M.L.: Prospective and retrospective memory coding in the hippocampus. Neuron 40(2003), 1227–1239 (2003) 16. Wood, E.R., Dudchenko, P.A., Robitsek, R.J., Eichenbaum, H.: Hippocampal neurons encode information about different types of memory episodes occurring in the same location. Neuron 27, 623–633 (2000) 17. Freund, T., Buzsaki, G.: Interneurons of the Hippocampus. Hippocampus 6, 347–470 (1996) 18. Sik, A., Minen, A., Penttonen, M., Buzsaki, G.: Inhibitory CA1-CA3- hilar region feedback in the hippocampus. Science 265, 1722–1724 (1994) 19. Somoyogi, P., Klausberger, T.: Defined types of cortical interneurone structure space and time in the Hippocampus. J.Physiol. 562.1, 9–26 (2005) 20. Foster, D.J., Wilson, M.A.: Hippocampal theta sequences Hippocampus (2007), DOI 10.1002/hipo.20345 21. Lisman, J.E.: Relating Hippocampal Circuitry to Function: Recall of Memory Sequences by Reciprocal Dentate-CA3 Interactions. Neuron 22, 233–242 (1999) 22. Johnston, D., Amaral, D.G.: In: Shepard, G.M. (ed.) The Synaptic Organization of the Brain, Oxford University Press, Oxford (1998) 23. Hafting, T., Fyhn, M., Molden, S., Moser, M.-B., Moser, E.I.: Microstructure of a spatial map in the entorhinal cortex. Nature 436, 801 (2005) 24. Freund, T., Antal, M.: GABA-containing neurons in the septum control inhibitory interneurons in the hippocampus. Nature 336, 170–173 (1998) 25. Ponzi, A.: Dynamical System Model of Spatial Reward Learning. IEICE Technical Report 103, 163, 19–24 (2006) 26. Ponzi, A.: Model of Balance of Excitation and Inhibition in Hippocampal Sharp Wave Replays and Application to Spatial Remapping. In: Proceedings of IJCNN 2007 (2007) 27. Ponzi, A.: Simple Model of Hippocampal Splitter Cells. Japan Neural Network Society (JNNS) abstract (2006)
A New Constructive Algorithm for Designing and Training Artificial Neural Networks Md. Abdus Sattar1 , Md. Monirul Islam1,2, , and Kazuyuki Murase2,3 1
Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1000, Bangladesh 2 Department of Human and Artificial Intelligence Systems, Graduate School of Engineering, University of Fukui, 3-9-1 Bunkyo, Fukui 910-8507, Japan
[email protected] 3 Research and Education Program for Life Science, University of Fukui, 3-9-1 Bunkyo, Fukui 910-8507, Japan
Abstract. This paper presents a new constructive algorithm, called problem dependent constructive algorithm (PDCA), for designing and training artificial neural networks (ANNs). Unlike most previous studies, PDCA puts emphasis on architectural adaptation as well as function level adaptation. The architectural adaptation is done by determining automatically the number of hidden layers in an ANN and of neurons in hidden layers. The function level adaptation, is done by training each hidden neuron with a different training set. PDCA uses a constructive approach to achieve both the architectural as well as function level adaptation. It has been tested on a number of benchmark classification problems in machine learning and ANNs. The experimental results show that PDCA can produce ANNs with good generalization ability in comparison with other algorithms. Keywords: Artificial neural networks (ANNs), architectural adaptation, function level adaptation, constructive approach and generalization ability.
1
Introduction
Artificial neural networks (ANNs) have been widely used in many application areas. Many issues and problems, such as selection of training data, training algorithm and architectures, have to be addressed and resolved when using ANNs [8]. Among them the proper selection of an ANN architecture is of great interest because the performance of the ANN is greatly dependent on its architecture. There have been many attempts in designing and training ANNs, such as various constructive, pruning and evolutionary approaches (see review papers [4], [12] and [20]). The main problem of most existing approaches is that they can design either single hidden layered ANNs or multiple hidden layers ANNs with one neuron in each hidden layer [1], [5]-[7]. It, however, is quite difficult to decide
Corresponding author.
M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 317–327, 2008. c Springer-Verlag Berlin Heidelberg 2008
318
M.A. Sattar, M.M. Islam, and K. Murase
in advance whether a problem can be solved efficiently by using single hidden layered or multiple hidden layered ANNs. It is therefore necessary to devise an algorithm that is able to design both single and multiple hidden layered ANNs depending on problems complexity. This paper proposes a new constructive algorithm, called problem dependent constructive algorithm (PDCA), for designing and training feedforward ANNs. PDCA determines automatically not only the number of hidden layers in an ANN, but also the number of neurons in hidden layers. It uses a constructive approach with a layer stopping criterion for determining them. PDCA’s emphasis on training different hidden neurons with different training sets can increase the efficiency of determining ANN’s architecture automatically. PDCA differs from previous works for designing and training ANNs on a number of aspects. First, it is an algorithm that can design both single and multiple hidden layered ANNs depending on the complexity of a given problem. This approach is quite different from most existing algorithms (e.g.,[1] and [17]), which try to solve problems either by using only single or multiple hidden layers but not the both. Although single hidden layered ANNs are universal aproximators [3], multiple hidden layered ANNs are superior over single hidden layered ANNs for some problems [18]. Second, all the existing algorithms train hidden neurons in an ANN by the same training set. But PDCA creates a new training set based on the performance of the existing ANN architecture when a new neuron is added. Although this approach is used by boosting algorithm [15] for designing ANN ensembles, it is the first attempt to our best knowledge to use this concept in designing single ANNs. Third, most existing algorithms (e.g., [1], [5]-[7] and [9]) do not have any effective mechanism for stopping the addition of neurons in hidden layers. Consequently, they use only one neuron in each hidden layer resulting very deep architectures and long propagation delay, which is also not suitable for VLSI implementation. Or they use a predefined and fixed number of neurons for all hidden layers [10]. The problem of using a fixed number of neurons lies in the difficulty of selecting an appropriate number that is suitable for a given problem. To address these problems, PDCA uses a layer stopping criterion that determines automatically the number of neurons in each hidden layer. The rest of the paper is organized as follows. Section 2 describes PDCA in details. Section 3 presents results of our experimental study. Finally, Section 4 concludes the paper with a brief summary and few remarks.
2
PDCA
In order to determine automatically the number of hidden layers in ANNs and of neurons in hidden layers, PDCA uses incremental training in association with layer stopping criterion in designing ANNs. In our incremental training, hidden layers and hidden neurons are added to the ANN architecture one by one in a constructive fashion during training. The layer stopping criterion is used to decide when to add a new hidden layer by stopping the growth i.e., neurons addition to a hidden layer. To obtain efficient solution, PDCA trains each hidden
A New Constructive Algorithm for Designing and Training ANNs
319
neuron in an ANN with a different training set and stops automatically the ANN construction process. Though any kinds of ANNs and activation functions can be used for PDCA, for this work, we used PDCA to design feedforward ANNs with sigmoid activation function. The feedforward ANNs considered here are generalized multilayer perceptrons. In such architecture, the first hidden layer receives only network inputs (I) while other hidden layer(s) receives I plus the outputs of the preceding hidden layer(s). The output layer receives signals only from all hidden layers. The major steps of PDCA are summarized in Fig. 1, which are explained further as follows. Step 1. Create an initial ANN architecture consisting of three layers i.e., an input layer, a hidden layer and an output layer. The number of neurons in the input and output layers is the same as the number of inputs and outputs
Initial partial training
Create an initial ANN architecture Stop ANN construction?
Yes
Create a training set No Train the ANN
No
Stop initial training ? Final ANN
Stop layer construction ?
Yes
No
Final partial training Yes Add one hidden layer Stop ANN construction ?
Add one neuron
No
No
Stop final training ?
Yes
Fig. 1. Flowchart of PDCA
Yes
320
M.A. Sattar, M.M. Islam, and K. Murase
of a given problem, respectively. Initially, the hidden layer contains only one neuron. Randomly initialize the connection weights of the ANN within a certain range and label the hidden layer with I. Step 2. Create a new training set for the newly added hidden neuron based on the performance of the existing ANN architecture. PDCA uses adaboost.M2 algorithm [16], which is a variant of boosting algorithm [15], in creating training sets. It is here important to note that the original training set is used for training the initial architecture. Step 3. Partially train the ANN by backpropagation learning algorithm for a certain number of training epochs. This training phase is known as the initial training of the existing ANN architecture. The number of epochs, τ , is specified by the user. Partial training means that an ANN is trained for a fixed number of epochs regardless whether it has converged or not. Step 4. Check the termination criterion for stopping the ANN construction process. If the criterion is satisfied, go to the Step 12. Otherwise continue. Step 5. Compute the ANN error, i.e., E, on the training set. If E reduces by a threshold after the training epochs τ , go to the Step 3 for further training of the existing architecture. It is here assumed that the training process is progressing well and it is necessary to train the existing architecture further. Otherwise continue for the final training of the existing architecture. Step 6. Add a small amount of noise to input and output connection weights of a previously added neuron in the I-labeled hidden layer. Partially train the ANN by backpropagation learning algorithm for τ epochs. This training phase is known as the final training of the existing ANN architecture. Step 7. Check the termination criterion for stopping the ANN construction process. If the criterion is satisfied, go the Step 12. Otherwise continue. Step 8. Compute E on the training set. If E reduces by a threshold after the training epochs τ , go to the Step 6 for further training of the existing architecture. It is here assumed that the final training phase is progressing well and it is necessary to train the existing ANN further. Otherwise continue for modifying the existing architecture by adding hidden neurons or layers. Step 9. Check the criterion for stopping the growth of I-labeled hidden layer construction process. If the criterion is satisfied, stop the construction process of the I-labeled hidden layer by freezing the input and output connectivities of a previously added neuron in the hidden layer and continue. Otherwise go to the Step 11 for adding a neuron to the hidden layer. Freezing, which was first introduced in [1], means that the frozen connection weights will not be trained i.e., changed when the ANN will be trained in future. Step 10. Replace the labeling of the I-labeled hidden layer by label F . Add a new hidden layer above the existing hidden layer(s) of the ANN. Initially the new hidden layer contains one neuron and it is labeled with I. The connection weights of the neuron is initialized in the same way as described in the Step 1 and go to the Step 2. Step 11. Add one neuron to the I-labeled hidden layer and freeze the input and output connectivities of a previously added neuron in this layer. The
A New Constructive Algorithm for Designing and Training ANNs
321
connection weights of the newly added neuron is initialized in the same way as described in the Step 1 and go to the Step 2. Step 12. The existing ANN architecture is the final architecture for a given problem. As PDCA trains only one, i.e., the newly added neuron at a time, hence other nonlinear optimization methods, such as BFS or quasi-Newton [17] methods that are computationally expensive but have faster convergence, can easily be used in PDCA for training ANNs. Although the design of ANNs could be formulated as a multi objective optimization problem, PDCA uses a very simple cost function, the ANN error. The processes and criteria, incorporated in PDCA at different stages, are described briefly in the following subsections. 2.1
Termination Criterion
PDCA uses a criterion based on both training and validation errors to decide when the training process of an ANN is to be stopped. To formally describe the criteria, let Eva (τ ) and Eopt (τ ) are the validation error at training epoch τ and lowest validation error obtained in epochs up to τ , respectively. The generalization loss, GL, at epoch τ can be defined by the following equation [11]. Eva (τ ) GL(τ ) = −1 (1) Eopt (τ ) A high generalization loss can be one obvious reason to stop training, because it directly indicates overfitting. However, it is desirable not to stop the training process if the training error Etr is still progressing very rapidly. To formalize this notion, let a training strip of length k to be a sequence of k epochs numbered n + 1...n + k, where n is divisible by k. The training progress in a training strip (Pk ) can be used to measure how much the average training error of the strip is larger than the minimum training error during the strip. It can be defined by the following equation [11]. τ τ =τ −k+1 Etr (τ ) Pk (τ ) = −1 (2) k. minττ =τ −k+1 Etr (τ ) PDCA terminates the training process when GL(τ )/Pk (τ ) > α, where α is a user specified positive number. The reason for using training and validation data in the termination condition is to anticipate the behavior of test data better. 2.2
Layer Stopping Criterion
PDCA uses a simple criterion for deciding when to stop the growth of a I-labeled hidden layer. The criteria is based on the contribution of neurons in a hidden layer. The contribution, Ck , of a neuron k at any training epoch is 1 1 Ck = 100 − (3) E Ek
322
M.A. Sattar, M.M. Islam, and K. Murase
Where E is the network error and Ek is the network error excluding neuron k. The layer stopping criterion stops the growth of a I-labeled hidden layer when its contribution to an ANN, measured after the addition of each hidden neuron, failed to improve after the addition of a certain number of neurons, indicated by the parameter mh below. In other words, the growth of a hidden layer stops when the following is true: Ck (m + mh ) ≤ Ck (m), m = 1, 2, . . . ,
(4)
where mh (mh > 0) is a user specified positive integer number. If mh = 0 then all hidden layers of an ANN can consist of one hidden neuron only like CCA [1]. In PDCA, each hidden layer can consist of several neurons because mh greater than zero is used. It is here worth mentioning that no neurons are added to a hidden layer after its growth process has been stopped. Furthermore, the neurons in a hidden layer whose growth process has been stopped will not be trained anymore. 2.3
Creation of New Training Sets
Adaboost.M2 algorithm [16], which is proposed for training ANN ensembles, is used in PDCA to create different training sets for different hidden neurons in ANNs. It maintains a probability distribution D over the original training set T . Initially, D = 1/M where M is number of examples in T . The algorithm trains the first ANN in the ensemble by using the original training set T . After training the first ANN, D is updated in which the probability of incorrectly and correctly classified examples is increased and decreased, respectively. A new training set T is created based on updated D by sampling M examples at random with replacement from T . The second ANN of the ensemble is then trained by T . This process is repeated for other ANNs in the ensemble. The strategy used in adaboost for training ANNs in an ensemble can easily be incorporated in PDCA for training hidden neurons in an ANN. This is because PDCA trains hidden neurons in an ANN one by one, which is similar to train ANNs in an ensemble one after another by adaboost. In addition, PDCA trains only one, i.e., a newly added hidden neuron, at a time by freezing the input and output connectivities of a previously added neuron. This is also similar to training one ANN in the ensemble by adaboost algorithm. The use of different training sets at different stages in the training process of an ANN will facilitate to achieve functional adaptation in PDCA.
3
Experimental Studies
This section evaluates the performance of PDCA on several benchmark classification problems. Table 1 shows the summary characteristics of the problems displaying considerable diversity in the number of examples, attributes and classes. The detail description of all these problems can be obtained from [13] except iris and letter which can be obtained from UCI Machine Learning Repository.
A New Constructive Algorithm for Designing and Training ANNs
323
Table 1. Characteristics of experimental data sets Data set
Cancer Card Diabetes Glass Gene Iris Letter Thyroid
Number of input output training validation testing attributes classes examples examples examples 9 2 350 175 174 14 2 345 173 172 8 2 384 192 192 9 6 107 54 53 120 3 1588 794 793 9 2 107 54 53 16 26 10000 5000 5000 21 3 3600 1800 1800
A. Experimental Setup Table 1 shows the partitioning and the number of examples in each partition for the datasets of different problems. The name of the partitions is training set, validation set, and testing set. In all data sets, the first M examples were used for the training set, the following N examples for the validation set, and the final P examples for the testing set. It should be kept in mind that such partitions do not represent the optimal ones in practice. While the training set and testing set are used to train and to evaluate the generalization ability of trained ANNs, respectively, the union of training and validation sets is used to determine whether to add hidden layers or neurons or to stop the training altogether. In all experiments, one bias neuron with a fixed input +1 is connected to all hidden layers and to the output layer. The logistic sigmoid function is used for neurons in hidden and output layers. 3.1
Experimental Results
Table 2 shows the results of PDCA for 50 independent runs on several classification problems. The generalization performance of produced ANNs is measured by testing error rate (TER), which refers to the percentage of wrong classification produced by trained ANNs. The number of hidden layers (HL) and hidden neurons (HN), respectively, refer to the total number of hidden layers and hidden neurons in produced ANNs. The SD refers to the standard deviation of different results produced by PDCA in 50 independent runs. It is quite natural that complex architectures will be needed for solving difficult problems and simple architectures for solving easy problems. PDCA shows this characteristics which is seen from Table 2. For example, ANNs produced by PDCA had on average 2.84 hidden layers for the glass problem, while 1.06 hidden layers for the cancer problem. It is well known that the glass and cancer are one of the most difficult and easy problems, respectively, in the realm of machine learning [13]. It is also apparent from their error rates achieved by PDCA; for the glass problem it is 0.323, while for the cancer problem it is 0.009.
324
M.A. Sattar, M.M. Islam, and K. Murase
Table 2. Performance of PDCA based on number of hidden layers (HL), number of hidden neurons (HN), number of epochs and testing error rate (TER) for eight classification problems. The results were averaged over 50 independent runs. Problem Cancer Card Diabetes Glass Gene Iris Letter Thyroid
HL Mean SD 1.06 0.7 1.54 1.1 2.04 1.2 2.84 1.5 1.00 1.1 1.46 0.8 2.76 1.3 2.04 0.9
HN Epochs Mean SD Mean SD 3.36 1.2 289.34 20 2.14 1.3 195.82 18 4.20 1.5 220.76 27 4.54 1.6 311.50 30 1.28 0.9 135.90 17 2.46 1.0 125.30 11 24.32 3.8 8037.84 280 4.28 1.7 260.34 16
TER Mean SD 0.009 0.003 0.116 0.021 0.183 0.032 0.323 0.035 0.092 0.021 0.035 0.007 0.128 0.010 0.011 0.001
Table 3. Performance of PDCAWL based on number of hidden neurons (HN), number of epochs and testing error rate (TER) for four classification problems. The results were averaged over 50 independent runs. Problem
HN Epochs Mean SD Mean SD Diabetes 5.64 1.3 261.32 25 Glass 6.58 1.6 370.90 30 Letter 30.64 4.2 8590.18 291 Thyroid 6.86 1.4 301.42 16
TER Mean SD 0.214 0.035 0.364 0.029 0.167 0.021 0.015 0.001
Table 4. Performance of PDCAST based on number of hidden layers (HL), number of hidden neurons (HN), number of epochs and testing error rate (TER) for five classification problems. The results were averaged over 50 independent runs. Problem
HL HN Mean SD Mean SD Cancer 1.20 0.9 4.13 1.2 Glass 2.78 1.5 4.17 1.4 Gene 1.61 1.3 3.13 1.6 Iris 1.45 0.7 2.53 1.0 Letter 2.82 1.8 36.93 4.1
Epochs Mean SD 315.77 23 301.33 23 290.17 19 130.84 9 9220.02 311
TER Mean SD 0.013 0.004 0.298 0.030 0.128 0.023 0.038 0.007 0.176 0.035
In order to observe the effect of layer stopping criterion, a new set of experiments was carried out where PDCA without the layer stopping criterion (PDCAWL) was used and the result is depicted in Table 3. Since PDCAWL does not use the layer stopping criterion, it can only able to design single hidden layered ANNs. The comparison of the Table 3 with the Table 2 indicates the positive effect of using the layer stopping criterion in PDCA. To observe the effect of using different training sets in PDCA, another new set of experiments was carried out where all the newly added neurons were trained
A New Constructive Algorithm for Designing and Training ANNs
325
with the same training set. We call this version of PDCA as PDCAST. The result of the experiment is depicted in Table 4. The effect of using different training sets can easily be understood when the results of PDCAST (Table 4) are compared with those of PDCA (Table 2). It is observed that PDCA produces more compact ANN architectures and takes less number of epochs in comparison with its counterpart PDCAST. The t-test based on number of hidden neurons, number of epochs and testing error rate indicated that PDCA was significantly better than its counterpart PDCAWL and PDCAST at 95% confidence level.
4
Comparison
This section compares the performance of PDCA with CCA [1] and aCasper [19]. The reason for selecting these algorithms is that they are able to design multiple hidden layered ANNs like PDCA. CCA and aCasper use quickprop [2] and RPROP [14] algorithms, respectively, for training ANNs. It is known that quickprop and RPROP algorithms are faster than backpropagation algorithm used in PDCA. In this section, we therefore compare PDCA with CCA and aCasper in terms of average TER and HN of 50 independent runs. Table 5 compares PDCAs results with those produced by CCA and aCasper on six classification problems. It can be seen from the table that PDCA achieved the smallest TER for five out of six problems and the second smallest next to aCasper [19] for one problem. The improved performance of PDCA could be attributed to a couple of factors such as determining the number of hidden layers in ANNs and of neurons in hidden layers automatically and using different data for training different hidden neurons.
Table 5. Comparison among PDCA, CCA [1], and aCasper [19], in terms of average testing error rate (TER) and number of hidden neurons (HN) Algorithm PDCA HN TER CCA HN TER aCasper HN TER
5
Cancer 3.36 0.009 5.18 0.019 4.86 0.018
Card Diabetes Glass Gene Thyroid 2.14 4.20 4.54 1.28 4.28 0.116 0.183 0.323 0.092 0.011 1.07 9.78 8.07 2.73 25.04 0.137 0.245 0.347 0.133 0.030 0.12 3.02 4.18 0.00 4.64 0.135 0.231 0.306 0.117 0.016
Conclusions
This paper proposes a new constructive algorithm to design as well as train ANNs. Neither the number of hidden layers in an ANN nor the number of neurons in hidden layers needs to be predefined and fixed. They are determined automatically during the training of an ANN. Experiments have been carried out on several benchmark classification problems to evaluate how well PDCA
326
M.A. Sattar, M.M. Islam, and K. Murase
performed on different problems in comparison with other ANN design algorithms. In almost all cases, PDCA outperformed the others. However, our experimental study appeared to have revealed one weakness of PDCA in dealing with the glass problem, which has a small number of training examples. It would be interesting in the future to analyze PDCA more rigorously in order to gain more insights into when PDCA are most likely to perform well and for what kind of problems. The use of different activation functions like the one used in [8] in association with different training sets would also be an interesting future research topic. Acknowledgement. MMI is currently a Visiting Associate Professor at University of Fukui supported by the Fellowship from Japanese Society for Promotion of Science (JSPS). This work was in part supported by grants to KM from JSPS, Yazaki Memorial Foundation for Science and Technology, and University of Fukui.
References 1. Fahlman, S.E., Lebiere, C.: The cascade-correlation learning architecture. In: Touretzky, D.S. (ed.) Advances in Neural Information Processing System, vol. 2, pp. 524–532. Morgan Kaufmann, San Mateo, CA (1990) 2. Fahlman, S.E.: An empirical study of learning speed in back-propagation networks. Tech Report CMU-CS-88-162, Carnegie Mellon University (1988) 3. Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Networks 2, 359–366 (1989) 4. Kwok, T.Y., Yeung, D.Y.: Constructive algorithms for structure learning in feedforward neural networks for regression problems. IEEE Transactions on Neural Networks 8, 630–645 (1997) 5. Lehtokangas, M.: Modified cascade-correlation learning for classification. IEEE Transactions on Neural Networks 11, 795–798 (2000) 6. Lehtokangas, M.: Fast initialization for cascade-correlation learning. IEEE Transactions on Neural Networks 10, 410–413 (1999) 7. Lehtokangas, M.: Modeling with constructive backpropagation. Neural Networks 12, 707–716 (1999) 8. Ma, L., Khorasani, K.: Constructive feedforward neural networks using hermite polynomial activation functions. IEEE Transactions on Neural Networks 16, 821– 833 (2005) 9. Monirul Islam, Md., Murase, K.: A new algorithm to design compact two-hiddenlayer artificial neural networks. Neural Networks 14, 1265–1278 (2001) 10. Phatak, D.S., Koren, I.: Connectivity and performance tradeoffs in the cascade correlation learning architecture. IEEE Transactions on Neural Networks 5, 930– 935 (1994) 11. Prechelt, L.: Automatic early stopping using cross validation: quantifying the criteria. Neural Networks 11, 761–767 (1998) 12. Reed, R.: Pruning algorithms - a survey. IEEE Transactions on Neural Networks 4, 740–747 (1993) 13. Prechelt, L.: PROBEN1- A set of neural network benchmark problems and benchmarking rules. Technical Report 21/94, Faculty of Informatics, University of Karlsruhe, Germany (1994)
A New Constructive Algorithm for Designing and Training ANNs
327
14. Riedmiller, M., Braun, H.: RPROP - A Fast Adaptive Learning Algorithm. In: Proc. of the 1992 International Symposium on Computer and Information Sciences, Turkey, November 1992, pp. 279–285 (1992) 15. Schapire, R.E.: The strength of weak learnability. Machine Learning 5, 197–227 (1990) 16. Schwenk, H., Bengio, Y.: Boosting Neural Networks. Neural Computation 12, 1869– 1887 (2000) 17. Setiono, R., Hui, L.C.K.: Use of quasi-Newton method in a feed forward neural network construction algorithm. IEEE Trans. Neural Networks 6, 273–277 (1995) 18. Tamura, S., Tateishi, M.: Capabilities of a four-layered feedforward neural network: four layers versus three. IEEE Transactions on Neural Networks 8, 251–255 (1997) 19. Treadgold, N.K., Gedeon, T.D.: Exploring constructive cascade networks. IEEE Transactions on Neural Networks 10, 1335–1350 (1999) 20. Yao, X.: A review of evolutionary artificial neural networks. International Journal of Intelligent Systems 8, 529–567 (1993)
Effective Learning with Heterogeneous Neural Networks Llu´ıs A. Belanche-Mu˜ noz Dept. de Llenguatges i Sistemes Inform` atics, Universitat Polit`ecnica de Catalunya, Barcelona, Spain
[email protected]
Abstract. This paper introduces a class of neuron models accepting heterogeneous inputs and weights. The neuron model computes a userdefined similarity function between inputs and weights. The neuron transfer function is formed by composition of an adapted logistic function with the power mean of the partial input-weight similarities. The resulting neuron model is capable of dealing directly with mixtures of continuous quantities (crisp or fuzzy) and discrete quantities (ordinal, integer, binary or nominal). There is also provision for missing values. An artificial neural network using these neuron models is trained using a breeder genetic algorithm until convergence. A number of experiments are carried out using several real-world benchmarking problems. The network is compared to a standard radial basis function network and to a multi-layer perceptron and shown to learn from non-trivial data sets with superior generalization ability in most cases, at a comparable computational cost. A further advantage is the interpretability of the learned weights.
1
Introduction
In many important domains from the real world, objects are described by a mixture of continuous and discrete variables, usually containing missing information and characterized by an underlying uncertainty or imprecision. For example, in the well-known UCI repository [1] over half of the problems contain explicitly declared nominal attributes, let alone other data types, usually unreported. In the case of artificial neural networks (ANN), this heterogeneous information has to be encoded in the form of real-valued quantities, although in most cases there is enough domain knowledge to characterize the nature of the variables. We present a general framework for dealing with data heterogeneity in ANNs under the conceptual cover of similarity, in which a class of neuron models accepting heterogeneous inputs and weights computes a user-defined similarity function between these inputs and weights. The similarity function is defined by composition of a specific power mean of the partial input-weight similarities plus a logistic function. The resulting neuron model then accepts mixtures of continuous and discrete quantities, with explicit provision for missing information. Other data types are possible by extension of the model. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 328–337, 2008. c Springer-Verlag Berlin Heidelberg 2008
Effective Learning with Heterogeneous Neural Networks
329
An artificial neural network using these neuron models is trained using a breeder genetic algorithm until convergence. A number of experiments are carried out that illustrate the validity of the approach, using several benchmarking problems (classification and regression), selected as representatives because of the diversity in the richness and kind of data heterogeneity. The network is compared to a standard radial basis function neural network and to a multi-layer perceptron, and shown to learn from non-trivial data sets with a generalization ability that is superior in most cases to both networks. A further advantage of the approach is the interpretability of the learned weights. The paper is organized as follows. Section 2 further motivates the basis of the approach and reviews some widespread ways of coping with data heterogeneity in ANNs; section 3 introduces material on similarity measures; section 4 builds the heterogeneous neural network and proposes specific similarity measures. Finally, section 5 presents experimental work.
2
Motivation
The action of a feed-forward ANN may be viewed as a mapping through which points in the input space are transformed into corresponding points in an output space. The task is to find a structure in the data, to transform it from an input space to a new hidden space (the space spanned by the hidden units) in such a way that the problem is simpler when projected to this space. This processing is repeated for all the subsequent hidden layers. It seems clear that adequate transformations will require less units and therefore a hidden space of less dimensions. This in turn leads to simpler mappings which are less likely to overfit. The integration of heterogeneous data sources in information processing systems has been advocated elsewhere [2]. In this sense, a shortcoming of the existent neuron models is the difficulty of adding prior knowledge to the model. ANN current practice assumes that an input may be faithfully represented as a point in Rn , and the geometry of this space is meant to capture the meaningful relations in input space. There is no particular reason why this should be the case. Knowledge representation influences the practical success of a learning task [3], but has received little attention in the ANN community. The form in which data are presented to the network is of primary relevance. In essence, the task of the hidden layer(s) is to find a new, more convenient representation for the problem given the data representation chosen, a crucial factor for a successful learning process that can have a great impact on generalization ability [4]. As Anderson and Rosenfeld put it, “prior information and invariances should be built into the design of a ANN, thereby simplifying the network design by not having to learn them” [5]. Further, the activity of the units should have a well defined meaning in terms of the input patterns [6]. Without the aim of being exhaustive, commonly used methods (see, e.g. [7,4]) are the following: Ordinal variables coded as real-valued or using a thermometer. Nominal variables coded using a 1-out-of-c or c − 1 representation.
330
L.A. Belanche-Mu˜ noz
Missing information is difficult to handle, specially when the lost parts are of significant size. The entire examples can be removed, or the values “filled in” with the mean, median, nearest neighbour or by adding another input equal to one only if the value is absent. Statistical approaches need to model the input distribution itself [4], or are computationally intensive[8]. Vagueness and uncertainty are considerations usually put aside. These encodings being intuitive, their precise effect on network performance is not clear, because of the change in input distribution, the increase (sometimes acute) in input dimension and other subtler mathematical effects derived from imposing an order or a continuum where there was none. This issue can dramatically affect generalization for the worse, due to the curse of dimensionality. In this scenario, the use of hidden units which can adequately capture the desired mapping and do not add input dimension is thus of great practical importance.
3
Similarity Measures
Let us represent patterns belonging to a space X = ∅ as a vector xi of n components, where each component xik represents the value of a particular feature k. A similarity measure is a unique number expressing how “like” two patterns are, given these features. A similarity measure may be defined as a function s : X × X −→ Is ⊂ R. Assume that s is upper bounded, exhaustive and total. This implies that Is is upper bounded and also that sup Is exists. Define smax ≡ sup Is and define smin ≡ inf Is if it exists. Let us denote by sij the similarity between xi and xj , that is, sij = s(xi , xj ). A similarity measure may be required to satisfy the following axioms, for any xi , xj ∈ X: Axiom S1. (Reflexivity) s(xi , xi ) = smax . This implies sup Is ∈ Is . Axiom S2. (Consistency) s(xi , xj ) = smax =⇒ xi = xj . Axiom S3. (Symmetry) s(xj , xi ) = s(xi , xj ). Axiom S4. (Boundedness) s is lower bounded when ∃a ∈ R such that s(xi , xj ) ≥ a, for all xi , xj ∈ X. This is equivalent to ask that inf Is exists. Axiom S5. (Closedness) a lower bounded function s is closed if there exist xi , xj ∈ X such that s(xi , xj ) = smin . This is equivalent to ask that inf Is ∈ Is . The similarity axioms enumerated above ensure a consistent definition of such functions, but should be taken as desiderata. Some similarity relations may fulfill part or all of them [9]. Other properties (like transitivity) may be of great interest in some contexts, but are not relevant for this work. The only requirement expressed so far about the set X is the existence of an equivalence relation. However, elements in this set can be simple or complex (i.e. composed by one or more variables). For atomic elements the similarity can be trivially computed, but for complex elements, we need to define a way to combine the partial similarities sk for each variable k to get a meaningful value sijk = sk (xik , xjk ). This combination has an important semantic role and it is not a trivial choice. We use in this work the concept of an A-average [10], as
Effective Learning with Heterogeneous Neural Networks
331
follows. Let [a, b] be a non-empty real interval. Call A(x1 , . . . , xn ) the A-average of x1 , ..., xn ∈ [a, b] to every n-place real function A fulfilling: Axiom A1. A is continuous, symmetric and strictly increasing in each xi . Axiom A2. A(x, . . . , x) = x. Axiom A3. For any positive integer k ≤ n: A(x1 , . . . , xn ) = A(yk , . . . , yk , xik+1 , . . . , xin ) k times where yk = A(xi1 , , . . . , xik ) and (i1 , ..., in ) is a permutation of (1, . . . , n). The means defined by these axioms fulfill Cauchy’s property of means, namely, that min xi ≤ A(x1 , . . . , xn ) ≤ max xi with equality only if x1 = x2 = . . . = xn . The proof is straightforward using axioms A1 and A2. This result is even stronger than idempotency; thus, the previous bounds actually impose idempotency. A further interesting property of these means is that we can add averaged elements to an A-average without changing the overall result. Formally, A(x1 , ..., xn ) = A(z1 , ..., zm , x1 , ..., xn ) if and only if A(z1 , ..., zm ) = A(x1 , ..., xn ). As a consequence, if y = A(x1 , ..., xn ), then A(x1 , ..., xn , A(x1 , ..., xn )) = y. Theorem. Let f : [a, b] −→ R be a continuous, strictly monotone mapping. Let g be the inverse function of f . Then, n 1 A(x1 , ..., xn ) ≡ g f (xi ) n i=1 is a well-defined A-average for all n ∈ N and xi ∈ [a, b] [10]. An important class of A-averages is formed by choosing f (z) = z q , q ∈ R, to obtain: Mq (x1 , ..., xn ) =
1 (xi )q n i=1 n
q1
Several well-known means are found by choosing particular values of q. Specifically, the arithmetic mean for q = 1, the geometric mean for q = 0, the quadratic mean for q = 2 and the harmonic mean for q = −1. A property of the means Mq is that Mq (x1 , ..., xn ) ≥ Mq (x1 , ..., xn ) if and only if q > q , with equality only if x1 = x2 = . . . = xn . We shall refer to these as the power means.
4 4.1
Heterogeneous Neural Networks The Heterogeneous Neuron Model
We develop in this section neuron models allowing for heterogeneous and imˆ n → Rout ⊆ R. Here R denotes precise inputs, defined as a mapping h : H n ˆ the reals and H is a cartesian product of an arbitrary number n of source sets ˆ (k) , k = 1 . . . , n. These source sets may include extended reals R ˆ k = Rk ∪ {X } H
332
L.A. Belanche-Mu˜ noz
(where each Rk is a real interval), extended families of fuzzy sets Fˆk = Fk ∪{X }, ˆk = Ok ∪ {X }, M ˆ k = Mk ∪ {X }, where and extended finite sets of the form O the Ok have a full order relation, while the Mk have not. The special symbol X extends the source sets and denotes the unknown element (missing information), behaving as an incomparable element w.r.t. any ordering relation. According to this definition, neuron inputs are vectors composed of n elements among which there might be reals, fuzzy sets, ordinals, categorical and missing data. ˆn × H ˆ n → Is a similarity function in H ˆ n , where Consider now a function s : H we take Is ⊆ [0, 1] for simplicity. This function is formed by combination of ˆ (k) × H ˆ (k) → Is . The sk measures are normalized n partial similarities sk : H to a common real interval ([0, 1] in this case) and computed according to different formulas for different variables (possibly but not necessaryly determined by variable type alone). The desired neuron model can be devised that is both similarity-based and handles data heterogeneity and missing values, as follows. ˆ n having a Let Γi (x) the function computed by the i-th neuron, where x ∈ H n ˆ weight vector μi ∈ H and smoothing parameter γi . Then Γi (x) = sˇ(γi Mq ([sk (xk , μi,k )]nk=1 )), q ∈ R, γi ∈ (0, 1] The activation function sˇ is any sigmoid-like automorphism (a monotonic bijection) in [0, 1]. The eventually missing similarities (because one or both of their arguments is missing) are regarded as ignorance and must not contribute in favour nor against the overall measure. In order to do this, let C = {k1 , . . . , km } be the set of indexes of the similarities that could not be computed. Then define sX (x, μi ) = Mq ([sk (xk , μi,k )]nk=1,k∈C / ), that is the power means of the successfully computed similarities. Thanks to the previous properties of A-averages, Mq ([sk (xk , μi,k )]nk=1 ) = Mq ([sk (xk , μi,k )]nk=1,k∈C / , sX (x, μi ), . . . , sX (x, μi )). m times 4.2
Heterogeneous Neural Networks
Heterogeneous neural networks are neural architectures built out of the previously defined neuron models, thus allowing for heterogeneous or missing inputs. A feed-forward architecture, with a hidden layer composed of heterogeneous neurons and a linear output layer is a straightforward choice, thus conforming a hybrid structure. The heterogeneous neural network computes the function: fhnn (x) =
h
αi Γi (x)
(1)
i=1
where h > 0 is the number of (heterogeneous) hidden neurons and {αi } is the set of mixing coefficients. The HNN thus keeps linearity in the output layer and interpretability in the hidden layer. It can be naturally seen as a generalization of the RBF. This is so because the response of hidden neurons is localized: centered at a given object (the neuron weight, where response is maximum), falling down
Effective Learning with Heterogeneous Neural Networks
333
as the input is less and less similar to that center. The general training procedure for the heterogeneous neural network (HNN for short) is based on evolutionary algorithms, due to the missing information and the likely non-differentiability of the similarity measure. Specifically, the training procedure for the HNN used in this work is based on a Breeder Genetic Algorithm (BGA) [11], a method in mid position between Genetic Algorithms (GA) and Evolution Strategies. While in GA selection is stochastic and meant to mimic Darwinian evolution, BGA selection is driven by a deterministic breeding mechanism, an artificial method in which only the best individuals —usually a fixed percentage τ of total population size μ— are selected to be recombined and mutated, as the basis to form a new generation. The used BGA does not need any coding scheme, being able to cope with heterogeneous or missing genes [12]. 4.3
Heterogeneous Similarity Measures
We present in this section specific measures defined to belong to the interval [0, 1] for the sake of illustration, but there are many variations that may be superior by making better use of available expert knowledge. Ordinal variables. It is assumed that the values of the variable form a linearly ordered space (O, ). Let xik , xjk ∈ O, xik xjk , Plk the fraction of values of variable k that take on the value xlk and the summation run through all the ordinal values xlk such that xik xlk and xlk xjk [13]. sijk =
2log(Pik + . . . + Pjk ) logPik + . . . + logPjk
(2)
Nominal variables. It is assumed that no partial order exists among these values and the only possible comparison is equality. The basic similarity measure for these variables is the overlap. Let N be a nominal space and xik , xjk ∈ N . Then sijk = 1 if xik = xjk and 0 otherwise. Continuous variables. Let x, y ∈ Γ = [r− , r+ ] ⊂ R, r+ > r− . The standard metric in R is a metric in Γ . Therefore, for any two values xik , xjk ∈ R: ⎛ ⎞ ⎜ |xik − xjk | ⎟ sijk = sˆR ⎝ ⎠ sup |x − y|
(3)
x,y∈Γ
where sˆR : [0, 1] −→ [0, 1] is a decreasing continuous function. The family sˆR (z) = (1 − z β )α , 0 < β ≤ 1, α ≥ 1 is used here, with α = 2, β = 1. Integer variables. Given that N ⊂ R, any distance-based similarity in R is also valid in N. A flexible choice does not limit the range of integer values (assumed positive for convenience). In this case, a self-normalizing distance-based similarity sijk = sˆN (|xik − xjk |) is indicated, where sˆN : [0, ∞) −→ (0, 1] is a 1 decreasing continuous function. In particular, the function sˆN (z) = 1+z can be used.
334
L.A. Belanche-Mu˜ noz
Binary variables. In the data analysis literature there are many similarity measures defined on collections of binary variables (see e.g. [14]). This is mostly due to the uncertainty over how to accommodate negative (i.e. false-false) matches. The present situation is that of comparison of a single binary variable rather than a whole vector. In general, one should know which of the two matches is the really relevant (true-true or false-false). For these reasons, treating the variable as purely nominal can result in a loss of relevant information. Since this meta-knowledge is usually not handy, we use in this work a frequencybased approach, as follows. Let xik , xjk be two binary values and let Plk be again the fraction of values of variable k that take on the value xlk . We define: 2xy sijk = h(1 − Pik , 1 − Pjk ), where h(x, y) = x+y is the harmonic mean between x and y. This measure compares the agreement on the rarity of each value: the similarity is higher the less frequent the values are. Fuzzy variables. For variables representing fuzzy sets, similarity relations from the point of view of fuzzy theory have been defined elsewhere [15], and different choices are possible. In possibility theory, the possibility expresses the likeliness of co-occurrence of two vague propositions, with a value of one standing for ˜ B ˜ of a reference set X, it is defined as: absolute certainty. For two fuzzy sets A, ˜ Π(A) ˜ (B) = sup (μA ˜ ∩B ˜ (u)) = sup (min (μA ˜ (u), μB ˜ (u))) u∈X
u∈X
In our case, if Fk is an arbitrary family of fuzzy sets, and xik , xjk ∈ Fk , the following similarity relation can be used sijk = Π(xik ) (xjk ). Notice that this measure indeed fulfills axioms S1, S2 and S3.
5
An Experimental Comparison
A number of experiments are carried out to illustrate the validity of the approach, using nine benchmarking problems, selected as representatives because Table 1. Basic features of the data sets. ‘Missing’ refers to the percentage of missing values, p to the number of examples and n to that of variables. Types are: C (continuous), N (nominal), I (integer), D (ordinal) and F (fuzzy). ‘Output’ is R for regression or a number indicating the number of classes. Name Annealing Audiology Boston Housing Credit Screening Heart Disease Horse Colic South African Heart Servo Data Solar Flares
p 898 226 506 690 920 364 462 167 1066
n Data Types Output Missing 28 6C,19N,3I 6 none 69 8D,61N 4 2.3% 13 11C,1I,1B R none 15 6C,6N,3B 2 0.6% 13 3C,5N,3I,2B 2 16.2% 20 6D,5N,2I,5C,2B 3 26.1% 9 7C,1I,1N 2 none 4 2I,2N R none 9 4N,3D,2B 2 none
Effective Learning with Heterogeneous Neural Networks
335
of the richness in data heterogeneity, taken from [7] and [1]. Their main characteristics are displayed in Table 1. For every data set, the available documentation is analysed for an assessment on the more appropriate treatment. Missing information is also properly identified. The HNN is compared to a standard radial basis function network (RBF) and to a multi-layer perceptron (MLP). The three networks are trained using 72
1 RBF MLP
70
RBF MLP 0.9
HNN
68
HNN
0.8
66 0.7 64 0.6 62 0.5 60 0.4
58
0.3
56
54
0.2 0
3
6
9
12
15
18
21
24
27
30
0
3
6
(a) Audiology
9
12
15
18
21
24
27
30
24
27
30
24
27
30
24
27
30
(b) Boston Housing
90
80
85
RBF MLP
RBF MLP
HNN
HNN 75
80
70 75
70 65
65 60 60
55
55 0
3
6
9
12
15
18
21
24
27
30
0
3
6
(c) Credit Screening
9
12
15
18
21
(d) Heart Disease
70
94 RBF MLP
RBF MLP
92
HNN 65
HNN
90 88
60
86 84
55
82 80
50
78 76
45
74 0
3
6
9
12
15
18
21
24
27
30
0
3
6
9
(e) Horse Colic
12
15
18
21
(f) Annealing
78
81.5 RBF MLP
76
RBF MLP
81
HNN
HNN 80.5
74 80 72 79.5 70
79
68
78.5 78
66 77.5 64 77 62
76.5
60
76 0
3
6
9
12
15
18
21
24
(g) South African Heart
27
30
0
3
6
9
12
15
18
21
(h) Solar Flares
Fig. 1. Generalization results for the networks. See text for an explanation of axis.
336
L.A. Belanche-Mu˜ noz
exactly the same procedure to exclude this source of variation from the analysis. The RBF and MLP networks need a pre-processing of the data, following the recommendations in [7]. The RBF neuron model has provision for a smoothing parameter, different for every hidden neuron. The network architecture is fixed to one single layer of hidden neurons ranging from 1 to 30 plus as many output neurons as required by the task, logistic for classification problems and linear otherwise. The input variables for the RBF and MLP network are standardized to zero mean, unit standard deviation. This is not needed by the heterogeneous neurons because they compute a normalized measure, but is beneficial for the other networks. The output is presented for all networks in 1-out-of-c mode (that is, c outputs) for classification problems with c classes. The error measure in this case is generalized cross-entropy (GCE) [4]. For regression problems, the mean is subtracted to the continuous output and the normalized root mean square (NRMS) error is reported. Each data set is split in three parts, for training (TR), validation (VA) and test (TE), as 50%-25%-25%. The BGA task is to minimize error (either NRMS or GCE) on the TR part, until 500 generations or convergence. Then the network having the lowest error on the VA part is returned. The reported performance is the NRMS or GCE of that network on the TE part. This process is repeated ten times and the average is taken. We present performance results in Fig. 1(a) to (h). In each figure, the abscissa represents the number of neurons in the hidden layer, and the ordinate the average NRMS or GCE of that network on the TE part, as explained above. For classification problems, the abscissa value 0 is used to indicate the percentage of examples in the majority class (that is, the minimum acceptable performance, as a reference). For regression problems, this value is set to 1.0. Note that for classification problems, higher values are better, whereas for regression problems, lower values are better. The HNN shows enhanced performance for most of the problems, specially for those displaying higher heterogeneity or missingness, and less so when this is not the case. This is reasonable since the HNN behaves as a RBF network whene there is no heterogeneity or missingness. The interpretability of the learned weights is also enhanced, since the weights are of the same type as their matching input. We additionally present performance results for the Servo Data (Fig. 2), which are specially interesting. This is a small data set with only 4 variables (none of them continuous) and less than 200 examples. For 16
1.1 RBF
RBF
MLP 14
HNN
HNN
1
12 0.9 10 0.8 8 0.7 6 0.6 4
0.5
2
0
0.4 0
3
6
9
12
15
(a)
18
21
24
27
30
0
3
6
9
12
15
18
21
24
27
30
(b)
Fig. 2. Generalization performance results for the Servo Data. Left: heavy overfit by the MLP network. Right: only RBF and HNN.
Effective Learning with Heterogeneous Neural Networks
337
this data set the MLP overfits heavily, due to the amplification in the number of variables introduced by the coding scheme of the nominal ones, which are many-valued.
6
Conclusions
An heterogeneous neuron that computes a similarity index in [0, 1], followed by a logistic function has been proposed. Neural architectures making use of such neurons are termed heterogeneous neural networks. A body of experimental evidence has shown very promising results in several difficult learning tasks. These results also illustrate the importance of data preparation and the use of a more adequate similarity measure that captures relations in the data. As further work, we are currently working on an clustering-based alternative training scheme.
References 1. Murphy, P.M., Aha, D.: UCI Repository of machine learning databases. UCI Dept. of Information and Computer Science (1991) 2. Kaburlassos, V., Petridis, V.: Learning and decision-making in the framework of fuzzy lattices. In: New learning paradigms in soft computing, Physica (2002) 3. Natarajan, B.: Machine learning: A theoretical approach. Morgan Kaufmann, San Francisco (1991) 4. Bishop, C.: Neural Networks for Pattern Recognition, Oxford (1995) 5. Anderson, J.A., Rosenfeld, E. (eds.): Neurocomputing: foundations of research. MIT Press, Cambridge (1988) 6. Omohundro, S.M.: Geometric Learning Algorithms. Technical Report 89-041. Univ. of Berkeley, CA (1989) 7. Prechelt, L.: Proben1-A set of Neural Network Benchmark Problems and Benchmarking Rules. Fac. f¨ ur Informatik. U. Karlsruhe T.R. 21/94 8. Tresp, V., Ahmad, S., Neuneier, R.: Training Neural Networks with Deficient Data. In: NIPS, vol. 6, Morgan Kaufmann, San Francisco (1994) 9. Chandon, J.L., Pinson, S.: Analyse Typologique. Masson (1981) 10. Hille, E.: Methods in Classical and Functional Analysis. Addison-Wesley, Reading (1971) 11. M¨ uhlenbein, H., Schlierkamp-Voosen, D.: Predictive Models for the Breeder Genetic Algorithm. Evolutionary Computation 1(1) (1993) 12. Belanche, L.: Evolutionary Optimization of Heterogeneous Problems. In: Guerv´ os, J.J.M., Adamidis, P.A., Beyer, H.-G., Fern´ andez-Villaca˜ nas, J.-L., Schwefel, H.-P. (eds.) PPSN 2002. LNCS, vol. 2439, pp. 475–484. Springer, Heidelberg (2002) 13. Lin, D.: An Information-Theoretic Definition of Similarity. In: Proceedings of International Conference on Machine Learning (1998) 14. Everitt, B.: Cluster Analysis. Heinemann Books (1977) 15. Zadeh, L.: Fuzzy Sets as a basis for a theory of possibility. Fuzzy Sets and Systems 1, 3–28 (1978)
Pattern-Based Reasoning System Using Self-incremental Neural Network for Propositional Logic Akihito Sudo, Manabu Tsuboyama, Chenli Zhang, Akihiro Sato, and Osamu Hasegawa Dept. of Computer Intelligence and Systems Science, Tokyo Institute of Technology Imaging Science and Engineering Lab., Tokyo Institute of Technology 4259 Nagatsuta-cho, Midori-ku, Yokohama, Japan {sudo,hasegawa}@isl.titech.ac.jp http://www.isl.titech.ac.jp/∼hasegawalab/
Abstract. We propose an architecture for reasoning with pattern-based if-then rules that is effective for intelligent systems like robots solving varying tasks autonomously in a real environment. The proposed system can store pattern-based if-then rules of propositional logic, including conjunctions, disjunctions, negations, and implications. The naive pattern-based reasoning can store pattern-based if-then rules and make inferences using them. However, it remains insufficient for intelligent systems operating in a real environment. The proposed system uses an algorithm that is inspired by self-incremental neural networks such as SONIN and SOINN-AM in order to achieve incremental learning, generalization, avoidance of duplicate results, and robustness to noise, which are important properties for intelligent systems. Keywords: Reasoning, neural network, intelligent agent, intelligent robot.
1
Introduction
Reasoning is an essential process of human intelligence. Although many reasoning systems have been proposed, they remain insufficient for intelligent systems that must crawl about in the real world, such as robots. We consider that one reason is that most of them address only symbol-based if-then rules. Several reasoning systems have been recognized for their effectiveness and have been commercialized, but they perform only in a specific domain. Although production systems are a much sought conventional reasoning system, they require human experts to input their knowledge in advance. Consequently, such systems perform well only in a specific domain, the knowledge of which an expert has provided. In contrast, nobody can predict correctly what the complete knowledge list for the intelligent systems operating in a varying environment; it must therefore learn knowledge that is provided sequentially from environments in an incremental manner. If the intelligent system uses a conventional symbol-based reasoning M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 338–347, 2008. c Springer-Verlag Berlin Heidelberg 2008
Pattern-Based Reasoning System Using Self-incremental Neural Network
339
system to make inferences using learned knowledge from environments, it must convert learned patterns to symbols before reasoning because what it obtains from environments through its sensors are patterns, not symbols. This strategy seems to be impractical currently and might be impossible even in the future. Converting patterns to symbols remains a difficult task so far in spite of many studies having been done as a task of classification. Discriminating cats from dogs is not an easy problem for current discriminators. Problems will remain even after one invents a classifier as powerful as a human. Intelligent systems must generate new symbols when they encounter what should be addressed as another symbol than those they already know. Humans must think or act with or about the object to assign a symbol to it. Then, humans must form inferences using patterns obtained from the object because it has not been symbolized yet. Intelligent systems also must make inferences using patterns to yield new symbols. Several approaches were proposed to realize pattern-based reasoning systems. Tsukimoto proposed pattern-based reasoning where perceptrons, which have learned patterns, are employed as atomic propositions of if-then rules[4]. In [3], a pattern-based reasoning system was proposed where binary vectors are employed as atomic propositions using a non-monotone neural network[3]. The method proposed in [4] can deal with both conjunction, disjunction and negation although one proposed by Yamane et al. can’t. On the other hand, patterns should be represented with vectors rather than perceptrons for robots because it is difficult to determine when a new perceptron is generated as a new proposition. Moreover, real-valued data should be dealt with because sensor information take real value although [3] deal with only binary vectors. There has been no study which realizes a pattern-based reasoning system which can deal with both conjunction, disjunction and negation and employs real-valued vector as a representation of a pattern. In this paper, we propose a novel reasoning system that can deal with both conjunction, disjunction and negation and employs real-valued vector as a representation of a pattern. The proposed method can learn if-then rules, atomic propositions of which are patterns represented as real-valued vectors. For instance, the proposed method can learn “((A∧B)∨C)→(D∧E∧F)”, where A–F are patterns represented as real-valued vectors1 . When the proposed system having learned if-then rules obtains a fact, it produces an inference from the fact, picking up and connecting learned if-then rules. It outputs C∨D and (E∧¬F)∨D if the proposed system which has learned (A∧B)→(C∨D) and C→(E∧¬F) obtains “A∧B” as a fact. In addition to basic function for pattern-based reasoning discribed above, some additional functions are realized by the proposed method for application to an intelligent system like a robot operating in dynamic environments: incremental learning, generalization ability, avoidance of duplication of reasoning results, and robustness to noise. The proposed system achieves them through 1
We use ∧, ∨, ¬, →, respectively, for conjunction, disjunction, negation, and implication.
340
A. Sudo et al.
Fig. 1. Two types of noisy data
the use of a self-incremental neural network inspired by SOINN[1] and SOINNAM[2]. Incremental learning of if-then rules is required because a user cannot provide a system that has complete knowledge in advance. The proposed system can learn if-then rules that are provided incrementally without collapsing previously learned knowledge. In [3], about 20% of reasoning is incorrect when it has incrementally learned new if-then-rules which is the same amount of ifthen rules that had been learned previously in a batch manner. The proposed system does not store the same knowledge as (or very similar knowledge to) a previously stored one. This enables the proposed system to avoid suffering redundancy of memory size when it learns much knowledge incrementally. In a real environment, robots rarely obtain identical data to previously learned data. The proposed system can find out similarly learned if-then rules when different data are given. The proposed system categorizes learned if-then rules in online manner to avoid duplication of reasoning results. If the if-then rules are not categorized, similar if-then rules are used completely as other rules, which engenders the duplication of results. For example, if the system which does not categorize knowledge stores “A → B” and “A’ →B’” where (A, A’) and (B, B’) are very similar pattern pairs, both B and B’ are outputs even when they can be interpreted as the same result. Note that our approach is categorizing if-then rules, not each pattern. We consider that our approach engenders generating symbols by intelligent systems as a result of their own reasoning. Two types of noise exist in real environments, as shown in Fig. 1. Noise shown in Fig. 1(a) taints the original data. Noise shown in Fig. 1(b) is a pattern that is independent from the original patterns. The proposed system is robust to both types of noise.
2
Proposed Method
The proposed system consists of long-term memory, short-term memory, a learning machine, and a reasoning machine, as shown in Fig. 2. The long-term memory stores if-then rules learned from environments. The short-term memory temporarily stores input data. The learning machine executes a learning algorithm when if-then rules are provided. Unnecessary if-then rules are eliminated and ifthen rules are categorized into several clusters in an online manner when learning. The reasoning machine executes a reasoning algorithm and outputs an OR tree as a reasoning result when a fact like A ∧ B is given. Not only the minimum function, but also incremental learning, generalization, avoidance of duplication of reasoning results, and robustness to noise are achieved through the use of
Pattern-Based Reasoning System Using Self-incremental Neural Network Input
Long Term Memory
Short Term Memory Learning Phase
If-Then Rule Reasoning Phase Fact Reference
Learning Machine
Find winners in LTM New knowledge? YES
NO
Add Inputted If-then Rule Make an edge between the first winner and the second winner
341
Cluster 1 Cluster 2 Cluster 3 ・・・
Reference Reference update Reasoning Machine Generate a tree which has Add conclusion parts single node holding input Eliminate old edges corresponding to the similar Find conditioning parts conditioning parts The number of input data is similar to input as child nodes greater than threshold? YES Remove knowledge belonging to Similar conditioning part exists? Find conditioning parts similar to leafs of the tree cluster which has less than two YES members. Reference
Output
Fig. 2. Architecture of the proposed system
algorithms that are inspired by a self-incremental neural network that is based on SOINN[1] and SOINN-AM[2]. 2.1
Learning Algorithm
The learning machine executes the learning algorithm when the system obtains an if-then rule. First, the input if-then rule is resolved into several if-then rules, both the conditional part and the sequential part of which are conjunctive of literals, which is accomplished as follows: both the conditional part and the sequential part are transformed to disjunctive normal forms; then every combination of clauses is picked up. For example, if “(A∧B)∨(C∧¬D)→(¬E∧F)” is input, it is resolved to “(A∧B)→(¬E∧F)” and “(C∧¬D)→(¬E∧F)”. Note that a conditional part and sequential part of any if-then rule can be transformed to a disjunctive normal form because any form of a logical formula can be transformed to a disjunctive normal form. The resolved if-then rules are stored temporarily in the short-term memory. The system would suffer a high computational load and duplication of reasoning results if all the resolved if-then rules were simply stored in the long-term memory. The steps described in the paragraph below let the learning machine determine which if-then rule should be stored in the long-term memory and categorize if-then rules in the long-term memory into several clusters using edges, which are carried out in online manner. Those steps are executed repeatedly for every resolved if-then rule. Note that a “learning datum” below denotes a single resolved if-then rule. First, the most similar if-then rule (first winner) and the second-most similar if-then rule (second winner) to a learning datum are sought among the if-then rules in the long-term memory which are the same form of the learning datum. Here, the same form means that the number of positive and negative propositions of the conditional part and the sequential part are identical to each other. For example, “A∧¬B→C” is regarded as having the same form of “D∧¬E→F”, but
342
A. Sudo et al.
is not regarded as the same form of “¬D∧¬E→F”. The similarity between the same-form if-then rules are calculated using the following distance measure d: M N 1 d(k, k ) = √ min Pi − Pk i + min Qi − Qki + {k1 ,···,kN }∈NN D {k1 ,···,kM }∈NM i=1 i=1 ˜ ˜ M N ˜i − Q ˜ k , min P˜i − P˜ ki + min Q {k1 ,···,kM ˜ }∈NM
{k1 ,···,kN ˜ }∈NN
i=1
i
i=1
where D represents the dimension of the patterns, NM = {1, 2, · · · , M }, NN = {1, 2, · · · , N }, and the learning data k and an if-then rule k stored in the longterm memory are denoted respectively as follows. ⎞ ⎛ ⎞ ⎛ ⎞ M ⎛ N ˜ ˜ M N ˜j ⎠ Pi ∧ ⎝ ¬Qj ⎠ → ⎝ P˜i ⎠ ∧ ⎝ ¬Q i=1
M i=1
Pi
⎛ ∧⎝
j=1 N
j=1
⎞
⎛
¬Qj ⎠ → ⎝
i=1 ˜ M
i=1
⎞
⎛
P˜ i ⎠ ∧ ⎝
j=1 ˜ N
⎞ ˜ j ⎠ ¬Q
j=1
Then, whether a learning datum is a member of the same cluster of both the first winner and the second winner is determined. The method of the determination is described in the end of this subsection. The learning datum is added to the longterm memory as new knowledge and the learning process for the learning data finishes if the learning datum is not a member of the same cluster of either the first winner or the second winner. If the learning datum is a member of the same cluster of both the first winner and the second winner, then the learning datum is not added to the long-term memory and the previously stored knowledge in the long-term memory is updated as follows. An edge between the first winner and the second winner is generated if the edge has not been generated yet, and the age of the edge is set to zero. That edge is used to generate clusters of ifthen rules in the long-term memory. If-then rules connected with an edge are regarded as members of the same cluster. Subsequently, the age of every edge emanating from the first winner is increased by 1. Every edge with age greater than a parameter Λedge , is removed. If-then rules in the long-term memory which have less than two edges are removed if a parameter λ is an integer multiple of the number of the previously provided learning data. The noise depicted in Fig. 1(b) is expected to be eliminated by this step. Here, we describe the method to determine whether a learning datum is a member of the same cluster of both the first and the second winners. The similarity threshold of the first (or second) winner is calculated as follows: maxk∈N d(k, kw ) (if N = ∅) s= , maxk∈A d(k, kw ) (if N = ∅)
Pattern-Based Reasoning System Using Self-incremental Neural Network
343
A∧B D∧¬E
C F
G
H∧¬I
¬J
Fig. 3. Example of an OR tree as a result of reasoning
where N is the set of all the if-then rules connected to the first (or second) winner by the edge, A is the set of all the if-then rules in the long-term memory, and kw denotes the first (or second) winner. The learning datum is regarded as a member of the same cluster of both the first and second winners if the distance between the first winner and the learning datum is less than the similarity threshold of the first winner and the distance between the second winner and the learning datum is less than the similarity threshold of the second winner. 2.2
The Reasoning Algorithm
When a fact such as A∧B is input, the system generates the tree, the root of which is the input fact; no node exists without the root. The distance between the fact and the conditional part of each if-then rule in the long term memory is calculated. If there exist conditional parts with distance of less than δr , their sequential parts are added to the tree as children of the root. Here we describe details of the reasoning algorithm. When a fact is input, the tree exists for which the root is the fact and no node without the root exists. The distance between the fact and the conditional part of each if-then rule in the long-term memory is calculated. If there exist conditional parts with distance less than δr , their sequential parts are added to the tree as children of the root, where the only nearest sequential part is added from the same cluster if several sufficiently close sequential parts exist in the same cluster. The reasoning process terminates if no child is added to the tree. The above steps are applied to the children instead of the fact if a new child exists. The above steps are executed repeatedly as long as a new child is generated. Figure 3 shows an example of an output tree. The reasoning machine outputs that tree when A’∧B’ is obtained as a fact if the longterm memory holds “A∧B→C∨(D∧¬E)” and “C’→F∨G∨(H∧I), D’∧¬E’→¬J” where (A, A’) - (E, E’) are pairs of sufficiently similar patterns. An output tree should be addressed as an OR tree; then a deduced proposition is a disjunctive normal form generated by combining several propositions in nodes. If the reasoning machine outputs the tree shown in Fig. 3, then “C∨(D∧¬E)”, “C∨¬J”, “F∨G∨(H∧I) ∨(D∧¬E)”, and “F∨G∨(H∧I) ∨¬J” are correct reasoning results. A reasoning result that is represented as a disjunctive normal form can be interpreted as the possibility of an environment. The system discovers the possibility of places that the system can reach from the present location if the present location is input to the system as a fact and “the supermarket or the station or the drug store” is the reasoning result.
344
A. Sudo et al.
Conditional Rule
Sequential Rule
Input Layer
・・・
・・・
D
¬BD
¬C
¬¬BC
A
A
Competitive Layer
Fig. 4. Neural Network Model of the Proposed Method
2.3
Neural Network Model
The architecture of the proposed system is as shown in Fig. 2 in terms of knowledge engineering; the proposed system can be interpreted as a self-incremental neural network like SOINN[1], SOINN-AM[2], and Growing Neural Gas[5]. The proposed neural network model has an input layer and a competitive layer, as shown in Fig. 4. The input layer obtains an if-then rule or a fact. New nodes are generated adaptively when the input layer obtains an if-then rule. Each if-then rule is represented with several nodes connected by edges, although other selfincremental neural networks represent learned data with a single node. Those edges are different-type edges from those used to generate clusters. The proposed model requires m × n neurons to represent an if-then rule, the conditional part of which has m literals and the sequential part has n literals. Each neuron holds a pair of literals picked up from the conditional part and the sequential part one-by-one, which means that each neuron holds two vectors and flags of negation. Neurons representing A∧¬B→¬C∧D are shown in Fig. 4.
3
Simulation Results
In the experiment, the proposed system learned the sequentially provided ifthen rules, the atomic propositions of which were the images taken in the real environment. It then produced inferences using the learned knowledge. We used 280 images of the 14 objects shown in Fig. 5 taken from 20 different angles. Each image has 56 × 46 pixels. The if-then rules the systems learned were “A→B”,
A. Close door B. Open door
C. Lab.
D. Wall
E. Hallway
F. Elevator
G. Desk-1
H. Desk-2
I. Desk drawer-1 J. Desk drawer-2 K. Desk drawer-1 L. Desk drawer-2 M. Down stairs N. Room plate O. White Noise (close) (close) (open) (open)
Fig. 5. Images for the experiment
Pattern-Based Reasoning System Using Self-incremental Neural Network
345
Table 1. Number of members in the clusters and compression rate Cluster No. If-then rule Number of members Compression rate (%) 1 AB 87 78.25 2 BD 145 63.75 3 BE 50 87.5 4 E(CN) 64 84 5 EF 31 92.25 6 EM 146 63.5 7 (CN)(GI) 84 79 8 (CN)(HJ) 111 72.25 9 (GI)K 140 65 10 (HJ)L 122 69.5 Total — 980 75.5
∧ ∧
∧
Fig. 6. Reasoning results g n i n o s a e r t c e r r o c
)
100
% ( o i t a r
系列 δ系列2 δ系列1
1 δr = 0.3
50 0 10
15
20
25
SN ratio
30
r
= 0.2
r
= 0.1
35
Fig. 7. Ratio of correct reasoning when noisy data are given
“B→D∨E”, “E→(C∧N)∨F∨M”, “(C∧N)→(G∧I) ∨ (H∧J)”, “(G∧I)→K”, and “(H∧J)→L”. They mean “the closed door → the open door”, “the open door → the wall ∨ the hallway”, “the hallway → (the laboratory ∧ the room plate) ∨ the elevator ∨ the down stairs”, “the laboratory ∧ the room plate→(the desk 1 ∧ the closed drawer 1)∨(the desk 2 ∧ the closed drawer 2)”, “(the desk 1 ∧ the
346
A. Sudo et al.
closed drawer 1)→ the open drawer 1”, “(the desk 2 ∧ the closed drawer 2)→ the open drawer 2”. The system obtained if-then rules, each atomic proposition of which was selected randomly from 20 different-angled images. Random selection from among them provided learning data for 4400 times for each lf-then rule. The parameters were set to Λedge = 100, λ = 50. The number of if-then rules in the long-term memory after learning was 980; these rules are categorized into 10 different clusters. Each cluster had members corresponding respectively only to “A→B”, “B→D”, “B→E”, “E→(C∧N)”, “E→F”, “E→M”, “(C∧N)→(G∧I)”, “(C∧N)→(H∧J)”, “(G∧I)→K”, or “(H∧J)→L”; one example is the members of clusters that consisted only of if-then rules corresponding to “A→B”. Consequently, the system was able to categorize if-then rules appropriately, which were the same rules as those in a symbol-based manner, into the same cluster. Table 1 shows the number of members of each cluster and data compression rates. We estimated the compression rate as (1 − Nm/Nl ), where Nm is the number of members of the cluster and Nl is the number of if-then rules provided to the system as learning data. Although the learning data consisted of 4000 if-then rules, the atomic propositions of which were different from one another, only 980 rules were stored in the long-term memory in all. The system eliminated 3020 rules as similar rules of previously stored rules. Note that the system obtained the learning data sequentially. The result of the learning indicate that the system learns new data without collapsing previously learned data and avoids consuming memory through eliminating similar data. We provided one learned image as a fact to test whether the system can produce inferences correctly. The parameter was set to δr = 0.3. The reasoning result is shown in Fig. 6. This tree has neither omissions nor duplication. The system was able to avoid duplication of results because it generated clusters appropriately. In all, 500,376 nodes were created in the output tree when we made the system regard every if-then rule in the long-term memory as belonging to different clusters. The reasoning result can be interpreted thusly: the system discovered where it can reach from the front of the closed door. One can find the importance of storing an if-then rule which has a conjunction in the reasoning about the contents of the closed drawers. The system would not make a correct inference if it were unable to deal with the conjunction because the closed drawers appear to be almost identical. To examine whether the system can reason correctly from the type of noisy data shown in Fig. 1(b), we provided ten white noise samples, one example of which is shown in Fig. 5O. Then, the system made no child node of the white noises because it was able to judge that all the white noise consisted of unknown patterns. The system proposed in [3] cannot judge whether an input fact has been learned. It outputs a wrong pattern that is entirely unrelated to learned patterns when it obtains an unknown pattern. We generated noisy data by adding Gaussian noise to 20 images of Fig. 5A to examine that the system can reason correctly from the type of noisy data shown in Fig. 1(a). Actually, 10 noisy images were generated from the same image in each noise level, which means that the system obtained 200 noisy images of the same noisy level. We estimated the correct reasoning ratio as “the number of
Pattern-Based Reasoning System Using Self-incremental Neural Network
347
correct reasoning / the number of reasoning with the noisy image”. Figure 7 shows that a boundary SN ratio exists. The system outputs the same result as that of the system that obtained the original image if the SN ratio is greater than the boundary. The system regards the fact as an unknown pattern and generates no child node of the fact if the SN ratio of a fact is less than the boundary. The boundary can be configured through setting. The system proposed in [3] also has the boundary of noise level to produce correct inferences, but a user cannot configure it.
4
Conclusion and Future Work
We proposed a novel reasoning system that can learn a pattern-based if-then rule including conjunction, disjunction, and negation. The proposed system is suitable for systems operating in the real world, such as intelligent robots, because the proposed system achieves incremental learning, generalization, avoidance of duplication of results, and robustness to noise. We will develop this study in two directions. The first direction is applying results of this study to development of a robot. Using the pattern-based reasoning system, a robot is expected to learn knowledge autonomously from an environment and solve tasks by reasoning using the learned knowledge. The second direction is an extension of the proposed system for a robot to solve complicated problems. Acknowledgments. This study was supported by the Industrial Technology Research Grant Program in 2004 from the New Energy and Industrial Technology Development Organization (NEDO) of Japan.
References 1. Shen, F., Hasegawa, O.: An Incremental Network for On-line Unsupervised Classification and Topology Learning. Neural Networks 19, 90–106 (2006) 2. Sudo, A., Sato, A., Hasegawa, O.: Associative Memory for Online Incremental Learning in Noisy Environments. In: Proc. of the 2005 International Joint Conference on Neural Networks (IJCNN 2005) (accepted, 2007) 3. Yamane, K., Hasuo, T., Suemitsu, A., Morita, M.: Pattern-based reasoning using trajectory attractors. IEICE Trans. Information and System J90-D, 933–944 (2007) 4. Tsukimoto, H.: Pattern Reasoning: Logical Reasoning of Neural Networks. IEICE Trans. Information and System J83-D-II, 744–753 (2000) 5. Fritzke, B.: A growing neural gas network learns topologies. In: Proc. of Advances in Neural Information Processing Systems (NIPS), pp. 625–632 (1995)
Effect of Spatial Attention in Early Vision for the Modulation of the Perception of Border-Ownership Nobuhiko Wagatsuma, Ryohei Shimizu, and Ko Sakai Graduate School of Systems and Information Engineering, University of Tsukuba, 1-1-1 Ten-nodai, Tsukuba, Ibaraki, 305-8577, Japan
[email protected],
[email protected],
[email protected] http://www.cvs.cs.tsukuba.ac.jp/
Abstract. We propose a computational model consisting of mutually linked V1, V2, and PP modules. The model reproduces the effect of attention in the determination of border-ownership (BO) that tells which side of the contour owns the border. The V2 module determines BO based on surrounding contrast extracted by the V1 module that could be influenced by top-down spatial attention from the PP module. We carried out the simulations of the model with random-block ambiguous figures to test whether spatial attention alters BO for these meaningless stimuli. To compare quantitatively these results with human perception, we carried out psychophysical experiments corresponding to the simulations. The results of these two showed good agreement in that the perception of BO was flipped when altering the location of spatial attention. These results suggest that spatial attention is a crucial factor for the modulation of figure direction in meaningless figures, and that the effects of spatial attention in early visual area are crucial for the modulation of figure direction. Keywords: Attention, border ownership, figure, early vision, model, psychophysics.
1 Introduction We have a function that focuses on the most important and salient object or location at the moment, which is known as visual attention. Visual attention not only boosts our perception [1] but also alters the perception of an object, or figures, which is apparent in ambiguous figures such as the Rubin’s vase [2]. We propose that attention alters the contrast gain in early vision, and the modified contrast then alters the border-ownership (BO) signals that are essential for the determination of the figure direction [3, 4]. If attention is significant, perception of figure direction is flipped because of the modulation of the activities of BO-selective neurons. As a result, perception of the object is changed. In the case of Rubin’s vase, for example, we can perceive two objects, the vase and two facing faces, depending on which side we pay attention. Visual attention has two distinct modes: spatial attention and object-based attention. Both types of attention have been shown to facilitate human perception M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 348–357, 2008. © Springer-Verlag Berlin Heidelberg 2008
Effect of Spatial Attention in Early Vision for the Modulation of the Perception of BO
349
from a number of aspects [5]. In particular, a recent study has reported that spatial attention alters contrast gain in early visual areas [6], the mechanism for which have been reported by several modeling works [e.g.7]. These models focus on interaction between the visual attention and lower visual functions such as contrast sensitivity, however they cannot explain more complex perception like a figure/ground segregation. It has been reported that majority of neurons in monkey’s V2 and V4 showed BO selectivity: their responses change depending on which side of a border owns the contour [8]. Computational works have suggested that the BO coding is determined based on the surrounding suppression/facilitation observed in early visual areas, thus luminance contrast around the classical receptive field is crucial for the determination of BO [3,4]. These models, however, don’t reproduce the perception of BO for ambiguous figures in which BO flips alternatively. These previous studies led us to propose following hypothesis: spatial attention alters contrast gain in early vision then the increased contrast modifies the activities of BO selective neurons. Based on this hypothesis, we propose a network model consisting of mutually connected V1, V2, and PP module. Top-down spatial attention from PP alters contrast gain in V1. The change in the contrast signal then modifies activities of BO selective neurons in V2 because BO is determined solely from surrounding contrast. We carried out the simulations of the model and the corresponding psychophysical experiments to investigate the effect of attention in the BO determination. Results of the simulations and the psychophysics show good agreement: the direction of figure was flipped by spatial attention in ambiguous stimuli. In addition, the activities of BO model cells are modified depending on the location of the attention when Rubin’s vase is provided. These results suggest that perception of the figure direction is altered when spatial attention functions in early visual area. A number of previous studies have reported significant effects of attention in the visual area V2 [9] and V4 [9, 10], however we focus on the effects in early visual
Fig. 1. An illustration of the model architecture. This model consists of three modules, V1, V2 and PP, with mutual connections, except for PP to V2 pathway to avoid the direct influence of attention from PP to V2.
350
N. Wagatsuma, R. Shimizu, and K. Sakai
area, V1, to investigate bottom-up attention that biases BO-selective neurons. The bottom-up attention seems to be crucial for the determination of BO, specifically for meaningless figures, because the latency of BO signal is short [8], and the switch of figure is achieved automatically [2]. Needless to say, BO-selective neurons in V2 and V4 might be affected directly by attention, however, it is not straight forward to explain how they alter the direction of BO. Here, we focus on the effects of attention in early visual area that modulate afferent BO neurons automatically and rapidly.
2 The Model In our model, spatial attention alters contrast gain in early vision, and the increased contrast modifies the activities of BO selective neurons, which may underlie the switch of figure ground. The model consists of three modules: V1, V2 and Posterior Parietal (PP) modules, as illustrated in Fig.1. Top-down and bottom-up pathways link mutually these modules, except for PP to V2. We excluded this PP to V2 connection to avoid direct influence of the attention to BO model cells. Each module consists of 100x100 model cells distributed retinotopically. In the absence of external input, the activities of a cell at time t, A(t), is given by τ
∂A(t ) = − A(t ) + μF ( A(t )) , ∂t
(1)
where the first term on the right side is a decay, and the second term shows the excitatory, recurrent signals among the excitatory model cells. Non-linear function, F(x), is given by F ( x (t )) =
1 , Tr − τ log (1 − (1 τx(t )) )
(2)
where τ is a membrane time-constant, and Tr is absolute refractory period. The dynamics of this equation as well as appropriate values for constants has been widely studied [11]. 2.1 V1 Module The V1 module models the primary visual cortex, in which local contrast is extracted from an input stimulus, and spatial attention modulates the contrast gain. The input image Input is a 124x124 pixel, gray scale image with intensity values ranging between zero and one. The local contrast, Cθω (x, y,t), is extracted by the convolution of the image with a gabor filter, Gθω , Cθω (x, y,t) = Input(x, y) ∗ Gθω (x, y) ,
(3)
where indices x and y are spatial positions, and ω represents spatial frequency. Orientation, θ , was selected from 0, π 2 , π and 3π 2 . The extracted contrast is modulated by spatial attention, thus the contrast at the attended location is enhanced. V1 The activity of a model cell in V1 module, Aθω xy , is given by
Effect of Spatial Attention in Early Vision for the Modulation of the Perception of BO
τ
V1 ∂Aθω xy (t)
∂t
V1 V1 V1−V 2 V1,E = −Aθω (t) + Iθω xy (t) + μF(Aθωxy (t)) + I xy xy (t) + I o ,
351
(4)
where I Vxy1−V 2 shows the feedback from V2 to V1, I o is random noise, and μ represents a scaling contrast. The local contrast, Cθω , is modulated by the feedback from PP to V1, I Vxy1− PP , as given by the following equation [7]: (Cθω ( x, y , t ))
V 1, E Iθω xy (t ) =
S
δ I Vxy1 − PP ( t )
⎛ 1 + ∑⎜ ⎜ θω ⎝ ( 2 I + 1)( 2 J + 1)
γ I Vxy1− PP ( t )
⎞ ∑ ∑ Cθω ( x + j, y + i, t ) ⎟⎟ j=− J i=−I ⎠ J
I
δ I Vxy1 − PP ( t )
, (5)
where S in eq(5) prevents the denominator to become zero. γ and δ are constants. In the V1 module, spatial attention influences contrast gain, therefore the contrast at the attended location is enhanced. 2.2 V2 Module
The V2 module models BO-selective cells reported in V2 that determine the direction of BO. Activities of the BO model cells is determined based on the surrounding contrast signal extracted by the V1 module, as illustrated in Fig.2[3, 4]. Each BO model cell has single excitatory and inhibitory regions. The activity of a BO model cell is modulated based on the location and shape of these surrounding regions. To reproduce a wide variety of BO selectivity, we implemented ten types of BO-left and BO-right model cells with distinct surrounding regions.
Fig. 2. A mechanism of the BO determination [3, 4]. In the case BO-right cell, contrast signal in the excitatory surrounding region enhances the activity of the cell. In contrast, if contrast exists in the inhibitory region, the activity of BO-left cell is suppressed. A dominant model cell owns this border. In this case, BO-right cell owns the border.
352
N. Wagatsuma, R. Shimizu, and K. Sakai
The activity of a BO-selective model cell is given by τ
2, BO ∂AVxyN (t )
∂t
2, BO 2, BO 2 −V 1, BO = − AVxyN (t ) + μF ( AVxyN (t )) − γF ( AV 2,inh (t )) + I VxyN (t ) + I o ,
(6)
2−V 1, BO where I VxyN represents afferent input from V1. An index BO shows left- or right-
BO selectivity, and N represents the type of BO model cells that is distinguished by their surround region. If BO-left model cells are more active than BO-right model cells, a figure is judged as located on the left side. The third term of the equation represents the input from inhibitory cells that gathers signals from all model cells in the layer. The activity of an inhibitory V2 model-cell is given by τ
∂AV 2,inh (t ) ∂t
2 , BO = − AV 2, inh (t ) + μF ( AV 2, inh (t )) + κ ∑ F ( AVxyN (t )) ,
(7)
Nxy
where κ is a constant. This inhibitory cell receives inputs from excitatory neurons in V2, and inhibits these neurons. 2.3 Posterior Parietal (PP) Module
The PP module encodes spatial location, with the aim of facilitating the processing of the attended location. The location of spatial attention is given explicitly in this module, which will boost the contrast gain of the location in V1 module. PP module receives afferent inputs from V1 and V2 modules. The activity of an excitatory model-cell in the PP module is given by τ
PP ∂A xy (t )
∂t
PP PP PP , BT PP , A = − A xy (t ) + μ F ( A xy (t )) − γF ( A PP , inh (t )) + I xy (t ) + I xy (t ) + I o ,
(8)
PP , A I xy represents the strength of attention with a Gaussian shape, which mimics topPP , BT down attention, the details of which is out of the focus of this model. I xy represents
afferent inputs from V1 and V2 modules to PP module, this process could be considered as saliency map based on luminance contrast. When there is no top-down attention, the PP module will be activated by afferent signals from V1 and V2. The third term shows input from an inhibitory PP model-cell. The activity of the inhibitory cell is determined from the activities of all excitatory PP cells as in the case of eq(7). The PP module encodes spatial location, and facilitates the processing in and around the attended location in V1. Note that spatial attention does not directly affect BO-selective model-cells in V2 module, because we focus on the effect of spatial attention in early vision V1.
3 Simulation Results We carried out the simulations of the model with a variety of stimuli, in order to test the characteristics of the model in various situations. Specifically, we investigated whether human perception of the direction of figure is reproduced in ambiguous
Effect of Spatial Attention in Early Vision for the Modulation of the Perception of BO
353
figures. First, we compare the simulation results with that of corresponding psychophysical experiments for ambiguous, random-block figures (Fig.3(a)). Second, we present an example of the simulation results for well-known ambiguous figures, specifically Rubin’s vase (Fig.3(b)). 3.1 Simulations and Psychophysics for Ambiguous Block Stimuli
First, we carried out the simulations of the model with the block objects as illustrated in Fig.3(a). These block objects are ambiguous figures; we can perceive a right- and left-hand object as figure. Fig.4 shows the simulation results for these ambiguous figures. Black and white bars represent the proportion that the black or white object is perceived as figure, respectively. We carried out the simulation of the model with three conditions: no attention, attending to the left- or right- object. By changing the location of the spatial attention, the dominant populations of BO model cells, either right or left, are switched. This result suggests that the perception of the figure direction is altered according to the attended location. To estimate whether the model reproduces the human perception, we carried out psychophysical experiments with similar settings to the simulations, as its procedure illustrated in Fig.5. We presented to human subjects the figures identical to those used in the simulations, and measured how they perceive the BO direction. Fig.6 shows the results of the psychophysical experiment. Subjects showed a tendency that the attended object is perceived as figure. This result suggests that spatial attention has influence for the determination of the figure direction in these meaningless block figures. Because this tendency was apparent similarly in the simulation results, we tested statistically whether there is a difference in the magnitude of the attention modulation between the model and psychophysics. Here, we define the magnitude of the attention modulation as below, m=
black ( attn (black )) white ( attn ( white )) + , black ( attn ( without )) white ( attn ( without ))
(9)
where m represents the magnitude of modulation, black(x) shows the proportion that the black object is perceived as figure and white(x) does the white object. attn(y) is the conditions of the attention. There was no significant difference between the modulation magnitude between the model and psychophysics (ANOVA : p=0.6798).
(a)
(b)
Fig. 3. Examples of stimuli. (a) Ambiguous, meaningless, random-block figures. Either black or white random block object may be perceived as figure. (b) the Rubin’s vase. A grey circle indicates the location and extent of the receptive field of BO model cells.
354
N. Wagatsuma, R. Shimizu, and K. Sakai
The magnitude of the attention modulation of the model agrees with that of human perception. This result suggests that covert spatial attention could be a crucial factor for the modulation of figure direction. 3.2 Simulation Results for Ambiguous Figures – A Case of Rubin’s Vase
㻾㼍㼠㼕㼛㻌㼛㼒㻌㼠㼔㼑㻌㼍㼏㼠㼕㼢㼕㼠㼕㼑㼟㻌㼛㼒㻌㻮㻻㻌㼏㼑㼘㼘㼟㻌
We carried out the simulations of the model with other ambiguous figures including variations of the Necker cube, the Rubin’s vase (Fig.3(b)), and others. As an example,
①
Fig. 4. The simulation results for ambiguous block figures. Bottom figures are the stimuli. and show attending locations. Black and white bars exhibit the activities of BO-left and BO-right model cells, respectively. A set of three bars corresponds to three attention conditions; no attention (left), attendance to the area (middle) and to the area (right). Arrows show the magnitude of the attention modulation.
②
①
②
Fig. 5. The procedure of the psychophysical experiment. The bottom panel presents the without-attention condition. The top panel shows the with-attention task. Three subjects were asked to report which object (left or right) is perceived as figure with 2AFC paradigm for 80 trials/ condition.
355
㻾㼍㼠㼕㼛㻌㼛㼒㻌㼠㼔㼑㻌㼍㼏㼠㼕㼢㼕㼠㼕㼑㼟㻌㼛㼒㻌㻮㻻㻌㼏㼑㼘㼘㼟㻌
Effect of Spatial Attention in Early Vision for the Modulation of the Perception of BO
Fig. 6. The measured human responses with the conventions same as Fig.5. Error bars indicate standard error.
Fig. 7. The simulation result for the Rubin’s vase. These bars show the proportion of BOselective neurons. “X” icons indicate the location of the attention. If BO-left neurons are more active than BO-right neurons at the location, the vase is considered as figure. The activities of BO model cells are altered when attended location is altered accordingly.
we show the results for the Rubin’s vase in this section. When we view the Rubin’s vase, we can perceive two objects alternatively, a vase and facing two faces, but we cannot perceive these two objects simultaneously [2]. If the direction of figure is the left with response to the border (indicated by a small circle in Fig.3(b)), we will perceive a vase. If the right, we will perceive a face. The simulation results for the Rubin’s vase are shown in Fig.7. The results show that BO-left model cells are dominant regardless of the location of spatial attention. However, more than 10% of the BO model cells altered its sign when the attended location is altered accordingly: when attention is applied to the center of the image
356
N. Wagatsuma, R. Shimizu, and K. Sakai
(Fig.7 (b)), the activities of BO-left cells are increased. In contrast, when attended to the right on the face (Fig.7(c)), the activities of BO-right cells are facilitated, as compared to the without-attention case (Fig.7 (a)). The early-level processing does not seem to evoke a full switch of figure in meaningful figures including the Rubin’s vase, because higher-level vision might influence the processing of human face, as described further in Discussion. The results show that the spatial attention changes the activities of BO-selective cells in the direction consistent with human perception, suggesting an importance of the modulation in early vision.
4 Discussion Our proposed hypothesis is that spatial attention alters contrast gain in early vision then the increased contrast modifies the activities of BO selective neurons, which may lead to the switch of figure/ground. We constructed the network model that consists of three modules each corresponding to V1, V2 and PP cortical area, together with the mutual connections between them including both bottom-up and top-down pathways, but excluding PP to V2. We tested the model with ambiguous, random-block figures to examine whether the model reproduces the attention modulation in human perception. The model shows good agreement with the human psychophysics in which subjects tended to perceive attended objects as figure. Our model reproduced this tendency of human perception: the activities of BO model cells are flipped according to the location of spatial attention. This result suggests that spatial attention is a crucial factor for the modulation of figure direction, and that gain control in early vision plays an important role for the modulation. Next, we tested the model with the Rubin’s vase, one of the most famous ambiguous figures. Although our model showed a modulation in the activities of BO model cells depending on the attention, the model did not exhibit the flip of BO to the face side. It has been suggested that the processing of human face is carried out by the specialized neurons in higher visual areas such as IT and TEO. We suppose that the feedback from such higher visual areas might modulate significantly BO neurons to flip the perception to the face. It should be noted that the some degree of modulation from covert spatial attention still works with this particular example, perhaps because this pathway through early vision is automatic and independent of object familiarity or meaning. Our model predicts that spatial attention alters figure direction through the change in contrast sensitivity. However, not only spatial attention but also feature-based attention or object-based attention should have considerable effect for the human perception in the determination of figure direction [12, 13]. It is expected to take into account feature-based attention to further understand the perception of figure direction. Specifically, application of the model to a number of ambiguous stimuli proposed by psychophysicists may be valued to test quantitatively the relation of feature and BO determination. Our results provide essential and testable predictions for the fundamental problems of figure/ground segregation and attention.
Effect of Spatial Attention in Early Vision for the Modulation of the Perception of BO
357
References 1. Posner, M.I.: Orientating attention. The Quarterly Journal of Experimental Psychology 32 (1980) 2. Nicholas, W.: The art and science of visual illusions (1982) 3. Carrasco, M., Ling, S., Read, S.: Attention alters appearance. Nature Neuroscience 7, 308– 313 (2004) 4. Lee, D.K., Itti, L., Koch, C., Braun, J.: Attention activates winner-take-all competition among visual filters. Nature Neuroscience 2, 375–381 (1999) 5. Sakai, K., Nishimura, H.: Surrounding suppression and facilitation in the determination of border ownership. Journal of Cognitive Neuroscience 18, 562–579 (2006) 6. Nishimura, H., Sakai, K.: The computational model for border-ownership determination consisting of surrounding suppression and facilitation in early vision. Neurocomputing 65– 66, 77–83 (2005) 7. Deco, G., Lee, T.S.: The role of early visual cortex in visual integration: a neural model of recurrent interaction. European Journal of Neuroscience 20, 1089–1100 (2004) 8. Zhou, H., Friedma, H.S., von der Heydt, R.: Coding of border ownership in monkey visual cortex. Journal of Neuroscience 20, 6594–6611 (2000) 9. Reynolds, J.H., Chelazzi, L., Desimone, R.: Competitive Mechanism Subserve Attention in Macaque Areas V2 and V4. The Journal of Neuroscience 19, 1736–1753 (1999) 10. Reynolds, J.H., Pasternak, T., Desimone, R.: Attention Increases Sensitive of V4 Neurons. Neuron 26, 703–714 (2000) 11. Gerstner., W.: Population dynamics spiking of neuron: Fast transients, asynchronous states, and locking. Neural Computation 12, 43–89 (2000) 12. Serences, J.T., Sxhwarzbach, J., Courtney, S., Golay, X., Yantis, S.: Control of Objectbased Attention in Human Cortex. Cerebral Cortex 14, 1346–1357 (2004) 13. Mitchwll, J.F., Stoner, G.R., Reynolds, J.H.: Object-based attention determines dominance in binocular rivalry. Nature 429, 410–413 (2004)
Effectiveness of Scale Free Network to the Performance Improvement of a Morphological Associative Memory without a Kernel Image Takashi Saeki and Tsutomu Miki Graduate School of Life Science and Systems Engineering, Kyushu Institute of Technology, Kitakyushu 808-0196, Japan
[email protected],
[email protected] http://www.lsse.kyutech.ac.jp
Abstract. In this paper, we present a new approach of the morphological associative memory (MAM) without a kernel image to reduce the network size by using the scale free network. The MAM is one of the powerful associative memories compared to ordinary associative memories. Weak point of the MAM is to need the kernel image which is susceptibility to noise and hard to design. We have already presented the MAM without a kernel image as a practical model. However the model has a drawback that the perfect recall rate is degraded. On the other hand, it has been reported that an introduction of the scale free network to associative memories is effective in the improvement of the recall rate and the reduction of the network size. We try to reduce the network size and improve the recall rate by introducing the scale free network. Keywords: Morphological associative memory, scale free, recall rate.
1 Introduction Nowadays, complex networks are studied in various fields. Complex networks are real huge networks such as internet, human relationship and joint of brain cells. Complex network exhibits small world behavior, scale free behavior, and cluster behavior in common. Watts et al. defined small world network in 1998[4]. Recently, feature of complex network has triggered lots of researchers in different field [1], [2], [3]. On the other hand, neural network model have been studied for implementation of human associative memory. An ordinary associative memory is used McCulloch-Pitts model. As unused McCulloch-Pitts model, the morphological neural networks were derived from the theory of image algebra developed by Ritter (1990) [5]. After that, several researchers have applied morphological associative memory networks for many applications [7], [8], [9], [10]. Ritter also proposed an associative memory network using the concept of morphological neural networks (MAM) (1998) [6]. The model was superior to ordinary associative memory models such as Hopfield network in terms of calculation amount, memory capacity, and perfect recall rate. Unfortunately, the model uses a kernel image so that the model has a disadvantage that it was difficult to calculate kernel image. The MAM has two type memory M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 358–364, 2008. © Springer-Verlag Berlin Heidelberg 2008
Effectiveness of Scale Free Network to the Performance Improvement of a MAM
359
matrices so that the MAM is necessary two times size of memory matrix compared with Hopfield network. The MAM without a kernel image is inferior to the noise tolerance compared with Ritter’s model and needs two times the size of memory matrices compared with ordinary associative memories such as Hopfield network. In this paper, we proposed new type MAM which has scale free behavior in order to solve these problems. The proposed model shows superior to performance compared with an ordinary fully connected MAM in terms of the noise tolerance and the size of memory matrices.
2 Scale Free Network A scale free network consists of nodes and links. The scale free network is satisfied with power of law distribution. Power of law distribution is that some of node connects many other nodes with a link so that the network has large order coefficient, however, the most of all nodes connect few nodes so that the network has small order. BA (Barabasi-Albert) model is popular as a scale free network model [2]. The complex network which has scale free behavior is designed easily by using the BA model. The BA model has a characteristic which is the growth of network, in truth, many non-growing networks are exist. Masuda et al. proposed threshold graph which is non-growing scale free model [1]. In this model, each node has a weight and connections to other nodes with the link when sum of weight of two nodes is greater than threshold value. Many scale free network designed by the threshold graph have following features; 1. Degree distribution follows the power of law. 2. However network is growing in scale, average of cluster coefficient can not decay to zero. 3. Cluster coefficient follows the power of law. These features are effect to reduce the size of memory matrix.
3 Morphological Associative Memory: MAM The MAM has two-stage recall process as illustrated in Fig.1. If it regards two-stage recall process as one step, the MAM is an associative memory as same as other associative memories. In Fig.1, ‘X’ and ‘Y’ is a pair of the sth stored patterns. ‘Y’ is ~ an output pattern for ‘X’ or ‘ X ’ which is an input pattern.
~ X
~ Y 1st Pattern Recall
Y 2nd Pattern Recall
Fig. 1. Recall process of MAM
360
T. Saeki and T. Miki
Normal associative memories use the learning in their memory process. On the other hand, the MAM uses calculations employing two different types of memory matrices in order to store the patterns. Therefore memory process of the MAM is faster than ordinary associative memories. In this paper, the first stage is called “the first recall” and the later is called “the second recall”. The recall pattern is given by:
{
n
(
)
(
)
r ~y r = j ∨ w ji + ~x j i =1 m
yir = ∧ mij + ~ y jr j =1
(1)
or (2)
{
n
(
)
(
)
r ~y r = j ∧ m ji + ~x j i =1 m
yir = ∨ wij + ~ y rj , j =1
(3)
(4)
where a pair of Eq.(1) and Eq.(2) is equal to that of Eq.(3) and Eq.(4). n and m represent the number of total units of an input pattern and an output pattern, respectively.
~ x jr is jth unit of the corrupted pattern x r and ~ y jr is an output of the first r
recall (It is also an input of the second recall). yi represents ith unit of the output pattern y . r is the number of stored patterns. ∨ , ∧ are maximum, minimum operator, respectively. Here memory matrices wij and mij can be calculated from the following equations (5), (6). r
(
)
1 j
= y −x
) ∧ (y
(
)
= y −x
) ∨ (y
S
wij = ∧ yir − x rj
(
r =1
1 i
S
2 i
mij = ∨ yir − x rj
(
r =1
1 i
1 j
2 i
)
(
s i
)
(
s i
−x ∧ " ∧ y −x 2 j
s j
−x ∨ " ∨ y −x 2 j
s j
) )
(5)
(6)
4 Scale Free Network Type MAM Masuda et al. proposed a new design method of scale free network by using a threshold graph. The scale free network employing a threshold graph is designed based on fully connected MAM. Geometry of scale free network used threshold graph is shown Fig.2. Fig.2 is recited from Reference [1]. Threshold graph processed as follows; Step 1. n numbers nodes are given. Step 2. A weight wi is assigned to each node. The weight wi is selected randomly from probabilistic distributed function f(w).Probability distribution function is given by;
f ( x) = e − x
(7)
Effectiveness of Scale Free Network to the Performance Improvement of a MAM
361
Step 3. For node i and j , if it fill the demand of Eq.(8), node i connects j. Step 3 is carried out for all nodes. wi+wj ≥ θ
(8)
where threshold value θ is real value. Here, in this paper, MAM needs a connection to itself for the perfect recall. Connection to itself is held certainly.
Fig. 2. Geometry of scale free network [1]
5 Experimental Results In this paper, we investigate the perfect recall rate of the MAM which employs a scale free network designed by using the threshold graph with a suitable threshold value. Fig.3 shows the stored patterns. Each pattern consists of 10 × 10=100 binary units. The unit takes ‘1’ or ‘0’. The ‘1’ represents black and ‘0’ white. The one of the stored patterns with a random noise is used as the input pattern. Here, the noise is defined as to change ‘1’ to ‘0’ or ‘1’ or ‘0’, and is not over 10% of total units of a pattern.
Fig. 3. Stored Patterns
362
T. Saeki and T. Miki
10000 trials are performed in the simulations. The performance of the noise tolerance is estimated for an average of the perfect recall rate and the size of memory matrices is estimated for the average of the size of the memory matrices. Here, the perfect recall is defined as to recall the stored pattern which corresponds to noise less input one. The size of memory matrices is defined as a total size of M and W matrices. 5.1 Decision of Insert Position of a Scale Free Network The MAM has a two-stage recall process. Introducing a scale free network can be inserted to three locations (first stage, second stage, and both stages). Effective insert position of the scale free network is determined by investigating their noise tolerance. Fig.4 shows noise tolerance of the MAM introducing the scale free network in the first stage, the second stage, and both stages. Here, threshold value is ‘1’. 100
Normal
] %[ 80 te a R lla60 c e R40 t c ef r20 e P
First Second Both
0 0
2
4
6
8
10
Noise Rate[%] Fig. 4. Difference of Noise Tolerance of the MAM with Different Inserting Position of a Scale Free Network: “Normal “means a noise tolerance of fully connected MAM, “First” means introducing the scale free network in the first stage, “Second” in the second stage, and “Both” in both stages. Table 1. Noise Tolerance and Size of Memory Matrices
Normal Average of Recall Rate[%] * Noise Tolerance Size of Memory Matrices
30.7
First Stage 34.9
Second Stage 34.5
Both Stage 34.8
―
4%↑
4%↑
4%↑
20000
10291
10291
578
* As the base, Noise tolerance is defined in case of under 10% one.
Effectiveness of Scale Free Network to the Performance Improvement of a MAM
363
Next, the effect on the size of the memory matrices is investigated. Table 1. shows improvement of noise tolerance and size reduction of the memory matrices of the MAM with different inserting position of the scale free network. From results, there is a little difference in improvement on noise tolerance. But noise tolerance is superior to the normal MAM. The ordinary MAM has two fully connected networks. However, the MAM introducing scale free network in the both stages has two low connected networks. Size of memory matrices is drastically reduced by introducing a scale free network in both stages. 5.2 Evaluation of MAM with Scale Free Network in Both Stages A construction of the scale free network is controlled by threshold value. Therefore we examine the effect that a difference of a threshold value gives improvements for the MAM employing the scale free network in both stages. Table 2. shows the noise tolerance and the size of the memory matrices of the MAM which is introduced the scale free network in both stages for different threshold values. Table 2. Noise Tolerance and Size of Memory Matrices for different threshold values
Threshold θ Average of Recall Rate[%] Size of Memory Matrices
2 34.2 202
1 34.8 578
0.1 33.7 1334
0.01 36.5 2128
* As the base, Noise tolerance is defined in case of under 10% one.
As table 2. shows, noise tolerance of each threshold value is not difference. As a result, threshold value has no effect on the noise tolerance.
6 Conclusion In this paper, we proposed a new MAM model using scale free network. The performance of the proposed model was confirmed by the autoassociation experiments for alphabet patterns. Inserting position and threshold value are determined by simulations. As the results, in the proposed model the perfect recall rate is improved about 4% and the size of the memory matrices is drastically reduced 97% with compared to the ordinary fully connected MAM. To design distribution of weight for improvement of noise tolerance and to use more experimental evidence in the various conditions are future works. Acknowledgments. This work was supported in part by a 21st Century Center of Excellence Program
,"World of Brain Computing Interwoven out of Animals and Robots (PI:T.Yamakawa)" granted in 2003 to Department of Brain Science and Engineering, (Graduate School of Life Science and Systems Engineering), Kyushu Institute of Technology by Japan Ministry of Education, Culture, Sports, Science and Technology.
364
T. Saeki and T. Miki
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
Masuda, N., Miwa, H., Konno, N.: Phys.Rev. E 70(036124) (2004) Barabási, A.-L., Albert, R.: Science 286, 509 (1999) Albert, R., Barab´asi, A.-L.: Rev. Mod. Phys. 74, 47 (2002) Watts, D., Strogatz, S.: Nature 393 (1998) Ritter, G.X., Wilson, J.N., Davidson, J.L.: Comput, Vision Graphics Image Processing 49(3) (1990) Ritter, G.X., Sussner, P., Diaz-de-Leon, J.L.: IEEE Trans.Neural Networks 9(2) (1998) Won, Y., Gader, P.D.: Proceedings of the 1995 IEEE International Conference on Neural Networks, Perth, Australia, November 1995, vol. 4 (1995) Won, Y., Gader, P.D., Coffield, P.: IEEE Trans. Neural Networks 8(5) (1997) Davidson, J.L.: Image Algebra and Morphological Image Processing III. In: Proceedings of SPIE, San Diego, CA, July 1992, vol. 1769 (1992) Davidson, J.L., Hummer, F.: IEEE Systems Signal Processing 12(2) (1993)
Intensity Gradient Self-organizing Map for Cerebral Cortex Reconstruction Cheng-Hung Chuang1, Jiun-Wei Liou2,3,*, Philip E. Cheng2, Michelle Liou2, and Cheng-Yuan Liou3 1
Dept. of Computer Science and Information Eng., Asia University, Taichung, Taiwan 2 Institute of Statistical Science, Academia Sinica, Taipei, Taiwan 3 Dept. of Computer Science and Information Eng., National Taiwan Univ., Taipei, Taiwan [email protected]
Abstract. This paper presents an application of a self-organizing map (SOM) model based on the image intensity gradient for the reconstruction of cerebral cortex from MR images. The cerebral cortex reconstruction is important for many brain science or medicine related researches. However, it is difficult to extract deep cortical folds. In our method, we apply the SOM model based on the image intensity gradient to deform the easily extracted white matter surface and extract the cortical surface. The intensity gradient vectors are calculated according to the intensities of image data. Thus the proper cortical surface can be extracted from the image information itself but not artificial features. The simulations on T1-weighted MR images show that the proposed method is robust to reconstruct the cerebral cortex. Keywords: Self-organizing map, brain science, cortical surface reconstruction, deformable surface models, active surface models, gradient vector field (GVF).
1 Introduction Recently, due to advanced magnetic resonance imaging (MRI) techniques, MR images, which reveal high spatial resolution and soft-tissue contrast, have great potential to be used in the analysis of cognitive neuroscience, diseases (e.g., epilepsy, schizophrenia, Alzheimer's disease, etc.), and researches into anatomical structures of human brains in vivo. Modern anatomical MRI studies on human brains have been concentrated on the cerebral cortex, which is a thin and folded layer of gray matter on the cerebral surface and contains dense neurons to control high cortical functions like language and information processing. Many studies have shown that the cortical thickness decreases or changes in association with neurodegenerative diseases and psychiatric disorders, e.g., schizophrenia, Alzheimer’s disease [1], and autism [2]. The cerebral cortex reconstruction nowadays is an on-going research field that can help other researches like brain mapping to explore human brain. Generally speaking, there are three major brain tissues which can be approximately partitioned and segmented in human brains, i.e., gray matter (GM), white matter (WM), * Corresponding author. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 365–373, 2008. © Springer-Verlag Berlin Heidelberg 2008
366
C.-H. Chuang et al.
and cerebral spinal fluid (CSF). The cerebral cortex is the thin and folded layer between GM/WM and GM/CSF interfaces [3, 4]. The cortex reconstruction means to extract GM/WM and GM/CSF boundaries and rebuild its surface. A lot of methods in the literature have been proposed to solve this problem [1–5]. These methods can be roughly categorized into stochastic and morphological types. The methods using a stochastic model [1, 4] employ labeled cortical mantle distance maps or intensity distance histograms related to the GM/WM interface so that the extraction of GM/CSF interface is needless. On the other hand, the morphological methods [2, 3, 5] apply the dilation of GM/WM interface to extract the accurate GM/CSF boundary surface since the obvious GM/WM interface is easily extracted. In the latter, a cortex is usually regarded as a double surface structure [3]. The exterior surface following the interior surface is deformed with some constraints to find out the GM/CSF interface. The deformable surface model is a powerful method to reconstruct the cortical surface. However, it is difficult to deform the surface with correct topology inside the highly folded and buried cortex with image noise. In [5], a deformable surface model based on the gradient vector flow (GVF) [6] is proposed to overcome this problem. The model using an energy minimizing function, a weighted combination of internal and external energy, is basically a three-dimensional (3-D) snake model. The internal energy controls the continuity force of surface itself, while the external energy governs the attraction force like image gradients that leads the surface. Nevertheless, the computational cost of minimizing the energy function becomes larger as number of points on the surface increases. In [7], the self-organizing map (SOM) model [810] plus a layered distance map (LDM) is applied to deform the GM/WM interface to find out the GM/CSF interface in segmented MR images. The layered distance map is calculated according to the extracted WM surface and segmented GM. Unfortunately, the distance values inside the sulci are usually symmetric that does not match the real cortical thickness. In this paper, we propose an advanced method in which the SOM model based on the image intensity gradient is applied to deform the GM/WM interface to find out the cortical surface in segmented MR images. In the simulation and experiment, a twodimensional (2-D) T1-weighted MR image and 3-D T1-weighted MRI data are used for test. In contrast with the results of previous method [7], our newly studies on T1weighted MRI data show that the proposed method gains more precise results to reconstruct cerebral cortex.
2 The Problem In the cerebral cortex, there are many narrow and deep fissures called sulci. These concave parts sometimes contain invisible or unrecognizable CSF, which makes the reconstruction of cerebral cortex laborious. To easily understand the problem, a T1-weighted MR image shown in Fig. 1(a) is illustrated. This image shows a sulcus structure, i.e. the region of interest (ROI). Its segmented image is shown in Fig. 1(b), where the white color region represents WM, the gray color region indicates GM, and the black color region is CSF, other tissues, and background. It is easy to extract the boundaries of the segmented GM, as shown in Fig. 1(c). Now, the ROI image is enlarged and shown in Fig. 1(d). The ROI image is equalized to clearly display the
Intensity Gradient Self-organizing Map for Cerebral Cortex Reconstruction
367
ideal boundary, as shown in Fig. 1(e). The extracted and ideal boundaries between Fig. 1(c) and Fig. 1(f) are obvious different and there should exist an interval within the sulcus. If the inner (WM) boundary line is deformed outward to extract the outer (GM) boundary according to the segmentation results, it will probably fail to catch the interval due to fewer extractable features inside the sulcus.
(a)
(d)
(b)
(e)
(c)
(f)
Fig. 1. Illustration of a missing interval within a sulcus in a T1-weighted MR image: (a) the raw image and the ROI, (b) its segmented image and the ROI, (c) extracted boundary of ROI, (d) magnified ROI, (e) equalized ROI image with ideal boundary, (f) ideal boundary of ROI.
3 Methods Since it is difficult to partition tissues inside sulci, the ideal GM surface is also hard to be extracted. One popular way is to dilate the GM/WM interface to extract the GM/CSF boundaries because the GM/WM boundaries are obvious and can be easily extracted. In our former method, this kind of dilation strategy is assisted by a layered distance map (LDM) [7]. The LDM is formed by calculating the connected voxels layer by layer from inner to outer surfaces within the segmented GM. Therefore, if the sulci are full of GM, distance values of LDM inside the sulci are symmetric and the detected interval is always in the center. To overcome this problem, the image intensity is employed in the new method. In a T1-weighted MR image, the intensity of WM, GM, and CSF is high, middle, and low, respectively. The GVF [6] is applied to obtain the intensity gradient flow between these three tissues. However, the intensity inhomogeneity should be removed prior to GVF processing. Fortunately,
368
C.-H. Chuang et al.
Su et al. [11] proposed a wavelet-based bias correction method which is also independent of tissue classes and computationally simple. 3.1 The Image Intensity Gradient In Fig. 2, the principle of boundary extraction inside the sulci by using GVF is demonstrated. The figures show two types of sulcus profile, where curves represent the dynamic intensity, i.e., the horizontal axis means positions and the vertical one indicates intensity values. GM, WM, and CSF are segmented and arrows denote gradient vectors. In Fig. 2(b), an interval should exist but it cannot be detected by the segmentation. The cortical thickness a is not equal to b, i.e. the structure is asymmetric and the LDM cannot find the ideal boundary. However, from the direction of gradient vectors, the most possible boundary, the long dash line in Fig. 2(b), can be easily extracted.
(a)
(b)
Fig. 2. Illustration of boundary extraction inside the sulci by using GVF: (a) the sulcus profile with a clear interval, (b) the sulcus profile with an obscure interval
The intensity GVF is computed from the bias-corrected MR images defined as I(x, y, z). The gradient of 3-D negative intensity map is defined first. Then the intensity GVF field is defined as the vector field V(x, y, z)=[u(x, y, z), v(x, y, z), w(x, y, z)] that minimizes the energy functional [6]
ε = ∫∫∫ μ ∇V + ∇f 2
2
2
V − ∇f dxdydz ,
(1)
where f represents the 3-D negative intensity map, i.e. f =–I, μ is a weighting parameter, and is the gradient operator, i.e.
▽
∇ = ( ∂∂x , ∂∂y , ∂∂z ) .
(2)
In Eq. (1), the first term controls the smoothing function while the second term governs the maintaining capability. The formulation of computing V by iteratively minimizing ε is given in [6]. The final iterative solution is
⎧ u n +1 − u n = μ∇ 2 u n − (u n − f x )( f x2 + f y2 + f z2 ) ⎪ n +1 n 2 n n 2 2 2 ⎨ v − v = μ∇ v − (v − f y )( f x + f y + f z ) , ⎪ wn +1 − wn = μ∇ 2 wn − ( wn − f )( f 2 + f 2 + f 2 ) z x y z ⎩
(3)
Intensity Gradient Self-organizing Map for Cerebral Cortex Reconstruction
▽
369
2
where is the Laplacian operator, n is the iteration number, and fx, fy, and fz are partial derivatives with respect to x, y, z, respectively. The initial conditions are set with
V 0 = (u 0 , v 0 , w0 ) = ∇f = ( f x , f y , f z )
(4)
Basically, the main goal of this iterative process is to diffuse gradient properties to reduce image noise and form the vector flow all over the volume. 3.2 The Intensity Gradient Self-organizing Map (IGSOM) Model
The SOM model, which is a nonlinear, ordered, and smooth function, is an effective algorithm for the mapping between the neuron model and input data sets. In our applications, it is desired to deform the GM/WM interface to find out the GM/CSF interface by the SOM model. The boundary voxels of GM/WM are defined as the neuron data set. The SOM model driven by intensity gradient, called IGSOM, is applied to move the neuron data within the segmented GM. Finally, the GM/CSF interface is reconstructed by the converged neuron data set. The proposed IGSOM model is described in the following. First, it is needed to initialize deformable mesh network M. Here we use the isosurface of GM/WM interface. The mesh network M is a deformable surface inside the segmented GM. The segmented GM is defined as a volume set G. The best matching function of SOM is modified and rewritten as a point function P(.), i.e. P(m j ) = m j + V (m j ), m j ∈ G and m j ∈ M ,
(5)
where mj is the randomly selected mesh network node, j is the index of mesh node, and V(mj) is the GVF on the node mj. The update function of SOM denotes m j ′ (k + 1) = m j ′ (k ) + α (k ) H ( D, k )[ P (m j ) − m j ′ (k )], where
(6)
m j′ including m j represents all neighbors around m j , k = 0, 1, 2…is the
iteration number, α ( k ) ∈ [0,1) is the learning rate, and H is the neighborhood function which decreases when the distance metric D and iteration k increase. The update function is iteratively proceeded until the average variation of input data is less than a threshold value. The Gaussian function is usually applied to be the smoothing kernel, i.e.
⎛ D ⎞ H ( D, k ) = exp ⎜ − 2 ⎟ ⎝ 2σ (k ) ⎠
(7)
where σ (k ) is the standard deviation, i.e. the width of the smoothing kernel. The distance metric D ≡ D( j ′, j ) defines the distance from m j ′ along the surface to m j . When j ′ is equal to j, i.e. m j ′ = m j , D is zero and the neighborhood function H has the maximal value 1.
370
C.-H. Chuang et al.
4 Simulation and Experiment The first experiment is a 2-D 181x217-pixel T1-weighted MR image with its expert segmentation result, as shown in Figs. 3(a) and 3(b). In our method, the parameter μ is set to 0.2 and the iteration number n is only set to 2. This is because the MR image is clear and less iteration can maintain the original gradient information. The update iteration in the IGSOM model is about 100 in this case. The learning rate and the standard deviation of the Gaussian function are fixed and set to 0.1 and 0.6, respectively. From the expert partition result, boundary pixels of GM/WM (inner) and GM/CSF (outer) interfaces can be extracted as shown in Figs. 3(c) and 3(d), respectively. In Fig. 3(d), there are many sulci lose their intervals. Figs. 3(e) and 3(f) show the experimental results by using SOM+LDM model [7] and IGSOM model, respectively. Roughly, the results are similar to each other. However, some intervals inside the sulci are over-detected in the result of earlier method. Some of them are pointed out with arrow symbols in Fig. 3(e). These defects will cause extra and deeper sulci spread on cortex.
(a)
(d)
(b)
(e)
(c)
(f)
Fig. 3. Experiments on a 2-D 181x217-pixel T1-weighted MR image: (a) the raw image, (b) the expert segmentation image, (c) the inner boundary image, (d) the outer boundary image, (e) the result by the SOM+LDM model, (f) the result by the IGSOM model
Intensity Gradient Self-organizing Map for Cerebral Cortex Reconstruction
371
Then the experiment is extended to 3-D 181x217x181-voxel T1-weighted MRI data. The cerebral cortex is separately reconstructed using SOM+LDM model [7] and IGSOM model as well. The GVF parameter μ and the iteration number n are the same as the former experiment. The update iteration in the IGSOM model is about 200 in this experiment. The learning rate and the standard deviation of the Gaussian function are also fixed and set to 0.1 and 1.0, respectively. The number of mesh network nodes is 35,851. The six views, i.e. top, bottom, left, right, front, and back, of reconstructed 3-D cortical surface are shown in Fig. 4, where figures on the left column show the result of SOM+LDM model while those on the right column display that of IGSOM model. The over-detection of sulci is revealed in the result of SOM+LDM model, i.e. the left column of Fig. 4. The measurement of error is performed by calculating the average nearest distance between mesh vertices and points of segmented GM outer surface. The maximal, minimal, and mean distances are 2.35, 0.0013, and 0.42 mm in the result of SOM+LDM model, while they are 2.32, 0.0011, and 0.42 mm in that of IGSOM model. Although both the mean distances are about the same, the maximal and minimal distances in the result of SOM+LDM model are larger than those in that of IGSOM model.
(a)
(b)
(c)
(d)
Fig. 4. Experiments of cortex reconstruction with 3-D T1-weighted MRI data: (a)(c)(e)(g)(i)(k) results by the SOM+LDM model, (b)(d)(f)(h)(j)(l) results by the IGSOM model, (a)(b) top view, (c)(d) bottom view, (e)(f) left view, (g)(h) right view, (i)(j) front view, (k)(l) back view
372
C.-H. Chuang et al.
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l) Fig. 4. (continued)
5 Conclusions In this paper, the SOM model based on image intensity gradient is applied to reconstruct the human cerebral cortex from MRI data. The method provides a good
Intensity Gradient Self-organizing Map for Cerebral Cortex Reconstruction
373
capability to deform the GM/WM surface outward to find out the GM/CSF surface. Thus the asymmetric intervals inside sulci can be automatically detected. The only requirement is the intensity bias of the MRI data should be corrected previously. The gradient vectors computed from the image intensity are used to help the SOM model to deform the GM/WM surface to a proper position of GM/CSF surface. Based on this method, even the 3-D highly folded GM/CSF surface can be reconstructed. However, in the 3-D case, one of the disadvantages is the heavy computational cost if the image size is huge or the number of mesh vertices is large. Therefore, to improve the computational efficiency is the future work. Also, from the reconstructed cerebral cortex, how to systematically evaluate the accuracy of cortical surface is the urgent study. Acknowledgments. The 3-D MRI data presented in this paper come from the McConnell Brain Imaging Centre at McGill University.
References 1. Miller, M.I., Hosakere, M., Barker, A.R., Priebe, C.E., Lee, N., Ratnanather, J.T., Wang, L., Gado, M., Morris, J.C., Csernansky, J.G.: Labeled Cortical Mantle Distance Maps of the Cingulate Quantify Differences between Dementia of the Alzheimer Type and Healthy Aging. Proc. Natl. Acad. Sci. 100, 15172–15177 (2003) 2. Chung, M.K., Robbins, S.M., Dalton, K.M., Davidson, R.J., Alexander, A.L., Evans, A.C.: Cortical Thickness Analysis in Autism with Heat Kernel Smoothing. Neuroimage 25, 1256–1265 (2005) 3. MacDonald, D., Kabani, N., Avis, D., Evans, A.C.: Automated 3-D Extraction of Inner and Outer Surfaces of Cerebral Cortex from MRI. NeuroImage 12, 340–356 (2000) 4. Barta, P., Miller, M.I., Qiu, A.: A Stochastic Model for Studying the Laminar Structure of Cortex from MRI. IEEE Trans. Med. Imaging 24, 728–742 (2005) 5. Xu, C., Pham, D.L., Rettmann, M.E., Yu, D.N., Prince, J.L.: Reconstruction of the Human Cerebral Cortex from Magnetic Resonance Images. IEEE Trans. Med. Imaging 18, 467– 480 (1999) 6. Xu, C., Prince, J.L.: Snakes, Shapes, and Gradient Vector Flow. IEEE Trans. Image processing 7, 359–369 (1998) 7. Chuang, C.H., Cheng, P.E., Liou, M., Liou, C.Y., Kuo, Y.T.: Application of SelfOrganizing Map (SOM) for Cerebral Cortex Reconstruction. International Journal of Computational Intelligence Research 3, 26–30 (2007) 8. Kohonen, T.: Self-Organizing Maps, 3rd edn. Springer, Berlin (2001) 9. Liou, C.Y., Tai, W.P.: Conformal Self-Organization for Continuity on a Feature Map. Neural Networks 12, 893–905 (1999) 10. Liou, C.Y., Tai, W.P.: Conformality in the Self-Organization Network. Artificial Intelligence 116, 265–286 (2000) 11. Su, H.-R., Liou, M., Cheng, P.E., Aston, J.A.D., Chuang, C.H.: MR Image Segmentation Using Wavelet Analysis Techniques. Neuroimage 26(1) (2005), Human Brain Mapping 2005 Abstract
Feature Subset Selection Using Constructive Neural Nets with Minimal Computation by Measuring Contribution Md. Monirul Kabir1, Md. Shahjahan3, and Kazuyuki Murase1,2 1
Department of Human and Artificial Intelligence Systems, Graduate School of Engineering [email protected] 2 Research and Education Program for Life Science University of Fukui, Bunkyo 3-9-1, Fukui 910-8507, Japan [email protected] 3 Department of Electrical and Electronic Engineering Khulna University of Engineering and Technology, Khulna 9203, Bangladesh [email protected]
Abstract. In this paper we propose a new approach to select feature subset based on contribution of input attributes in a three-layered feedforward neural network (NN). Three techniques: constructive, contribution, and backward elimination are integrated together in this method. Initially, to determine the minimal NN architecture, the number of hidden neurons is determined by a constructive approach. After that, one-by-one removal of input attributes is performed to compute their contribution. Finally, a sequential backward elimination is used to generate relevant feature subset from the original input space. The elimination process is continued depending on a criterion. To evaluate the proposed method, we applied it to four real-world benchmark problems. Experimental results confirmed that, the proposed method significantly reduces the irrelevant features without degrading the network performance and generates the feature subset with good generalization ability. Keywords: Feature subset selection, neural network, classification, contribution.
1 Introduction Feature subset selection (FSS) among the huge number of dataset is a crucial task for the success of neural network (NN). The selection of effective model and then obtaining the more descriptive data is essential to achieve good performance in classification with NN. As wrapper model always outperforms filter model [1][2], a numerous attempts have been done to solve the FSS problem in wrapper approach based on different criteria such as [3]-[9]. An extensive discussion regarding wrapper model has been introduced in Kohavi’s method [3]. This approach is capable of producing a minimal set of input variables, but the cost grows exponentially in the face of many irrelevant variables. A better solution has also been reported for FSS by Setiono [4] using NN M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 374–384, 2008. © Springer-Verlag Berlin Heidelberg 2008
Feature Subset Selection Using Constructive Neural Nets
375
but it failed to reduce the computational cost due to using a large fixed number of hidden neuron (HN). The concept of contribution is implemented for FSS by Linda [5] and S. Guan [6]. Linda uses a procedure to compute the contribution of each attribute (feature) using weights. In contrast, S. Guan considers separate training of each input attribute using a separate constructive NN model to do so, though it failed to reduce the computational cost. Two other solutions for FSS can be found in [7][8]. According to the above discussions, none of them except [5] and [6], has emphasized for FSS by measuring contribution either in backward search model or constructive approach. Since backward search always shows better performance over forward search [2], in this paper, we introduce a method called contribution based feature subset selection method (CFSS) that is comprised by three techniques: constructive, contribution, and backward elimination. It is known that the constructive approach can achieve both the compact architecture and better generalization [10]. We therefore initially use a constructive algorithm, starting with a minimal network and then adding new hidden layer neurons during training. After that, in order to find out the contribution of input individuals, removal of one-by-one input attributes is implemented. In the latter part, heuristic search namely sequential backward elimination (BE) is used by utilizing four new criteria: a) double elimination, b) backtracking, c) single elimination, and d) subset validation. Subset generation process is quickened and accomplished accurately by incorporating the above four steps in BE. In order to evaluate CFSS, we tested it on four real world benchmark problems such as breast cancer, diabetes, glass, and iris problems. The experimental result exhibited that CFSS works well in the real world classification problems and makes a compact NN architecture. The remainder of this paper is organized as follows. Section 2 describes CFSS in details. Section 3 presents the results of experimental analysis. A short discussion and conclusions are given in Section 4 and Section 5 respectively.
2 The Proposed Method In feature selection, the wrapper approaches always suffer from two major problems: high computational cost [1], and instability due to changing weights randomly [9]. In order to remedy mainly the first one, we emphasize on the compact network architecture. As we know, if the size of the network becomes larger, then expensive computation takes place during the process and the performance of the network is degraded. Therefore, we implemented a constructive algorithm in CFSS and the combination of thrice, constructive, contribution, and backward elimination (BE), exhibits the completeness of CFSS efficiently. Before describing the actual algorithm, we define some parameters used in the algorithm such as stopping criteria, measurement of contribution, and so on. 2.1 Stopping Criteria (SC) We used in this method a NN that is trained by back propagation learning (BPL) algorithm [11]. New criteria are adopted for stopping the training process early in order to improve the generalization ability and to reduce the training cost. Three
376
M.M. Kabir, M. Shahjahan, and K. Murase
stopping criteria used in CFSS are summarized in this section. The details of training process and SC are available in [12]. 2.1.1 Stopping Criterion for Adding Hidden Neurons (HNs) During the course of adding HN, the SC is adopted as Emin > E p −min . That is, when the last HN is added, the minimum training error Emin should be larger than the previously achieved minimum training error Ep-min. 2.1.2 Stopping Criterion for Backward Elimination (BE) 2.1.2.1 Calculation of Error Reduction Rate (ERR) During BE process, we need to calculate the ERR as follows,
ERR = − Here,
′ − Emin Emin Emin
′ is the minimum training error during backward elimination. Emin
2.1.2.2 Stopping Criterion in double feature deletion During the course of training, the irrelevant features are sequentially deleted double at a time in BE: the SC is adopted as ERR <= th AND CA′ => CA . Here, th refers to threshold value equals to 0.05, CA′ and CA refer to the classification accuracies during BE and before BE, respectively.
2.1.2.3 Stopping Criterion in single feature deletion During the course of training, the irrelevant features are sequentially deleted single at a time in BE: the SC is adopted as ERR <= th AND CA′ => CA . Here, th refers to threshold value equals to 0.08. 2.2 Measurement of Contribution Finding the contribution of input attributes accurately ensures the effectiveness of CFSS. Therefore, during the one-by-one removal training, we measure the minimum i
training error, Emin for removing each ith feature. Then, calculate the contribution of each feature as follows. i i Training error difference for removing the ith features is, Ediff = Emin − Emin where Emin is the minimum training error using all features. Now, we need to calculate the average training error difference for removing ith feature Eavg = 1
N
∑E N i =1
i diff
where
N is the number of features. Now, the percentage contribution of ith feature is,
Coni = 100.
i Eavg − Ediff
Eavg
Now, we can decide the rank of features according to Conmax = max(Coni ) . The worst ranking features are treated as the irrelevant features for the network as these are providing comparatively less contribution for the network.
Feature Subset Selection Using Constructive Neural Nets
377
2.3 Calculation of Network Connection The number of connections (C) of the final architecture of the network can be calculated in terms of the number of existing input attributes (x), number of existing hidden units (h), and number of outputs (o) by C = ( x × h) + ( h × o) + h + o . 2.3.1 Reduction of Network Connection (RNC) In this context we here estimate how much connection has been reduced due to feature selection. In this regard, we initially calculate the total number of connections of the achieved network before and after BE, Cbefore and Cafter , respectively. After that, we estimate RNC as, RNC = 100.
Cbefore − C after
.
Cbefore
2.3.2 Increment of Network Accuracy (INA) Due to the reduction of network connection, we estimate INA that shows how much network accuracy is improved. We measure the classification accuracy before and after BE, CAbefore and CAafter , respectively. After that, we estimate INA as, INA = 100.
CAafter − CAbefore Cbefore
.
2.4 The Algorithm
In this framework, CFSS comprises into three main steps that are summarized in Fig. 1: (a) architecture determination of NN in constructive fashion (step 1-2), (b) measurement of contribution of each feature (step 3), and (c) subset generation (step 4-8). These steps are dependent each other and the entire process has been accomplished one after another depending on particular criteria. The detail of each step is as follows. Step 1) Create a minimal NN architecture. Initially it has three layers, i.e. an input layer, an output layer, and a hidden layer with only one neuron. Number of input and output neurons is equal to the number of inputs and outputs of the problem. Randomly initialize connection weights between input layer to hidden layer and hidden layer to output layer within a range between [+1.0, -1.0]. Step 2) Train the network by using BPL algorithm and try to achieve a minimum training error, Emin. Then, add a HN, retrain the network from the beginning, and check the SC according to subsection 2.1.1. When it is satisfied, then validate the NN to test set to calculate the classification accuracy, CA and go to step 3. Otherwise, continue the HN selection process. Step 3) Now to find out the contribution of features, we perform one-by-one removal training of the network. Delete successively the ith feature and save the i individual minimum training error, E min . Then, the rank of all features can be reflected in accordance with the subsection 2.2. Then go to Step 4.
378
M.M. Kabir, M. Shahjahan, and K. Murase
1
Create an NN with minimal architecture Training
2
SC satisfied ?
yes
Validate NN, calculate accuracy
no Add HN
One-by-one training, compute contribution
3
Delete double/single feature Training 4
Validate subset, calculate accuracy
SC satisfied ?
no
yes 5
6
Backtracking no
Single deletion end ? yes
7
Further check
8
Final subset
Fig. 1. Overall Flowchart of CFSS. Here, NN, HN and SC refer to neural network, hidden neuron and stopping criterion respectively.
Step 4) This stage is the first step for BE to generate the feature subset. Initially, we attempt to delete double features of worst rank at a time to accelerate the process. Calculate the minimum training error, E′min during training. Validate the existing features using test set and calculate classification accuracy CA′ . After that, if SC mentioned in subsection 2.1.2.2 is satisfied, then continue. Otherwise, go to step 5. Step 5) We perform backtracking. That is, the lastly deleted double/single of features is restored here with associated all components. Step 6) If single deletion process has not been finished then attempt to delete feature single at a time using step 4 to filter the unnecessary ones, otherwise go to step 7. If SC mentioned in subsection 2.1.2.3 is satisfied, then continue. Otherwise, go to step 5 and then step 7.
Feature Subset Selection Using Constructive Neural Nets
379
Step 7) Before going to select the final subset, we again check through the existing features whether any irrelevant feature presents or not. The following steps are taken to accomplish this task. i) Delete the existing features from the network single in each step in accordance with worst rank, and then again retrain. ii) Validate the existing features by using test set. For the next stage, save the classification accuracy CA′′ , the responsible deleted feature, and all its components. iii) If DF<(TF-1) then continue the deletion process. Otherwise, go to the next step. Here, DF and TF refer to the number of deleted feature and total number of feature, respectively. iv) Compare the values of CA′′ according to CA′′ => CA . v) If better CA′′ are available, then identify the higher rank of feature among them. Otherwise, stop. Delete the higher ranked feature with corresponding worst ranked ones from the subset that obtained at step 7-i. After that, recall the components to the current network that was obtained at step 7-ii and stop. Step 8) Finally, we achieve the relevant feature subset with compact network.
3 Experimental Analysis This section evaluates CFSS’s performance on four well-known benchmark problems such as Breast cancer (BCR), Diabetes (DBT), Glass (GLS), and Iris (IRS) problems which are obtained from [13]. Input attributes and output classes of BCR are 9 and 2, 8 and 2 of DBT, 9 and 6 of GLS, 4 and 3 of IRS, respectively. All datasets are partitioned into three sets: a training set, a validation set, and a test set. The first 50% is used as training set for training the network, and the second 25% as validation set to check the condition of training and stop it when the overall training error goes high. The last 25% is used as test set to evaluate the generalization performance of the network and is not seen by the network during the whole training process. In all experiments, two bias nodes with a fixed input +1 are connected to the hidden layer and output layer. The learning rate η is set between [0.1,0.3] and the weights are initialized to random values between [+1.0, -1.0]. Each experiment was carried out 10 times and the presented results were the average of these 10 runs. The all experiments were done by Pentium-IV, 3 GHz desktop personal computer. 3.1 Experimental Results
Table 1 shows the average results of CFSS where the number of selected features, classification accuracies and so on are incorporated for BCR, DBT, GLS, and IRS problems in before and after feature selection process. For clarification, in this table, we measured the training error by the section 2.5 of [12], and classification accuracy CA is the ratio of classified examples to the total examples of the particular data set.
40 30 20 10 0 -10 -20 0
1
2
3
4
5
6
7
8
9
10
Classification Accuracy (%)
M.M. Kabir, M. Shahjahan, and K. Murase Attribute Contribution (%)
380
80 78 76 74 72 70 0
Attribute
2
4
6
8
10
Subset Size
Fig. 2. Contribution of attributes in breast Fig. 3. Performance of the network according to cancer problem for single run the size of subset in diabetes problem for single run Table 1. Results of four problems such as BCR, DBT, GLS, and IRS. Numbers in () are the standard deviations. Here, TERR, CA, FS, HN, ITN, and CTN refer to training error, classification accuracy, feature selection, hidden neuron, iteration, and connection respectively. Name
BCR DBT GLS IRS
Feature # 9 (0.00) 8 (0.00) 9 (0.00) 4 (0.00)
Before FS TERR (%) 3.71 (0.002) 16.69 (0.006) 26.44 (0.026) 4.69 (0.013)
CA (%) 98.34 (0.31) 75.04 (2.06) 64.71 (1.70) 97.29 (0.00)
Feature # 3.2 (0.56) 2.5 (0.52) 4.1 (1.19) 1.3 (0.48)
Table 2. Results of computational time for BCR, DBT, GLS, and IRS problems Name BCR DBT GLS IRS
Computational Time (Second) 22.10 30.07 22.30 7.90
After FS TERR (%) 3.76 (0.003) 14.37 (0.024) 26.56 (0.028) 8.97 (0.022)
CA (%) 98.57 (0.52) 76.13 (1.74) 66.62 (2.18) 97.29 (0.00)
HN#
ITN#
CTN#
2.60
249.4
18.12
2.66
271.3
16.63
3.3
1480.9
42.63
3.2
2317.1
19.96
Table 3. Results of connection reduction and accuracy increase for BCR, DBT, GLS, and IRS problems Name BCR DBT GLS IRS
Connection Decrease (%) 45.42 46.80 27.5 30.2
Accuracy Increase (%) 0.23 1.45 2.95 0.00
In each experiment, CFSS generates a subset having the minimum number of relevant features is available. Thereupon, CFSS produces better CA with a smaller number of hidden neurons HN. It is shown in Table 1 that, for example, in BCR, the network easily selects a minimum number of relevant features subset i.e. 3.2 which occurs 98.57% CA while the network used only 2.6 of hidden neurons to build the minimal network structure. In contrast, in case of remaining problems like DBT, GLS, and IRS, the results of those attributes are nearly similar as BCR except IRS. This is because, for IRS problem, the network cannot produce better accuracy any
Feature Subset Selection Using Constructive Neural Nets
381
more after feature selection rather it able to generate 1.3 relevant features subset among 4 attributes which is sufficient to exhibit the same network performance. In addition, Fig. 2 exhibits the arrangement of attributes during training according to their contribution in BCR. CFSS can easily delete the unnecessary ones and generate the subset {6,7,1}. In contrast, Fig. 3 shows the relationship between classification accuracy and subset size for DBT. We calculated the network connection of the pruned network at the final stage according to section 2.3 and the results are shown in Table 1. The computational time for completing the entire FSS process is exhibited in Table 2. After that, we estimated the connection decrement of the network corresponding to accuracy increment due to FSS as shown in Table 3 according to 2.3.1 and 2.3.2. We can see here the relation between the reduction of the network connection, and the increment of accuracy. 3.2 Comparison with other Methods
In this section, we compare the results of CFSS to those obtained by other methods NNFS and ICFS reported in [4] and [6], respectively. The results are summarized in Tables 4-7. Prudence should be considered because different technique has been involved in their methods for feature selection. Table 4. Comparison on the number of relevant features for BCR, DBT, GLS, and IRS data sets Name
CFSS
NNFS
BCR DBT GLS IRS
3.2 2.5 4.1 1.3
2.70 2.03 -
ICFS (M 1) 5 2 5 -
ICFS (M 2) 5 3 4 -
Table 6. Comparison on the average number of hidden neurons for BCR, DBT, GLS, and IRS data sets Name
CFSS
NNFS
BCR DBT GLS IRS
2.60 2.66 3.3 3.2
12 12 -
ICFS (M 1) 33.55 8.15 62.5 -
ICFS (M 2) 42.05 21.45 53.95 -
Table 5. Comparison on the average testing CA (%) for BCR, DBT, GLS, and IRS data sets Name
CFSS
NNFS
BCR DBT GLS IRS
98.57 76.13 66.62 97.29
94.10 74.30 -
ICFS (M 1) 98.25 78.88 63.77 -
ICFS (M 2) 98.25 78.70 66.61 -
Table 7. Comparison on the average number of connections for BCR, DBT, GLS, and IRS data sets Name
CFSS
NNFS
BCR DBT GLS IRS
18.12 16.63 42.63 19.96
70.4 62.36 -
ICFS (M 1) 270.4 42.75 756 -
ICFS (M 2) 338.5 130.7 599.4 -
Table 4 shows the discrimination capability of CFSS in which the dimensionality of input layer is reduced for above-mentioned four problems. In case of BCR, the result of CFSS is quite better in terms of ICFS’s two methods while it is not so with NNFS. The result of DBT is average with comparing others. But, for the case of GLS, the result is comparable or better with two methods of ICFS.
382
M.M. Kabir, M. Shahjahan, and K. Murase
The comparison on the average testing CA for all problems is shown in Table 5. It is seen that, the results of BCR and GLS are better with NNFS and ICFS while DBT with NNFS. The most important aspect however in our proposed method is, less number of connections of final network due to less number of HN in NN architecture. The results are shown in Table 6 and 7 where the number of HNs and connections of CFSS are much less in comparison to other three methods. Note that there are some missing data in Table 4-7, since IRS problem was not tested by NNFS and ICFS, and GLS problem by NNFS.
4 Discussion This paper presents a new combinatorial method for feature selection that generates subset in minimal computation due to minimal size of hidden layer as well as input layer. Constructive technique is used to achieve the compact size of hidden layer, and a straightforward contribution measurement leads to achieve reduced input layer showing better performance. Moreover, a composite combination of BE by means of double and single elimination, backtracking, and validation helps CFSS to generate subset proficiently. The results shown in Table 1 exhibit that CFSS generates subset with a small number of relevant features with producing better performance in four-benchmark problems. The results of relevant subset generation and generalization performance are better or comparable to other three methods as shown in Tables 4 and 5. From the long period, the wrapper approaches for FSS are overlooked because of the huge computation in processing. In CFSS, computational cost is much less. As seen in Table 3, due to FSS, 37.48% of computational cost is reduced in the advantage of 1.18% network accuracy for four problems in average. The computational time for different problems to complete the entire process in CFSS is shown in Table 2. We believe that these values are sufficiently low especially for clinical field. The system can give the diagnostic result of the patient to the doctor within a minute. Though the exact comparison is difficult, other methods such as NNFS and ICFS may take 4-10 times more since the numbers of hidden neurons and connections are much more as seen in Tables 6 and 7 respectively. CFSS thus provides minimal computation in feature subset selection. In addition, during subset generation, we used to meet up the generated subset for validation in each step. The reason is that, during BE we build a composite SC, which eventually find out the local minima where the network training should be stopped. Due to implementing such criterion, network produces significant performance and thus no need to validate the generated subset finally. Furthermore, further checking for irrelevant features in BE brings the completeness of CFSS. In this study we applied CFSS to the datasets with smaller number of features up to 9. To get more relevant tests for real tasks, we intend to use CFSS with other datasets having a larger number of features in future. The issue of extracting rules from NN is always demandable to interpret the knowledge how it works. For this, a NN with compactness is desirable. As CFSS can give support to fulfill the requirements, rule extraction from NN is the further task in future efficiently.
Feature Subset Selection Using Constructive Neural Nets
383
5 Conclusion This paper presents a new approach for feature subset selection based on contribution of input attributes in NN. The combination of constructive, contribution, and backward elimination carries the success of CFSS. Initially a basic constructive algorithm is used to determine a minimal and optimal structure of NN. In the latter part, one-by-one removal of input attributes is adopted that does not computationally expensive. Finally, a backward elimination with new stopping criteria are used to generate relevant feature subset efficiently. Moreover, to evaluate CFSS, we tested it on four real-world problems such as breast cancer, diabetes, glass and iris problems. Experimental results confirmed that, CFSS has a strong capability of feature selection, and it can remove the irrelevant features from the network and generates feature subset by producing compact network with minimal computational cost.
Acknowledgements Supported by grants to KM from the Japanese Society for Promotion of Sciences, the Yazaki Memorial Foundation for Science and Technology, and the University of Fukui.
References 1. Liu, H., Tu, L.: Toward Integrating Feature Selection Algorithms for Classification and Clustering. IEEE Transactions on Knowledge and Data Engineering 17(4), 491–502 (2005) 2. Dash, M., Liu, H.: Feature Selection for Classification. Intelligent Data Analysis - An International Journal 1(3), 131–156 (1997) 3. Kohavi, R., John, G.H.: Wrapper for feature subset selection. Artificial Intelligence 97, 273–324 (1997) 4. Sateino, R., Liu, H.: Neural Network Feature Selector. IEEE Transactions on Neural Networks 8 (1997) 5. Milna, L.: Feature Selection using Neural Networks with Contribution Measures. In: 8th Australian Joint Conference on Artificial Intelligence, Canberra, November 27 (1995) 6. Guan, S., Liu, J., Qi, Y.: An incremental approach to Contribution-based Feature Selection. Journal of Intelligence Systems 13(1) (2004) 7. Schuschel, D., Hsu, C.: A weight analysis-based wrapper approach to neural nets feature subset selection. In: Tools with Artificial Intelligence: Proceedings of 10th IEEE International Conference (1998) 8. Hsu, C., Huang, H., Schuschel, D.: The ANNIGMA-Wrapper Approach to Fast Feature Selection for Neural Nets. IEEE Trans. on Systems, Man, and Cybernetics-Part B: Cybernetics 32(2), 207–212 (2002) 9. Dunne, K., Cunningham, P., Azuaje, F.: Solutions to Instability Problems with Sequential Wrapper-based Approaches to Feature Selection. Journal of Machine Learning Research (2002)
384
M.M. Kabir, M. Shahjahan, and K. Murase
10. Phatak, D.S., Koren, I.: Connectivity and performance tradeoffs in the cascade correlation learning architecture, Technical Report TR-92-CSE-27, ECE Department, UMASS, Amherst (1994) 11. Rumelhart, D.E., McClelland, J.: Parallel Distributed Processing. MIT Press, Cambridge (1986) 12. Prechelt, L.: PROBEN1-A set of neural network benchmark problems and benchmarking rules, Technical Report 21/94, Faculty of Informatics, University of Karlsruhe, Germany (1994) 13. newman, D.J., Hettich, S., Blake, C.L., Merz, C.J.: UCI Repository of Machine Learning Databases, Dept. of Information and Computer Sciences, University of California, Irvine (1998), http://www.ics.uci.edu/~mlearn/MLRepository.html
Dynamic Link Matching between Feature Columns for Different Scale and Orientation Yasuomi D. Sato1 , Christian Wolff1 , Philipp Wolfrum1 , and Christoph von der Malsburg1,2 1
2
Frankfurt Institute for Advanced Studies (FIAS), Johann Wolfgang Goethe University, Max-von-Laue-Str. 1, 60438, Frankfurt am Main, Germany Computer Science Department, University of Southern California, LA, 90089-2520, USA
Abstract. Object recognition in the presence of changing scale and orientation requires mechanisms to deal with the corresponding feature transformations. Using Gabor wavelets as example, we approach this problem in a correspondence-based setting. We present a mechanism for finding feature-to-feature matches between corresponding points in pairs of images taken at different scale and/or orientation (leaving out for the moment the problem of simultaneously finding point correspondences). The mechanism is based on a macro-columnar cortical model and dynamic links. We present tests of the ability of finding the correct feature transformation in spite of added noise.
1
Introduction
When trying to set two images of the same object or scene into correspondence with each other, so that they can be compared in terms of similarity, it is necessary to find point-to-point correspondences in the presence of changes in scale or orientation (see Fig. 1). It is also necessary to transform local features (unless one chooses to work with features that are invariant to scale and orientation, accepting the reduced information content of such features). Correspondencebased object recognition systems [1,2,3,4] have so far mainly addressed the issue of finding point-to-point correspondences, leaving local features unchanged in the process [5,6]. In this paper, we propose a system that can not only transform features for comparison purposes, but also recognize the transformation parameters that best match two sets of local features, each one taken from one point in an image. Our eventual aim will be to find point correspondences and feature correspondences simultaneously in one homogeneous dynamic link matching system, but we here take point correspondences as given for the time being. Both theoretical [7] and experimental [8] investigations are suggesting 2D-Gabor-based wavelets as features to be used in visual cortex. These are best sampled in a log-polar manner [11]. This representation has been shown to be particularly useful for face recognition [12,13], and due to its inherent symmetry it is highly appropriate for implementing a transformation system for scale and M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 385–394, 2008. c Springer-Verlag Berlin Heidelberg 2008
386
Y.D. Sato et al.
Fig. 1. When images of an object differ in (a) scale or (b) orientation, correspondence between them can only established if also the features extracted at a given point (e.g., at the dot in the middle of the images) are transformed during comparison
orientation. The work described here is based entirely on the macro-columnar model of [14]. In computer simulations we demonstrate feature transformation and transformation parameter recognition.
2
Concept of Scale and Rotation Invariance
Let there be two images, called I and M (for image and model). The GaborL based wavelet transform J¯ (L = I or M), has the form of a vector, called a jet, whose components are defined as convolutions of the image with a family of Gabor functions: 1 1 L ¯ Jk,l (x0 , a, θ) = I(x) 2 ψ Q(θ)(x0 − x) d2 x, (1) a a R2 1 1 D2 2 ψ(x) = 2 exp(− 2 |x| ) exp(ixT · e1 ) − exp(− ) , (2) D 2D 2 cos θ sin θ Q(θ) = . (3) −sin θ cos θ Here a simultaneously represents the spatial frequency and controls the width of a Gaussian envelope window. D represents the standard deviation of the Gaussian. The individual components of the jet are indexed by n possible orientations and (m + m1 + m2 ) possible spatial frequencies, which are θ = πk/n (k ∈ {0, 1, . . . , (n−1)}) and a = al0 (0 < a0 < 1, l ∈ {0, 1, . . . , (m+m1 +m2 −1)}). Here parameters m1 and m2 fix the number of scale steps by which the image can be scaled up or down, respectively. m (> 1) is the number of scales in the jets of the model domain. L L The Gabor jet J¯ is defined as the set {J¯k,l } of NL Gabor wavelet components extracted from the position x0 of the image. In order to test the ability to find the correct feature transformation described below, we add noise of strength σ1 to L L the Gabor wavelet components, J˜k,l = J¯k,l (1 + σ1 Sran ) where Sran are random L numbers between 0 and 1. Instead of the Gabor jet J¯ itself, we employ the L sum-normalized J˜k,l . Jets can be visualized in angular-eccentricity coordinates,
Dynamic Link Matching between Feature Columns (a)
387
(b)
Fig. 2. Comparison of Gabor jets in the model domain M and the image domain I (left and right, resp., in the two part-figures). Jets are visualized in log-polar coordinates: The orientation θ of jets is arranged circularly while the spatial frequency l is set radially. The jet J M of the model domain M is shown right while the jet J I from the transformed image in the image domain I is shown left. Arrows indicate which components of J I and J M are to be compared. (a) comparison of scaled jets and (b) comparison of rotated jets.
in which orientation and spatial frequency of a jet are arranged circularly and radially (as shown in Fig.2). The components of jets that are taken from corresponding points in images I and M may be arranged according to orientation (rows) and scale (columns): ⎛ ⎞ I I 0 · · · 0 J0,m · · · J0,m 0 ··· 0 1 −1 1 −1+m ⎜ .. .. .. .. .. ⎟ , J I = ⎝ ... (4) . . . . .⎠ I I 0 · · · 0 Jn−1,m · · · Jn−1,m 0 ··· 0 1 −1 1 −1+m
⎛
JM
⎞ M M J0,m · · · J0,m 1 −1 1 −1+m ⎜ ⎟ .. .. =⎝ ⎠. . . M M Jn−1,m1 −1 · · · Jn−1,m1 −1+m
(5)
Let us assume the two images M and I to be compared have the exact same structure, apart from being transformed relative to each other. Let us assume the transformation conforms to the sample grid of jet components. Then there will be pair-wise identities between components in Eqs.(4) and (5). If the jet in I is scaled relative to the jet in M, the non-zero components of J I are shifted along the horizontal (or, in Fig.2(a), radial) coordinate. If the image I, and correspondingly the jet J I , is rotated, then jet components are circularly permuted along the vertical axis in Eq.(4) (see Fig.2(b)). When comparing scaled and rotated jets, the non-zero components of J I are shifted along both axes simultaneously. There are model jet components of m different scales, and to allow for m1 steps of scaling down the image I and m2 steps of scaling it up. The jet in Eq.(4) is padded on the left and right with m1 and m2 columns of zeros, respectively.
388
Y.D. Sato et al.
Fig. 3. A dynamic link matching network model of a pair of single macro-columns. In the I and M domains, the network consists of, respectively, NI and NM feature units (mini-columns). The units in the I and M domains represent components of the jets J I and J M . The Control domain C controls interconnections between the I and M macrocolumns, which are all-to-all at the beginning of a ν cycle. By comparing the two jets and through competition of control units in C, the initial all-to-all interconnections are dynamically reduced to a one-to-one interconnection.
3
Modelling a Columnar Network
In analogy to [14], we set up a dynamical system of variables, both for jet components and for all possible correspondences between jet components. These variables are interpreted as the activity of cortical mini-columns (here called “units”, “feature units” in this case). The set of units corresponding to a jet, either in I or in M, form a cortical macro-column (or short “column”) and inhibit each other, the strength of inhibition being controlled by a parameter ν, which cyclically ramps up and falls down. In addition, there is a column of control units, forming the control domain C. Each control unit stands for one pair of possible values of relative scale and orientation between I and M and can evaluate the similarity in the activity of feature units under the corresponding transformation. The whole network is schematically shown in Fig.3. Each domain, or column in our simple case, consists of a set of unit activities {P1L , . . . , PNLL } in the respective domain (L = I or M). NI = (m + m1 + m2 ) × n and NM = m × n represent respectively the total number of units in the I and M domains. The control unit’s activations are PC = (P1C , . . . , PNCC )T , where NC = (m1 + m2 + 1) × n. The equation system for the activations in the domains is given by dPαL = fα (PL ) + κL CE JαL + κL (1 − CE )EαLL + ξα (t), dt (α = 1, . . . , NL and L = {M, I}\L), dPγC = fγ (PC ) + κC Fγ + ξγ (t), dt (γ = 1, . . . , NC ),
(6)
(7)
Dynamic Link Matching between Feature Columns
389
Fig. 4. Time course of all activities in the domains I, M and C over a ν cycle for two jets without relative rotation and scale, for a system with size m = 5, n = 8, m1 = 0 and m2 = 4. The activity of units in the M and I columns is shown within the bars on the right and the bottom of the panels, respectively, each of the eight subblocks corresponding to one orientation, with individual components corresponding to different scales. The indices k and l at the top label the orientation and scale of the I jet, respectively. The large matrix C contains, in the top version, as non-zero (white) entries all allowed component-to-component correspondences between the two jets (the empty, black, entries would correspond to forbidden relative scales outside the range set by m1 and m2 ). The matrix consists of 8 × 8 blocks, corresponding to the 8 × 8 possible correspondences between jet orientations, and each of these sub-matrices contains m×(m+m1 +m2 ) entries to make room for all possible scale correspondences. Each control unit controls (and evaluates) a whole jet to jet correspondence with the help of its entries in this matrix. The matrix in the lowest panel has active entries corresponding to just one control unit (for identical scale and orientation). Active units are white, silent units black. Unit activity is shown for three moments within one ν cycle. At t = 3.4 ms, the system has already singled out the correct scale, but is still totally undecided as to relative orientation. System parameters are κI = κM = 2, κC = 5, CE = 0.99, σ = 0.015 and σ1 = 0.1.
390
Y.D. Sato et al.
where ξα (t) and ξγ (t) are Gaussian noise of strength σ 2 . CE ∈ [0, 1] scales the ratio between the Gabor jet JαL and the total input EαLL from units in the other domain, L , and κL controls the strength of the projection L to L. κC controls the strength of the influence of the similarity Fγ on the control units. Here we explain the other terms in the above equations. The function fα (·) is: L L L L L 2 fα (P ) = a1 Pα Pα − ν(t) max {Pβ } − (Pα ) , (8) β=1,...,NL
where a1 = 25. The function ν(t) describes a cycle in time during which the feedback is periodically modulated:
k (νmax − νmin ) t−T (Tk t < T1 + Tk ) T1 + νmin , ν(t) = . (9) νmax , (T1 + Tk t < T1 + Tk + Trelax ) Here T1 [ms] is the duration during which ν increases with t while Tk (k = 1, 2, 3, . . .) are the periodic times at which the inhibition ν starts to increase. Trelax is a relaxation time after ν has been increased. νmin = 0.4, νmax = 1, T1 = 36.0 ms and Trelax = T1 /6 are set up in our study. Because of the increasing ν(t) and the noise in Eqs.(6) and (7), only one minicolumn in each domain remains active at the end of the cycle, while the others are deactivated, as shown in Fig.4. If we define the mean-free column activities as P˜αL := PαL − (1/NL ) α PαL , the interaction terms in Eqs. (6) and (7) are EαLL
=
NL NC
β ˜ L PβC Cαα Pα ,
(10)
β ˜I P˜γM Cγγ Pγ .
(11)
β=1 α =1
Fβ =
NM NI γ=1
γ =1
Here NC = n×(m1 +m2 +1) is the number of permitted transformations between the two domains (n different orientations and m1 + m2 + 1 possible scales). The β expression Cαα designates a family of matrices, rotating and scaling one jet into another. If we write β = (o, s) so that o identifies the rotation and s the scaling, then we can write the C matrices as tensor product: C β = Ao ⊗ B s , where Ao is an n × n matrix with a wrap-around shifted diagonal of entries 1 and zeros otherwise, implementing rotation o of jet components at any orientation, and B s is an m × (m + m1 + m2 ) matrix with a shifted diagonal of entries 1, and zeros otherwise, implementing an up or down shift of jet components of any scale. The C matrix implementing identity looks like the lowest panel in Fig. 4. The function E of Eq. (10), transfers feature unit activity from one domain to the other, and is a weighted superposition of all possible transformations, weighted with the activity of the control units. The function of F in Eq. (11) evaluates the similarity of the feature unit activity in one domain with that in the other under mutual transformation.
Dynamic Link Matching between Feature Columns
391
Through the columnar dynamics of Eq. (6), the relative strengths of jet components are expressed in the history of feature unit activities (the strongest jet component, for instance, letting its unit stay on longest) so that the sum of products in Eq. (11), integrated over time in Eq. (7), indeed expresses jet similarities. The time course of all unit activities in our network model during a ν cycle is demonstrated in Fig.4. In this figure, the non-zero components of the jets used in the I and M domains are identical without any rotation and scale. At the beginning of the ν cycle, all units in the I, M and C domains are active, as indicated in Fig.4 by the white jet-bars on the right and the bottom, and indicated by all admissible correspondences in the matrix being on. The activity state of the units in each domain is gradually reduced following Eqs.(6) – (11). In the intermediate state at t = 3.4 ms, all control units for the wrong scale have switched off, but the control units for all orientations are still on. At t = 12.0 ms, finally, only one control unit has survived, the one that correctly identifies the identity map. This state remains stable during the rest of the ν cycle.
4
Results
In this Section, we describe tests performed on the functionality of our macrocolumnar model by studying the robustness of matching to noise introduced into the image jets (parameter σ1 ) and the dynamics of the units, Eqs. (6) and (7) (parameter σ) and robustness to the admixture of input from the other domain, Eq. (6), due to non-vanishing (1 − CE ). A correct match is one in which the surviving control unit corresponds to the correct transformation pair which the I jet differs from the M jet. Jet noise is to model differences between jets actually extracted from natural images. Dynamic noise is to reflect signal fluctuations to be expected in units (minicolumns) composed of spiking neurons, and a nonzero admixture parameter may be relevant in a system in which transfer of jet information between the domains is desired. For both of the experiments described below, we used Gabor jets extracted from the center of real facial images. The transformed jets used for matching were extracted from images that had been scaled and rotated relative to the center position by exactly the same factors and angles that are also used in the Gabor transform. This ensures that there exists a perfect match between any two jets produced from two different rotated and scaled versions of the same face. The first experiment uses a single facial image. This experiment investigates the influence of noise in the units (σ) and of the parameter CE . Using a pair of the same jets, we obtain temporal averages of the correct matching over 10 ν cycles for each size (n = 4, 5, . . . , 9, 10, m = 4, m1 = 0 and m2 = m − 1) of the macro-column. From these temporal averages, we calculate √ the sampling average over all n. The standard error is estimated using [σd / N1 ] where σd is standard deviation of the sampling average. N1 is the sample size. Fig.5(a) shows results for κI = κM = 1 and κC = 5. For strong noise (σ = 0.015 to 0.02), the correct matching probabilities for CE = 0.98 to CE = 0.96
392
Y.D. Sato et al.
(a)
(b) CE=1.0 CE=0.98 CE=0.96 CE=0.95 CE=0.94
1
Probability Correct
Probability Correct
0.8
0.6
0.4 CE=1.0 CE=0.98 CE=0.96 CE=0.95 CE=0.94
0.2
1
0.75
0 0.5 0
0.005
0.01
0.015
0.02
0
0.005
σ
0.01
0.015
0.02
σ
Fig. 5. Probability of correct match between units in the I and M domains. (a) κI = κM = 1 and κC = 5. (b) κI = 2.3, κM = 5 and κC = 5. σ=0.000
Probability Correct
1 0.8 0.6 0.4 0.2 0 0
0.05
0.1 σ1
0.15
0.2
Fig. 6. Probability of the correct matching for comparisons of two identical faces. 6 facial images are used. Our system is set with m = 4, m1 = m2 = m − 1, n = 8, κI = κM = 6.5, κC = 5.0, CE = 0.98 and σ = 0.0.
take better values than the ones for CE = 1. Interestingly, for these low CE values, matching gets worse for weaker noise, collapsing in the case of σ = 0. This effect requires further investigation. However, for κI = 2.3 and κM = 5, we have found that the correct matching probability takes higher values than in case of Fig.5(a). In particular, our system model with higher noise σ = 0.015 even demonstrates perfect correct matching for CE = 0.98, independent of n. Next, we have investigated the robustness of the matching against the second type of noise (σ1 ), using 6 different face types. Since the original image is resized and/or rotated with m1 + m2 + 1 = 7 scales and n = 8 orientations, a total of 56 images could be employed in the I domain. Here our system is set with κI = κM = 6.5, κC = 5.0 and CE = 0.98. For each face image, we take temporal averages of the correct matching probability for each size-orientation of the image, in a similar manner as described above. Averages and standard errors of the temporal averages on 6 different facial images can be plotted, in terms of σ1 [Fig. 6]. The result of this experiment is independent of σ as long as σ 0.02. As a result of Fig. 6, as random noise in the jets is increased, the correct matching probability smoothly decreases for one image, or it abruptly shifts down to around 0 at a certain σ1 for the other image. We have also obtained
Dynamic Link Matching between Feature Columns
393
perfect matching, independent of σ1 when using the same but rotated or resized face. The most interesting point that we would like to insist is that the probability takes higher values than 87.74% as σ < 0.02 and σ1 < 0.08 (see Fig.6). Therefore, we can say that network model has a high recognition ability for scale and orientation invariance, which is not dependent of different facial types.
5
Discussion and Conclusion
The main purpose of this communication is to convey a simple concept. Much more work needs to be done on the way to practical applications, for instance, more experiments with features extracted from independently taken images of different scale and orientation to better bring out the strengths and weaknesses of the approach. In addition to comparing and transforming local packages of feature values (our jets), it will be necessary to also handle spatial maps, that is, sets of point-to-point correspondences, a task we will approach next. Real applications involve, of course, a continuous range of transformation parameters, whereas we here had admitted only transformations from the same sample grid used for defining the family of wavelets in a jet. We hope to address this problem by working with continuous superpositions of transformation matrices for the neighboring transformation parameter values that straddle the correct value. Our approach may be seen as using brute force, as it requires many separate control units to sample the space of transformation parameters with enough density. However, the same set of control units can be used in a whole region of visual space, and in addition to that, for controlling point-to-point maps between regions, as spatial maps and feature maps stand in one-to-one correspondence to each other. The number of control units can be reduced even further if transformations are performed in sequence, by consecutive, separate layers of dynamic links. Thus, the transformation from an arbitrary segment of primary visual cortex to an invariant window could be done in the sequence translation – scaling – rotation. In this case, the number of required control units would be the sum and not the product of the number of samples for each individual transformation. A disadvantage of that approach might be added difficulties in finding correspondences between the domains. Further work is required to elucidate these issues.
Acknowledgements This work was supported by the EU project Daisy, FP6-2005-015803 and by the Hertie Foundation. We would like to thank C. Weber for help with preparing the manuscript.
References 1. Anderson, C.H., Van Essen, D.C., Olshausen, B.A.: Directed visual attention and the dynamic control of information flow. In: Itti, L., Rees, G., Tsotsos, J.K. (eds.) Neurobiology of attention, pp. 11–17. Academic Press/Elservier (2005) 2. Weber, C., Wermter, S.: A self-organizing map of sigma-pi units. Neurocomputing 70, 2552–2560 (2007)
394
Y.D. Sato et al.
3. Wiskott, L., v.d. Malsburg, C.: Face Recognition by Dynamic Link Matching, ch. 4. In: Sirosh, J., Miikkulainen, R., Choe, Y. (eds.) Lateral Interactions in the Cortex: Structure and Function, vol. 4, Electronic book (1996) 4. Wolfrum, P., von der Malsburg, C.: What is the optimal architecture for visual information routing? Neural Comput. 19 (2007) 5. Lades, M.: Invariant Object Recognition with Dynamical Links, Robust to Variations in Illumination. Ph.D. Thesis, Ruhr-Univ. Bochum (1995) 6. Maurer, T., von der Malsburg, C.: Learning Feature Transformations to Recognize Faces Rotated in Depth. In: Fogelman-Souli´e, F., Rault, J.C., Gallinari, P., Dreyfus, G. (eds.) Proc. of the International Conference on Artificial Neural Networks ICANN 1995, EC2 & Cie, p. 353 (1995) 7. Daugmann, J.G.: Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical filters. J. Opt. Soc. Am. A 2, 1160–1169 (1985) 8. Jones, J., Palmer, L.: An evaluation of the two-dimensional Gabor filter model of simple receptive fields in cat striate cortex. J. Neurophysiol. 58, 1233–1258 (1987) 9. Pentland, A., Moghaddam, B., Starner, T.: View-based and modular eigenspaces for face recognition. In: Proc. of the third IEEE Conference on Computer Vision and Pattern Recognition, pp. 84–91 (1994) 10. Wiskott, L., Fellous, J.-M., Kr¨ uger, von der Malsburg, C.: Face Recognition by Elastic Bunch Graph Matching. IEEE Trans. Pattern Anal. & Machine Intelligence 19, 775–779 (1997) 11. Marcelja, S.: Mathematical description of the responses of simple cortical cells. J. Optical Soc. Am. 70, 1297–1300 (1980) 12. Yue, X., Tjan, B.C., Biederman, I.: What makes faces special? Vision Res. 46, 3802–3811 (2006) 13. Okada, K., Steffens, J., Maurer, T., Hong, H., Elagin, E., Neven, H., von der Malsburg, C.: The Bochum/USC Face Recognition System and How it Fared in the FERET Phase III Test. In: Wechsler, H., Phillips, P.J., Bruce, V., Fogelman Souli´e, F., Huang, T.S. (eds.) Face Recognition: FromTheory to Applications, pp. 186–205. Springer, Heidelberg (1998) 14. L¨ ucke, J., von der Malsburg, C.: Rapid Correspondence Finding in Networks of Cortical Columns. In: Kollias, S., Stafylopatis, A., Duch, W., Oja, E. (eds.) ICANN 2006. LNCS, vol. 4131, pp. 668–677. Springer, Heidelberg (2006)
Perturbational Neural Networks for Incremental Learning in Virtual Learning System Eiichi Inohira1 , Hiromasa Oonishi2 , and Hirokazu Yokoi1 1
Kyushu Institute of Technology, Hibikino 2-4, 808-0196 Kitakyushu, Japan {inohira,yokoi}@life.kyutech.ac.jp 2 Mitshbishi Heavy Industries, Ltd., Japan
Abstract. This paper presents a new type of neural networks, a perturbational neural network to realize incremental learning in autonomous humanoid robots. In our previous work, a virtual learning system has been provided to realize exploring plausible behavior in a robot’s brain. Neural networks can generate plausible behavior in unknown environment without time-consuming exploring. Although an autonomous robot should grow step by step, conventional neural networks forget prior learning by training with new dataset. Proposed neural networks features adding output in sub neural network to weights and thresholds in main neural network. Incremental learning and high generalization capability are realized by slightly changing a mapping of the main neural network. We showed that the proposed neural networks realize incremental learning without forgetting through numerical experiments with a twodimensional stair-climbing bipedal robot.
1
Introduction
Recently, humanoid robots such as ASIMO [1] dramatically develop in view of hardware and are prospective for working just like a human. Although many researchers have studied artificial intelligence for a long time, humanoid robots have not so high autonomy as experts are not wanted. Humanoid robots should accomplish missions by itself rather than experts give them solutions such as models, algorithms, and programs. Researchers have studied to realize a robot with learning ability through trial and error. Such studies uses so-called soft computing techniques such as reinforcement learning [2] and central pattern generator (CPG) [3]. Learning of a robot saves expert’s work to some degree but takes much time. These techniques are less efficient than humans. Humans instantly act depending on a situation by using imagination and experience. Even if a human fails, a failure will serve himself as experience. In particular, humans know characteristics of environment and their behavior and simulate trial and error to explore plausible behavior in their brains. In our previous work, Yamashita and et al. [4] have proposed a bipedal walking control system with virtual environment based on motion control mechanism of primates. In this study, exploring plausible behavior by trial and error is carried M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 395–404, 2008. c Springer-Verlag Berlin Heidelberg 2008
396
E. Inohira, H. Oonishi, and H. Yokoi
out in a robot’s brain. However, the problem is that exploring takes much time. Kawano and et al. [5] have introduced learning of a relation of environmental data and an optimized behavior into the previous work to save time. Generating behavior becomes fast because a trained neural network (NN) immediately outputs plausible behavior due to its generalization capability, which is an ability to obtain a desired output at an unknown input. However, a problem in this previous work is that conventional neural networks cannot realize incremental learning. It means that a trained NN forgets prior learning by training with new dataset. Realizing incremental learning is indispensable to learn various behaviors step by step. We presents a new type of NNs, perturbational neural networks (PNN) to achieve incremental learning. PNN consists of main NN and sub NN. Main NN learns with representative dataset and never changes after that. Sub NN learns with new dataset as output error of main NN for the dataset is canceled. Sub NN is connected with main NN as additional terms of weights and thresholds in main NN. We guess that connections through weights and thresholds slightly changes characteristics of main NN, which means that perturbation is given to mapping of main NN, and that incremental learning does not sacrifice its generalization capability.
2
A Virtual Learning System for a Biped Robot
We assume that a robot tries to achieve a task in an unknown environment. Virtual learning is defined as learning of virtual experience in virtual environment in a robot’s brain. Virtual environment is generated from sensory information on real environment around a robot and used for exploring behavior fit for a task without real action. Exploring such behavior in virtual environment is regarded as virtual experience by trial and error or ingenuity. Virtual learning is to memorize a relation between environment and behavior under a particular task. Benefits of virtual learning are (1) to reduce risk of robot’s failure in the real world such as falling and colliding and (2) to enable a robot to immediately act and achieve learned tasks in similar environments. The former is involved in a robot’s safety and physical wastage. The latter leads to saving work and time to achieve a task. We presented virtual learning system for a biped robot [4,5]. This virtual learning system has virtual environment and NNs for learning as shown in Fig.1. We assume that central pattern generator (CPG) generates motion of a biped robot. CPG of a biped robot consists of neural oscillators corresponding to its joints. Then behavior is described as CPG parameters rather than joint angles. CPG parameters for controlling a robot are called CPG input. Optimized CPG input for a plausible behavior is given by exploring in virtual environment. Virtual learning is realized by NNs. A NN learned a mapping from environmental data to optimized CPG input, i.e, a mapping from real environment to robot’s behavior. NNs have generalization capability, which means that desired output can be generated at an unknown input. When a NN learned a mapping
PNNs for Incremental Learning in Virtual Learning System
397
Virtual Environment
Exploring optimized motion Environmental data
Does it serve a purpose ?
CPG
CPG input Robot
NN
Fig. 1. Virtual learning system for a biped robot
with representative data, it can generate desired CPG input at unknown environment and then a biped robot achieves a task. It takes much time to train NN, but NN is fast to generate output. CPG input is first generated by NN and then checked in virtual environment. When the CPG input generated by NN achieve a task, it is sent to a CPG. Otherwise, CPG input obtained by exploring, which takes time, is used for controlling a robot. And NN learns with CPG input obtained by exploring to correct its error. Combination of NN and exploring realizes autonomous learning and covers their shortcomings each other.
3
Neural Networks for Incremental Learning
A key of our virtual learning system is NN. A virtual learning system should develop with experience in various environments and tasks. However, multilayer perceptron (MLP) with back-propagation (BP) learning algorithm, which is widely used in many applications, has a problem that it forgets prior learning through training with a new dataset because its connection weights are overwritten. If all training datasets are stored in a memory, prior learning can be reflected in a MLP by training with all the datasets, but training with a large volume of dataset takes huge amount of time and memory. Of course humans can learn something new with keeping old experiences. A virtual learning system needs to accumulate virtual experiences step by step. 3.1
Perturbational Neural Networks
The basic idea of PNN is that a NN learns a new dataset by adding new parameters to constant weights and thresholds after training without overwriting them. PNN consists of main NN and sub NN as shown in Fig.2. Main NN learns with representative datasets and is constant after this learning. Sub NN learns with new datasets and generates new terms in weights and thresholds of main
398
E. Inohira, H. Oonishi, and H. Yokoi Input
Main NN
Output
Δw, Δh Sub NN
Fig. 2. A perturbational neural network
Neuron w1 Δ w1
Input
x1 −
f
z1
Output
Δh
h From Sub NN
Fig. 3. A neuron in main NN
NN, i.e., Δw and Δh. A neuron of PNN is shown in Fig.3. Input-output relation of a conventional neuron is given by the following equation. z=f wi xi − h (1) i
where z denotes output, wi weight, xi input, and f (·) activation function. A neuron of PNN is given as follows. z=f (wi + Δwi )xi − (h + Δh)
(2)
i
where Δwi and Δh are outputs of sub NN. Training of PNN is divided into two phases. First, main NN learns a main NN with representative dataset. For instance, it is assumed that representative datasets consists of environmental data (EA , EB , EC ) and CPG parameters
PNNs for Incremental Learning in Virtual Learning System
399
PA , PB , P C EA , EB , E C
Main NN Δw, Δh Sub NN 0
Fig. 4. Training of NN with representative dataset
PD Stop learning
ED
Main NN Δw, Δh Sub NN
Fig. 5. Training of NN with new dataset
(PA , PB , PC ) as shown in Fig.4. Training of main NN is the same as NN with BP. At this time, sub NN learn as Δw and Δh equal zero. Then, for representative dataset, sub NN has no effect on main NN. Next, PNN learns with new dataset. For instance, it is assumed that new dataset consists of environmental data ED and CPG parameters PD as shown in Fig.5. Main NN does not learn with the new dataset, but reinforcement signals are passed to sub NN through main NN. Output error of main NN for ED exists because it is unknown for main NN. Sub NN learns with new dataset as such error is canceled. 3.2
Related Work
Some authors [6,7,8] have proposed other methods in which Mixture of Experts (MoE) architecture is applied to incremental learning of neural networks. The MoE architecture is based on divide-and-conquer approach. It means that a complex mapping which a signal neural network should learn is divided into simple mappings which a neural network can learn easily in such architecture. On the other hand, a PNN learns a complex mapping by connecting sub NNs with a main NN. A PNN is expected to have global generalization capability because it does not divided a mapping to local mappings. In the MoE architecture, a expert neural network is expected to have local generalization capability. However, a system would not have global generalization capability because it is not concerned. Therefore, when generalization capability is focused on, a PNN should be used.
400
E. Inohira, H. Oonishi, and H. Yokoi
A PNN have a problem that it needs large number of connections from sub NNs to a main NN. It means that a PNN is very inefficient in resources. Although we have not yet study efficiency in a PNN, we guess that a PNN has room to reduce the connections, and it would be our future work.
4 4.1
Numerical Experiments Setup
We evaluate generalization capability of proposed NN for incremental learning through numerical experiments of a robot climbing stairs. Simplified experiments are performed because we focus on a NN for virtual learning.
θ3 θ4
θ2 θ1
θ5
Fig. 6. A two-dimensional five-link biped robot model
θ3
Neural oscillator
θ2
θ4
θ1
θ5
Fig. 7. A CPG for a biped robot
A two-dimensional five-link model of a bipedal robot is used as shown in Fig. 6. Five joint angles defined in Fig. 6 are controlled by five neural oscillators corresponding to each joint angle. Length of all link is 100 cm. A task of a bipedal robot is to climb five stairs. Height and depth of stairs are used for
PNNs for Incremental Learning in Virtual Learning System
401
Table 1. Dimension of representative stairs Stairs Height [cm] Depth [cm] A 30 60 B 10 70 C 30 100 D 10 90
Output
Input
Main NN Sub NN Sub NN for hidden layer
Δwh , Δhh
Sub NN for output layer
Δwo , Δho
Fig. 8. A PNN used for experiments
environmental data. A robot concerns only kinematics in virtual environment and ignores dynamics. A CPG shown in Fig. 7 is used for controlling of a bipedal robot. We used Matsuoka’s neural oscillator model [9] in a CPG. In this study, 16 connection weights w and five constant external inputs u0 are defined as CPG input for controlling a bipedal robot. CPG input for climbing stairs are given through exploring their parameter space by GA and are optimized for each environment. Internal parameters of CPG are also given by GA and are constant in all environments because it takes much time to explore all CPG parameters including internal parameters. Internal parameters of CPG are optimized for walking on a horizontal plane. The four pairs shown in Table 1 are defined as representative data for NN training. Then inputs and outputs of the NNs numbers 2 and 21 respectively. The following targets are compared to evaluating their generalization capability. – CPG whose parameters are optimized from each of the three representative environmental data, i.e., stairs A, B, and C – MLP trained for stairs A, B, and C (MLP-I) – MLP trained for stairs D after trained for stairs A, B, and C (MLP-II) – PNN trained in the same way as the above MLP These targets are optimized and trained for one to four kinds of stairs. Generalization capability of each targets is calculated by the number of conditions which is different from the representative stairs and where a biped robot can climb five
402
E. Inohira, H. Oonishi, and H. Yokoi
stairs successfully. Stairs’ height ranges from 4 cm to 46 cm and width from 40 cm to 110 cm. MLPs has 30 neurons in a hidden layer. Initial weights of MLPs are given by uniform random numbers ranging between ±0.3. PNN used in this paper has two sub NNs as shown in Fig.8. Main NN in PNN is the same as the above MLP. Sub NN for hidden layer has 100 neurons and 90 outputs. Sub NN for output layer has 600 neurons and 561 outputs. All initial weights in PNN are given in the same way as the MLPs. One learning cycle is defined as stairs A, B, and C are given to MLP or PNN sequentially. MLP-I is trained for the three kinds of stairs until 10000 learning cycles where sum of squared error is much less than 10−7 . MLP-II is trained for stairs D until 1000 learning cycles after trained for the three kinds of stairs. As mentioned below, although the number of learning cycles for stairs D is small, MLP-II forgets stair A, B, and C. Training condition for Main NN in PNN is the same as MLP-I. Incremental learning of PNN for stairs D means that the two sub NNs are trained for stairs D while the main NN is constant. These sub NNs are trained until 700 learning cycles. Learning rate of NNs is optimized through preliminary experiments because their performance heavily depends on it. We used learning rate minimizing sum of squared error under a certain number of learning cycles for each of NNs. Learning rate used in MLP-I and sub NN of PNN is 0.30 and 0.12 respectively. 4.2
Experimental Results
In Fig. 9, mark x denotes successful condition in climbing five stairs and a circle a given condition as representative dataset. Fig. 9 (a) to (c) show that successful conditions spreads around representative stairs to some degree. It means that CPG has a certain generalization capability by itself as already known. However, generalization capability of CPG is not as much as stairs A, B, and C are covered simultaneously. On the other hand, Fig. 9 (d) shows that MLP-I covers intermediate conditions in stairs A, B, and C. Effect of virtual learning with NNs is clear. Fig. 9 (e) and (f) are involved in incremental learning. Fig. 9 (e) shows that PNN is successful in incremental learning. Moreover, when conditions near stairs C are focused on, generalization capability of PNN is larger than MLP-I. Fig. 9 (f) shows that MLP-II forgot the precedent learning on stairs A, B, and C and fails in incremental learning. It is known that incremental learning of MLP with BP fails because connection weights are overwritten by training with new dataset. In PNN, main NN is constant after training with initial dataset. Then main NN does not forget initial dataset. Incremental learning is realized by sub NNs in PNN. The problems of sub NNs are effects on adjusting of connection weights, i.e., whether it is successful in incremental learning or not, and whether performance for initial dataset decreases or not. From the experimental results, we showed that PNN realized incremental learning and increased generalization capability by incremental learning.
PNNs for Incremental Learning in Virtual Learning System
45
45
40
40
25 20
10
80 100 Depth [cm] (d) Trained NN with data A, B, and C
45
45
40
40
35
35
25 20
A
30
C
25 20
B
10 5 40
60
80 100 Depth [cm] (b) CPG without NN for data B
D
80 100 Depth [cm] (e) Trained proposed NN with data D after the three data
45
45
40
40
35
60
25 20
Height [cm]
35 C
30
30
15 10 80 100 Depth [cm] (c) CPG without NN for data C
C
20
10 60
A
25
15
5 40
60
15 B
10 5 40
B
5 40
80 100 Depth [cm] (a) CPG without NN for data A
15
Height [cm]
20
10 60
C
25
15
30
A
30
15
5 40
Height [cm]
Height [cm]
30
35 A
Height [cm]
Height [cm]
35
403
5 40
B
D
60
80 100 Depth [cm] (f) Trained conventional NN with data D after the three data
Fig. 9. A comparison of generalization capability of CPG and MLP and PNN in stair climbing
404
5
E. Inohira, H. Oonishi, and H. Yokoi
Conclusions
We proposed a new type of NNs for incremental learning in a virtual learning system. Our proposed NNs features adjusting weights and thresholds externally to slightly change a mapping of a trained NN. This paper demonstrated numerical experiments with a two-dimensional five-link biped robot and climbing-stairs task. We showed that PNN is successful in incremental learning and has generalization capability to some degree. This study is very limited to focus on only verifying our approach. In future work, we will study PNN with as much data as an actual robot needs and compare with related work quantitatively.
References 1. Hirai, K., Hirose, M., Haikawa, Y., Takenaka, T.: The development of honda humanoid robot. In: Proc. IEEE ICRA, vol. 2, pp. 1321–1326 (1998) 2. Mahadevan, S., Connell, J.: Automatic programming of behavior-based robots using reinforcement learning. Artificial Intelligence 55, 311–365 (1992) 3. Kuniyoshi, Y., Sangawa, S.: Early motor development from partially ordered neuralbody dynamics: experiments with a cortico-spinal-musculo-skeletal model. Biological Cybernetics 95, 589–605 (2006) 4. Yamashita, I., Yokoi, H.: Control of a biped robot by using several virtual environments. In: Proceedings of the 22nd Annual Conference of the Robotics Society of Japan 1K25 (in Japanese) (2004) 5. Kawano, T., Yamashita, I., Yokoi, H.: Control of the bipedal robot generating the target by the simulation in virtual space (in Japanese). IEICE Technical Report 104, 65–69 (2004) 6. Haruno, M., Wolpert, D.M., Kawato, M.: MOSAIC model for sensorimotor learning and control. Neural Computation 13, 2201–2220 (2001) 7. Schaal, S., Atkeson, C.G.: Constructive incremental learning from only local information. Neural Computation 10, 2047–2084 (1998) 8. Yamauchi, K., Hayami, J.: Incremental learning and model selection for radial basis function network through sleep. IEICE TRANSACTIONS on Information and Systems E90-D, 722–735 (2007) 9. Matsuoka, K.: Sustained oscillations generated by mutually inhibiting neurons with adaptation. Biological Cybernetics 52, 367–376 (1985)
Bifurcations of Renormalization Dynamics in Self-organizing Neural Networks Peter Tiˇ no University of Birmingham, Birmingham, UK [email protected]
Abstract. Self-organizing neural networks (SONN) driven by softmax weight renormalization are capable of finding high quality solutions of difficult assignment optimization problems. The renormalization is shaped by a temperature parameter - as the system cools down the assignment weights become increasingly crisp. It has been reported that SONN search process can exhibit complex adaptation patterns as the system cools down. Moreover, there exists a critical temperature setting at which SONN is capable of powerful intermittent search through a multitude of high quality solutions represented as meta-stable states. To shed light on such observed phenomena, we present a detailed bifurcation study of the renormalization process. As SONN cools down, new renormalization equilibria emerge in a strong structure leading to a complex skeleton of saddle type equilibria surrounding an unstable maximum entropy point, with decision enforcing “one-hot” stable equilibria. This, in synergy with the SONN input driving process, can lead to sensitivity to annealing schedules and adaptation dynamics exhibiting signatures of complex dynamical behavior. We also show that (as hypothesized in earlier studies) the intermittent search by SONN can occur only at temperatures close to the first (symmetry breaking) bifurcation temperature.
1
Introduction
For almost three decades there has been an energetic research activity on application of neural computation techniques in solving difficult combinatorial optimization problems. Self-organizing neural network (SONN) [1] constitutes an example of a successful neural-based methodology for solving 0-1 assignment problems. SONN has been successfully applied in a wide variety of applications, from assembly line sequencing to frequency assignment in mobile communications. As in most self-organizing systems, dynamics of SONN adaptation is driven by a synergy of cooperation and competition. In the competition phase, for each item to be assigned, the best candidate for the assignment is selected and the corresponding assignment weight is increased. In the cooperation phase, the assignment weights of other candidates that were likely to be selected, but were not quite as strong as the selected one, get increased as well, albeit to a lesser M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 405–414, 2008. c Springer-Verlag Berlin Heidelberg 2008
406
P. Tiˇ no
degree. The assignment weights need to be positive and sum to 1. Therefore, after each SONN adaptation phase, the assignment weights need to be renormalized back onto the standard simplex e.g via the softmax function [2]. When endowed with a physics-based Boltzmann distribution interpretation, the softmax function contains a temperature parameter T > 0. As the system cools down, the assignments become increasingly crisp. In the original setting SONN is annealed so that a single high quality solution to an assignment problem is found. Yet, renormalization onto the standard simplex is a double edged sword. On one hand, SONN with assignment weight renormalization have empirically shown sensitivity to annealing schedules, on the other hand, the quality of solutions could be greatly improved [3]. Interestingly enough, it has been reported recently [4] that there exists a critical temperature T∗ at which SONN is capable of powerful intermittent search through a multitude of high quality solutions represented as meta-stable states of the adaptation SONN dynamics. It is hypothesised that the critical temperature may be closely related to the symmetry breaking bifurcation of equilibria in the autonomous softmax dynamics. At present there is still no theory regarding the dynamics of SONN adaptation driven by the softmax renormalization. Consequently, the processes of crystallising a solution in an annealed version of SONN, or of sampling the solution space in the intermittent search regime are far from being understood. The first steps towards theoretical underpinning of SONN adaptation driven by softmax renormalization were taken in [5,4,6]. For example, in [5] SONN is treated as a dynamical system with a bifurcation parameter T . The cooperation phase was not included in the model, The renormalization process was empirically shown to result in complicated bifurcation patterns revealing a complex nature of the search process inside SONN as the systems gets annealed. More recently, Kwok and Smith [4] suggested to study SONN adaptation dynamics by concentrating on the autonomous renormalization process, since it is this process that underpins the search dynamics in the SONN. In [6] we initiated a rigorous study of equilibria of the autonomous renormalization process. Based on dynamical properties of the autonomous renormalization, we found analytical approximations to the critical temperature T∗ as a function of SONN size. In this paper we complement [6] by reporting a detailed bifurcation study of the renormalization process and give precise characterization and stability types of equilibria, as they emerge during the annealing process. Interesting and intricate equilibria structure emerges as the system cools down, explaining empirically observed complexity of SONN adaptation during intermediate stages of the annealing process. The analysis also clarifies why the intermittent search by SONN occurs near the first (symmetry breaking) bifurcation temperature of the renormalization step, as was experimentally verified in [4,6]. Due to space limitations, we cannot fully prove statements presented in this study. Detailed proofs can be found in [7] and will be published elsewhere.
Bifurcations of Renormalization Dynamics in SONNs
2
407
Self-organizing Neural Network and Iterative Softmax
First, we briefly introduce Self-Organizing Neural Network (SONN) endowed with weight renormalization for solving assignment optimization problems (see e.g. [4]). Consider a finite set of input elements (neurons) i ∈ I = {1, 2, ..., M } that need to be assigned to outputs (output neurons) j ∈ J = {1, 2, ..., N }, so that some global cost of an assignment A : I → J is minimized. Partial cost of assigning i ∈ I to j ∈ J is denoted by V (i, j). The “strength” of assigning i to j is represented by the “assignment weight” wi,j ∈ (0, 1). The SONN algorithm can be summarized as follows: The connection weights wi,j , i ∈ I, j ∈ J , are first initialized to small random values. Then, repeatedly, an output item jc ∈ J is chosen and the partial costs V (i, jc ) incurred by assigning all possible input elements i ∈ I to jc are calculated in order to select the “winner” input element (neuron) i(jc ) ∈ I that minimizes V (i, jc ). The “neighborhood” BL (i(jc )) of size L of the winner node i(jc ) consists of L nodes i = i(jc ) that yield the smallest partial costs V (i, jc ). Weights from nodes i ∈ BL (i(jc )) to jc get strengthened: wi,jc ← wi,jc + η(i)(1 − wi,jc ),
i ∈ BL (i(jc )),
(1)
where η(i) is proportional to the quality of assignment i → jc , as measured by V (i, jc ). Weights1 wi = (wi,1 , wi,2 , ..., wi,N ) for each input node i ∈ I are then renormalized using softmax wi,j ←
wi,j T ) N wi,k . k=1 exp( T )
exp(
(2)
We will refer to SONN for solving an (M, N )-assignment problem as (M, N )SONN. As mentioned earlier, following [4,6] we strive to understand the search dynamics inside SONN by analyzing the autonomous dynamics of the renormalization update step (2) of the SONN algorithm. The weight vector wi of each of M neurons in an (M, N )-SONN lives in the standard (N − 1)-simplex, SN −1 = {w = (w1 , w2 , ..., wN ) ∈ RN | wi ≥ 0, i = 1, 2, ..., N, and
N
wi = 1}.
i=1
Given a value of the temperature parameter T > 0, the softmax renormalization step in SONN adaptation transforms the weight vector of each unit as follows: w → F(w; T ) = (F1 (w; T ), F2 (w; T ), ..., FN (w; T )) , where Fi (w; T ) = 1
exp( wTi ) , i = 1, 2, ..., N, Z(w; T )
Here denotes the transpose operator.
(3)
(4)
408
P. Tiˇ no
wk N and Z(w; T ) = N k=1 exp( T ) is the normalization factor. Formally, F maps R 0 to SN −1 , the interior of SN −1 . 0 Linearization of F around w ∈ SN −1 is given by the Jacobian J(w; T ): J(w; T )i,j =
1 [δi,j Fi (w; T ) − Fi (w; T )Fj (w; T )], T
i, j = 1, 2, ..., N,
(5)
where δi,j = 1 iff i = j and δi,j = 0 otherwise. 0 The softmax map F induces on SN −1 a discrete time dynamics known as Iterative Softmax (ISM): w(t + 1) = F(w(t); T ).
(6)
The renormalization step in an (M, N )-SONN adaptation involves M separate renormalizations of weight vectors of all of the M SONN units. For each temperature setting T , the structure of equilibria in the i-th system, wi (t + 1) = F(wi (t); T ), gets copied in all the other M − 1 systems. Using this symmetry, it is sufficient to concentrate on a single ISM (6). Note that the weights of different units are coupled by the SONN adaptation step (1). We will study systems for N ≥ 2.
3
Equilibria of SONN Renormalization Step
We first introduce basic concepts and notation that will be used throughout the paper. An (r − 1)-simplex is the convex hull of a set of r affinely independent points in Rm , m ≥ r−1. A special case is the standard (N −1)-simplex SN −1 . The convex hull of any nonempty subset of n vertices of an (r − 1)-simplex Δ, n ≤ r, is called an (n − 1)-face of Δ. There are nr distinct (n − 1)-faces of Δ and each (n − 1)-face is an (n − 1)-simplex. Given a set of n vertices w1 , w2 , ..., wn ∈ Rm defining an (n − 1)-simplex Δ in Rm , the central point, n
w(Δ) =
1 wi , n i=1
(7)
is called the maximum entropy point of Δ. We will denote the set of all (n − 1)-faces of the standard (N − 1)-simplex SN −1 by PN,n . The set of their maximum entropy points is denoted by QN,n , i.e. QN,n = {w(Δ)| Δ ∈ PN,n }.
(8)
The n-dimensional column vectors of 1’s and 0’s are denoted by 1n and 0n , respectively. Note that wN,n = n1 (1n , 0N −n ) ∈ QN,n . In addition, all the other elements of QN,n can be obtained by simply permuting coordinates of wN,n . Due to this symmetry, we will be able to develop most of the material using wN,n only and then transfer the results to permutations of wN,n . The maximum entropy point wN,N = N −1 1N of the standard (N − 1)-simplex SN −1 will be denoted
Bifurcations of Renormalization Dynamics in SONNs
409
simply by w. To simplify the notation we will use w to denote both the maximum entropy point of SN −1 and the vector w − 0N . We showed in [6] that w is a fixed point of ISM (6) for any temperature setting T and all the other fixed points w = (w1 , w2 , ..., wN ) of ISM have exactly two different coordinate values: wi ∈ {γ1 , γ2 }, such that N −1 < γ1 < N1−1 and 0 < γ2 < N −1 , where N1 is the number of coordinates γ1 larger than N −1 . Since 0 w ∈ SN −1 , we have 1 − N1 γ1 γ2 = . (9) N − N1 The number of coordinates γ2 smaller than N −1 is denoted by N2 . Obviously, N2 = N − N1 . If w = (γ1 1N1 , γ2 1N2 ) is a fixed point of ISM (6), so are all NN1 distinct permutations of it. We collect w and its permutations in a set 1 − N γ 1 1 EN,N1 (γ1 ) = v| v is a permutation of γ1 1N1 , 1 . (10) N − N1 N −N1 The fixed points in EN,N1 (γ1 ) exist if and only if the temperature parameter T is set to [6]
TN,N1 (γ1 ) = (N γ1 − 1) −(N − N1 ) · ln 1 −
N γ1 − 1 (N − N1 )γ1
−1 .
(11)
We will show that as the system cools down, increasing number of equilibria emerge in a strong structure. Let w, v ∈ SN −1 be two points on the standard simplex. The line from w to v is parametrized as (τ ; w, v) = w + τ · (v − w),
τ ∈ [0, 1].
(12)
Theorem 1. All equilibria of ISM (6) lie on lines connecting the maximum entropy point w of SN −1 with the maximum entropy points of its faces. In particular, for 0 < N1 < N and γ1 ∈ (N −1 , N1−1 ), all fixed points from EN,N1 (γ1 ) lie on lines (τ ; w, w), where w ∈ QN,N1 . Sketch of the Proof: Consider the maximum entropy point wN,N1 = N11 (1N1 , 0N2 ) of an (N1 −1)-face of SN −1 . Then w(γ1 ) = (γ1 1N1 , γ2 1N2 ) lies on the line (τ ; w, wN,N1 ) for the parameter setting τ = 1 − N γ2 . Q.E.D. The result is illustrated in figure 1. As the (M, N )-SONN cools down, the ISM equilibria emerge on lines connecting w with the maximum entropy points of faces of SN −1 of increasing dimensionality. Moreover, on each such line there can be at most two ISM equilibria. Theorem 2. For N1 < N/2, there exists a temperature TE (N, N1 ) > N −1 such that for T ∈ (0, TE (N, N1 )], ISM fixed points in EN,N1 (γ1 ) exist for some γ1 ∈
410
P. Tiˇ no
Fig. 1. Positions of equilibria of SONN renormalization illustrated for the case of 4-dimensional weight vectors w. ISM is operating on the standard 3-simplex S3 and its equilibria can only be found on the lines connecting the maximum entropy point w (filled circle) with maximum entropy points of its faces. Triangles, squares and diamonds represent maximum entropy points of 0-faces (vertices), 1-faces (edges) and 2-faces (facets), respectively.
(N −1 , N1−1 ), and no ISM fixed points in EN,N1 (γ1 ), for any γ1 ∈ (N −1 , N1−1 ), can exist at temperatures T > TE (N, N1 ). For each temperature T ∈ (N −1 , TE (N, N1 )), there are two coordinate values γ1− (T ) and γ1+ (T ), N −1 < γ1− (T ) < γ1+ (T ) < N1−1 , such that ISM fixed points in both EN,N1 (γ1− (T )) and EN,N1 (γ1+ (T )) exist at temperature T . Furthermore, as the temperature decreases, γ1− (T ) decreases towards N −1 , while γ1+ (T ) increases towards N1−1 . For temperatures T ∈ (0, N −1 ], there is exactly one γ1 (T ) ∈ (N −1 , N1−1 ) such that ISM equilibria in EN,N1 (γ1 (T )) exist at temperature T . Sketch of the Proof: The temperature function TN,N1 (γ1 ) (11) is concave and can be continuously extended to [N −1 , N1−1 ) with TN,N1 (N −1 ) = N −1 and limγ1 →N −1 TN,N1 (γ1 ) = 1 0 < N −1 . The slope of TN,N1 (γ1 ) at N −1 is positive for N1 < N/2. Q.E.D. Theorem 3. The bifurcation temperature TE (N, N1 ) is decreasing with increasing number N1 of equilibrium coordinates larger than N −1 . Sketch of the Proof: It can be shown that for any feasible value of γ1 > N −1 , if there are two fixed points w ∈ EN,N1 (γ1 ) and w ∈ EN,N1 (γ1 ) of ISM, such that N1 < N1 , then w exists at a higher temperature than w does. For a given N1 < N/2, the bifurcation temperature TE (N, N1 ) corresponds to the maximum of TN,N1 (γ1 ) on γ1 ∈ (N −1 , N1−1 ). It follows that N1 < N1 implies TE (N, N1 ) > TE (N, N1 ). Q.E.D.
Bifurcations of Renormalization Dynamics in SONNs
411
Theorem 4. If N/2 ≤ N1 < N , for each temperature T ∈ (0, N −1 ), there is exactly one coordinate value γ1 (T ) ∈ (N −1 , N1−1 ), such that ISM fixed points in EN,N1 (γ1 (T )) exist at temperature T . No ISM fixed points in EN,N1 (γ1 ), for any γ1 ∈ (N −1 , N1−1 ) can exist for temperatures T > N −1 . As the temperature decreases, γ1 (T ) increases towards N1−1 . Sketch of the Proof: Similar to the proof of theorem 2, but this time the slope of TN,N1 (γ1 ) at N −1 is not positive. Q.E.D. Let us now summarize the process of creation of new ISM equilibria, as the (M, N )-SONN cools down. For temperatures T > 1/2, the ISM has exactly one equilibrium - the maximum entropy point w of SN −1 [6]. As the temperature is lowered and hits the first bifurcation point, TE (N, 1), new fixed points of ISM w ∈ QN,1 , one on each line. The lines connect w emerge on the lines (τ ; w, w), of SN −1 . As the temperature decreases further, on each line, with the vertices w the single fixed point splits into two fixed points, one moves towards w, the other in QN,1 (vertex of SN −1 ). moves towards the corresponding high entropy point w When the temperature reaches the second bifurcation point, TE (N, 2), new fixed w ∈ QN,2 , one on each line. This points of ISM emerge on the lines (τ ; w, w), of the time the lines connect w with the maximum entropy points (midpoints) w edges of SN −1 . Again, as the temperature continues decreasing, on each line, the single fixed point splits into two fixed points, one moves towards w, the other in QN,2 (midpoint moves towards the corresponding maximum entropy point w of an edge of SN −1 ). The process continues until the last bifurcation temperature TE (N, N1 ) is reached, where N1 is the largest natural number smaller than N/2. At TE (N, N1 ), new fixed points of ISM emerge on the lines (τ ; w, w), ∈ QN,N1 , connecting w with maximum entropy points w of (N1 − 1)-faces w of SN −1 . As the temperature continues decreasing, on each line, the single fixed point splits into two fixed points, one moves towards w, the other moves towards in QN,N1 . At temperatures below the corresponding maximum entropy point w N −1 , only the fixed points moving towards the maximum entropy points of faces of SN −1 exist. In the low temperature regime, 0 < T < N −1 , a fixed point occurs on every w ∈ QN,N1 , N1 = N/2 , N/2 + 1, ..., N − 1. Here, x denotes line (τ ; w, w), the smallest integer y, such that y ≥ x. As the temperature decreases, the fixed ∈ QN,N1 points w move towards the corresponding maximum entropy points w of (N1 − 1)-faces of SN −1 . The process of creation of new fixed points and their flow as the temperature cools down is demonstrated in figure 2 for an ISM operating on 9-simplex S9 (N = 10). We plot against each temperature setting T the values of the larger coordinate γ1 > N −1 = 0.1 of the fixed points existing at T . The behavior of ISM in the neighborhood of its equilibrium w is given by the structure of stable and unstable manifolds of the linearized system at w outlined in the next section.
412
P. Tiˇ no Bifurcation structure of ISM equilibria (N=10) T (10,4)
T (10,2)
E
TE(10,1)
E
1 T (10,3) E
N =1 1
0.8
gamma
1
N =2 1
0.6
N =3 1
0.4
N =4 1
0.2
0
0
0.05
0.1
0.15
0.2
0.25
temperature T
Fig. 2. Demonstration of the process of creation of new ISM fixed points and their flow as the system temperature cools down. Here N = 10, i.e. the ISM operates on the standard 9-simplex S9 . Against each temperature setting T , the values of the larger coordinate γ1 > N −1 of the fixed points existing at T are plotted. The horizontal bold line corresponds to the maximum entropy point w = 10−1 110 .
4
Stability Analysis of Renormalization Equilibria
The maximum entropy point w is not only a fixed point of ISM (6), but also, regarded as a vector w − 0N , it is an eigenvector of the Jacobian J(w; T ) 0 at any w ∈ SN −1 , with eigenvalue λ = 0. This simply reflects the fact that ISM renormalization acts on the standard simplex SN −1 , which is a subset of a (N − 1)-dimensional hyperplane with normal vector 1N . We have already seen that w plays a special role in the ISM equilibria structure: all equilibria lie on lines going from w towards maximum entropy points of faces of SN −1 . The lines themselves are of special interest, since we will show that these lines are invariant manifolds of the ISM renormalization and their directional vectors are eigenvectors of ISM Jacobians at the fixed points located on them. Theorem 5. Consider ISM (6) and 1 ≤ N1 < N . Then for each maximum ∈ QN,N1 of an (N1 − 1)-face of SN −1 , the line (τ ; w, w), entropy point w is an invariant set τ ∈ [0, 1) connecting the maximum entropy point w with w under the ISM dynamics. Sketch of the Proof: into (3) and realizing The result follows from plugging parametrization (τ ; w, w) (after some manipulation) that for each τ ∈ [0, 1), there exists a parameter = (τ1 ; w, w). setting τ1 ∈ [0, 1) such that F((τ ; w, w)) Q.E.D.
Bifurcations of Renormalization Dynamics in SONNs
413
The proofs of the next two theorems are rather involved and we refer the interested reader to [7]. Theorem 6. Let w ∈ EN,N1 (γ1 ) be a fixed point of ISM (6). Then, w∗ = w − w is an eigenvector of the Jacobian J(w; TN,N1 (γ1 )) with the corresponding eigenvalue λ∗ , where 1. if N/2 ≤ N1 ≤ N − 1, then 0 < λ∗ < 1. 2. if 1 ≤ N1 < N/2 and N −1 < γ1 < (2N1 )−1 , then λ∗ > 1. 3. if 1 ≤ N1 < N/2 , then there exists γ¯1 ∈ ((2N1 )−1 , N1−1 ), such that for all ISM fixed points w ∈ EN,N1 (γ1 ) with γ1 ∈ (¯ γ1 , N1−1 ), 0 < λ∗ < 1. We have established that for an ISM equilibrium w, both w and w∗ = w− w are eigenvectors of the ISM Jacobian at w. Stability types of the remaining N − 2 eigendirections are characterized in the next theorem. Theorem 7. Consider an ISM fixed point w ∈ EN,N1 (γ1 ) for some 1 ≤ N1 < N and N −1 < γ1 < N1−1 . Then, there are N − N1 − 1 and N1 − 1 eigenvectors of Jacobian J(w; TN,N1 (γ1 )) of ISM at w with the same associated eigenvalue 0 < λ− < 1 and λ+ > 1, respectively.
5
Discussion – SONN Adaptation Dynamics
In the intermittent search regime by SONN [4], the search is driven by pulling promising solutions temporarily to the vicinity of the 0-1 “one-hot” assignment values - vertices of SN −1 (0-dimensional faces of the standard simplex SN −1 ). The critical temperature for intermittent search should correspond to the case where the attractive forces already exist in the form of attractive equilibria near the “one-hot” assignment suggestions (vertices of SN −1 ), but the convergence rates towards such equilibria should be sufficiently weak so that the intermittent character of the search is not destroyed. This occurs at temperatures lower than, but close to the first bifurcation temperature TE (N, 1) (for more details, see [7]). In [4] it is hypothesised that there is a strong link between the critical temperature for intermittent search by SONN and bifurcation temperatures of the autonomous ISM. In [6] we hypothesised (in accordance with [4]) that even though there are many potential ISM equilibria, the critical bifurcation points are related only to equilibria near the vertices of SN −1 , as only those could be guaranteed by the theory of [6] (stability bounds) to be stable, even though the theory did not prevent the other equilibria from being stable. In this study, we have rigorously shown that the stable equilibria can in fact exist only near the vertices of SN −1 , on the lines connecting w with the vertices. Only when N1 = 1, there are no expansive eigendirections of the local Jacobian with λ+ > 1. As the SONN system cools down, more and more ISM equilibria emerge on the lines connecting the maximum entropy point w of the standard simplex SN −1 with the maximum entropy points of its faces of increasing dimensionality. With decreasing temperature, the dimensionality of stable and unstable manifolds
414
P. Tiˇ no
of linearized ISM at emerging equilibria decreases and increases, respectively. At lower temperatures, this creates a peculiar pattern of saddle type equilibria surrounding the unstable maximum entropy point w, with decision enforcing “one-hot” stable equilibria located near vertices of SN −1 . Trajectory towards the solution as the SONN system anneals is shaped by the complex skeleton of saddle type equilibria with stable/unstable manifolds of varying dimensionalities and can therefore, in synergy with the input driving process, exhibit signatures of a very complex dynamical behavior, as reported e.g. in [5]. Once the temperature is sufficiently low, the attraction rates of stable equilibria near the vertices of SN −1 are so high that the found solution is virtually pinned down by the system. Even though the present study clarifies the prominent role of the first (symmetry breaking) bifurcation temperature TE (N, 1) in obtaining the SONN intermittent search regime and helps to understand the origin of complex SONN adaptation patterns in the annealing regime, many interesting open questions remain. For example, no theory as yet exists of the role of abstract neighborhood BL (i(jc )) of the winner node i(jc ) in the cooperative phase of SONN adaptation. We conclude by noting that it may be possible to apply the theory of ISM in other assignment optimization systems that incorporate the softmax assignment weight renormalization e.g. [8,9].
References 1. Smith, K., Palaniswami, M., Krishnamoorthy, M.: Neural techniques for combinatorial optimization with applications. IEEE Transactions on Neural Networks 9, 1301–1318 (1998) 2. Guerrero, F., Lozano, S., Smith, K., Canca, D., Kwok, T.: Manufacturing cell formation using a new self-organizing neural network. Computers & Industrial Engineering 42, 377–382 (2002) 3. Kwok, T., Smith, K.: Improving the optimisation properties of a self-organising neural network with weight normalisation. In: Proceedings of the ICSC Symposia on Intelligent Systems and Applications (ISA 2000), Paper No.1513-285 (2000) 4. Kwok, T., Smith, K.: Optimization via intermittency with a self-organizing neural network. Neural Computation 17, 2454–2481 (2005) 5. Kwok, T., Smith, K.: A noisy self-organizing neural network with bifurcation dynamics for combinatorial optimization. IEEE Transactions on Neural Networks 15, 84–88 (2004) 6. Tiˇ no, P.: Equilibria of iterative softmax and critical temperatures for intermittent search in self-organizing neural networks. Neural Computation 19, 1056–1081 (2007) 7. Tiˇ no, P.: Bifurcation structure of equilibria of adaptation dynamics in selforganizing neural networks. Technical Report CSRP-07-12, University of Birmingham, School of Computer Science (2007), http://www.cs.bham.ac.uk/∼ pxt/PAPERS/ism.bifurc.tr.pdf 8. Gold, S., Rangarajan, A.: Softmax to softassign: Neural network algorithms for combinatorial optimization. Journal of Artificial Neural Networks 2, 381–399 (1996) 9. Rangarajan, A.: Self-annealing and self-annihilation: unifying deterministic annealing and relaxation labeling. Pattern Recognition 33, 635–649 (2000)
Variable Selection for Multivariate Time Series Prediction with Neural Networks Min Han and Ru Wei School of Electronic and Information Engineering, Dalian University of Technology, Dalian 116023, China [email protected]
Abstract. This paper proposes a variable selection algorithm based on neural networks for multivariate time series prediction. Sensitivity analysis of the neural network error function with respect to the input is developed to quantify the saliency of each input variables. Then the input nodes with low sensitivity are pruned along with their connections, which represents to delete the corresponding redundant variables. The proposed algorithm is tested on both computergenerated time series and practical observations. Experiment results show that the algorithm proposed outperformed other variable selection method by achieving a more significant reduction in the training data size and higher prediction accuracy. Keywords: Variable selection, neural network pruning, sensitivity, multivariate prediction.
1 Introduction Nonlinear and chaotic time series prediction is a practical technique which can be used for studying the characteristics of complicated dynamics from measurements. Usually, multivariate variables are required since the output may depend not only on its own previous values but also on the past values of other variables. However, we can’t make sure that all of the variables are equally important. Some of them may be redundant or even irrelevant. If these unnecessary input variables are included into the prediction model, the parameter estimation process will be more difficult, and the overall results may be poorer than if only the required inputs are used [1]. Variable selection is such a problem to discard the redundant variables, which will reduce the number of input variables and the complexity of the prediction model. A number of variable selection methods based on statistical or heuristics tools have been proposed, such as Principal Component Analysis (PCA) and Discriminant Analysis. These techniques attempt to reduce the dimensionality of the data by creating new variables that are linear combinations of the original ones. The major difficulty comes from the separation of variable selection process and prediction process. Therefore, variable selection using neural network is attractive since one can globally adapt the variable selector together with the predictor. Variable selection with neural network can be seen as a special case of architecture pruning [2], where the pruning of input nodes is equivalent to removing the corresponding M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 415–425, 2008. © Springer-Verlag Berlin Heidelberg 2008
416
M. Han and R. Wei
variables from the original data set. One approach to pruning is to estimate the sensitivity of the output to the exclusion of each unit. There are several ways to perform sensitivity analysis with neural network. Most of them are weight-based [3], which is based on the idea that weights connected to important variables attain large absolute values while weights connected to unimportant variables would probably attain values somewhere near zero. However, smaller weights usually result in smaller inputs to neurons and larger sigmoid derivatives in general, which will increase the output sensitivity to the input. Mozer and Smolensky [4] have introduced a method which estimates which units are least important and can be deleted over training. Gevrey et al. [5] compute the partial derivatives of the neural network output with respect to the input neurons and compare performances of several different methods to evaluate the relative contribution of the input variables. This paper concentrates on a neural-network-based variable selection algorithm as the tool to determine which variables are to be discarded. A simple sensitivity criterion of the neural network error function with respect to each input is developed to quantify the saliency of each input variables. Then the input nodes are arrayed by a decreasing sensitivity order so that the neural network can be pruned efficiently by discarding the last items with low sensitivity. The variable selection algorithm is then applied to both computer-generated data and practical observations and is compared with the PCA variable reduction method. The rest of this paper is organized as follows. Section 2 reviews the basic concept of multivariate time series prediction and a statistical variable selection method. Section 3 explains the sensitivity analysis with neural networks in detail. Section 4 presents two simulation results. The work is finally concluded in section 5.
2 Modeling Multivariate Chaotic Time Series The basic idea of chaotic time series analysis is that, a complex system can be described by a strange attractor in its phase space. Therefore, the reconstruction of the equivalent state space is usually the first step in chaotic time series prediction. 2.1 Multivariate Phase Space Reconstruction Phase space reconstruction from observations can be accomplished by choosing a suitable embedding dimension and time delay. Given an M-dimensional time series{Xi, i=1, 2,…, M}, where Xi=[xi(1), xi(2), …, xi(N)]T, N is the length of each scalar time series. As in the case of univariate time series (where M=1), the reconstructed phase-space can be made as [6]:
X (t ) = [ x1 (t ), x1 (t − τ 1 ), " , x1 (t − (d1 − 1)τ 1 ), ," ,
(1)
xM (t ), xM (t − τ M )," , xM (t − (d M − 1)τ M ] where t = L, L + 1," , N , L = max(di − 1) ⋅τ i + 1 , τi and di ( i = 1, 2," , M ) are the time 1≤ i ≤ M
delays and embedding dimensions of each time series, respectively. The delay time τi can be calculated using mutual information method and the embedding dimension is computed with the false nearest neighbor method.
Variable Selection for Multivariate Time Series Prediction with Neural Networks
417
M
According to Takens’ embedding theorem, if D = ∑ di is large enough there exist i =1
an mapping F: X(t+1)=F{X(t)}. Then the evolvement of X(t)→X(t+1) reflects the evolvement of the original dynamics system. The problem is then to find an appropriate expression for the nonlinear mapping F. Up to the present, many chaotic time series prediction models have been developed. Neural network has been widely used because of its universal approximation capabilities. 2.2 Neural Network Model
A multilayer perceptron (MLP) with a back propagation (BP) algorithm is used as a nonlinear predictor for multivariate chaotic time series prediction. MLP is a supervised learning algorithm designed to minimize the mean square error between the computed output of the neural network and the desired output. The network usually consists of three layers: an input layer, one or more hidden layers and an output layer. Consider a three layer MLP that contains one hidden layer. The D dimensional delayed time series X(t) are used as the input of the network to generate the network output X(t+1). Then the neural network can be expressed as follows: NI
o j = f (∑ xi wij(I) )
(2)
i =0
NH
yk = ∑ w(O) jk o j
(3)
j =1
where [ x1 , x2 ," , xNI ] = X (t ) denotes the input signal, N I is number of input signal to the neural network, wij(I) is the weight connected from the ith input neuron to the jth hidden neuron, oj are the output of the jth hidden neuron, N H is the number of neurons in the hidden layer, [ y1 , y2 ," , y NO ] = X (t + 1) is the output, N O is the number of output neurons and w(O) is the weight connected from the jth hidden neuron jk and the kth output neuron. The activation function f(·) is the sigmoid function given by
f ( x) =
1 1 + exp(− x)
(4)
The error function of the net is usually defined as the sum square of the error N
No
E = ∑∑ [ yk (t ) − pk (t )]2 , t=1,2,…N t =1 k =1
where pk(t) is the desired output for unit k, N is the length of the training sample.
(5)
418
M. Han and R. Wei
2.3 Statistical Variable Selection Method
For the multivariate time series, the dimension of the reconstructed phase space is usually very high. Moreover, the increase of the input variable numbers will lead to the high complexity of the prediction model. Therefore, in many practical applications, variable selection is needed to reduce the dimensionality of the input data. The aim of variable selection in this paper is to select a subset of R inputs that retains most of the important features of the original input sets. Thus, D-R irrelevant inputs are discarded. The Principle Component Analysis (PCA) is a traditional technique for variable selection [7]. PCA attempts to reduce the dimensionality by first decomposing the normalize input vector X(t) d with singular value decomposition (SVD) method X = U ∑V T
(6)
where ∑ = diag[ s1 s2 ... s p 0 ... 0] , s1 ≥ s2 " ≥ s p are the first p eigenvalues of X ar-
rayed by a decreasing order, U and V are both orthogonal matrixes. Then the first k singular values are preserved as the principle components. The final input can be obtained as
Z = U T X
(7)
where U is the first k rows of U. PCA is an efficient method to reduce the input dimension. However, we can’t make sure that the factors we discard have no influence to the prediction output because the variable selection and prediction process are separated individually. Neural network selector is a good choice to combine the selection process and prediction process.
3 Sensitivity Analysis with Neural Networks Variable selection with neural networks can be achieved by pruning the input nodes of a neural network model based on some saliency measure aiming to remove less relevant variables. The significance of a variable can be defined as the error when the unit is removed minus the error when it is left in place:
Si = EWithoutUnit _ i − EWithUnit _ i = E ( xi = 0) − E ( xi = xi )
(8)
where E is the error function defined in Eq.(5). After the neural network has been trained, a brute-force pruning method for ever input is setting the input to zero and evaluate the change in the error. If it increases too much, the input is restored, otherwise it is removed. Theoretically, this can be done by training the network under all possible subsets of the input set. However, this exhaustive search is computational infeasible and can be very slow for large network. This paper uses the same idea with Mozer and Smolensky [4] to approximate the sensitivity by introducing a gating term α i for each unit such that
o j = f (∑ wijα i oi ) i
where o j is the activity of unit j, wij is the weight from unit i to unit j.
(9)
Variable Selection for Multivariate Time Series Prediction with Neural Networks
419
The gating term α is shown in Fig.1, where α iI , i = 1, 2," , N I is the gating term of the ith input neuron and α Hj , j = 1, 2," , N H is the gating term of the jth output neuron.
α1I
x1 xi
xN I
α1H
#
#
#
α NI
α
yk H NH
I
Fig. 1. The gating term for each unit
The gating term α is merely a notational convenience rather than a parameter that must be implied in the net. If α = 0 , the unit has no influence on the network; If α = 1 , the unit behaves normally. The importance of a unit is then approximated by the derivative Si = −
∂E ∂α i
(10) α i =1
By using a standard error back-propagation algorithm, the derivative of Eq.(9) can be expressed in term of network weights as follows NH N NO ⎡ ⎤ ∂E ∂E ∂yk (O ) = − ⋅ = ∑∑ ⎢( pk (t ) − yk (t ) ) ∑ w jk o j ⎥ H H ∂α j ∂yk ∂α j t =1 k =1 ⎣ j =1 ⎦
(11)
NH N NO ⎡ ⎤ ∂E ∂E ∂yk = − ⋅ = p ( t ) − y ( t ) w(jkO ) o j (1 − o j ) wij( I ) xi (t ) ⎥ ( ) ∑∑ ⎢ k ∑ k I I ∂α i ∂yk ∂α i t =1 k =1 ⎣ j =1 ⎦
(12)
S Hj = −
SiI = −
where SiI is the sensitivity of the ith input neuron, S Hj is the sensitivity of the jth output neuron. Thus the algorithm can prune the input nodes as well as the hidden nodes according to the sensitivity over training. However, the undulation is high when the sensitivity is calculated directly using Eq.(11) and Eq.(12) because of the engineering approximation in Eq.(10). Sometimes, it may delete the input incorrectly. In order to possibly reduce the dimensionality of input vectors, the sensitivity matrix needs to be evaluated over the entire training set. This paper develops several ways to define the overall sensitivity such as: (1) The mean square average sensitivity:
Si , avg =
1 N ∑ Si (t )2 T t =1
where T is the number of data in the training set.
(13)
420
M. Han and R. Wei
(2) The absolute value average sensitivity:
Si , abs =
1 N ∑ Si (t ) T t =1
(14)
(3) The maximum absolute sensitivity:
Si ,max = max Si (t ) 1≤ t ≤ N
(15)
Any of the sensitivity measure in Eqs.(13)~(15) can provide a useful criterion to determine which input is to be deleted. For succinctness, this paper uses the mean square average sensitivity as an example. An input with a low sensitivity has little or no influence on the prediction accuracy and can therefore be removed. In order to get a more efficient criterion for pruning inputs, the sensitivity is normalized. Define the absolute sum of the sensitivity for all the input nodes NI
S = ∑ Si
(16)
i =1
Then the normalized sensitivity of each unit can be defined as S Sˆi = i S
(17)
where the normalized value Sˆi is between [0 1]. The input variables is then arrayed by a decreasing sensitivity order: Sˆ1 ≥ Sˆ2 ≥ " ≥ SˆN I
(18)
The larger values of Sˆi (t ), i = 1, 2," , N I present the important variables. Define the sum the first k terms of the sensitivity ηk as k
η k = ∑ Sˆ j
(19)
j =1
where k=1,2,…,NI . Choosing a threshold value 0 < η 0 < 1 , if ηk > η0 , the first k values are preserved as the principal components and the last term of the inputs with low sensitivity are removed. The number of variable remained is increasing as the threshold η0 increase.
4 Simulations In this section, two simulations are carried out on both computer-generated data and practical observed data to demonstrate the performance of the variable selection method proposed in this paper. Then the simulation results are compared with the
Variable Selection for Multivariate Time Series Prediction with Neural Networks
421
PCA method. The prediction performance can be evaluated by two error evaluation criteria [8]: the Root Mean Square Error ERMSE and Prediction Accuracy EPA: 1
ERMSE
⎛ 1 N 2 ⎞2 =⎜ [ P(t ) − O(t )] ⎟ ∑ ⎝ N − 1 t =1 ⎠
(20)
N
EPA =
∑ [( P(t ) − P )(O(t ) − O )] t =1
m
m
(21)
( N − 1)σ Pσ O
where O(t) is the target value, P(t) is the predicted value, Om is the mean value of O(t), σO is the standard deviation of y(t), Pm and σP are the mean value and standard deviation of P(t), respectively. ERMSE reflects the absolute deviation between the predicted value and the observed value while EPA denotes the correlation coefficient between the observed and predicted value. In ideal situation, if there are no errors in prediction, these parameters will be ERMSE =0 and EPA =1. 4.1 Prediction of Lorenz Time Series
The first data is derived from the Lorenz system, given by three differential equations: ⎧ dx(t ) ⎪ dt = a ( − x(t ) + y (t ) ) ⎪ ⎪ d y (t ) = bx(t ) − y (t ) − x(t ) z (t ) ⎨ ⎪ dt ⎪ d z (t ) ⎪ dt = x(t ) y (t ) − c(t ) z (t ) ⎩
(22)
where the typical values for the coefficients are a=10, b=8/3, c=28 and the initial values are x(0)=12, y(0)=2, z(0)=9. 1500 points of x(t), y(t) and z(t) obtained by fourorder Runge-Kutta method are used as the training sample and 500 points as the testing sample. In order to extract the dynamics of this system to predict x(t+1), the parameters for phase-space reconstruction are chosen as τx=τy=τz=3, mx=my=mz=9. Thus a MLP neural network with 27 input nodes, one hidden layer of 20 neurons and one output node are considered and a back propagation training algorithm is used. After the training process of the MLP neural network is topped, sensitivity analysis is carried out to evaluate the contribution of each input variable to the error function of the neural network. The trajectories of the sensitivity through training for each input are shown in Fig.2. It can be seen that the sensitivity undulates through training and finally converges when the weights and error are steady. The normalized sensitivity measures in Eq.(17) are calculated. A threshold η0 = 0.98 is chosen to determine which inputs are discarded. Thus the input dimension of neural network is reduced to 11. The original input matrix is replaced by the reduced input matrix and the structure of the neural networks is simplified. The prediction performance over the testing samples with the reduced inputs is shown in Fig.4.
422
M. Han and R. Wei
8
0.7
7
0.6
6
0.5
5
0.4
4
0.3
3 0.2
2
0.1
1 0
0 0
2000
4000
6000 Epoch
8000
10000
0
5
10
15 20 Input Nodes
25
30
Fig. 2. The trajectories of the input sensitivityFig. 3. The normalized sensitivity for each input through training node
x(t)
20 10
Ovserved
Predicted
0
-10
Error
-20 1 0.5 0
-0.5 -1
0
100
200
Time
300
400
500
Fig. 4. The observed and predicted values of Lorenz x(t) time series
The solid line in Fig. 4 represents the observed values while the dashed line represents the predicted values. It can be seen from Fig.4 that the chaotic behaviors of x(t) time series are well predicted and the errors between the observed values and the predicted values are small. The prediction performance are calculated in Table 1 and compared with the PCA variable reduction method. Table 1. Prediction performance of the x(t) time series
Input Nodes ERMSE EPA
With All Variables 27 0.1278 0.9998
PCA Selection 11 0.1979 0.9997
NN Selection 11 0.0630 1.0000
The prediction performance in Table 1 are comparable for the variable selection method with neural networks and the PCA method while the algorithm proposed in this paper obtains the best prediction accuracy.
Variable Selection for Multivariate Time Series Prediction with Neural Networks
423
4.2 Prediction of the Rainfall Time Series
Rainfall is an important variable in hydrological systems. The chaotic characteristic of the rainfall time series has been proven in many papers [9]. In this section, the simulation is taken on the monthly rainfall time series in the city of Dalian, China over a period of 660 months (from 1951 to 2005). The performance of the rainfall may be influenced by many factors, so in this paper five other time series such as the temperature time series, air-pressure time series, humidity time series, wind-speed time series and sunlight time series are also considered. This method also follows the Taken’s theorem to reconstruct the embedding phase space first with the dimension and delay-time as m1=m2=m3=m4=m5=m6=9, τ1=τ2=τ3 =τ4=τ5=τ6=3. Then the input of the neural network contains L=660-(9-1)×3=636 data points. In the experiments, this data set is divided into a training set composed of the first 436 points and a testing set containing the remaining 200 points. The neural network used in this paper then contains 54 input nodes, 20 hidden notes and 1 output. The threshold is also chosen as η0 = 0.98 . The trajectory of the input sensitivity and the normalized sensitivity for ever inputs are shown in Fig.5 and Fig.6, respectively. Then 34 input nodes are remained according to the sensitivity value. 6
0.2
Normalized Sensitivity
5 4 3 2
0.12 0.08 0.04
1 0
0.16
0
2000
4000 6000 Epoch
8000
10000
Fig. 5. The trajectories of the input sensitivity through training
0
0
10
20
30 40 Input Nodes
50
60
Fig. 6. The normalized sensitivity for each input node
The observed and predicted values of rainfall time series are shown in Fig.7, which gives high prediction accuracy. It can be seen from the figures that the chaotic behaviors of the rainfall time series are well predicted and the errors between the observed values and the predicted values are small. Corresponding values of ERMSE and EPA are shown in Table 2. Both of the figures and the error evaluation criteria indicate that the result for multivariate chaotic time series using the neural network based variable selection is much better than the results with all variables and PCA method. It can be concluded from the two simulations that the variable selection algorithm using neural networks is able to capture the dynamics of both computer-generated and practical time series accurately and gives high prediction accuracy.
424
M. Han and R. Wei
rainfall(mm)
400
Observed Predicted
300 200 100
error(mm)
0 200 100 0
-100 -200
0
40
80 120 t (month)
160
200
Fig. 7. The observed and predicted values of rainfall time series Table 2. Prediction performance of the rainfall time series
Input Nodes ERMSE EPA
With All Variables 54 22.2189 0.9217
PCA Selection 43 21.0756 0.9286
NN Selection 31 18.1435 0.9529
5 Conclusions This paper studies the variable selection algorithm using the sensitivity for pruning input nodes in a neural network model. A simple and effective criterion for identifying input nodes to be removed is also derived which does not require high computational cost and proves to work well in practice. The validity of the method was examined through a multivariate prediction problem and a comparison study was made with other variable selection methods. Experimental results encourage the application of the proposed method to complex tasks that need to identify significant input variables. Acknowledgements. This research is supported by the project (60674073) of the National Nature Science Foundation of China, the project (2006CB403405) of the National Basic Research Program of China (973 Program) and the project (2006BAB14B05) of the National Key Technology R&D Program of China. All of these supports are appreciated.
References [1] Verikas, B.M.: Feature selection with neural networks. Pattern Recognition Letters 23, 1323–1335 (2002) [2] Castellano, G., Fanelli, A.M.: Variable selection using neural network models. Neuralcomputing 31, 1–13 (2000)
Variable Selection for Multivariate Time Series Prediction with Neural Networks
425
[3] Castellano, G., Fanelli, A.M., Pelillo, M.: An iterative method for pruning feed-forward neural networks. IEEE Trans. Neural Networks 8(3), 519–531 (1997) [4] Mozer, M.C., Smolensky, P.: Skeletonization: a technique for trimming the fat from a network via a relevance assessment. NIPS 1, 107–115 (1989) [5] Gevrey, M., Dimopoulos, I., Lek, S.: Review and comparison of methods to study the contribution of variables in artificial neural network models. Ecol. Model. 160, 249–264 (2003) [6] Cao, L.Y., Mees, A., Judd, K.: Dynamics from multivariate time series. Physica D 121, 75–88 (1998) [7] Han, M., Fan, M., Xi, J.: Study of Nonlinear Multivariate Time Series Prediction Based on Neural Networks. In: Wang, J., Liao, X.-F., Yi, Z. (eds.) ISNN 2005. LNCS, vol. 3497, pp. 618–623. Springer, Heidelberg (2005) [8] Chen, J.L., Islam, S., Biswas, P.: Nonlinear dynamics of hourly ozone concentrations: nonparametric short term prediction. Atmospheric environment 32(11), 1839–1848 (1998) [9] Liu, D.L., Scott, B.J.: Estimation of solar radiation in Australia from rainfall and temperature observations. Agricultural and Forest Meteorology 106(1), 41–59 (2001)
Ordering Process of Self-Organizing Maps Improved by Asymmetric Neighborhood Function Takaaki Aoki1 , Kaiichiro Ota2 , Koji Kurata3 , and Toshio Aoyagi1,2 1 CREST, JST, Kyoto 606-8501, Japan Graduate School of Informatics, Kyoto University, Kyoto 606-8501, Japan Faculty of Engineering, University of the Ryukyus, Okinawa 903-0213, Japan [email protected]
2 3
Abstract. The Self-Organizing Map (SOM) is an unsupervised learning method based on the neural computation, which has recently found wide applications. However, the learning process sometime takes multi-stable states, within which the map is trapped to a undesirable disordered state including topological defects on the map. These topological defects critically aggravate the performance of the SOM. In order to overcome this problem, we propose to introduce an asymmetric neighborhood function for the SOM algorithm. Compared with the conventional symmetric one, the asymmetric neighborhood function accelerates the ordering process even in the presence of the defect. However, this asymmetry tends to generate a distorted map. This can be suppressed by an improved method of the asymmetric neighborhood function. In the case of one-dimensional SOM, it found that the required steps for perfect ordering is numerically shown to be reduced from O(N 3 ) to O(N 2 ). Keywords: Self-Organizing Map, Asymmetric Neighborhood Function, Fast ordering.
1
Introduction
The Self-Organizing Map (SOM) is an unsupervised learning method of a type of nonlinear principal component analysis [1]. Historically, it was proposed as a simplified neural network model having some essential properties to reproduce topographic representations observed in the brain [2,3,4,5]. The SOM algorithm can be used to construct an ordered mapping from input stimulus data onto two-dimensional array of neurons according to the topological relationships between various characters of the stimulus. This implies that the SOM algorithm is capable of extracting the essential information from complicated data. From the viewpoint of applied information processing, the SOM algorithm can be regarded as a generalized, nonlinear type of principal component analysis and has proven valuable in the fields of visualization, compression and data mining. With based on the biological simple learning rule, this algorithm behaves as an unsupervised M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 426–435, 2008. c Springer-Verlag Berlin Heidelberg 2008
A
B
1
0.5
0 0
0.5
1
Reference Vector mi
Ordering Process of SOMs Improved by Asymmetric Neighborhood Function
427
1 0.5 0 0
N Unit Number i
Fig. 1. A: An example of a topological defect in a two-dimensional array of SOM with a uniform rectangular input space. The triangle point indicates the conflicting point in the feature map. B: Another example of topological defect in a one-dimensional array with scalar input data. The triangle points also indicate the conflicting points.
learning method and provides a robust performance without a delicate tuning of learning conditions. However, there is a serious problem of multi-stability or meta-stability in the learning process [6,7,8]. When the learning process is trapped to these states, the map seems to be converged to the final state practically. However, some of theses states are undesirable for the solution of the learning procedure, in which typically the map has topological defects as shown in Fig. 1A. The map in Fig. 1A, is twisted with a topological defect at the center. In this situation, two-dimensional array of SOM should be arranged in the square space, for the input data taken uniformly from square space. But, this topological defect is a global conflicting point which is difficult to remove by local modulations of the reference vectors of units. Therefore, it will require a sheer number of learning steps to rectify the topological defect. Thus, the existence of the topological defect critically aggravates the performance of the SOM algorithm. To avoid the emergence of the topological defect, several conventional and empirical methods have been used. However, it is more favorable that the SOM algorithm works well without tuning any model parameters, even when the topological defect emerged. Thus, let us consider a simple method which enables the effective ordering procedure of SOM in the presence of the topological defect. Therefore we propose an asymmetric neighborhood function which effectively removes the topological defect [9]. In the process of removing the topological defect, the conflicting point must be moved out toward the boundary of the arrays and vanished. Therefore, the motive process of the defect is essential for the efficiency of the ordering process. With the original symmetric neighborhood, the movement of the defect is similar to a random walk stochastic process, whose efficiency is worse. By introducing the asymmetry of the neighborhood function, the movement behaves like a drift, which enables the faster ordering. For this reason, in this paper we investigate the effect of an asymmetric neighborhood function on the performance of the SOM algorithm for the case of one-dimensional and two-dimensional SOMs.
428
2 2.1
T. Aoki et al.
Methods SOM
The SOM constructs a mapping from the input data space to the array of nodes, we call the ‘feature map’. To each node i, a parametric ‘reference vector’ mi is assigned. Through SOM learning, these reference vectors are rearranged according to the following iterative procedure. An input vector x(t) is presented at each time step t, and the best matching unit whose reference vector is closest to the given input vector x(t) is chosen. The best matching unit c, called the ‘winner’ is given by c = arg mini x(t) − mi . In other words, the data x(t) in the input data space is mapped on to the node c associated with the reference vector mi closest to x(t). In SOM learning, the update rule for reference vectors is given by mi (t + 1) = mi (t) + α · h(ric )[x(t) − mi (t)],
ric ≡ ric ≡ ri − rc
(1)
where α, the learning rate, is some small constant. The function h(r) is called the ‘neighborhood function’, in which ric is the distance from the position rc of the winner node c to the position ri of a node i on the array of units. A widely used r2 neighborhood function is the Gaussian function defined by, h(ric ) = exp − 2σic2 . We expect an ordered mapping after iterating the above procedure a sufficient number of times. 2.2
Asymmetric Neighborhood Function
We now introduce a method to transform any given symmetric neighborhood function to an asymmetric one (Fig. 2A). Let us define an asymmetry parameter β (β ≥ 1), representing the degree of asymmetry and the unit vector k indicating the direction of asymmetry. If a unit i is located on the positive direction with k, then the component parallel to k of the distance from the winner to the unit is scaled by 1/β. If a unit i is located on the negative direction with k, the parallel component of the distance is scaled by β. Hence, the asymmetric function hβ (r), transformed from its symmetric counterpart h(r), is described by ⎧ ⎪ −1 r 2 ⎨ + r⊥ 2 (ric · k ≥ 0) 1 hβ (ric ) = 2 +β · h(˜ ric ), r˜ic = β , ⎪ β ⎩ βr 2 + r 2 (r · k < 0) ⊥ ic (2) where r˜ic is the scaled distance from the winner. r is the projected component of ric , and r⊥ are the remaining components perpendicular to k, respectively. In addition, in order to single ∞out the effect of asymmetry, the overall area of the neighborhood function, −∞ h(r)dr, is preserved under this transformation. In the special case of the asymmetry parameter β = 1, hβ (r) is equal to the original symmetric function h(r). Figure 2B displays an example of asymmetric Gaussian neighborhood functions in the two-dimensional array of SOM.
Ordering Process of SOMs Improved by Asymmetric Neighborhood Function
A
Oriented in the positive direction c
Oriented in the negative direction β
1/β
c
k
k
B
C
Inverse the direction 1.0
hβ(r)
k
1.0
Interval: T 0.5
1 0.5
0.5
distance
0 -6 -3
429
distance
Reduce the degree of asymmetric
0
3
6-6
-3
0
3
6
1.0
1.0
1.0
0.5
0.5
0.5
b=3
b=2
b=1
Fig. 2. A: Method of generating an asymmetric neighborhood function by scaling the distance ric asymmetrically.The degree of asymmetry is parameterized by β. The distance of the node on the positive direction with asymmetric unit vector k, is scaled by 1/β. The distance on the negative direction is scaled by β. Therefore, the asymmetric 2 function is described by hβ (ric ) = β+1/β h(˜ ric ) where r˜ic is the scaled distance of node i from the winner c. B: An example of an asymmetric Gaussian function. C: An illustration of the improved algorithm for asymmetric neighborhood function.
Next, we introduce an improved algorithm for the asymmetric neighborhood function. The asymmetry of the neighborhood function causes to distort the feature map, in which the density of units does not represent the probability of the input data. Therefore, two novel procedures will be introduced. The first procedure is an inversion operation on the direction of the asymmetric neighborhood function. As illustrated in Fig. 2C, the direction of the asymmetry is turned in the opposite direction after every time interval T , which is expected to average out the distortion in the feature map. It is noted that the interval T should be set to a larger value than the typical ordering time for the asymmetric neighborhood function. The second procedure is an operation that decreases the degree of asymmetry of the neighborhood function. When β = 1, the neighborhood function equals the original symmetric function. With this operation, β is decreased to 1 with each time step, as illustrated in Fig. 2C. In our numerical simulations, we adopt a linear decreasing function. 2.3
Numerical Simulations
In the following sections, we tested learning procedures of SOM with sample data to examine the performance of the ordering process. The sample data is
430
T. Aoki et al.
generated from a random variable of a uniform distribution. In the case of onedimensional SOM, the distribution is uniform in the ranges of [0, 1]. Here we use Gaussian function for the original symmetric neighborhood function. The model parameters in SOM learning was used as follows: The total number of units N = 1000, the learning rate α = 0.05 (constant), and the neighborhood radius σ = 50. The asymmetry parameter β = 1.5 and asymmetric direction k is set to the positive direction in the array. The interval period T of flipping the asymmetric direction is 3000. In the case of two-dimensional SOM ( 2D → 2D map), the input data taken uniformly from square space, [0, 1] × [0, 1]. The model parameters are same as in one-dimensional SOM, excepted that The total number of units N = 900 (30 × 30) and σ = 5. The asymmetric direction k is taken in the direction (1, 0), which can be determined arbitrary. In the following numerical simulations, we also confirmed the result holds with other model parameters and other form of neighborhood functions. 2.4
Topological Order and Distortion of the Feature Map
For the aim to examine the ordering process of SOM, lets us consider two measures which characterize the property of the future map. One is the ‘topological order’ η for quantifying the order of reference vectors in the SOM array. The units of the SOM array should be arranged according to its reference vector mi . In the presence of the topological defect, most of the units satisfy the local ordering. However, the topological defect violates the global ordering and the feature map is divided into fragments of ordering domains within which the units satisfy the order-condition. Therefore, the topological order η can be defined as the ratio of the maximum domain size to the total number of units N , given by maxl Nl , (3) N where Nl is the size of domain l. In the case of one-dimensional SOM, the order-condition for the reference vector of units is defined as, mi−1 ≤ mi ≤ mi+1 , or mi−1 ≥ mi ≥ mi+1 . In the case of two-dimensional SOM as referred in the previous section, the order-condition is also defined explicitly with the vector product a(i,i) ≡ (m(i+1,i) −m(i,i) )×(m(i,i+1) −m(i,i) ). Within the ordering domain, the vector products a(i,i) of units have same sign, because the reference vectors are arranged in the sample space ordered by the position of the unit. The other is the ‘distortion’ χ, which measures the distortion of the feature map. The asymmetry of the neighborhood function tends to distort the distribution of reference vectors, which is quite different from the correct probability density of input vectors. For example, when the probability density of input vectors is uniform, a non-uniform distribution of reference vectors is formed with an asymmetric neighborhood function. Hence, for measuring the non-uniformity in the distribution of reference vectors, let us define the distortion χ. χ is a coefficient of variation of the size-distribution of unit Voronoi tessellation cells, and is given by Var(Δi ) χ= , (4) E(Δi ) η≡
Ordering Process of SOMs Improved by Asymmetric Neighborhood Function
431
where Δi is the size of Voronoi cell of unit i. To eliminate the boundary effect of the SOM algorithm, the Voronoi cells on the edges of the array are excluded. When the reference vectors are distributed uniformly, the distortion χ converges to 0. It should be noted that the evaluation of Voronoi cell in two-dimensional SOM is time-consuming, and we approximate the size of Voronoi cell by the size of the vector product a(i,i) . If the feature map is uniformly formed, the approximate value also converges to 0.
3 3.1
Results One-Dimensional Case
In this section, we investigate the ordering process of SOM learning in the presence of a topological defect in symmetric, asymmetric, and improved asymmetric cases of the neighborhood function. For this purpose, we use the initial condition that a single topological defect appears at the center of the array. Because the density of input vectors is uniform, the desirable feature map is a linear arrangement of SOM nodes. Figure 3A shows a typical time development of the reference vectors mi . In the case of the symmetric neighborhood function, a single defect remains around the center of the array even after 10000 steps. In contrast, in the case of the asymmetric one, this defect moves out to the right so that the reference vectors are ordered within 3000 steps. This phenomenon can also be confirmed in Fig. 3B, which shows the time dependence of the topological order η. In the case of the asymmetric neighborhood function, η rapidly converges to 1 (completely ordered state) within 3000 steps, whereas the process of eliminating the last defect takes a large amount of time (∼18000 step) for the symmetric one. On the other hand, one problem arises in the feature map obtained with the asymmetric neighborhood function. After 10000 steps, the distribution of the reference vectors in the feature map develops an unusual bias (Fig. 3A). Figure 3C shows the time dependence of the distortion χ during learning. In the case of the symmetric neighborhood function, χ eventually converges to almost 0. This result indicates that the feature map obtained with the symmetric one has an almost uniform size distribution of Voronoi cells. In contrast, in the case of the asymmetric one, χ converges to a finite value (= 0). Although the asymmetric neighborhood function accelerates the ordering process of SOM learning, the resultant map becomes distorted which is unusable for the applications. Therefore, the improved asymmetric method will be introduced, as mentioned in Method. Using this improved algorithm, χ converges to almost 0 as same as the symmetric one (Fig. 3C). Furthermore, as shown in Fig. 3B, the improved algorithm preserves the faster order learning. Therefore, by utilizing the improved algorithm of asymmetric neighborhood function, we confer the full benefit of both the fast order learning and the undistorted feature map. To quantify the performance of the ordering process, let us define the ‘ordering time’ as the time at which η reaches to 1. Figure 4A shows the ordering
432
T. Aoki et al.
A
Symmetric 1
1
0
1
0 0
1000
1
0 0
1000
0 0
1000
0
1000
0
1000
Asymmetric 1
1
0
1
0 0
1000
1
0 0
1000
1
0 0
1000
0 0
1000
Improved asymmetric 1
1
0
0 0 1000 t=1000
t=0
1 0 0 1000 t=5000
B
Distortion χ
1
0.5
0
0 0 1000 t=10000
0 1000 t=20000
Symmetric Asymmetric Improved asym.
C
Topological order η
1
1.5 1 0.5 0
0
10000 Time
20000
0
10000 Time
20000
Fig. 3. The asymmetric neighborhood function enhances the ordering process of SOM. A: A typical time development of the reference vectors mi in cases of symmetric, asymmetric, and improved asymmetric neighborhood functions. B: The time dependence of the topological order η. The standard deviations are denoted by the error bars, which cannot be seen because they are smaller than the size of the graphed symbol. C: The time dependence of the distortion χ.
time as a function of the total number of units N for both improved asymmetric and symmetric cases of the neighborhood function. It is found that the ordering time scales roughly as N 3 and N 2 for symmetric and improved asymmetric neighborhood functions, respectively. For detailed discussion about the reduction of the ordering time, refer to the Aoki & Aoyagi (2007). Figure 4B shows the dependency of the ordering time on the width of neighborhood function, which indicates that ordering time is proportional to (N/σ)k with k = 2/3 for asymmetric/symmetric neighborhood function. This result implies that combined usage of the asymmetric method and annealing method for the width of neighborhood function is more effective. 3.2
Two-Dimensional Case
In this section, we investigate the effect of asymmetric neighborhood function for two-dimensional SOM (2D → 2D map). Figure 5 shows that a similar fast
Ordering Process of SOMs Improved by Asymmetric Neighborhood Function
Ordering time
108 10
7
10
6
10
5
N2.989±0.002
10
7
10
6
Sym. σ = 8 Sym. σ = 16 Sym. σ = 32 Sym. σ = 64 Asym.σ = 8 Asym.σ = 16 Asym.σ = 32 Asym.σ = 64
N3
105
104 10
108
Symmetric Improved asym. Fitting
104
N1.917±0.005
3
103
10
104
The number of units N
433
N2
3
102
Scaled number of units N/σ
Fig. 4. Ordering time as a function of the total number of units N . The fitting function is described by Const. · N γ .
A
Symmetric 1
1
0
1
0 0
0
1
0
1
0
1
0
1
Asymmetric 1
1
0
1
0 0
1
1
0 0
1
0 0
1
Improved asymmetric 1
1
0
0 0
t=0
1
0 0
t=1000
B
1 t=5000
1
0.5
0
0 1 t=20000
Symmetric Asymmetric Improved asym.
C Distortion χ
Topological order η
1
1
0.5
0 0
10000 Time
20000
0
10000 Time
20000
Fig. 5. A: A typical time development of reference vectors in two-dimensional array of SOM for the cases of symmetric, asymmetric and improved asymmetric neighborhood functions. B: The time dependence of the topological order η. C: The time dependence of the distortion χ.
434
T. Aoki et al.
Population
Symmetric
Improved asymmetric
0.5
0.5
0
0 0
25000
50000
Ordering time
0
25000
50000
Ordering time
Fig. 6. Distribution of ordering times when the initial reference vectors are generated randomly. The white bin at the right in the graph indicates a population of failed trails which could not converged to the perfect ordering state within 50000 steps.
ordering process can be realized with an asymmetric neighborhood function in two-dimensional SOM. The initial state has a global topological defect, in which the map is twisted at the center. In this situation, the conventional symmetric neighborhood function has trouble in correcting the twisted map. Because of the local stability, this topological defect is never corrected even with a huge learning iteration. The asymmetric neighborhood function also is effective to overcome such a topological defect, like the case of one-dimensional SOM. However, the same problem of ’distortion map’ occurs. Therefore, by using the improved asymmetric neighborhood function, the feature map converges to the completely ordered map in much less time without any distortion. In the previous simulations, we have considered a simple situation that a single defect exists around the center of the feature map as an initial condition in order to investigate the ordering process with the topological defect. However, when the initial reference vectors are set randomly, the total number of topological defects appearing in the map is not generally equal to one. Therefore, it is necessary to consider the statistical distribution of the ordering time, because the total number of the topological defects and the convergence process depend generally on the initial conditions. Figure 6 shows the distribution of the ordering time, when the initial reference vectors are randomly selected from the uniform distribution [0, 1] × [0, 1]. In the case of symmetric neighborhood function, a part of trial could not converged to the ordered state with trapped in the undesirable meta-stable states, in which the topological defects are never rectified. Therefore, although the fast ordering process is observed in some successful cases (lucky initial conditions), the formed map with symmetric one is highly depends on the initial conditions. In contrast, for the improved asymmetric neighborhood function, the distribution of the ordering time has a single sharp peak and the successive future map is constructed stably without any tuning of the initial condition.
4
Conclusion
In this paper, we discussed the learning process of the self-organized map, especially in the presence of a topological defect. Interestingly, even in the presence
Ordering Process of SOMs Improved by Asymmetric Neighborhood Function
435
of the defect, we found that the asymmetry of the neighborhood function enables the system to accelerate the learning process. Compared with the conventional symmetric one, the convergence time of the learning process can be roughly reduced from O(N 3 ) to O(N 2 ) in one-dimensional SOM(N is the total number of units). Furthermore, this acceleration with the asymmetric neighborhood function is also effective in the case of two-dimensional SOM (2D → 2D map). In contrast, the conventional symmetric one can not rectify the twisted feature map even with a sheer of iteration steps due to its local stability. These results suggest that the proposed method can be effective for more general case of SOM, which is the subject of future study.
Acknowledgments This work was supported by Grant-in-Aid for Scientific Research from the Ministry of Education, Science, Sports, and Culture of Japan: Grant number 18047014, 18019019 and 18300079.
References 1. Kohonen, T.: Self-organized formation of topologically correct feature maps. Biol. Cybern. 43(1), 59–69 (1982) 2. Hubel, D.H., Wiesel, T.N.: Receptive fields, binocular interaction and functional architecture in cats visual cortex. J. Physiol.-London 160(1), 106–154 (1962) 3. Hubel, D.H., Wiesel, T.N.: Sequence regularity and geometry of orientation columns in monkey striate cortex. J. Comp. Neurol. 158(3), 267–294 (1974) 4. von der Malsburg, C.: Self-organization of orientation sensitive cells in striate cortex. Kybernetik 14(2), 85–100 (1973) 5. Takeuchi, A., Amari, S.: Formation of topographic maps and columnar microstructures in nerve fields. Biol. Cybern. 35(2), 63–72 (1979) 6. Erwin, E., Obermayer, K., Schulten, K.: Self-organizing maps - stationary states, metastability and convergence rate. Biol. Cybern. 67(1), 35–45 (1992) 7. Geszti, T., Csabai, I., Cazok´ o, F., Szak´ acs, T., Serneels, R., Vattay, G.: Dynamics of the kohonen map. In: Statistical mechanics of neural networks: proceedings of the XIth Sitges Conference, pp. 341–349. Springer, New York (1990) 8. Der, R., Herrmann, M., Villmann, T.: Time behavior of topological ordering in self-organizing feature mapping. Biol. Cybern. 77(6), 419–427 (1997) 9. Aoki, T., Aoyagi, T.: Self-organizing maps with asymmetric neighborhood function. Neural Comput. 19(9), 2515–2535 (2007)
A Characterization of Simple Recurrent Neural Networks with Two Hidden Units as a Language Recognizer Azusa Iwata1 , Yoshihisa Shinozawa1 , and Akito Sakurai1,2 1
Keio University, Hiyoshi, Kohoku-ku, Yokohama 223-8522, Japan 2 CREST, Japan Science and Technology Agency
Abstract. We give a necessary condition that a simple recurrent neural network with two sigmoidal hidden units to implement a recognizer of the formal language {an bn |n > 0} which is generated by a set of generating rules {S → aSb, S → ab} and show that by setting parameters so as to conform to the condition we get a recognizer of the language. The condition implies instability of learning process reported in previous studies. The condition also implies, contrary to its success in implementing the recognizer, difficulty of getting a recognizer of more complicated languages.
1
Introduction
Pioneered by Elman [6], many researches have been conducted on grammar learning by recurrent neural networks. Grammar is defined by generating or rewriting rules such as S → Sa and S → b. S → Sa means that “the letter S should be rewritten to Sa,” and S → b means that “the letter S should be rewritten to b.” If we have more than one applicable rules, we have to try all the possibility. The string generated this way but with no further applicable rewriting rule is called a sentence. The set of all possible sentences is called the language generated by the grammar. Although the word “language” would be better termed formal language in contrast to natural language, we follow custom in formal language theory field(e.g. [10]). In everyday expression a sentence is a sequence of words whereas in the above example it is a string of characters. Nevertheless the essence is common. The concept that a language is defined based on a grammar this way has been a major paradigm in a wide range of formal language study and related field. A study of grammar learning focuses on restoring a grammar from finite samples of sentences of a language associated with the grammar. In contrast to ease of generation of sample sentences, grammar learning from sample sentences is a hard problem and in fact it is impossible except for some very restrictive cases e.g. the language is finite. As is well-known, humans do learn language whose grammar is very complicated. The difference between the two situations might be attributed to possible existence of some unknown restrictions on types of natural language grammars in our brain. Since neural network is a model of M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 436–445, 2008. c Springer-Verlag Berlin Heidelberg 2008
A Characterization of Simple RNNs with Two Hidden Units
437
brain, some researchers think that neural network could learn a grammar from exemplar sentences of a language. The researches on grammar learning by neural networks are characterized by 1. focus on learning of self-embedding rules, since simpler rules that can be expressed by finite state automata are understood to be learnable, 2. simple recurrent neural networks (SRN) are used as basic mechanism, and 3. languages such as {an bn |n > 0}, {an bn cn |n > 0} which are clearly the results of, though not a representative of, self-embedding rules, were adopted for target languages. Here an is a string of n-times repetition of a character a, the language {an bn |n > 0} is generated by a grammar {S → ab, S → aSb}, and {an bn cn |n > 0} is generated by a context-sensitive grammar. The adoption of simple languages such as {an bn |n > 0} and {an bn cn |n > 0} as target languages are inevitable, since it is not easy to see whether enough generalization is achieved by the learning if realistic grammars are used. Although these languages are simple, many times when they learned, what they learned was just what they were taught, not a grammar, i.e. they did almost rote-learning. In other words, their generalization capability in learning was limited. Also their learned-results were unstable in a sense that when they were given new training sentences which were longer than the ones they learned but still in the same language, learned network changes unexpectedly. The change was more than just refinement of the learned network. Considering these situations, we may doubt that there really exists a solution or correct learned network. Bod´en et al. ([1,2,3]), Rodriguez et al. ([13,14]), Chalup et al. ([5]), and others tried to clarify the reasons and explore the possibility of network learning of the languages. But their results were not conclusive and did not give clear conditions that the learned networks satisfy in common. We will in this paper describe a condition that SRNs with two hidden units learned to recognize language {an bn |n > 0} have in common, that is, a condition that has to be met for SRN to be qualified to be successfully learned the language. The condition implies, by the way, instability of learning results. Moreover by utilizing the condition, we realize a language recognizer. By doing this we found that learning of recognizers of languages more complicated than {an bn |n > 0} are hard. RNN (recurrent neural network) is a type of networks that has recurrent connections and a feedforward network connections. The calculation is done at once for the feedforward network part and after one time-unit delay the recurrent connection part gives rises to additional inputs (i.e., additional to the external inputs) to the feedforward part. RNN is a kind of discrete time system. Starting with an initial state (initial outputs, i.e. outputs without external inputs, of feedforward part) the network proceeds to accept the next character in a string given to external inputs, reaches the final state, and obtains the final external output from the final state. SRN (simple recurrent network) is a simple type of RNN and has only one layer of hidden units in its feedforward part.
438
A. Iwata, Y. Shinozawa, and A. Sakurai
Rodriguez et al. [13] showed that SRN learns languages {an bn |n > 0} (or, more correctly, its subset {ai bi |n ≥ i > 0}) and {an bn cn |n > 0} (or, more correctly, {ai bi |n ≥ i > 0}). For {an bn |n > 0}, they found an SRN that generalized to n ≤ 16 after learned for n ≤ 11. They analyzed how the languages are processed by the SRN but the results were not conclusive so that it is still an open problem if it is really possible to realize recognizers of the languages such as {an bn |n > 0} or {an bn cn |n > 0} and if SRN could learn or implement more complicated languages. Siegelmann [16] showed that an RNN has computational ability superior to a Turing machine. But the difficulty of learning {an bn |n > 0} by SRN suggests that at least SRN might not be able to realize Turing machine. The difference might be attributed to the difference of output functions of neurons: piecewise linear function in Siegelmann’s case and sigmoidal function in standard SRN cases, since we cannot find other substantial differences. On the other hand Casey [4] and Maass [12] showed that in noisy environment RNN is equivalent to finite automaton or less powerful one. The results suggest that correct (i.e., infinite precision) computation is to be considered when we had to make research on possibility of computations by RNN or specifically SRN. Therefore in the research we conducted and will report in this paper we have adopted RNN models with infinite precision calculation and with sigmoidal function (tanh(x)) as output function of neurons. From the viewpoints above, we discuss two things in the paper: a necessary condition for an SRN with two hidden units to be a recognizer of the language {an bn |n > 0}, the condition is sufficient enough to guide us to build an SRN recognizing the language.
2
Preliminaries
SRN (simple recurrent network) is a simple type of recurrent neural networks and its function is expressed by sn+1 = σ(w s · sn + wx · xn ), Nn (sn ) = wos · sn + woc where σ is a standard sigmoid function (tanh(x) = (1 − exp(−x))/(1 + exp(−x))) and is applied component-wise. A counter is a device which keeps an integer and allows +1 or −1 operation and answers yes or no to an inquiry if the content is 0 (0-test). A stack is a device which allows operations push i (store i) and pop up (recall the laststored content, discard it and restore the device as if it were just before the corresponding push operation). Clearly a stack is more general than a counter, so that if a counter is not implementable, a stack is not, too. Elman used SRN as a predictor but Rodriguez et al. [13] and others consider it as a counter. We in this paper mainly stand on the latter viewpoint. We will first explain that a predictor is used as a recognizer or a counter. When we try to train SRN to recognize a language, we train it to predict correctly the next character to come in an input string (see e.g. [6]). As usual we adopted “one hot vector” representation for the network output. One hot vector is a vector representation in which a single element is one and the others
A Characterization of Simple RNNs with Two Hidden Units
439
are zero. The network is trained so that the sum of the squared error (difference between actual output and desired output) is minimized. By this method, if two possible outputs with the same occurrence frequency for the same input exist in a training data, the network would learn to output 0.5 for two elements in the output with others being 0, since the output vector should give the minimum of the sum of the squared error. It is easily seen that if a network correctly predicts the next character to come for a string in the language {an bn |n > 0}, we could see it behaves as a counter with limited function. Let us add a new network output whose value is positive if the original network predicts only a to come (which happens only when the input string up to the time was an bn for some n which is common practice since Elman’s work) and is negative otherwise. The modified network seems to count up for a character a and count down for b since it outputs positive value when the number of as and bs coincide and negative value otherwise. Its counting capability, though, is limited since it could output any value when a is fed before the due number of bs are fed, that is, when counting up action is required before the counter returns back to 0-state. A (discrete-time) dynamical system is represented as the iteration of a function application: si+1 = f (si ), i ∈ N , si ∈ Rn . A point s is called a fixed point of f if f (s) = s. A point s is an attracting fixed point of f if s is a fixed point and there exists a neighborhood Us around s such that limi→∞ f i (x) = s for all x ∈ Us . A point s is a repelling fixed point of f if s is an attracting fixed point of f −1 . A point s is called a periodic point of f if f n (s) = s for some n. A point s is a ω-limit point of x for f if limi→∞ f ni (x) = s for limi→∞ ni = ∞. A fixed point x of f is hyperbolic if all of the eigenvalues of Df at x have absolute values different from one, where Df = [∂fi /∂xj ] is the Jacobian matrix of first partial derivatives of the function f . A set D is invariant under f if for any s ∈ D, f (s) ∈ D. The following theorem plays an important role in the current paper. Theorem 1 (Stable Manifold Theorem for a Fixed Point [9]). Let f : Rn → Rn be a C r (r ≥ 1) diffeomorphism with a hyperbolic fixed point x. Then s,f u,f there are local stable and unstable manifolds Wloc (x), Wloc (x), tangent to the s,f s,f u,f eigenspaces Ex , Ex of Df at x and of corresponding dimension. Wloc (x) and u,f r Wloc (x) are as smooth as the map f , i.e. of class C . Local stable and unstable manifold for f are defined as follows: s,f Wloc (q) = {y ∈ Uq | limm→∞ dist(f m (y), q) = 0} u,f Wloc (q) = {y ∈ Uq | limm→∞ dist(f −m (y), q) = 0}
where Uq is a neighborhood of q and dist is a distance function. Then global s,f stable and unstable manifold for f are defined as: W s,f (q) = i≥0 f −i (Wloc (q)) u,f u,f i and W (q) = i≥0 f (Wloc (q)). As defined, SRN is a pair of a discrete-time dynamical system sn+1 = σ(w s · sn + wx · xn ) and an external output part Nn = wos · sn + woc . We simply
440
A. Iwata, Y. Shinozawa, and A. Sakurai
write the former (dynamical system part) as sn+1 = f (sn , xn ) and the external output part as h(sn ). When RNN (or SRN) is used as a recognizer of the language {an bn |n > 0}, as described in Introduction, it is seen as a counter where the input character a is for a count-up operation (i.e., +1) and b is for a count-down operation (i.e., −1). In the following we may use x+ for a and x− for b. For abbreviation, in the following, we use f+ = f ( · , x+ ) and f− = f ( · , x− ) Please note that f− is undefined for the point outside and on the border of the square (I[−1, 1])2 , where I[−1, 1] is the closed interval [−1, 1]. In the following, though, we do not mention it for simplicity. D0 is the set {s|h(s) ≥ 0}, that is, a region where the counter value is 0. Let −i D i = f− (D0 ), that is, a region where the counter value is i. We postulate that f+ (Di ) ⊆ Di+1 . This means that any point in Di is eligible for a state where the counter content is i. This may seem to be rather demanding. An alternative would be that the point p is for counter content c if and only if p1 pi m1 mi p = f− ◦ f+ ◦ . . . ◦ f− ◦ f+ (s0 ) for a predefined s0 , some mj ≥ 0 and pj ≥ 0 i for 1 ≤ j ≤ i, and i ≥ 0 such that j=1 (pj − mj ) = c. This, unfortunately, has not resulted in a fruitful result. We also postulate that Di ’s are disjoint. Since we defined Di as a closed set, the postulate is natural. The point is therefore we have chosen Di to be closed. The postulate requires that we should keep a margin between D0 and D1 and others.
3
A Necessary Condition
We consider only SRN with two hidden units, i.e., all the vectors concerning s such as ws , sn , wos are two-dimensional. Definition 2. Dω is the set of the accumulation points of i≥0 Di , i.e. s ∈ Dω iff s = limi→∞ ski for some ski ∈ Dki . Definition 3. Pω is the set of ω-limit points of points in D0 for f+ , i.e. s ∈ Pω ki iff s = limi→∞ f+ (s0 ) for some ki and s0 ∈ D0 . Qω is the set of ω-limit points −ki −1 of points in D0 for f− , i.e. s ∈ Qω iff s = limi→∞ f− (s0 ) for some ki and s0 ∈ D0 . Considering the results obtained in Bod´en et al. ([1,2,3]), Rodriguez et al. ([13,14]), Chalup et al. ([5]), it is natural, at least for a first consideration, to i i postulate that f+ (x) and f− (x) are not wondering so that they will converge to periodic points. Therefore Pω and Qω are postulated as finite set of hyperbolic periodic points for f+ and f− , respectively. In the following, though, for simplification of presentations, we postulate that Pω and Qω are finites set of hyperbolic fixed points for f+ and f− , respectively. Moreover, according to the same literature, the points in Qω are saddle points u,f s,f for f− , so that we further postulate that Wloc − (q) for q ∈ Qω and Wloc − (q) for q ∈ Qω are one-dimensional, where their existence is guaranteed by Theorem 1.
A Characterization of Simple RNNs with Two Hidden Units
441
Postulate 4. We postulate that f+ (Di ) ⊆ Di+1 , Di ’s are disjoint, Pω and Qω u,f are finites set of hyperbolic fixed points for f+ and f− , respectively, and Wloc − (q) s,f− for q ∈ Qω and Wloc (q) for q ∈ Qω are one-dimensional. −1 −1 Lemma 5. f− ◦ f+ (Dω ) = Dω , f− (Dω (I(−1, 1))2 ) = Dω and Dω (I(−1, 1))2 = f− (Dω ), and f+ (Dω ) ⊆ Dω . Pω ⊆ Dω and Qω ⊆ Dω . −1 Definition 6. W u,−1 (q) is the global unstable manifold at q ∈ Qω for f− , i.e., −1 u,f− u,−1 s,f− W (q) = W (q) = W (q). i Lemma 7. For any p ∈ Dω , any accumulation point of {f− (p)|i > 0} is in Qω .
Proof. Since p is in Dω , there exist pki ∈ Dki such that p = limi→∞ pki . Suppose q in Dω is the accumulation point stated in the theorem statement, i.e., q = h limj→∞ f−j (p). We take ki large enough for any hj so that in any neighborhood of h h h −k q where f−j (p) is in, pki exists. Then q = limj→∞ f−j (pki ) = limj→∞ f−j i (ski ) ki where ki is a function of hj with ki > hj . Let ski = f− (pki ) ∈ D0 and s0 ∈ D0 −1 be an accumulation point of {ski }. Then since f− is continuous, letting nj = −n −hj + ki > 0, q = limj→∞ f− j (s0 ), i.e., q ∈ Qω .
Lemma 8. Dω = q∈Qω W u,−1 (q). Proof. Let p be any point in Dω . Since f− (Dω ) ⊆ (I[−1, 1])2 where I[−1, 1] n is the interval [−1, 1], i.e., f− (Dω ) is bounded, and f− (Dω ) ⊆ Dω , {f− (p)} has an accumulation point q in Dω , which is, by Lemma 7, in Qω . Then q is n expressed as q = limj→∞ f−j (p). Since Qω is a finite set of hyperbolic fixed −1 n points, q = limn→∞ f− (p), i.e., p ∈ W s,f (q) = W u,f (q) = W u,−1 (q).
Since Pω ⊆ Dω , the next theorem holds. Theorem 9. A point in Pω is either a point in Qω or in W u,−1 (q) for some q ∈ Qω . Please note that since q ∈ W u,−1 (q), the theorem statement is just “If p ∈ Pω then p ∈ W u,−1 (q) for some q ∈ Qω .”
4
An Example of a Recognizer
To construct an SRN recognizer for {an bn |n > 0}, the SRN should satisfy the conditions stated in Theorem 9 and Postulate 4, which are summarized as: 1. f+ (Di ) ⊆ Di+1 , 2. Di ’s are disjoint, 3. Pω and Qω are finites set of hyperbolic fixed points for f+ and f− , respectively, u,f s,f 4. Wloc − (q) for q ∈ Qω and Wloc − (q) for q ∈ Qω are one-dimensional, and u,−1 5. If p ∈ Pω then p ∈ W (q) for some q ∈ Qω .
442
A. Iwata, Y. Shinozawa, and A. Sakurai
Let us consider as simple as possible, so that the first choice is to think about a point p ∈ Pω and q ∈ Qω , that is f+ (p) = p and f− (q) = q. Since p cannot be −1 the same as q (because f− ◦ f+ (p) = p + w−1 s · w x · (x+ − x− ) = p ), we have u,−1 to find a way to let p ∈ W (q). Since it is very hard in general to calculate stable or unstable manifolds from a function and its fixed point, we had better try to let W u,−1 (q) be a “simple” manifold. There is one more reason to do so: we have to define D0 = {x|h(x) ≥ 0} but if W u,−1 (q) is not simple, suitable h may not exist. We have decided that W u,−1 (q) be a line (if possible). Considering the function form f− (s) = σ(w s · s + wx · x− ), it is not difficult to see that the line could be one of the axes or one of the bisectors of the right angles at the origin (i.e., one of the lines y = x and y = −x). We have chosen the bisector in the first (and the third) quadrant (i.e., the line y = x). By the way q was chosen to be the origin and p was chosen arbitrarily to be (0.8, 0.8). The item 4 is satisfied by setting one of the two eigenvalues of Df− at the origin to be greater than one, and the other smaller then one. We have chosen 1/0.6 for one and 1/μ for the other which is to be set so that Item 1 and 2 are satisfied by considering eigenvalues of Df+ at p for f+ . The design consideration that we have skipped is how to design D0 = {x|h(x) ≥ 0}. A simple way is to make the boundary h(x) = 0 parallel to W u,−1 (q) for our intended q ∈ Qω . Because, if we do so, by setting the largest eigenvalue of Df− at q to be equal to the inverse of the eigenvalue of Df+ at p along the 2 2 i normal to W u,−1 , we can get the points s ∈ D0 , f− ◦ f+ (s), f− ◦ f+ (s), . . . , f− ◦ i f+ (s), . . . that belong to {an bn |n > 0}, reside at approximately equal distance from W u,−1 . Needless to say that the points belonging to, say, {an+1 bn |n > 0} have approximately equal distance from W u,−1 among them and this distance is different from that for {an bn |n > 0}. Let f− (x) = σ(Ax + B0 ), f+ (x) = σ(Ax + B1 ). We plan to put Qω = {(0, 0)}, Pω = {(0.8, 0.8)}, W u,−1 = {(x, y)|y = x}, the eigenvalues of the tangent space −1 of f− at (0, 0) are 1/λ = 1/0.6 and 1/μ (where the eigenvector on y = x is expanding), and the eigenvalues of the tangent space of f+ at (0.8, 0.8) are 1/μ and any value. Then, considering derivatives at (0, 0) and (0.8, 0.8), it is easy to see π 1 π 1 λ 0 A=ρ ρ − , = (1 − 0.82 )μ 0 μ 2 4 4 μ where ρ(θ) is a rotation by θ. Then λ+μ λ−μ A= λ−μ λ+μ Next from σ(B0 ) = (0, 0)T and σ((0.8λ, 0.8λ)T + B1 ) = (0.8, 0.8)T , −1 0 σ (0.8) − 0.8λ B0 = , B1 = . 0 σ −1 (0.8) − 0.8λ These give us μ = 5/3, λ = 0.6, B1 ≈ (1.23722, 1.23722)T .
A Characterization of Simple RNNs with Two Hidden Units
443
Fig. 1. The vector field representation of f+ (left) and f− (right) 1
0.75
0.5
0.25
-1
-0.75
-0.5
-0.25
0.25
0.5
0.75
1
-0.25
-0.5
-0.75
-1
-1
-0.75
-0.5
1
1
0.75
0.75
0.5
0.5
0.25
0.25
-0.25
0.25
0.
0.75
1
-1
-0.75
-0.5
-0.25
0.25
-0.25
-0.25
-0.5
-0.5
-0.75
-0.75
-1
-1
0.5
0.75
1
n+1 n+1 n n n Fig. 2. {f− ◦ f+ (p)| n ≥ 1} (upper), {f− ◦ f+ (p)| n ≥ 1} (lower left), and {f− ◦ n f+ (p)| n ≥ 1} (lower right) where p = (0.5, 0.95)
444
A. Iwata, Y. Shinozawa, and A. Sakurai
In Fig. 1, the left image shows the vector field of f+ where the arrows starting at x end at f+ (x) and the right image shows the vector field of f− . In Fig. 2, the upper plot shows points corresponding to strings in {an bn |n > 0}, the lower-left plot {an+1 bn |n > 0}, and the lower-right plot {an bn+1 |n > 0}. The initial point was set to p = (0.5, 0.95) in Fig. 2. All of them are for n = 1 to n = 40 and when n grows the points gather so we could say that they stay in narrow stripes, i.e. Dn , for any n.
5
Discussion
We obtained a necessary condition that SRN implements a recognizer for the language {an bn |n > 0} by analyzing its behavior from the viewpoint of discrete dynamical systems. The condition supposes that Di ’s are disjoint, f+ (Di ) ⊆ Di+1 , and Qω is finite. It suggests a possibility of the implementation and in fact we have successfully built a recognizer for the language, thereby we showed that the learning problem of the language has at least a solution. Unstableness of any solutions for learning is suggested to be (but not derived to be) due to the necessity of Pω being in an unstable manifold W u,−1 (q) for n q ∈ Qω . Since Pω is attractive in the above example, f+ (s0 ) for s0 ∈ D0 comes n exponentially close to Pω for n. By even a small fluctuation of Pω , since f+ (s0 ), u,−1 n n too, is close to W (q), f− (f+ (s0 )), which should be in D0 , is disturbed much. This means that even if we are close to a solution, by just a small fluctuation of n n Pω caused by a new training data, f− (f+ (s0 )) may easily be pushed out of D0 . Since Rodriguez et al. [14] showed that the languages that do not belong to the context-free class could be learned to some degree, we have to further study the discrepancies. Instability of grammar learning by SRN shown above might not be seen in our natural language learning, which suggests that SRN might not be appropriate for a model of language learning.
References 1. Bod´en, M., Wiles, J., Tonkes, B., Blair, A.: Learning to predict a context-free language: analysis of dynamics in recurrent hidden units. Artificial Neural Networks (1999); Proc. ICANN 1999, vol. 1, pp. 359–364 (1999) 2. Bod´en, M., Wiles, J.: Context-free and context-sensitive dynamics in recurrent neural networks. Connection Science 12(3/4), 197–210 (2000) 3. Bod´en, M., Blair, A.: Learning the dynamics of embedded clauses. Applied Intelligence: Special issue on natural language and machine learning 19(1/2), 51–63 (2003) 4. Casey, M.: Correction to proof that recurrent neural networks can robustly recognize only regular languages. Neural Computation 10, 1067–1069 (1998) 5. Chalup, S.K., Blair, A.D.: Incremental Training Of First Order Recurrent Neural Networks To Predict A Context-Sensitive Language. Neural Networks 16(7), 955– 972 (2003)
A Characterization of Simple RNNs with Two Hidden Units
445
6. Elman, J.L.: Distributed representations, simple recurrent networks and grammatical structure. Machine Learning 7, 195–225 (1991) 7. Elman, J.L.: Language as a dynamical system. In: Mind as Motion: Explorations in the Dynamics of Cognition, pp. 195–225. MIT Press, Cambridge 8. Gers, F.A., Schmidhuber, J.: LSTM recurrent networks learn simple context free and context sensitive languages. IEEE Transactions on Neural Networks 12(6), 1333–1340 (2001) 9. Guckenheimer, J., Holmes, P.: Nonlinear Oscillations, Dynamical Systems, and Bifurcations of Vector Fields. Springer, Heidelberg (Corr. 5th print, 1997) 10. Hopcroft, J.E., Ullman, J.D.: Introduction to automata theory, languages, and computation. Addison-Wesley, Reading (1979) 11. Katok, A., Hasselblatt, B.: Introduction to the Modern Theory of Dynamical Systems. Cambridge University Press, Cambridge (1996) 12. Maass, W., Orponen, P.: On the effect of analog noise in discrete-time analog computations. Neural Computation 10, 1071–1095 (1998) 13. Rodriguez, P., Wiles, J., Elman, J.L.: A recurrent neural network that learns to count. Connection Science 11, 5–40 (1999) 14. Rodriguez, P.: Simple recurrent networks learn context-free and context-sensitive languages by counting. Neural Computation 13(9), 2093–2118 (2001) 15. Schmidhuber, J., Gers, F., Eck, D.: Learning Nonregular Languages: A Comparison of Simple Recurrent Networks and LSTM. Neural Computation 14(9), 2039–2041 (2002) 16. Siegelmann, H.T.: Neural Networks and Analog Computation: beyond the Turing Limit, Birkh¨ auser (1999) 17. Wiles, J., Blair, A.D., Bod´en, M.: Representation Beyond Finite States: Alternatives to Push-Down Automata. In: A Field Guide to Dynamical Recurrent Networks
Unbiased Likelihood Backpropagation Learning Masashi Sekino and Katsumi Nitta Tokyo Institute of Technology, Japan
Abstract. The error backpropagation is one of the popular methods for training an artificial neural network. When the error backpropagation is used for training an artificial neural network, overfitting occurs in the latter half of the training. This paper provides an explanation about why overfitting occurs with the model selection framework. The explanation leads to a new method for training an aritificial neural network, Unibiased Likelihood Backpropagation Learning. Several results are shown.
1
Introduction
An artificial neural network is one of the model for function approximation. It is possible to approximate arbitrary function when the number of basis functions is large. The error backpropagation learning [1], which is a famous method for training an artificial neural network, is the gradient discent method with the squared error to learning data as a target function. Therefore, the error backpropagation learning can obtain local optimum while monotonously decreasing the error. Here, although the error to learning data is monotonously decreasing, the error to test data increases in the latter half of training. This phenomenon is called overfitting. Early stopping is one of the method for preventing the overfitting. This method stop the training when an estimator of the generalization error does not decrease any longer. For example, the technique which stop the training when the error to hold-out data does not decrease any longer is often applied. However, the early stopping basically minimize the error to learning data, therefore there is no guarantee for obtaining the optimum parameter which minimize the estimator of the generalization error. When the parameters of the basis functions (model parameter) are fixed, an artificial neural network becomes a linear regression model. If a regularization parameter is introduced to assure the regularity of this linear regression model, the artificial neural network becomes a set of regular linear regression models. The cause of why an artificial neural network tends to overfit is that the maximum likelihood estimation with respect to the model parameter is the model selection about regular linear regression models based on the empirical likelihood. In this paper, we propose the unbiased likelihood backpropagation learning which is the gradient discent method for modifying the model parameter with unbiased likelihood (information criterion) as a target function. It is expected M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 446–455, 2008. c Springer-Verlag Berlin Heidelberg 2008
Unbiased Likelihood Backpropagation Learning
447
that the proposed method has better approximation performance because the method explicitly minimize an estimator of the generalization error. Following section, section 2 explains about statistical learning and maximum likelihood estimation, and section 3 explains about information criterion shortly. Next, in section 4, we give an explanation about an artificial neural network, regularized maximum likelihood estimation and why the error backpropagation learning cause overfitting. Then, the proposed method is explained in section 5. We show the effectiveness of the proposed method by applying the method to DELVE data set [4] in section 6. Finally we conclude this paper in section 7.
2
Statistical Learning
Statistical learning aims to construct an optimal approximation pˆ(x) of a true distribution q(x) from a set of hypotheses M ≡ {p(x|θ) | θ ∈ Θ} using learning data D ≡ {xn | n = 1, · · · , N } obtained from q(x). M is called model and approximation pˆ(x) is called estimation. When we want to clearly denote that the estimation pˆ(x) is constructed using learning data D, we use a notation pˆ(x|D). Kullback-Leibler divergence: q(x) D(q||p) ≡ q(x) log dx (1) p(x) is used for distance from q(x) to p(x). Probability density p(x) is called likelihood, and especially the likelihood of the estimation pˆ(x) to the learning data D: N pˆ(D|D) ≡ pˆ(xn |D) (2) n=1
is called empirical likelihood. Sample mean of the log-likelihood: N 1 1 log p(D) = log p(xn ) N N n=1
asymptotically converges in probability to the mean log-likelihood: Eq(x) log p(x) ≡ q(x) log p(x)dx
(3)
(4)
according to the law of large numbers, where Eq(x) denotes an expectation under q(x). Because Kullback-Leibler divergence can be decomposed as D(q||p) = Eq(x) log q(x) − Eq(x) log p(x) , (5) the maximization of the mean log-likelihood is equal to the minimization of Kullback-Leibler divergence. Therefore, statistical learning methods such as maximum likelihood estimation, maximum a posteriori estimation and Bayesian estimation are based on the likelihood.
448
M. Sekino and K. Nitta
Maximum Likelihood Estimation Maximum likelihood estimation pˆML (x) is the hypothesis p(x|θˆML ) by the maximizer θˆML of likelihood p(D|θ): pˆML (x) ≡ p(x|θˆML ), θˆML ≡ argmax p(D|θ). θ
3 3.1
(6) (7)
Information Criterion and Model Selection Information Criterion
Because sample mean of the log-likelihood asymptotically converges in probability to the mean log-likelihood, statistical learning methods are based on likelihood. However, because learning data D is a finite set in practice, the empirical likelihood pˆ(D|D) contains bias. This bias b(ˆ p) is defined as b(ˆ p) ≡ Eq(D) log pˆ(D) −N · Eq(x) log pˆ(x|D) , (8) N where q(D) ≡ n=1 q(xn ). Because of the bias, it is known that the most overfitted model to learning data is selected when select a regular model from a candidate set of regular models based on the empirical likelihood. Addressing this problem, there have been proposed many information criteria which evaluates learning results by correcting the bias. Generally the form of the information criterion is IC(ˆ p, D) ≡ −2 log pˆ(D) + 2ˆb(ˆ p)
(9)
where ˆb(ˆ p) is the estimator of the bias b(ˆ p). Corrected AIC (cAIC) [3] estimates and corrects accurate bias of the empirical log-likelihood as N (M + 1) ˆbcAIC (ˆ pML ) = , (10) N −M −2 under the assumption that the learning model is a normal linear regression model, the true model is included in the learning model and the estimation is constructed by maximum likelihood estimation. Here, M is the number of explanatory variables and dim θ = M +1 (1 is the number of the estimators of variance.) Therefore, cAIC asymptotically equals AIC [2]: ˆbcAIC (ˆ pML ) → ˆbAIC (ˆ pML ) (N → ∞).
4
Artificial Neural Network
In a regression problem, we want to estimate the true function using learning data D = {(xn , yn ) | n = 1, · · · , N }, where xn ∈ Rd is an input and yn ∈ R is the corresponding output. An artificial neural network is defined as
Unbiased Likelihood Backpropagation Learning
f (x; θM ) =
M
ai φ(x; ϕi )
449
(11)
i=1
where ai (i = 1, · · · , N ) are regression coefficients and φ(x; ϕi ) are basis functions which are parametalized by ϕi . Model parameter of this neural network is θM ≡ (ϕT1 , · · · , ϕTM )T (T denotes transpose.) Design matrix X, coefficient vector a and output vector y are defined as Xij ≡ φ(xi ; ϕj )
(12)
a ≡ (a1 , · · · , aM )
(13)
y ≡ (y1 , · · · , yN ) .
(14)
T
T
When the model parameter θM are fixed, the artificial neural network (11) becomes a linear regression model parametalized by a. A normal linear regression model:
1 (y −f (x; θM))2 p(y|x; θM ) ≡ √ exp − (15) 2σ 2 2πσ 2 is usually used when the noise included in the output y is assumed to follow a normal distribution. In this paper, we call the parameter θR ≡ (aT , σ 2 )T regular parameter. 4.1
Regularized Maximum Likelihood Estimation
To assure the regularity of the normal linear regression model (15), regularized maximum likelihood estimation is usually used for estimating θR = (aT , σ 2 )T . Regularized maximum likelihood estimation maximize regularized log-likelihood: log p(D) − exp(λ) a 2 .
(16)
λ ∈ R is called a regularization parameter. Regularized maximum likelihood estimators of the coefficient vector a and the valiance σ 2 are ˆ = Ly a (17) and σ ˆ2 = where
1 ˆ 2 , y − Xa N
L ≡ (X T X + exp(λ)I)−1 X T
and I is the identity matrix. Here, the effective number of the regression coefficients is
Mef f = tr XL .
(18) (19)
(20)
Therefore, Mef f is used for the number of explanatory variables M in (10)1 . 1
When the denominator of (10) is not positive value, we use ˆbcAIC (ˆ p) = ∞.
450
4.2
M. Sekino and K. Nitta
Overfitting of the Error Backpropagation Learning
The error backpropagation learning is usually used for training an artifical neural network. This method is equal to the gradient discent method based on the likelihood of the linear regression model (15), because the target function is the squared error to learning data. In what follows, we assume the noise follow a normal distribution and the normal linear regression model (15) is regular for all θM . Then, an artificial neural network becomes a set of regular linear regression models. For simplicity, let’s think about a set of regular models H ≡ {M(θM ) | θM ∈ ΘM }, where M(θM ) ≡ {p(x|θR ; θM ) | θR ∈ ΘR } is a regular model. We can define a new model MC ≡ {ˆ p(x; θM ) | θM ∈ ΘM }, where pˆ(x; θM ) is the estimation of M(θM ). Concerning model parameter θM , statistical learning methods construct the estimation of the model MC based on the empirical likelihood pˆ(D; θM ). For example, maximum likelihood estimation selects θˆMML which is the maximizer of pˆ(D; θM ): pˆML (x) = pˆ(x; θˆMML ), θˆMML ≡ argmax pˆ(D; θM ). θM
(21) (22)
Thus, the maximum likelihood estimation with respect to model parameter θM is the model selection from H based on the empirical likelihood. Because the error backpropagation is the method which realizes maximum likelihood estimation by the gradient discent method, the model M(θM ) becomes the one gradually overfitted in the latter half of the training. Therefore, we propose a learning method for model parameter θM based on unbiased likelihood which is the empirical likelihood corrected by an appropriate information criterion.
5 5.1
Unbiased Likelihood Backpropagation Learning Unbiased Likelihood
Using an information criterion IC(ˆ p, D), we define unbiased likelihood as: 1 pˆub (D) = exp − IC(ˆ p, D) . (23) 2 This unbiased likelihood satisfies 1 Eq(D) log pˆub (D) = Eq(x) log pˆ(x) N when the assumptions of the information criterion are satisfied.
(24)
Unbiased Likelihood Backpropagation Learning
5.2
451
Regular Hierarchical Model
In this paper, we consider about a certain type of hierarchical model, which we call a regular hierarchical model, defined as a set of regular models. A concise definitions of a regular hierarchical model are follows. Regular Hierarchical Model – H ≡ {M(θM ) | θM ∈ ΘM } – M(θM ) ≡ {p(x|θR ; θM ) | θR ∈ ΘR } – M(θM ) is a regular model with respect to θR . An artificial neural network is one of the regular hierarchical models. And also we define unbiased maximum likelihood estimation as follows. Unbiased Maximum Likelihood Estimation Unbiased maximum likelihood estimation pˆubML (x) is the estimation pˆ(x; θˆMubML) by the maximizer θˆMubML of the unbiased likelihood pˆub (D|θM ): pˆubML (x) ≡ pˆ(x; θˆMubML ) θˆMubML ≡ argmax pˆub (D; θM ). θM
5.3
(25) (26)
Unbiased Likelihood Backpropagation Learning
The partial differential of the unbiased likelihood is ∂ ∂ ∂ ˆ log pˆub (D; θM ) = log pˆ(D; θM ) − b(ˆ p; θM ). ∂θM ∂θM ∂θM
(27)
We define the unbiased likelihood estimation based on the gradient method with this partial differential as unbiased likelihood backpropagation learning. 5.4
Unbiased Likelihood Backpropagation Learning for an Artificial Neural Network
In this paper, we derive the unbiased likelihood backpropagation learning for an artificial neural network when the bias of the empirical likelihood is estimated by cAIC (10). The partial differential of the empirical likelihood with respect to θM which is the first term of (27), is ∂ 1 log pˆ(D; θM ) = 2 ∂θM σ ˆ ˆ ∂X a = ∂θM
ˆ ∂X a ∂θM
T
ˆ) (y − X a
T ∂X ∂X T ∂X L+L y − 2LT X T Ly. ∂θM ∂θM ∂θM
(28)
(29)
452
M. Sekino and K. Nitta
The partial differential of cAIC (10) with respect to θM , which is the second term of (27), is ∂ ˆ N (N − 1) ∂Mef f bcAIC (ˆ p; θM ) = ∂θM (N − M − 2)2 ∂θM ∂Mef f ∂X ∂X = 2 tr L − LT X T L . ∂θM ∂θM ∂θM
(30)
(31)
We can also obtain the partial differential of the unbiased likelihood with respect to λ as ∂ 1 ˆ ) exp(λ), log pˆ(D; θM ) = − 2 y T LT L(y − X a (32) ∂λ σ ˆ and ∂Mef f = −tr LT L exp(λ). (33) ∂λ Now, we have already obtain the partial differential of the unbiased likelihood with respect to θM and λ, therefore it is possible to apply the unbiased likelihood backpropagation learning to an artificial neural network.
6
Application to Kernel Regression Model
Kernel regression model is one of the artificial neural networks. The kernel regression model using gaussian kernels has the model parameter of degree one, which is the size of gaussian kernels. This model is comprehensible about the behavior of learning methods. Therefore, the kernel regression model using gaussian kernels is used in the following simulations. In the implementation of the gradient discent method, we adopted quasi-Newton method with BFGS method for estimating the Hesse matrix and golden section search for determining the modification length. 6.1
Kernel Regression Model
Kernel regression model is f (x; θM ) =
N
an K(x, xn ; θM ).
(34)
n=1
K(x, xn ; θM ) are kernel functions parametalized by model parameter θM . Gaussian kernel: x − xn 2 K(x, xn ; c) = exp − (35) 2c2 is used in the following simulations, where c is a parameter which decides the size of a gaussian kernel. Model parameter is θM = c.
Unbiased Likelihood Backpropagation Learning
453
1 emp true cAIC
log-likelihood
0.5
0
-0.5
-1 0
2
4
6 iterations
8
10
12
(a) Empirical Likelihood Backpropagation
1 emp true cAIC
log-likelihood
0.5
0
-0.5
-1 0
2
4
6 iterations
8
10
12
(b) Unbiased Likelihood Backpropagation Fig. 1. An example of the transition of mean log empirical likelihood, mean log test likelihood and mean log unbiased likelihood (cAIC)
454
6.2
M. Sekino and K. Nitta
Simulations
For the purpose of evaluation, the empirical likelihood backpropagation learning and the unbiased likelihood backpropagation learning are applied to the 8 dimensional input to 1 dimensional output regression problems of “kin-family” and “pumadyn-family” in the DELVE data set [4]. Each data has 4 combinations of fairly linear (f) or non linear (n), and medium noise (m) or high noise (h). We use 128 samples for learning data. 50 templates are chosen randomly and kernel functions are put on the templates. Fig.1 shows an example of the transition of mean log empirical likelihood, mean log test likelihood and mean log unbiased likelihood (cAIC). It shows the test likelihood of the empirical likelihood backpropagation learning (a) decrease in the latter half of the training and overfitting occurs. On the contrary, it shows the unbiased likelihood backpropagation learning (b) keeps the test likelihood close to the unbiased likelihood (cAIC) and overfitting does not occur. Table 1 shows mean and standard deviations of the mean log test likelihood for 100 experiments. The number in bold face shows it is significantly better result by the t-test at the significance level 1%. Table 1. Mean and standard deviations of the mean log test likelihood for 100 experiments. The number in bold face shows it is significantly better result by the t-test at the significance level 1%. Data Empirical BP kin-8fm 2.323 ± 0.346 kin-8fh 1.108 ± 0.240 kin-8nm −0.394 ± 0.415 kin-8nh −0.435 ± 0.236 pumadyn-8fm −2.235 ± 0.249 pumadyn-8fh −3.168 ± 0.187 pumadyn-8nm −3.012 ± 0.238 pumadyn-8nh −3.287 ± 0.255
6.3
Unbiased BP 2.531 ± 0.140 1.599 ± 0.100 0.078 ± 0.287 0.064 ± 0.048 −1.708 ± 0.056 −2.626 ± 0.021 −2.762 ± 0.116 −2.934 ± 0.055
Discussion
The reason why the results of the unbiased likelihood backpropagation learning in Table 1 shows better is attributed to the fact that the method maximizes the true likelihood averagely, because the mean of the log unbiased likelihood is equal to the log true likelihood (see (24)). The reason why the standard deviations of the test likelihood of the unbiased likelihood backpropagation learning is smaller than that of the empirical likelihood backpropagation learning is assumed to be due to the fact that the empirical likelihood prefer the model which has bigger degree of freedom. On the contrary, the unbiased likelihood prefer the model which has appropriate degree of freedom. Therefore, the variance of the estimation of the unbiased likelihood backpropagation learning becomes smaller than that of the empirical likelihood backpropagation learning.
Unbiased Likelihood Backpropagation Learning
7
455
Conclusion
In this paper, we provide an explanation about why overfitting occurs with the model selection framework. We propose the unibiased likelihood backpropagation learning, which is the gradient discent method for modifying the model parameter with unbiased likelihood (information criterion) as a target function. And we confirm the effectiveness of the proposed method by applying the method to DELVE data set.
References 1. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations by error propagation. In: Rumelhart, D.E., McClelland, J.L., et al. (eds.) Parallel Distributed Processing, vol. 1, pp. 318–362. MIT Press, Cambridge (1987) 2. Akaike, H.: A new look at the statistical model identification. IEEE Transactions on Automatic Control 19(6), 716–723 (1974) 3. Sugiura, N.: Further analysis of the data by Akaike’s information criterion and the finite corrections. Communications in Statistics, vol. A78, pp. 13–26 (1978) 4. Rasmussen, C.E., Neal, R.M., Hinton, G.E., van Camp, D., Revow, M., Ghahramani, Z., Kustra, R., Tibshirani, R.: The DELVE manual (1996), http://www.cs.toronto.edu/∼ delve/
The Local True Weight Decay Recursive Least Square Algorithm Chi Sing Leung, Kwok-Wo Wong, and Yong Xu Department of Electronic Engineering, City University of Hong Kong, Hong Kong [email protected]
Abstract. The true weight decay recursive least square (TWDRLS) algorithm is an efficient fast online training algorithm for feedforward neural networks. However, its computational and space complexities are very large. This paper first presents a set of more compact TWDRLS equations. Afterwards, we propose a local version of TWDRLS to reduce the computational and space complexities. The effectiveness of this local version is demonstrated by simulations. Our analysis shows that the computational and space complexities of the local TWDRLS are much smaller than those of the global TWDRLS.
1
Introduction
Training multilayered feedforward neural networks (MFNNs) using recursive least square (RLS) algorithms has aroused much attention in many literatures [1, 2, 3, 4]. This is because those RLS algorithms are efficient second-order gradient descent training methods. They lead to a faster convergence when compared with first-order methods, such as the backpropagation (BP) algorithm. Moreover, fewer parameters are required to be tuned during training. Recently, Leung et. al. found that the standard RLS algorithm has an implicit weight decay effect [2]. However, its decay effect is not substantial and so its generalization ability is not very good. A true weight decay RLS (TWDRLS) algorithm is then proposed [5]. However, the computational complexity of TWDRLS is equal to O(M 3 ) at each iteration, where M is the number of weights. Therefore, it is necessary to reduce the complexity of TWDRLS so that the TWDRLS can be used for large scale practical problems. The main goal of this paper is to reduce both the computational complexity and storage requirement. In Section 2, we derive a set of concise equations for TWDRLS and give some discussions on it. We then describe a local TWDRLS algorithm in Section 3. Simulation results are presented in Section 4. We then summarize our findings in Section 5.
2
TWDRLS Algorithm
A general MFNN is composed of L layers, indexed by 1, · · · , L from input to output. There are nl neurons in layer l. The output of the i-th neuron in the M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 456–465, 2008. c Springer-Verlag Berlin Heidelberg 2008
The Local True Weight Decay Recursive Least Square Algorithm
457
l-th layer is denoted by yi,l . That means, the i-th neuron of the output layer is represented by yi,L while the i-th input of the network is represented by yi,1 . The connection weight from the j-th neuron of layer l − 1 to the i-th neuron of layer l is denoted by wi,j,l . Biases are implemented as weights and are specified by wi,(nl−1 +1),l , where l = 2, · · · , L. Hence, the number of weights in a MFNN is given by M = L l=2 (nl−1 + 1)nl . In the standard RLS algorithm, we arrange all weights as a M -dimensional vector, given by T w = w1,1,2 , · · · w1,(n1 +1),2 , · · · , wnL ,1,L , · · · , wnL ,(nL−1 +1),L . (1) The energy function up to the t-th training sample is given by E(w) =
t
T
ˆ ˆ d(τ ) − h(w, x(τ )) + [w − w(0)] P −1 (0) [w − w(0)] 2
(2)
τ =1
where x(τ ) is a n1 -dimensional input vector, d(τ ) is the desired nL -dimensional output, and h(w0 , x(τ )) is a nonlinear function that describes the function of the network. The matrix P (0) is the error covariance matrix and is usually set to δ −1 IM×M , where IM×M is a M × M identity matrix. The minimization of (2) leads to the standard RLS equations [1, 3, 6, 7], given by −1 K(t) = P (t − 1) H T (t) InL ×nL + H(t) P (t − 1) H T (t) (3) P (t) = P (t − 1) − K(t) H(t) P (t − 1) (4) ˆ ˆ − 1) + K(t) [d(t) − h(w(t ˆ − 1), x(t)) ] , w(t) = w(t (5) T ∂h(w,(t)) where H(t) = is the gradient matrix (nL × M ) of ∂w ˆ w=w(t−1)
h(w, x(t)); and K(t) is the so-called Kalman gain matrix (M × nL ) in the classical control theory. The matrix P (t) is the so-called error covariance matrix. It is symmetric positive definite. As mentioned in [5], the standard RLS algorithm only has the limited weight decay effect, being tδo per training iteration, where to is the number of training iterations. The decay effect decreases linearly as the number of training iterations increases. Hence, the more training presentations take place, the less smoothing effect would have in the data fitting process. A true weight decay RLS algorithm, namely TDRLS, was then proposed [5], where a decay term is added to the original energy function. The new energy function is given by t
E(w) =
T ˆ ˆ d(τ ) − h(w, x(τ ))2 + αw T w + [w − w(0)] P −1 (0) [w − w(0)] (6)
τ =1
where α is a regularization parameter. The gradient of E(w) is given by t ∂E(w) ˆ ≈ P −1 (0) [w − w(0)] + αw − H T (τ ) [d(τ ) − H(τ )w − ξ(τ )] . ∂w τ =1 (7)
458
C.S. Leung, K.-W. Wong, and Y. Xu
ˆ − 1). That means, In the above, we linearize h(w, x(τ )) around the estimate w(τ ˆ − 1), x(τ )) − H(τ )w(τ ˆ ) + ρ(τ ). h(w, x(τ )) = H(τ )w + ξ(τ ), where ξ(τ ) = h(w(τ To minimize the energy function, we set the gradient to zero. Hence, we have ˆ w(t) = P (t)r(t)
(8)
P −1 (t) = P −1 (t − 1) + H T (t) H(t) + αIM×M r(t) = r(t − 1) + H T (t) [d(t) − ξ(t)] .
(9) (10)
where
Δ
Furthermore, we define P ∗ (t) = [IM×M + αP (t − 1)]−1 P (t − 1). Hence, we have P ∗ −1 (t) = P −1 (t − 1) + αIM×M . With the matrix inversion lemma [7] in the recursive calculation of P (t), (8) becomes −1
P ∗ (t − 1) = [IM×M + αP (t − 1)] P (t − 1) (11) −1 ∗ T ∗ T K(t) = P (t − 1) H (t) InL ×nL + H(t) P (t − 1)H (t) (12) ∗ ∗ P (t) = P (t − 1) − K(t) H(t) P (t − 1) (13) ˆ ˆ − 1)−αP (t)w(t ˆ − 1) + K(t)[d(t) − h(w(t ˆ − 1), x(t))]. (14) w(t) = w(t Equations (11)-(14) are the general global TWDRLS equations. Those are more compact than the equations presented in [5]. Also, the weight updating equation in [5] ( i.e., (14) in this paper) is more complicated. When the regularization parameter α is set to zero, the decay term αw T w vanishes and (11)-(14) reduce to the standard RLS equations. The decay effect can be easily understood by the decay term αw T w in the energy function given by (6). As mentioned in [5], the decay effect per training iterations is equal to αw T w which does not decrease with the number of training iterations. The energy function of TWDRLS is the same as that of batch model weight decay methods. Hence, existing heuristic methods [8,9] for choosing the value of α can be used for the TWDRLS’s case. We can also explain the weight decay effect based on the recursive equations (11)-(14). The main difference between the standard RLS equations and the ˆ − 1) in TWDRLS equations is the introduction of a decay term −αP (t) w(t (14). This term guarantees that the magnitude of the updating weight vector decays an amount proportional to αP (t). Since P (t) is positive definite, the magnitude of the weight vector would not be too large. So the generalization ability of the trained networks would be better [8, 9]. A drawback of TWDRLS is the requirement in computing the inverse of the M -dimensional matrix (IM×M + αP (t − 1)). This complexity is equal to O(M 3 ) which is much larger than that of the standard RLS, O(M 2 ). Hence, the TWDRLS algorithm is computationally prohibitive even for a network with moderate size. In the next Section, a local version of the TWDRLS algorithm will be proposed to solve this large complexity problem.
The Local True Weight Decay Recursive Least Square Algorithm
3
459
Localization of the TWDRLS Algorithm
To localize the TWDRLS algorithm, we first divide the weight vector into several T small vectors, where wi,l = wi,1,l , · · · , wi,(nl−1 +1),l is denoted as the weights connecting all the neurons of layer l − 1 to the i − th neuron of layer l. We consider the estimation of each weight vector separately. When we consider the i−th neuron in layer l, we assume that other weight vectors are constant vectors. Such a technique is usually used in many numerical methods [10]. At each training iteration, we update each weight vector separately. Each neuron has its energy function. The energy function of the i − th neuron in layer l is given by E(wi,l ) =
t
2 d(τ ) − h(wi,l , x(τ )) + αw Ti,l wi,l τ =1 T
−1 ˆ i,l (0)] Pi,l ˆ i,l (0)] . + [wi,l − w (0) [w i,l − w
(15)
Utilizing a derivation process similar to the previous analysis, we obtain the following recursive equations for the local TWDRLS algorithm. Each neuron (excepting input neurons) has its set of TWDRLS equations. For the i − th neuron in layer l, the TWDRLS equations are given by
∗ Pi,l (t − 1) = I(nl−1 +1)×(nl−1 +1) + αPi,l (t − 1)
−1
Pi,l (t − 1)
∗ T ∗ T Ki,l (t) = Pi,l (t − 1) Hi,l (t) InL ×nL + Hi,l (t) Pi,l (t − 1)Hi,l (t)
Pi,l (t) =
∗ Pi,l (t
− 1) −
∗ Ki,l (t) Hi,l (t) Pi,l (t
− 1)
−1
(16) (17) (18)
ˆ i,l (t) = w ˆ i,l (t−1)−αPi,l (t)w ˆ i,l (t−1) + Ki,l (t)[d(t)−h(w ˆ i,l (t−1), x(t))], (19) w
where Hi,l is the nL × n(nl−1 +1) local gradient matrix. In this matrix, only one row associated with the considered neuron is nonzero for output layer L. Ki,l (t) is the (nl−1 + 1) × nL local Kalman gain. Pi,l (t) is the (nl−1 + 1) × (nl−1 + 1) local error covariance matrix. The follows. There training process of the local TWDRLS algorithm is as L are L n neurons (excepting input neurons). Hence, there are l l=2 l=2 nl sets of TWDRLS equations. We update the local weight vectors in accordance with a descending order of l and then an ascending order of i using (16)-(19). At each training stage, only the concerned local weight vector is updated and all other local weight vectors remain unchanged. In the global TWDRLS, the complexity mainly comes from the computing the inverse of the M -dimensional matrix (IM×M + αP (t − 1)). This complexity 3 is equal to O(M complexity is equal to T CCglobal = ). So, the computational
3 L 3 O(M ) = O . Since the size of the matrix is M × M , l=2 nl (nl−1 + 1) the space complexity (storage requirement) is equal to T CSglobal = O(M 2 ) =
2 L O . l=2 nl (nl−1 + 1) From (16), the computational cost of local TWDRLS algorithm mainly comes from the inversion of an (nl−1 + 1) × (nl−1 + 1) matrix. In this way, the
460
C.S. Leung, K.-W. Wong, and Y. Xu
computational complexity of each set of local TWDRLS equations is equal to O((nl−1 +1)3 ) and the corresponding space complexity is equal to O((nl−1 +1)2 ). Hence, the total of the local TWDRLS is given by computational complexity
L 3 T CClocal = O n (n + 1) and the space complexity (storage requirel−1 l=2 l
L 2 ment) is equal to T CSlocal = O n (n + 1) . They are much smaller l l−1 l=2 than the computational and space complexities of the global case.
4
Simulations
Two problems, the generalized XOR and the sunspot data prediction, are considered. We use three-layers networks. The initial weights are small zero-mean independent identically distributed Gaussian random variables. The transfer function of hidden neurons is a hyperbolic tangent. Since the generalized XOR is a classification problem, output neurons are with the hyperbolic tangent function. For the sunspot data prediction problem, output neurons are with the linear activation function. The training for each problem is performed 10 times with different random initial weights. 4.1
Generalized XOR Problem
The generalized XOR problem is formulated as d = sign(x1 x2 ) with inputs in the range [−1, 1]. The network has 2 input neurons, 10 hidden neurons, and 1 output neuron. As a result, there are 41 weights. The training set and test set, shown in Figure 1, consists of 50 and 2,000 samples, respectively. The total number of training cycles is set to 200. In each cycle, training samples from the training set are feeded to the network one by one. The decision boundaries obtained from typical networks trained with both global and local TWDRLS and standard RSL algorithms are plotted in Figure 2.
(a) Training samples
(b) Test samples
Fig. 1. Training and test samples for the generalized XOR problems
The Local True Weight Decay Recursive Least Square Algorithm
461
Table 1. Computational and space complexities of the global and local TWDRLS algorithms for solving the generalized XOR problem Algorithm Computational complexity Space complexitty Global O(6.89 × 104 ) O(1.68 × 103 ) 3 Local O(1.60 × 10 ) O(2.21 × 102 )
(a) Global TWDRLS, α = 0
(b) Local TWDRLS, α = 0
(c) Global TWDRLS, α = 0.00178
(d) Local TWDRLS, α = 0.00178
Fig. 2. Decision boundaries of various trained networks for the generalized XOR problem. Note that when α = 0, the TWDRLS is identical to RLS.
From Figures 1 and 2, the decision boundaries obtained from the trained networks with TWDRLS algorithm are closer to the ideal ones than those with the standard RLS algorithm. Also, both local and global TWDRLS algorithms produce a similar shape of decision boundaries. Figure 3 summarizes the average test set false rates in the 10 runs. The average test set false rates obtained by global and local TWDRLS algorithms are usually lower than those obtained by the standard RLS algorithm over a wide range of regularization parameters. That means, both global and local TWDRLS algorithms can improve the generalization ability. In terms of average false rate, the performance of the local TWDRLS algorithm is quite similar to that of the global ones. The computational and space complexities for global and local algorithms are listed in Table 1. From Figure 3 and Table 1, we can conclude that
462
C.S. Leung, K.-W. Wong, and Y. Xu
Fig. 3. Average test set false rate of 10 runs for the generalized XOR problem
the performance of local TWDRLS is comparable to that of the global ones, and that its complexities are much smaller. Figure 3 indicates that the average test set false rate first decreases with the regularization parameter α and then increases with it. This shows that a proper selection of α will indeed improve the generalization ability of the network. On the other hand, we observe that the test set false rate becomes very high at large values of α, especially for the networks trained with global TWDRLS algorithm. This is due to the fact that when the value of α is too large, the weight decay effect is very substantial and the trained network cannot learn the target function. In order to further illustrate this, we plot in Figure 4 the decision boundary obtained from the network trained with global TWDRLS algorithm for α = 0.0178. The figure shows that the network has already converged when the decision boundary is still quite far from the ideal one. This is because when the value of α is too large, the weight decay effect is too strong. That means, the regularization parameter α cannot be too large otherwise the network cannot learn the target function. 4.2
Sunspot Data Prediction
The sunspot data from 1700 to 1979 are normalized to the range [0,1] and taken as the training and the test sets. Following the common practice, we divide the data into a training set (1700 − 1920) and two test sets, namely, Test-set 1 (1921 − 1955) and Test-set 2 (1956 − 1979). The sunspot series is rather nonstationary and Test-set 2 is atypical for the series as a whole. In the simulation, we assume that the series is generated from the following auto-regressive model, given by d(t) = ϕ(d(t − 1), · · · , d(t − 12)) + (t)
(20)
where (t) is noise and ϕ(·, · · · , ·) is an unknown nonlinear function. A network with 12 input neurons, 8 hidden neurons (with hyperbolic tangent activation
The Local True Weight Decay Recursive Least Square Algorithm
463
Table 2. Computational and space complexities of the global and local TWDRLS algorithms for solving the sunspot data prediction Algorithm Computational complexity Space complexitty Global O(1.44 × 106 ) O(1.28 × 104 ) 4 Local O(1.83 × 10 ) O(1.43 × 103 )
Fig. 4. Decision boundaries of a trained network with local TWDRLS where α = 0.0178. In this case, the value of the regularization parameter is too large. Hence, the network cannot form a good decision boundary.
(a) Test-set 1 average RMSE
(b) Test-set 2 average RMSE
Fig. 5. RMSE of networks trained by global and local TWDRLS algorithms. Note that when α = 0, the TWDRLS is identical to RLS.
function), and one output neuron (with linear activation function) is used for approximating ϕ(·, · · · , ·). The total number of training cycles is equal to 200. As this is a time series problem, the training samples are feeded to the network sequentially in each iteration. The criterion to evaluate the model performance
464
C.S. Leung, K.-W. Wong, and Y. Xu
is the mean squared error (RMSE) of the test set. The experiment are repeated 10 times with different initial weights. Figure 5 summarizes the average RMSE in 10 runs. The computational and space complexities for global and local algorithms are listed in Table 2. We observe from Figure 5 that over a wide range of the regularization parameter α, both global and local TWDRLS algorithms have greatly improved the generalization ability of the trained networks, especially for test-set 2 that is quite different from the training set. However, the test RMSE becomes very large at large values of α. The reasons are similar to those stated in the last subsection. This is because at large value of α, the weight decay effect is too strong and so the network cannot learn the target function. In most cases, the performance of the local training is found to be comparable to that of the global ones. Also, Table 2 shows that those complexities of the local training are much smaller than those of the global one.
5
Conclusion
We have investigated the problem of training the MFNN model using the TWDRLS algorithms. We derive a set of concise equations for the local TWDRLS algorithm. The computational complexity and the storage requirement are reduced considerably when using the local approach. Computer simulations indicate that both local and global TWDRLS algorithms can improve the generation ability of MFNNs. The performance of the local TWDRLS algorithm is comparable to that of the global ones.
Acknowledgement The work is supported by the Hong Kong Special Administrative Region RGC Earmarked Grant (Project No. CityU 115606).
References 1. Shah, S., Palmieri, F., Datum, M.: Optimal filtering algorithm for fast learning in feedforward neural networks. Neural Networks 5, 779–787 (1992) 2. Leung, C.S., Wong, K.W., Sum, J., Chan, L.W.: A pruning method for recursive least square algorithm. Neural Networks 14, 147–174 (2001) 3. Scalero, R., Tepedelelenlioglu, N.: Fast new algorithm for training feedforward neural networks. IEEE Trans. Signal Processing 40, 202–210 (1992) 4. Leung, C.S., Sum, J., Young, G., Kan, W.K.: On the kalman filtering method in neural networks training and pruning. IEEE Trans. Neural Networks 10, 161–165 (1999) 5. Leung, C.S., Tsoi, A.H., Chan, L.W.: Two regularizers for recursive least squared algorithms in feedforward multilayered neural networks. IEEE Trans. Neural Networks 12, 1314–1332 (2001) 6. Mosca, E.: Optimal Predictive and adaptive control. Prentice-Hall, Englewood Cliffs, NJ (1995)
The Local True Weight Decay Recursive Least Square Algorithm
465
7. Haykin, S.: Adaptive filter theory. Prentice-Hall, Englewood Cliffs, NJ (1991) 8. Mackay, D.: Bayesian interpolation. Neural Computation 4, 415–447 (1992) 9. Mackay, D.: A practical bayesian framework for backpropagation networks. Neural Computation 4, 448–472 (1992) 10. William H, H.: Applied numerical linear algebra. Prentice-Hall, Englewood Cliffs, NJ (1989)
Experimental Bayesian Generalization Error of Non-regular Models under Covariate Shift Keisuke Yamazaki and Sumio Watanabe Precision and Intelligence Laboratory, Tokyo Institute of Technology R2-5, 4259 Nagatsuta, Midori-ku, Yokohama, 226-8503 Japan {k-yam,swatanab}@pi.titech.ac.jp
Abstract. In the standard setting of statistical learning theory, we assume that the training and test data are generated from the same distribution. However, this assumption cannot hold in many practical cases, e.g., brain-computer interfacing, bioinformatics, etc. Especially, changing input distribution in the regression problem often occurs, and is known as the covariate shift. There are a lot of studies to adapt the change, since the ordinary machine learning methods do not work properly under the shift. The asymptotic theory has also been developed in the Bayesian inference. Although many effective results are reported on statistical regular ones, the non-regular models have not been considered well. This paper focuses on behaviors of non-regular models under the covariate shift. In the former study [1], we formally revealed the factors changing the generalization error and established its upper bound. We here report that the experimental results support the theoretical findings. Moreover it is observed that the basis function in the model plays an important role in some cases.
1
Introduction
The task of regression problem is to estimate the input-output relation q(y|x) from sample data, where x, y are the input and output data, respectively. Then, we generally assume that the input distribution q(x) is generating both of the training and test data. However, this assumption cannot be satisfied in practical situations, e.g., brain computer interfacing [2], bioinformatics [3], etc. The change of the input distribution from the training q0 (x) into the test q1 (x) is referred to as the covariate shift [4]. It is known that, under the covariate shift, the standard techniques in machine learning cannot work properly, and many efficient methods to tackle this issue are proposed [4,5,6,7]. In the Bayes estimation, Shimodaira [4] revealed the generalization error improved by the importance weight on regular cases. We formally clarified the behavior of the error in non-regular cases [1]. The result shows the generalization error is determined by lower order terms, which are ignored at the situation without the covariate shift. At the same time, it appeared that the calculation of these terms is not straightforward even in a simple regular example. To cope M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 466–476, 2008. c Springer-Verlag Berlin Heidelberg 2008
Experimental Bayesian Generalization Error of Non-regular Models
467
with this problem, we also established the upper bound using the generalization error without the covariate shift. However, it is still open how to derive the theoretical generalization error in non-regular models. In this paper, we observe experimental results calculated with a Monte Carlo method and examine the theoretical upper bound on some non-regular models. Comparing the generalization error under the covariate shift to those without the shift, we investigate an effect of a basis function in the learning model and consider the tightness of the bound. In the next section, we define non-regular models, and summarize the Bayesian analyses for the models without and with the covariate shift. We show the experimental results in Section 3 and give discussions at the last.
2
Bayesian Generalization Errors with and without Covariate Shift
In this section, we summarize the asymptotic theory on the non-regular Bayes inference. At first, we define the non-regular case. Then, mathematical properties of non-regular models without the covariate shift are introduced [8]. Finally, we state the results of the former study [1], which clarified the generalization error under the shift. 2.1
Non-regular Models
Let us define the parametric learning model by p(y|x, w), where x, y and w are the input, output and parameter, respectively. When the true distribution r(y|x) is realized by the learning model, the true parameter w∗ exists, i.e., p(y|x, w∗ ) = r(y|x). The model is regular if w∗ is one point in the parameter space. Otherwise, non-regular; the true parameter is not one point but a set of parameters such that Wt = {w∗ : p(y|x, w∗ ) = r(y|x)}. For example, three-layer perceptrons are non-regular. Let the true distribution be a zero function with Gaussian noise 1 y2 r(y|x) = √ exp − , 2 2π and the learning models be a simple three-layer perceptron 1 (y − a tanh(bx))2 p(y|x, w) = √ exp − , 2 2π where the parameter w = {a, b}. It is easy to find that the true parameters are a set {a = 0}∪{b = 0} since 0×tanh(bx) = a×tanh 0 = 0. This non-regularity (also called non-identifiability) causes that the conventional statistical method cannot be applied to such models (We will mention the detail in Section 2.2). In spite of this difficulty for the analysis, the non-regular models, such as perceptrons, Gaussian mixtures, hidden Markov models, etc., are mainly employed in many information engineering fields.
468
2.2
K. Yamazaki and S. Watanabe
Properties of the Generalization Error without Covariate Shift
As we mentioned in the previous section, the conventional statistical manner does not work in the non-regular models. To cope with the issue, a method was developed based on algebraic geometry. Here, we introduce the summary. Hereafter, we denote the cases without and with the covariate shift subscript 0 and 1, respectively. Some of the functions with a suffix 0 will be replaced by those with 1 in the next section. Let {X n , Y n } = {X1 , Y1 , . . . , Xn , Yn } be a set of training samples that are independently and identically generated by the true distribution r(y|x)q0 (x). Let p(y|x, w) be a learning machine and ϕ(w) be an a priori distribution of parameter w. Then the a posteriori distribution/posterior is defined by p(w|X n , Y n ) =
n 1 p(Yi |Xi , w)ϕ(w), Z(X n , Y n ) i=1
where Z(X n , Y n ) =
n
p(Yi |Xi , w)ϕ(w)dw.
(1)
i=1
The Bayesian predictive distribution is given by n n p(y|x, X , Y ) = p(y|x, w)p(w|X n , Y n )dw. When the number of sample data is sufficiently large (n → ∞), the posterior has the peak at the true parameter(s). The posterior in a regular model is a Gaussian distribution, whose mean is asymptotically the parameter w∗ . On the other hand, the shape of the posterior in a non-regular model is not Gaussian because of Wt (cf. the right panel of Fig.10 in Section 3). We evaluate the generalization error by the average Kullback divergence from the true distribution to the predictive distribution: r(y|x) 0 G0 (n)=EX r(y|x)q0 (x) log dxdy . n ,Y n p(y|x, X n , Y n ) In the standard statistical manner, we can formally calculate the generalization error by integrating the predictive distribution. This integration is viable based on the Gaussian posterior. Therefore, this method is applicable only to regular models. The following is one of the solutions for non-regular cases. The stochastic complexity [9] is defined by F (X n , Y n ) = − log Z(X n , Y n ), (2) which can be used for selecting an appropriate model or hyper-parameters. To analyze the behavior of the stochastic complexity, the following functions play important roles: 0 n n U 0 (n) = EX )] , n ,Y n [F (X , Y 0 where EX n ,Y n [·] stands for the expectation value over r(y|x)q0 (x) and
(3)
Experimental Bayesian Generalization Error of Non-regular Models
F (X n , Y n ) = F (X n , Y n ) +
n
469
log r(Yi |Xi ).
i=1
The generalization error and the stochastic complexity are linked by the following equation [10]: G0 (n) = U 0 (n + 1) − U 0 (n).
(4)
When the learning machine p(y|x, w) can attain the true distribution r(y|x), the asymptotic expansion of F (n) is given as follows [8]. U 0 (n) = α log n − (β − 1) log log n + O(1).
(5)
The coefficients α and β are determined by the integral transforms of U 0 (n). More precisely, the rational number −α and natural number β are the largest pole and its order of J(z) = H0 (w)z ϕ(w)dw, r(y|x) H0 (w) = r(y|x)q0 (x) log dxdy. (6) p(y|x, w) J(z) is obtained by applying the inverse Laplace and Mellin transformations to exp[−U 0 (n)]. Combining Eqs.(5) and (4) immediately gives
α β−1 1 0 G (n) = − +o , n n log n n log n when G0 (n) has an asymptotic form. The coefficients α and β indicate the speed of convergence of the generalization error when the number of training samples is sufficiently large. When the learning machine cannot attain the true distribution (i.e., the model is misspecified), the stochastic complexity has an upper bound of the following asymptotic expression [11]. U 0 (n) ≤ nC + α log n − (β − 1) log log n + O(1),
(7)
where C is a non-negative constant. When the generalization error has an asymptotic form, combining Eqs.(7) and (4) gives
α β−1 1 G0 (n) ≤ C + − +o , (8) n n log n n log n where C is the bias. 2.3
Properties of the Generalization Error with Covariate Shift
Now, we introduce the results in [1]. Since the test data are distributed from r(y|x)q1 (x), the generalization error with the shift is defined by
470
K. Yamazaki and S. Watanabe
1
G
0 (n)=EX n ,Y n
r(y|x) r(y|x)q1 (x) log dxdy . p(y|x, X n , Y n )
(9)
We need the function similar to (3) given by 1 U 1 (n) = EX EX n−1 ,Y n−1 [F (X n , Y n )]. n ,Yn
(10)
Then, the variant of (4) is obtained as G1 (n) = U 1 (n + 1) − U 0 (n).
(11)
When we assume that G1 (n) has an asymptotic expansion and converges to a constant and that U i (n) has the following asymptotic expansion 1 di U i (n) = ai n + bi log n + · · · +ci + +o , n n i (n) TH
0
i (n) TL
1
it holds that G (n) and G (n) are expressed by 1 b0 G0 (n) = a0 + +o , n n 1 b0 + (d1 − d0 ) G1 (n) = a0 + (c1 − c0 ) + +o , (12) n n 1 0 and that TH (n) = TH (n). Note that b0 = α. The factors c1 −c0 and d1 −d0 determine the difference of the errors. We have also obtained that the generalization error G1 (n) has an upper bound G1 (n) ≤ M G0 (n), if the following condition is satisfied M ≡ max
x∼q0 (x)
3
q1 (x) < ∞. q0 (x)
(13)
(14)
Experimental Generalization Errors in Some Toy Models
Even though we know the factors causing the difference between G1 (n) and G0 (n) according to Eq.(12), it is not straightforward to calculate the lower order terms in TLi (n). More precisely, the solvable model is restricted to find the constant and decreasing factors ci , di in the increasing function U i (n). This implies to reveal the analytic expression of G1 (n) is still open issue in non-regular models. Here, we calculate G1 (n) with experiments and observe the behavior. A non-regular model requires the sampling from the non-Gaussian posterior in the Bayes inference. We use the Markov Chain Monte Carlo (MCMC) method to execute this task [12]. In the following examples, we use the common notations: the true distribution is defined by 1 (y − g(x))2 r(y|x) = √ exp − , 2 2π
Experimental Bayesian Generalization Error of Non-regular Models
-10
-5
0
5
10
20
15
-5
0
5
10
20
15
-5
0
5
10
15
0
5
10
15
20
-10
-10
-5
0
5
10
15
20
20
Fig. 7. (μ1 , σ1 ) = (10, 1)
-10
-5
0
5
10
15
20
Fig. 8. (μ1 , σ1 )=(10, 0.5)
-5
0
5
10
15
20
Fig. 3. (μ1 , σ1 ) = (0, 2)
-10
-5
0
5
10
15
20
Fig. 6. (μ1 , σ1 ) = (2, 2)
Fig. 5. (μ1 , σ1 ) = (2, 0.5)
Fig. 4. (μ1 , σ1 ) = (2, 1)
-10
-5
Fig. 2. (μ1 , σ1 ) = (0, 0.5)
Fig. 1. (μ1 , σ1 ) = (0, 1)
-10
-10
471
-10
-5
0
5
10
15
20
Fig. 9. (μ1 , σ1 ) = (10, 2)
The training and test distributions
the learning model is given by
1 (y − f (x, w))2 p(y|x, w) √ exp − , 2 2π
the prior ϕ(w) is a standard normal distribution, and the input distribution is of the form 1 (x − μi )2 qi (x) = √ exp − (i = 0, 1). 2σi2 2πσi The training input distribution has (μ0 , σ0 ) = (0, 1), and there are nine test distributions, where each one has the mean and variance as the combination between μ1 = {0, 2, 10} and σ1 = {1, 0.5, 2} (cf. Fig.1-9). Note that the case in Fig.1 corresponds to q0 (x). As for the experimental setting, the number of traning samples is n, the number of test samples is ntest , the number of parameter samples distributed from the posterior with the MCMC method is np , and the number of samples to have the expectation EX n ,Y n [·] is nD . In the mathematical expressions, n np 1 p(y|x, wj ) i=1 p(Yi |Xi , wj )ϕ(wj ) n n p(y|x, X , Y ) , n p np j=1 n1p nk=1 i=1 p(Yi |Xi , wk )ϕ(wk ) nD ntest 1 1 r(yj |xj ) G1 (n) log , nD i=1 ntest j=1 p(yj |xj , Di )
0.4
0.49
0.31
0.22
0.13
0.04
-0.05
-0.14
-0.23
-0.32
-0.5
K. Yamazaki and S. Watanabe
-0.41
472
4 cy en qu fre
2500 2000 1500 1000 500 0
4
−4
0
b
3
2
2
−2
1
a
0
−2
0 2
-1
-2
4
-3 -3
-2
-1
0
1
2
3
−4
4
Fig. 10. The sampling from the posteriors. The left-upper panel shows the histogram of a for the first model. The left-middle one is the histogram of a3 for the third model. The left-lower one is the point diagram of (a, b) for the second model, and the right one is its histogram.
where Di = {Xi1 , Yi1 , · · · , Xin , Yin } stands for the ith set of training data, and Di and (xj , yj ) in G1 (n) are taken from q0 (x)r(y|x) and q1 (x)r(y|x), respectively. The experimental parameters were as follows: n = 500, ntest = 1000, np = 10000, nD = 100. Example 1 (Lines with Various Parameterizations) g(x) = 0, f1 (x, a) = ax, f2 (x, a, b) = abx, f3 (x, a) = a3 x, where the true is the zero function, the learning functions are lines with the gradient a, ab and a3 . In this example, all learning functions belong to the same function class though the second model is non-regular (Wt = {a = 0} ∪ {b = 0}) and the third one has the non-Gaussian posterior. The gradient parameters are taken from the posterior depicted by Fig.10. Table 1 summarizes the results. The first row indicates the pairs (μ1 , σ1 ), and the rest does the experimental average generalization errors. G1 [fi ] stands for the error of the model with fi . M G0 [fi ] is the upper bound in each case according to Eq.(13). Note that there are some blanks in the row because of the condition Eq.(14). To compare G1 [f3 ] with G1 [f1 ], the last row shows the values 3 × G1 [f3 ] of each change. Since it is regular, the first model has theoretical results:
1 1 R 1 μ2 + σ12 G0 (n) = +o , G1 (n) = +o , R = 12 . 2n n log n 2n n log n μ0 + σ02 ‘th G1 [f1 ]’ in Table 1 is this theoretical result.
Experimental Bayesian Generalization Error of Non-regular Models
473
Table 1. Average generalization errors in Example 1 (μ1 , σ1 ) 1
th G [f1 ] G1 [f1 ] G1 [f2 ] G1 [f3 ] MG0 [f1 ] MG0 [f2 ] MG0 [f3 ]
(0,1)= G0 (0,0.5)
(0,2)
(2,1)
0.001 0.00025 0.004 0.005 0.001055 0.000239 0.004356 0.006162 0.000874 0.000170 0.003466 0.004523 0.000394 0.000059 0.001475 0.002374 — — —
0.002000 0.001356 0.000667
— — —
— — —
(2,0.5)
(2,2)
(10,1)
(10,0.5)
(10,2)
0.00425 0.008 0.101 0.10025 0.104 0.005107 0.009532 0.109341 0.106619 0.108042 0.003670 0.006669 0.079280 0.078802 0.080180 0.001912 0.003736 0.040287 0.038840 0.039682 0.028784 0.019521 0.009595
— — —
— — —
1.79×1026 1.22×1026 5.98×1025
— — —
3 × G1 [f3 ] 0.001182 0.000177 0.004425 0.007122 0.005736 0.011208 0.120861 0.116520 0.119046
Table 2. Average generalization errors in Example 2 (μ1 , σ1 ) 1
G [f4 ] G1 [f5 ] G1 [f6 ] MG0 [f4 ] MG0 [f5 ] MG0 [f5 ]
(0,1)= G0 (0,0.5)
(0,2)
(2,1)
(2,0.5)
(2,2)
(10,1)
(10,0.5)
(10,2)
0.000688 0.000260 0.002312 0.002977 0.003094 0.003423 0.013640 0.012769 0.012204 0.000251 0.000103 0.000934 0.001116 0.001291 0.001318 0.004350 0.003729 0.003743 0.000146 0.000062 0.000626 0.000705 0.000875 0.000918 0.002896 0.002357 0.002489 — — —
0.001356 0.000667 0.000400
— — —
— — —
0.019521 0.009595 0.005757
— — —
— — —
1.22×1026 5.98×1025 3.59×1025
— — —
3 × G1 [f5 ] 0.000753 0.000309 0.002802 0.003348 0.003873 0.003954 0.013050 0.011187 0.011229 5 × G1 [f6 ] 0.000730 0.000310 0.003130 0.003525 0.004375 0.004590 0.014480 0.011785 0.012445
We can find that ‘th G1 [f1 ]’ are very close to G1 [f1 ] in spite of the fact that the theoretical values are established in asymptotic cases. Based on this fact, the accuracy of experiments can be evaluated to compare them. As for f2 and f3 , they do not have any comparable theoretical value except for the upper bound. We can confirm that every value in G1 [f2 , f3 ] is actually smaller than the bound. Example 2 (Simple Neural Networks). Let assume that the true is the zero function, and the learning models are three-layer perceptrons: g(x) = 0, f4 (x, a, b) = a tanh(bx), f5 (x, a, b) = a3 tanh(bx), f6 (x, a, b) = a5 tanh(bx). Table 2 shows the results. In this example, we can also confirm that the bound works. Combining the results in the previous example, the bound tends to be tight when μ1 is small. As a matter of fact, the bound holds in small sample cases, i.e., the number of training data n does not have to be sufficiently large. Though we omit it because of the lack of space, the bound is always larger than the experimental results in n = 100, 200, . . . , 400. The property of the bound will be discussed in the next section.
4
Discussions
First, let us confirm if the sampling from the posterior was successfully done by the MCMC method. Based on the algebraic geometrical method, the coefficients of G0 (n) are derived in the models (cf. Table 3). As we mentioned, f2 , f4 and
474
K. Yamazaki and S. Watanabe Table 3. The coefficients of generalization error without the covariate shift f1 f2 , f4 f3 , f5 f6 α 1/2 1/2 1/6 1/10 β 1 2 1 1
f3 , f5 have the same theoretical error. According to the examples in the previous section, we can compare the theoretical value to the experimental one, G0 (n)[f1 ] = 0.001 0.001055, G0(n)[f2 , f4 ] = 0.000678 0.000874, 0.000688 G0 (n)[f3 , f5 ] = 0.000333 0.000394, 0.000251, G0(n)[f6 ] = 0.0002 0.000146. In the sense of the generalization error, the MCMC method worked well though there is some fluctuation in the results. Note that it is still open how to evaluate the method. Here we measured by the generalization error since the theoretical value is known in G0 (n). However, this index is just a necessary condition. To develop an evaluation of the selected samples is our future study. Next, we consider the behavior of G1 (n). In the examples, the true function was commonly the zero function g(x) = 0. It is an important case to learn the zero function because we often prepare an enough rich Kmodel in practice. Then, the learning function will be set up as f (x, w) = i=k t(w1k )h(x, w2k ), where h is the basis function t is the parameterization for its weight, and w = {w11 , w21 , w12 , w22 , . . . , w1K , w2K }. Note that many practical models are included in this expression. According to the redundancy of the function, some of h(x, w2k ) learn the zero function. Our examples provided the simplest situations and highlighted the effect of non-regularity in the learning models. The errors G0 (n) and G1 (n) are generally expressed as
α β−1 1 G0 (n) = − +o , n n log n n log n
α β−1 1 G1 (n) = R1 − R2 +o , n n log n n log n where R1 , R2 depend on f, g, q0 , and q1 . R1 and R2 cause the difference between G0 and G1 in this expression. In Eq. (12), the coefficient of 1/n is given by b0 + (d1 − d0 ). So b0 + (d1 − d0 ) d1 − d0 R1 = =1+ . b0 α Let us denote A B as “A is the only factor to determine a value of B”. As mentioned above, f, g, q0 , q1 R1 , R2 . Though f, g α, β (cf. around Eqs.(5)-(6)), we should emphasize that α, β, q0 , q1 R1 , R2 .
Experimental Bayesian Generalization Error of Non-regular Models
475
This fact is easily confirmed by comparing f2 to f4 (also f3 to f5 ). It holds that G1 (n)[f2 ] = G1 (n)[f4 ] for all q1 although they have the same α and β (G0 (n)[f2 ] = G0 (n)[f4 ]). Thus α and β are not enough informative to describe R1 and R2 . Comparing the values in G1 [f2 ] to the ones in G1 [f4 ], the basis function (x in f2 and tanh(bx) in f4 ) seems to play an important role. To clarify the effect of basis functions, let us fix the function class. Examples 1 and 2 correspond to h(x, w2 ) = x and h(x, w2 ) = tanh(bx), respectively. The values of G1 [f1 ] and 3 × G1 [f3 ] (also 3 × G1 [f5 ] and 5 × G1 [f6 ]) can be regarded as the same in any covariate shift. This implies h, g, q0 , q1 R1 , i.e. the parameterization t(w1 ) will not affect R1 . Instead, it affects the nonregularity or the multiplicity and decides α and β. Though it is an unclear factor, the influence of R2 does not seem as large as R1 . Last, let us analyze properties of the upper bound M G0 (n). According to the above discussion, it holds that R G1 /G0 ≤ M. The ratio G1 /G0 basically depends on g, h, q0 and q1 . However, M is determined by only the training and test input distributions, q0 , q1 M . Therefore this bound gives the worst case evaluation in any g and h. Considering the tightness of the bound, we can still improve it based on the relation between the true and learning functions.
5
Conclusions
In the former study, we have got the theoretical generalization error and its upper bound under the covariate shift. This paper showed that the theoretical value is supported by the experiments in spite of the fact that it is established under an asymptotic case. We observed the tightness of the bound and discussed an effect of basis functions in the learning models. In this paper, the non-regular models are simple lines and neural networks. It is an interesting issue to investigate more general models. Though we mainly considered the amount of G1 (n), the computational cost for the MCMC method strongly connects to the form of the learning function f . It is our future study to take account of the cost in the evaluation.
Acknowledgements The authors would like to thank Masashi Sugiyama, Motoaki Kawanabe, and Klaus-Robert M¨ uller for fruitful discussions. The software to calculate the MCMC method and technical comments were provided by Kenji Nagata. This research partly supported by the Alexander von Humboldt Foundation, and MEXT 18079007.
476
K. Yamazaki and S. Watanabe
References 1. Yamazaki, K., Kawanabe, M., Wanatabe, S., Sugiyama, M., M¨ uller, K.R.: Asymptotic bayesian generalization error when training and test distributions are different. In: Proceedings of the 24th International Conference on Machine Learning, pp. 1079–1086 (2007) 2. Wolpaw, J.R., Birbaumer, N., McFarland, D.J., Pfurtscheller, G., Vaughan, T.M.: Brain-computer interfaces for communication and control. Clinical Neurophysiology 113(6), 767–791 (2002) 3. Baldi, P., Brunak, S., Stolovitzky, G.A.: Bioinformatics: The Machine Learning Approach. MIT Press, Cambridge (1998) 4. Shimodaira, H.: Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference 90, 227– 244 (2000) 5. Sugiyama, M., M¨ uller, K.R.: Input-dependent estimation of generalization error under covariate shift. Statistics & Decisions 23(4), 249–279 (2005) 6. Sugiyama, M., Krauledat, M., M¨ uller, K.R.: Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research 8 (2007) 7. Huang, J., Smola, A., Gretton, A., Borgwardt, K.M., Sch¨ olkopf, B.: Correcting sample selection bias by unlabeled data. In: Sch¨ olkopf, B., Platt, J., Hoffman, T. (eds.) Advances in Neural Information Processing Systems, vol. 19, MIT Press, Cambridge, MA (2007) 8. Watanabe, S.: Algebraic analysis for non-identifiable learning machines. Neural Computation 13(4), 899–933 (2001) 9. Rissanen, J.: Stochastic complexity and modeling. Annals of Statistics 14, 1080– 1100 (1986) 10. Watanabe, S.: Algebraic analysis for singular statistical estimation. In: Watanabe, O., Yokomori, T. (eds.) ALT 1999. LNCS (LNAI), vol. 1720, pp. 39–50. Springer, Heidelberg (1999) 11. Watanabe, S.: Algebraic information geometry for learning machines with singularities. Advances in Neural Information Processing Systems 14, 329–336 (2001) 12. Ogata, Y.: A monte carlo method for an objective bayesian procedure. Ann. Inst. Statis. Math. 42(3), 403–433 (1990)
Using Image Stimuli to Drive fMRI Analysis David R. Hardoon1 , Janaina Mour˜ ao-Miranda2, Michael Brammer2 , and John Shawe-Taylor1 1
The Centre for Computational Statistics and Machine Learning Department of Computer Science University College London Gower St., London WC1E 6BT {D.Hardoon,jst}@cs.ucl.ac.uk 2 Brain Image Analysis Unit Centre for Neuroimaging Sciences (PO 89) Institute of Psychiatry, De Crespigny Park London SE5 8AF {Janaina.Mourao-Miranda,Michael.Brammer}@iop.kcl.ac.uk
Abstract. We introduce a new unsupervised fMRI analysis method based on Kernel Canonical Correlation Analysis which differs from the class of supervised learning methods that are increasingly being employed in fMRI data analysis. Whereas SVM associates properties of the imaging data with simple specific categorical labels, KCCA replaces these simple labels with a label vector for each stimulus containing details of the features of that stimulus. We have compared KCCA and SVM analyses of an fMRI data set involving responses to emotionally salient stimuli. This involved first training the algorithm ( SVM, KCCA) on a subset of fMRI data and the corresponding labels/label vectors, then testing the algorithms on data withheld from the original training phase. The classification accuracies of SVM and KCCA proved to be very similar. However, the most important result arising from this study is that KCCA in able in part to extract many of the brain regions that SVM identifies as the most important in task discrimination blind to the categorical task labels. Keywords: Machine learning methods, Kernel canonical correlation analysis, Support vector machines, Classifiers, Functional magnetic resonance imaging data analysis.
1
Introduction
Recently, machine learning methodologies have been increasingly used to analyse the relationship between stimulus categories and fMRI responses [1,2,3,4,5,6,7,8, 9,10]. In this paper, we introduce a new unsupervised machine learning approach to fMRI analysis, in which the simple categorical description of stimulus type (e.g. type of task) is replaced by a more informative vector of stimulus features. We compare this new approach with a standard Support Vector Machine (SVM) analysis of fMRI data using a categorical description of stimulus type. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 477–486, 2008. c Springer-Verlag Berlin Heidelberg 2008
478
D.R. Hardoon et al.
The technology of the present study originates from earlier research carried out in the domain of image annotation [11], where an image annotation methodology learns a direct mapping from image descriptors to keywords. Previous attempts at unsupervised fMRI analysis have been based on Kohonen selforganising maps, fuzzy clustering [12] and nonparametric estimation methods of the hemodynamic response function, such as the general method described in [13]. [14] have reported an interesting study which showed that the discriminability of PCA basis representations of images of multiple object categories is significantly correlated with the discriminability of PCA basis representation of the fMRI volumes based on category labels. The current study differs from conventional unsupervised approaches in that it makes use of the stimulus characteristics as an implicit representation of a complex state label. We use kernel Canonical Correlation Analysis (KCCA) to learn the correlation between an fMRI volume and its corresponding stimulus. Canonical correlation analysis can be seen as the problem of finding basis vectors for two sets of variables such that the correlations of the projections of the variables onto corresponding basis vectors are maximised. KCCA first projects the data into a higher dimensional feature space before performing CCA in the new feature space. CCA [15, 16] and KCCA [17] have been used in previous fMRI analysis using only conventional categorical stimulus descriptions without exploring the possibility of using complex characteristics of the stimuli as the source for feature selection from the fMRI data. The fMRI data used in the following study originated from an experiment in which the responses to stimuli were designed to evoke different types of emotional responses, pleasant or unpleasant. The pleasant images consisted of women in swimsuits while the unpleasant images were a collection of images of skin diseases. Each stimulus image was represented using Scale Invariant Feature Transformation (SIFT) [18] features. Interestingly, some of the properties of the SIFT representation have been modeled on the properties of complex neurons in the visual cortex. Although not specifically exploited in the current paper, future studies may be able to utilize this property to probe aspects of brain function such as modularity. In the current study, we present a feasibility study of the possibility of generating new activity maps by using the actual stimuli that had generated the fMRI volume. We have shown that KCCA is able to extract brain regions identified by supervised methods such as SVM in task discrimination and to achieve similar levels of accuracy and discuss some of the challenges in interpreting the results given the complex input feature vectors used by KCCA in place of categorical labels. This work is an extension of the work presented in [19]. The paper is structured as follows. Section 2 gives a review of the fMRI data acquisition as well as the experimental design and the pre-processing. These are followed by a brief description of the scale invariant feature transformation in Section 2.1. The SVM is briefly described in Section 2.2 while Section 2.2 elaborates on the KCCA methodology. Our results in Section 3. We conclude with a discussion in Section 4.
Using Image Stimuli to Drive fMRI Analysis
2
479
Materials and Methods
Due to the lack of space we refer the reader to [10] for a detailed account of the subject, data acquisition and pre-processing applied to the data as well as to the experimental design. 2.1
Scale Invariant Feature Transformation
Scale Invariant Feature Transformation (SIFT) was introduced by [18] and shown to be superior to other descriptors [20]. This is due to the SIFT descriptors being designed to be invariant to small shifts in position of salient (i.e. prominent) regions. Calculation of the SIFT vector begins with a scale space search in which local minima and maxima are identified in each image (so-called key locations). The properties of the image at each key location are then expressed in terms of gradient magnitude and orientation. A canonical orientation is then assigned to each key location to maximize rotation invariance. Robustness to reorientation is introduced by representing local image regions around key voxels in a number of orientations. A reference key vector is then computed over all images and the data for each image are represented in terms of distance from this reference. Interestingly, some of the properties of the SIFT representation have been modeled on the properties of complex neurons in the visual cortex. Although not specifically exploited in the current paper, future studies may be able to utilize this property to probe aspects of brain function such as modularity. Image Processing. Let fil be the SIFT features vector for image i where l is the number of features. Each image i has a different number of SIFT features l, making it difficult to directly compare two images. To overcome this problem we apply K-means to cluster the SIFT features into a uniform frame. Using K-means clustering we find K classes and their respective centers oj where j = 1, . . . , K. The feature vector xi of an image stimuli i is K dimensional with j’th component xi,j . The feature vectors is computed as the Gaussian measure of the minimal distance between the SIFT features fil to the centre oj . This can be represented as − minv∈f l d(v,oj )2
xi,j = exp
i
(1)
where d(., .) is the Euclidean distance. The number of centres is set to be the smallest number of SIFT features computed (found to be 300). Therefore after processing each image, we will have a 300 dimensional feature vector representing its relative distance from the cluster centres. 2.2
Methods
Support Vector Machines. Support vector machines [21] are kernel-based methods that find functions of the data that facilitate classification. They are derived from statistical learning theory [22] and have emerged as powerful tools for statistical pattern recognition [23]. In the linear formulation a SVM finds,
480
D.R. Hardoon et al.
during the training phase, the hyperplane that separates the examples in the input space according to their class labels. The SVM classifier is trained by providing examples of the form (x, y) where x represents a input and y it’s class label. Once the decision function has been learned from the training data it can be used to predict the class of a new test example. We used a linear kernel SVM that allows direct extraction of the weight vector as an image. A parameter C, that controls the trade-off between training errors and smoothness was fixed at C = 1 for all cases (default value).1 Kernel Canonical Correlation Analysis. Proposed by Hotelling in 1936, Canonical Correlation Analysis (CCA) is a technique for finding pairs of basis vectors that maximise the correlation between the projections of paired variables onto their corresponding basis vectors. Correlation is dependent on the chosen coordinate system, therefore even if there is a very strong linear relationship between two sets of multidimensional variables this relationship may not be visible as a correlation. CCA seeks a pair of linear transformations one for each of the paired variables such that when the variables are transformed the corresponding coordinates are maximally correlated. Consider the linear combination x = wa x and y = wb y. Let x and y be two random variables from a multi-dimensional distribution, with zero mean. The maximisation of the correlation between x and y corresponds to solving maxwa ,wb ρ = wa Cab wb subject to wa Caa wa = wb Cbb wb = 1. Caa and Cbb are the non-singular within-set covariance matrices and Cab is the between-sets covariance matrix. We suggest using the kernel variant of CCA [24] since due to the linearity of CCA useful descriptors may not be extracted from the data. This may occur as the correlation could exist in some non linear relationship. The kernelising of CCA offers an alternate solution by first projecting the data into a higher dimensional feature space φ : x = (x1 , . . . , xn ) → φ(x) = (φ1 (x), . . . , φN (x)) (N ≥ n) before performing CCA in the new feature space. Given the kernel functions κa and κb let Ka = Xa Xa and Kb = Xb Xb be the kernel matrices corresponding to the two representations of the data, where Xa is the matrix whose rows are the vectors φa (xi ), i = 1, . . . , from the first representation while Xb is the matrix with rows φb (xi ) from the second representation. The weights wa and wb can be expressed as a linear combination of the training examples wa = Xa α and wb = Xb β. Substituting into the primal CCA equation gives the optimisation maxα,β ρ = α Ka Kb β subject to α K2a α = β K2b β = 1. This is the dual form of the primal CCA optimisation problem given above, which can be cast as a generalised eigenvalue problem and for which the first k generalised eigenvectors can be found efficiently. Both CCA and KCCA can be formulated as an eigenproblem. The theoretical analysis shown in [25,26] suggests the need to regularise kernel CCA as it shows that the quality of the generalisation of the associated pattern function is controlled by the sum of the squares of the weight vector norms. We 1
The LibSVM toolbox for Matlab was used to perform the classifications http://www.csie.ntu.edu.tw/∼cjlin/libsvm/
Using Image Stimuli to Drive fMRI Analysis
481
refer the reader to [25, 26] for a detailed analysis and the regularised form of KCCA. Although there are advantages in using kernel CCA, which have been demonstrated in various experiments across the literature. We must clarify that in this particular work, as we are using a linear kernel in both views, regularised CCA is the same as regularised linear KCCA (since the former and latter are linear). Although using KCCA with a linear kernel has advantages over CCA, the most important of which is in our case speed, together with the regularisation.2 Using linear kernels as to allow the direct extraction of the weights, KCCA performs the analysis by projecting the fMRI volumes into the found semantic space defined by the eigenvector corresponding to the largest correlation value (these are outputted from the eigenproblem). We classify a new fMRI volume as follows; Let αi be the eigenvector corresponding to the largest eigenvalue, and let φ(ˆ x) be the new volume. We project the fMRI into the semantic space w = Xa αi (these are the training weights, similar to that of the SVM) and using the weights we are able to classify the new example as w ˆ = φ(ˆ x)w where w ˆ is a weighted value (score) for the new volume. The score can be thresholded to allocate a category to each test example. To avoid the complications of finding a threshold, we zeromean the outputs and threshold the scores at zero, where w ˆ < 0 will be associated with unpleasant (a label of −1) and w ˆ ≥ 0 will be associated with pleasant (a label of 1). We hypothesis that KCCA is able to derive additional activities that may exist a-priori, but possibly previously unknown, in the experiment. By projecting the fMRI volumes into the semantic space using the remaining eigenvectors corresponding to lower correlation values. We have attempted to corroborate this hypothesis on the existing data but found that the additional semantic features that cut across pleasant and unpleasant images did not share visible attributes. We have therefore confined our discussion here to the first eigenvector.
3
Results
Experiments were run on a leave-one-out basis where in each repeat a block of positive and negative fMRI volumes was withheld for testing. Data from the 16 subjects was combined. This amounted, per run, in 1330 training and 14 testing fMRI volumes, each set evenly split into positive and negative volumes (these pos/neg splits were not known to KCCA but simply ensured equal number of images with both types of emotional salience). The analyses were repeated 96 times. Similarly, we run a further experiment of leave-subject-out basis where 15 subjects were combined for training and one left for testing. This gave a sum total of 1260 training and 84 testing fMRI volumes. The analyses was repeated 16 times. The KCCA regularisation parameter was found using 2-fold cross validation on the training data. Initially we describe the fMRI activity analysis. After training the SVM we are able to extract and display the SVM weights as a representation of the brain 2
The KCCA toolbox used was from http://homepage.mac.com/davidrh/Code.html
482
D.R. Hardoon et al.
regions important in the pleasant/unpleasant discrimination. A thorough analysis is presented in [10]. We are able to view the results in Figures 1 and 2 where in both figures the weights are not thresholded and show the contrast between viewing Pleasant vs. Unpleasant. The weight value of each voxel indicates the importance of the voxel in differentiating between the two brain states. In Figure 1 the unthresholded SVM weight maps are given. Similarly with KCCA, once learning the semantic representation we are able to project the fMRI data into the learnt semantic feature space producing the primal weights. These weights, like those generated from the SVM approach, could be considered as a representation of the fMRI activity. Figure 2 displays the KCCA weights. In Figure 3 the unthresholded weights values for the KCCA approach with the hemodynamic function applied to the image stimuli (i.e. applied to the SIFT features prior to analysis) are displayed. The hemodynamic response function is the impulse response function which is used to model the delay and dispersion of hemodynamic responses to neuronal activation [27]. The application of the hemodynamic function to the images SIFT features allows for the reweighting of the image features according to the computed delay and dispersion model. We compute the hemodynamic function with the SPM2 toolbox with default parameter settings. As the KCCA weights are not driven by simple categorical image descriptors (pleasant/unpleasant) but by complex image feature vectors it is of great interest that many regions, especially in the visual cortex, found by SVM are also highlighted by the KCCA. We interpret this similarity as indicating that many important components of the SIFT feature vector are associated with pleasant/unpleasant discrimination. Other features in the frontal cortex are much less reproducible between SVM and KCCA indicting that many brain regions detect image differences not rooted in the major emotional salience of the images. In order to validate the activity patterns found in Figure 2 we show that the learnt semantic space can be used to correctly discriminate withheld (testing) fMRI volumes. We also give the 2−norm error to provide an indication as to
Fig. 1. The unthresholded weight values for the SVM approach showing the contrast between viewing Pleasant vs. Unpleasant. We use the blue scale for negative (Unpleasant) values and the red scale for the positive values (Pleasant). The discrimination analysis on the training data was performed with labels (+1/ − 1).
Using Image Stimuli to Drive fMRI Analysis
483
Fig. 2. The unthresholded weight values for the KCCA approach showing the contrast between viewing Pleasant vs. Unpleasant. We use the blue scale for negative (Unpleasant) values and the red scale for the positive values (Pleasant). The discrimination analysis on the training data was performed without labels. The class discrimination is automatically extracted from the analysis.
Fig. 3. The unthresholded weight values for the KCCA approach with the hemodynamic function applied to the image stimuli showing the contrast between viewing Pleasant vs. Unpleasant. We use the blue scale for negative (Unpleasant) values and the red scale for the positive values (Pleasant).
the quality of the patterns found between the fMRI volumes and image stimuli from the testing set by Ka α − Kb β2 (normalised over the number of volumes and analyses repeats). The latter is especially important when the hemodynamic function has been applied to the image stimuli as straight forward discrimination is no longer possible to compare with. Table 1 shows the average and median performance of SVM and KCCA on the testing of pleasant and unpleasant fMRI blocks for the leave-two-block-out experiment. Our proposed unsupervised approach had achieved an average accuracy of 87.28%, slightly less than the 91.52% of the SVM. Although, both methods had the same median accuracy of 92.86%. The results of the leavesubject-out experiment are given in Table 2, where our KCCA has achieved an average accuracy of 79.24% roughly 5% less than the supervised SVM method. In both tables the Hemodynamic Function is abbreviated as HF. We are able to observe in both tables that the quality of the patterns are better than random. The results demonstrate that the activity analysis is meaningful. To further confirm the validity of the methodology we repeat the experiments with the
484
D.R. Hardoon et al.
Table 1. KCCA & SVM results on the leave-two-block-out experiment. Average and median performance over 96 repeats. The value represents accuracy, hence higher is better. For norm−2 error lower is better. Method Average Median Average · 2 error Median · 2 error KCCA 87.28 92.86 0.0048 0.0048 SVM 91.52 92.86 Random KCCA 49.78 50.00 0.0103 0.0093 Random SVM 52.68 50.00 KCCA with HF 0.0032 0.0031 Random KCCA with HF 1.1049 0.9492
Table 2. KCCA & SVM results on the leave-one-subject-out experiment. Average and median performance over 16 repeats. The value represents accuracy, hence higher is better. For norm−2 error lower is better. Method Average Median Average · 2 error Median · 2 error KCCA 79.24 79.76 0.0025 0.0024 SVM 84.60 86.90 Random KCCA 48.51 47.62 0.0052 0.0044 Random SVM 48.88 48.21 KCCA with HF 0.0016 0.0015 Random KCCA with HF 0.5869 0.0210
image stimuli randomised, hence breaking the relationship between fMRI volume and stimuli. Table 1 and 2 KCCA and SVM both show performance equivalent to the performance of a random classifier. It is also interesting to observe that when applying the hemodynamic function the random KCCA is substantially different, and worse than, the non random KCCA. Implying that the spurious correlations are found.
4
Discussion
In this paper we present a novel unsupervised methodology for fMRI activity analysis in which a simple categorical description of a stimulus type is replaced by a more informative vector of stimulus (SIFT) features. We use kernel canonical correlation analysis using an implicit representation of a complex state label to make use of the stimulus characteristics. The most interesting aspect of KCCA is its ability to extract visual regions very similar to those found to be important in categorical image classification using supervised SVM. KCCA “finds” areas in the brain that are correlated with the features in the SIFT vector regardless of the stimulus category. Because many features of the stimuli were associated with the pleasant/unpleasant categories we were able to use the KCCA results to classify the fMRI images between these categories. In the current study it is difficult to address the issue of modular versus distributed neural coding as the complexity of the stimuli (and consequently of the SIFT vector) is very high.
Using Image Stimuli to Drive fMRI Analysis
485
A further interesting possible application of KCCA relates to the detection of “inhomogeneities” in stimuli of a particular type (e.g happy/sad/disgusting emotional stimuli). If KCCA analysis revealed brain regions strongly associated with substructure within a single stimulus category this could be valuable in testing whether a certain type of image was being consistently processed by the brain and designing stimuli for particular experiments. There are many openended questions that have not been explored in our current research, which has primarily been focused on fMRI analysis and discrimination capacity. KCCA is a bi-directional technique and therefore are also able to compute a weight map for the stimuli from the learned semantic space. This capacity has the potential of greatly improving our understanding as to the link between fMRI analysis and stimuli by potentially telling us which image features were important. Acknowledgments. This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST2002-506778. David R. Hardoon is supported by the EPSRC project Le Strum, EP-D063612-1. This publication only reflects the authors views. We would like to thank Karl Friston for the constructive suggestions.
References 1. Cox, D.D., Savoy, R.L.: Functional magnetic resonance imaging (fmri) ‘brain reading’: detecting and classifying distributed patterns of fmri activity in human visual cortex. Neuroimage 19, 261–270 (2003) 2. Carlson, T.A., Schrater, P., He, S.: Patterns of activity in the categorical representations of objects. Journal of Cognitive Neuroscience 15, 704–717 (2003) 3. Wang, X., Hutchinson, R., Mitchell, T.M.: Training fmri classifiers to detect cognitive states across multiple human subjects. In: Proceedings of the 2003 Conference on Neural Information Processing Systems (2003) 4. Mitchell, T., Hutchinson, R., Niculescu, R., Pereira, F., Wang, X., Just, M., Newman, S.: Learning to decode cognitive states from brain images. Machine Learning 1-2, 145–175 (2004) 5. LaConte, S., Strother, S., Cherkassky, V., Anderson, J., Hu, X.: Support vector machines for temporal classification of block design fmri data. NeuroImage 26, 317–329 (2005) 6. Mourao-Miranda, J., Bokde, A.L.W., Born, C., Hampel, H., Stetter, S.: Classifying brain states and determining the discriminating activation patterns: support vector machine on functional mri data. NeuroImage 28, 980–995 (2005) 7. Haynes, J.D., Rees, G.: Predicting the orientation of invisible stimuli from activity in human primary visual cortex. Nature Neuroscience 8, 686–691 (2005) 8. Davatzikos, C., Ruparel, K., Fan, Y., Shen, D.G., Acharyya, M., Loughead, J.W., Gur, R.C., Langleben, D.D.: Classifying spatial patterns of brain activity with machine learning methods: Application to lie detection. NeuroImage 28, 663–668 (2005) 9. Kriegeskorte, N., Goebel, R., Bandettini, P.: Information-based functional brain mapping. PANAS 103, 3863–3868 (2006)
486
D.R. Hardoon et al.
10. Mourao-Miranda, J., Reynaud, E., McGlone, F., Calvert, G., Brammer, M.: The impact of temporal compression and space selection on svm analysis of singlesubject and multi-subject fmri data. NeuroImage (accepted, 2006) 11. Hardoon, D.R., Saunders, C., Szedmak, S., Shawe-Taylor, J.: A correlation approach for automatic image annotation. In: Li, X., Za¨ıane, O.R., Li, Z. (eds.) ADMA 2006. LNCS (LNAI), vol. 4093, pp. 681–692. Springer, Heidelberg (2006) 12. Wismuller, A., Meyer-Base, A., Lange, O., Auer, D., Reiser, M.F., Sumners, D.: Model-free functional mri analysis based on unsupervised clustering. Journal of Biomedical Informatics 37, 10–18 (2004) 13. Ciuciu, P., Poline, J., Marrelec, G., Idier, J., Pallier, C., Benali, H.: Unsupervised robust non-parametric estimation of the hemodynamic response function for any fmri experiment. IEEE TMI 22, 1235–1251 (2003) 14. O’Toole, A.J., Jiang, F., Abdi, H., Haxby, J.V.: Partially distributed representations of objects and faces in ventral temporal cortex. Journal of Cognitive Neuroscience 17(4), 580–590 (2005) 15. Friman, O., Borga, M., Lundberg, P., Knutsson, H.: Adaptive analysis of fMRI data. NeuroImage 19, 837–845 (2003) 16. Friman, O., Carlsson, J., Lundberg, P., Borga, M., Knutsson, H.: Detection of neural activity in functional MRI using canonical correlation analysis. Magnetic Resonance in Medicine 45(2), 323–330 (2001) 17. Hardoon, D.R., Shawe-Taylor, J., Friman, O.: KCCA for fMRI Analysis. In: Proceedings of Medical Image Understanding and Analysis, London, UK (2004) 18. Lowe, D.: Object recognition from local scale-invariant features. In: Proceedings of the 7th IEEE International Conference on Computer vision, Kerkyra, Greece, pp. 1150–1157 (1999) 19. Hardoon, D.R., Mourao-Miranda, J., Brammer, M., Shawe-Taylor, J.: Unsupervised analysis of fmri data using kernel canonical correlation. NeuroImag (in press, 2007) 20. Mikolajczyk, K., Schmid, C.: Indexing based on scale invariant interest points. In: International Conference on Computer Vision and Pattern Recognition, pp. 257–263 (2003) 21. Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and other kernel-based learning methods. Cambridge University Press, Cambridge (2000) 22. Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995) 23. Boser, B.E., Guyon, I., Vapnik, V.: A training algorithm for optimal margin classifiers. In: D. Proc. Fifth Ann. Workshop on Computational Learning Theory, pp. 144–152. ACM, New York (1992) 24. Fyfe, C., Lai, P.L.: Kernel and nonlinear canonical correlation analysis. International Journal of Neural Systems 10, 365–377 (2001) 25. Hardoon, D.R., Szedmak, S., Shawe-Taylor, J.: Canonical correlation analysis: an overview with application to learning methods. Neural Computation 16, 2639–2664 (2004) 26. Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge (2004) 27. Stephan, K.E., Harrison, L.M., Penny, W.D., Friston, K.J.: Biophysical models of fmri responses. Current Opinion in Neurobiology 14, 629–635 (2004)
Parallel Reinforcement Learning for Weighted Multi-criteria Model with Adaptive Margin Kazuyuki Hiraoka, Manabu Yoshida, and Taketoshi Mishima Saitama University, 255 Shimo-Okubo, Sakura-ku, Saitama-shi, Japan [email protected]
Abstract. Reinforcement learning (RL) for a linear family of tasks is studied in this paper. The key of our discussion is nonlinearity of the optimal solution even if the task family is linear; we cannot obtain the optimal policy by a naive approach. Though there exists an algorithm for calculating the equivalent result to Q-learning for each task all together, it has a problem with explosion of set sizes. We introduce adaptive margins to overcome this difficulty.
1
Introduction
Reinforcement learning (RL) for a linear family of tasks is studied in this paper. Such learning is useful for time-varying environments, multi-criteria problems, and inverse RL [5,6]. The family is defined as a weighted sum of several criteria. This family is linear in the sense that reward is linear with respect to weight parameters. For instance, criteria of network routing include end-to-end delay, loss of packets, and power level associated with a node [5]. Selecting appropriate weights beforehand is difficult in practice and we need try and errors. In addition, appropriate weights may change someday. Parallel RL for all possible weight values is desirable in such cases. The key of our discussion is nonlinearity of the optimal solution; it is not linear but piecewise-linear actually. This fact implies that we cannot obtain the best policy by the following naive approach: 1. Find the value function for each criterion. 2. Calculate weighted sum of them to obtain the total value function. 3. Construct a policy on the basis of the total value function. A typical example is presented in section 5. Piecewise-linearity of the optimal solution has been pointed out independently in [4] and [5]. The latter aims at fast adaptation under time-varying environments. The former is our previous report, and we have tried to obtain the optimal solutions for various weight values all together. Though we have developed an algorithm that gives exactly equivalent solution to Q-learning for each weight value, it has a difficulty with explosion of set size. This difficulty is not a problem of the algorithm but an intrinsic nature of Q-learning for the weighted criterion model. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 487–496, 2008. c Springer-Verlag Berlin Heidelberg 2008
488
K. Hiraoka, M. Yoshida, and T. Mishima
We have introduced a simple approximation with a ‘margin’ into decision of convexity first [6]. Then we have improved it so that we obtain an interval estimation and we can monitor the effect of the approximation [7]. In this paper, we propose adaptive adjustment of margins. In margin-based approach, we have to manage large sets of vectors in the first stage of learning. The peak of the set size tends to be large if we set a small margin to obtain an accurate final result. The proposed method reduces worry of this trade-off. By changing margins appropriately through learning steps, we can enjoy small set size in the first stage with large margins, and an accurate result in the final stage with small margins. The weighted criterion model is defined in section 2, and parallel RL for it is described in section 3. Then the difficulty of set size is pointed out and margins are introduced in section 4. Adaptive adjustment of margins is also proposed there. Its behavior is verified with experiments in section 5. Finally, a conclusion is given in section 6.
2
Weighted Criterion Model
An “orthodox” RL setting is assumed for states and actions as follows. – – – – –
The The The The The
time step is discrete (t = 0, 1, 2, 3, . . .). state set S and the action set A are finite and known. state transition rule P is unknown. state st is observable. task is a Markov decision process (MDP).
1 M The reward rt+1 is given as a weighted sum of partial rewards rt+1 , . . . , rt+1 :
rt+1 (β) =
M
i βi rt+1 = β · r t+1 ,
(1)
i=1
weight vector β ≡ (β1 , . . . , βM ) ∈ RM , reward vector r t+1 ≡
1 M (rt+1 , . . . , rt+1 )
M
∈R .
(2) (3)
1 M We assume that the partial rewards rt+1 , . . . , rt+1 are also observable, whereas their reward rules R(1), . . . , R(M ) are unknown. Multi-criteria RL problems of this type have been introduced independently in [3] and [5]. We hope to find the optimal policy πβ∗ for each weight β that maximizes the expected cumulative reward with a given discount factor 0 < γ < 1, ∞ πβ∗ = argmax E π γ τ rτ +1 (β) , (4) π
τ =0
π
where E [·] denotes the expectation under a policy π. To be exact, π ∗ is defined π∗
as a policy that attains Qββ (s, a; γ) = Q∗β (s, a; γ) ≡ maxπ Qπβ (s, a; γ) for all state-action pairs (s, a), where the action-value function Qπβ is defined as
Parallel Reinforcement Learning for Weighted Multi-criteria Model
Qπβ (s, a; γ)
≡E
π
∞
τ =0
γ rτ +1 (β) s0 = s, a0 = a . τ
489
(5)
It is well known that MDP has a deterministic policy πβ∗ that satisfies the above condition; such πβ∗ is obtained from the optimal value function [2], πβ∗ : S → A : s → argmax Q∗β (s, a; γ). a∈A
(6)
Thus we concentrate on estimation of Q∗β . Note that Q∗β is nonlinear with respect to β. A typical example is presented in section 5. Basic properties of the action-value function Q are described briefly in the rest of this section [4,5,6]. The discount factor γ is fixed through this paper, and it is omitted below. Proposition 1. Qπβ (s, a) is linear with respect to β for a fixed policy π. Proof. Neither P nor π depend on β from assumptions. Hence, joint distribution of (s0 , a0 ), (s1 , a1 ), (s2 , a2 ), . . . is independent of β. It implies linearity. Definition 1. If f : RM → R can be written as f (β) = maxq∈Ω (q · β) with a nonempty finite set Ω ⊂ RM , we call f Finite-Max-Linear (FML) and write it as f = FMLΩ . It is trivial that f is convex and piecewise-linear if f is FML. Proposition 2. The optimal action-value function is FML as a function of the weight β. Namely, there exists a nonempty finite set Ω ∗ (s, a) ⊂ RM for each state-action pair (s, a), and Q∗β is written as Q∗β (s, a) =
max
q∈Ω ∗ (s,a)
q · β.
(7)
Proof. We have assumed MDP. It is well known that Q∗β can be written Q∗β (s, a) = maxπ∈Π Qπβ (s, a) for the set Π of all deterministic policies. Π finite, and Qπβ is linear with respect to β from proposition 1. Hence, Q∗β FML.
as is is
Proposition 3. Assume that an estimated action-value function Qβ is FML as a function of the weight β. If we apply Q-learning, the updated new Qβ (st , at ) = (1 − α)Qβ (st , at ) + α β · rt+1 + γ max Qβ (st+1 , a) (8) a∈A
is still FML as a function of β, where α > 0 is the learning rate. Proof. There exists a nonempty finite set Ω(s, a) ⊂ RM such that Qβ (s, a) = ˜ · β, maxq∈Ω(s,a) (q · β) for each (s, a). Then (8) implies Qnew ˜q β (st , at ) = maxq ˜ ∈Ω where ˜ ≡ (1 − α)q + α(r t+1 + γq ) a ∈ A, q ∈ Ω(st , at ), q ∈ Ω(st+1 , a) , (9) Ω because maxx f (x) + maxy g(y) = maxx,y (f (x) + g(y)) holds in general. The set ˜ is finite, and Qnew is FML. Ω β These propositions imply that (1) the true Q∗β is FML, and (2) its estimation Qβ is also FML as long as the initial estimation is FML.
490
3
K. Hiraoka, M. Yoshida, and T. Mishima
Parallel Q-Learning for All Weights
A parallel Q-learning method for the weighted criterion model has been proposed in [6]. The estimation Qβ for all β ∈ RM are updated all together in parallel Q-learning. In this method, Qβ (s, a) for each (s, a) is treated in an FML expression: Qβ (s, a) =
max q · β = FMLΩ(s,a) (β)
(10)
q∈Ω(s,a)
with a certain set Ω(s, a) ⊂ RM . We store and update Ω(s, a) instead of Qβ (s, a) on the basis of propositions 2 and 3. Though a naive updating rule has been suggested in the proof of proposition 3, it is extremely redundant and inefficient. We need several definitions to describe a better algorithm. Definition 2. An element c ∈ Ω is redundant if FML(Ω−{c}) = FMLΩ . Definition 3. We use Ω † to represent non-redundant elements in Ω. Note that FMLΩ † = FMLΩ [5]. Definition 4. We define the following operations: cΩ ≡ {cq | q ∈ Ω},
c + Ω ≡ {c + q | q ∈ Ω},
K † K † Ω Ω ≡ (Ω ∪ Ω ) , Ωk ≡ Ωk , k=1
Ω ⊕ Ω ≡ {q + q | q ∈ Ω, q ∈ Ω },
(11) (12)
k=1 †
Ω Ω ≡ (Ω ⊕ Ω ) .
With these operations, the updating rule of Ω is described as follows [6]:
Ω new (st , at ) = (1 − α)Ω(st , at ) α r t+1 + γ Ω(st+1 , a) .
(13)
(14)
a∈A
The initial value of Ω at t = 0 is Ω(s, a) = {o} ⊂ RM for all (s, a) ∈ S × A. It corresponds to a constant initial function Qβ (s, a) = 0. Proposition 4. When (10) holds for all states s ∈ S and actions a ∈ A, Qnew β (st , at ) in (8) is equal to FMLΩ new (st ,at ) (β) for (14). Namely, parallel Qlearning is equivalent to Q-learning for each β: {Qβ (s, a)} FML expression
{Ω(s, a)}
update → Qnew β (st , at )
FML expression . → Ω new (st , at ) update
(15)
Parallel Reinforcement Learning for Weighted Multi-criteria Model
491
y
Ω
O
x
Ω[+]Ω’
(2) Merge and sort edges according to their arguments
Ω’
(3) Connect edges to generate a polygon
(4) Shift the origin (max x in Ω)+(max x in Ω’) 㤘(max x in Ω[+]Ω’) (max y in Ω)+(max y in Ω’) 㤘(max y in Ω[+]Ω’)
(1) Set directions of edges
Fig. 1. Calculation of Ω Ω in (14) for two-dimensional convex polygons. Vertices of polygons correspond to Ω, Ω and Ω Ω .
˜ in (9) to prove proposition 3. With the above Proof. We have introduced a set Ω operations, (9) is written as
˜ Ω = (1 − α)Ω(st , at ) ⊕ α r t+1 + Ω(st+1 , a) . a∈A †
˜ = Ω new (st , at ) is obtained and FMLΩ new (s ,a ) (β) = FML ˜ Then (Ω) t t Ω(st ,at ) (β) = new Qβ (st , at ) is implied. It is well known that Ω † is equal to the vertices in the convex hull of Ω [6]. Efficient algorithms of convex hull have been developed in computational geometry † [8]. Using them, we can calculate the merged set (Ω Ω ) = (Ω ∪ Ω ) . The sum set (Ω Ω ) have been also studied as Minkowski sum algorithms [9,10,11]. Its calculation is particularly easy for two-dimensional convex polygons (Fig.1). Before closing the present section, we note an FML version of Bellman equation in our notation. Theoretically, we can use successive iteration of this equation to find the optimal policy when we know P and R, though we must take care of numerical error in practice. Proposition 5. FML expression Q∗β = FMLΩ ∗ (β) satisfies †
Ω ∗ (s, a) = Ras + γ
a ∈A
a ∗ + Pss Ω (s , a ),
(16)
s ∈S
where i Ras = (Ras (1), . . . , Ras (M )), Ras (i) = E[rt+1 | st = s, at = a], a Pss = P (st+1 = s | st = s, at = a),
+ s ∈{s1 ,...,sk }
Xs = Xs1 Xs2 · · · Xsk .
(17) (18) (19)
492
K. Hiraoka, M. Yoshida, and T. Mishima
In particular, the next equation holds if state transition is deterministic: † Ω ∗ (s, a) = Ras + γ Ω ∗ (s , a ),
(20)
a ∈A
where s is the next state for the action a at the current state s. a Proof. Substituting (7) and Ras,β ≡ E[r t+1 (β) | st = s, at = a] = Rs · β into the ∗ a a ∗ Bellman equation Qβ (s, a) = Rs,β + γ s ∈S Pss maxa ∈A Qβ (s , a ), we obtain
max q · β, a ∗ Ω (s, a) = Ras + γ Pss q s q s ∈ Ω (s , a )
max
q∈Ω ∗ (s,a)
q·β =
q ∈Ω (s,a)
a ∈A
(22)
s ∈S
in the same way as (9). Hence, Ω ∗ is equal to Ω except for redundancy.
4
(21)
Interval Operations
Under regularity conditions, Q-learning has been proved to converge to Q∗ [1]. That result implies pointwise convergence of parallel Q-learning to Q∗β for each β because of proposition 3. From proposition 2, Q∗β (s, a) is expressed with a finite Ω ∗ (s, a). However, as we can see in Fig.1, the number of elements in the set Ω(s, a) increases monotonically and it never ‘converges’ to Ω ∗ (s, a). This is not a paradox; the following assertions can be true at the same time. 1. Vertices of polygons P1 , P2 , . . . monotonically increase. 2. Pt converges to a polygon P ∗ in the sense that the volume of the difference Pt P = (Pt ∪ P ∗ ) − (Pt ∩ P ∗ ) converges to 0. 2’. The function FMLPt (·) converges pointwise to FMLP ∗ (·). In short, pointwise convergence of a piecewise-linear function does not imply convergence of the number of pieces. Note that it is not a problem of the algorithm. It is an intrinsic nature of pointwise Q-learning of the weighted criterion model for each weight β. To overcome this difficulty, we tried a simple approximation with a small ‘margin’ at first [6]. Then we have introduced interval operations to monitor approximation error [7]. A pair of sets Ω L (s, a) and Ω U (s, a) are updated instead of the original Ω(s, a) so that CH Ω L (s, a) ⊂ CH Ω(s, a) ⊂ CH Ω U (s, a) holds, where CH Z represents the convex hull of Z. This relation implies lower and upU X per bounds QL β (s, a) ≤ Qβ (s, a) ≤ Qβ (s, a), where Qβ (s, a) = FMLΩ X (s,a) (β) L U for X = L, U . When the difference between Q and Q is sufficiently small, it is guaranteed that the effect of the approximation can be ignored. Updating rules of Ω L and Ω U are same as those of Ω, except for the following approximations after every calculation of and . We assume M = 2 here. Lower approximation for Ω L : A vertex is removed if the change of the area of CH Ω L (s, a) is smaller than a threshold L /2 (Fig.2 left).
Parallel Reinforcement Learning for Weighted Multi-criteria Model
b
c
b
d a
a
d
e
b a
c
if the area of triangle /// is small
if the area of triangle /// is small d - remove c e
493
z a
- remove b,c - add z d
Fig. 2. Lower approximation (left) and upper approximation (right)
Upper approximation for Ω U : An edge is removed if the change of the area of CH Ω U (s, a) is smaller than a threshold U /2 (Fig.2 right). In this paper, we propose an automatic adjustment of the margins L , U . The below procedures are performed at every step t after the updating of Ω L , Ω U . The symbol X represents L or U here. ξs , ξw ≥ 1 and θQ , θΩ ≥ 0 are constants. 1. Check the changes of set sizes and interval width compared with the previous ones. Namely, check these values: Xnew ∆X (st , at ) − Ω X (st , at ) , (23) Ω = Ω Unew U Lnew L ∆Q = Qβ¯ (st , at ) − Qβ¯ (st , at ) − Qβ¯ (st , at ) − Qβ¯ (st , at ) , (24) ¯ is selected beforehand. where |Z| is the number of elements in Z, and β 2. Increase of set size suggests a need of thinning, whereas increase of interval width suggests a need of more accurate calculation. Modify margins as ˜X (∆Q ≤ θQ ) X (∆X Xnew X Ω ≤ θΩ ) = X , where ˜ = . (25) X ˜ /ξw (∆Q > θQ ) ξs (∆X Ω > θΩ ) To avoid underflow, we set Xnew = min if Xnew is smaller than a constant min .
5
Experiments with a Basic Task of Weighted Criterion
We have verified behaviors of the proposed method. We set S = {S, G, A, B, X, Y}, A = {Up, Down, Left, Right}, s0 = S, and γ = 0.8 (Fig.3) [6]. Each action causes a deterministic state transition to the corresponding direction except at G, where the agent is moved to S regardless of its action. Rewards 1, 4b, b are offered at st = G, X, Y, respectively. If at is an action to ‘outside wall’ at st = G, the state is unchanged and a negative reward (−1) is added further. It is a weighted criterion model of M = 2, because it can be written as 1 2 the form rt+1 = β · rt+1 for r t+1 = (rt+1 , rt+1 ) and β = (b, 1). The optimal policy changes depending on the weight b. Hence, the optimal value function is
494
K. Hiraoka, M. Yoshida, and T. Mishima
S
X (4b)
G (1)
A
Y (b)
B
outside = wall (-1)
Fig. 3. Task for experiments. Numbers in parentheses are reward values. Table 1. Optimal state-value functions and optimal policies Range of weight Optimal Vb∗ (S) Optimal state transition b < −16/25 0 S → A → S → ··· −16/25 ≤ b < −225/1796 (2000b + 1280)/2101 S → A → Y → B → G → S → · · · −225/1796 ≤ b < 15/47 (400b + 80)/61 S → X → G → S → ··· 15/47 ≤ b < 3/4 32b/3 S → X → Y → X → ··· 3/4 ≤ b 16b − 4 S → X → X → ···
1
1e-04 1e-06
1e-13 1e-10 1e-7 1e-4 0.1
0.01 1e-04 Upper margin
Lower margin
1
1e-13 1e-10 1e-7 1e-4 0.1
0.01
1e-08 1e-10 1e-12 1e-14
1e-06 1e-08 1e-10 1e-12 1e-14
1e-16
1e-16
1e-18
1e-18 0
5000
10000
15000
20000
0
5000
10000
t
15000
20000
t
300
Total number of elements Σs,a|Ω (s,a)|
1e-13 1e-10 1e-7 1e-4 0.1
250 200
U
L
Total number of elements Σs,a|Ω (s,a)|
Fig. 4. Transition of margins L and U from various initial margins
150 100 50 0 0
5000
10000
15000
20000
t
Fig. 5. Total number of elements
300
1e-13 1e-10 1e-7 1e-4 0.1
250 200 150 100 50 0 0
5000
10000
15000
20000
t
s,a
|Ω X (s, a)|. (Left: X = L, Right: X = U ).
Parallel Reinforcement Learning for Weighted Multi-criteria Model
495
1 0.01 1e-04 Interval width
1e-06 1e-08 1e-10 1e-12
1e-13 1e-10 1e-7 1e-4 0.1
1e-14 1e-16 1e-18 0
5000
10000
15000
20000
t
4500
1
1e-2(Lower) 1e-2(Upper) 1e-9(Lower) 1e-9(Upper)
4000 3500
0.01 1e-04
3000
Interval width
X
Total number of elements Σs,a|Ω (s,a)|
L Fig. 6. Interval width QU (0.2,1) (A, Up) − Q(0.2,1) (A, Up)
2500 2000 1500
1e-06 1e-08 1e-10 1e-12
1000
1e-14
500
1e-16
0
1e-18 0
5000
10000
15000
20000
1e-2 1e-9 0
5000
10000
t
15000
20000
t
300
1
1e-13 1e-10 1e-7 1e-4 0.1
250 200
0.01 1e-04 Interval width
U
Total number of elements Σs,a|Ω (s,a)|
Fig. 7. Fixed-margin (U = L = 10−2 and U = L = 10−9 ). Left: total
algorithm X number of elements s,a |Ω (s, a)| for X = U, L. Right: interval width.
150 100
1e-06 1e-08 1e-10 1e-12
1e-13 1e-10 1e-7 1e-4 0.1
1e-14
50
1e-16 0
1e-18 0
2000
4000
6000 t
8000
10000
0
2000
4000
6000
8000
10000
t
Fig. 8. Average of 100 trials with inappropriate factors ξs = 1.5, ξw = 1.015 for γ = 0.5 Left: total number of elements in upper approximation. Right: interval width.
nonlinear with respect to b (Table 1). Note that the second pattern (S→A→Y) in Table 1 cannot appear on the naive approach in section 1. The proposed algorithm is applied to this task with random actions at and ¯ = (0.2, 1), parameters α = 0.7, (ξs , ξw ) = (1.7, 1.015), (θQ , θΩ ) = (0, 2), β −14 L U −1 min = 10 . The initial margins = at t = 0 is one of 10 , 10−4 , 10−7 ,
496
K. Hiraoka, M. Yoshida, and T. Mishima
10−10 , 10−13 . On this task, we can replace convex hulls with upper convex hulls in our algorithm because β is restricted to the upper half plane [6]. We also assume |b| ≤ 10 ≡ bmax and we safely remove the edges on the both end in Fig.2 if the absolute value of their slope is greater than bmax for lower approximation. Averages of 100 trials are shown in Fig.4,5,6. The proposed algorithm is robust to wide range of initial margins. It realizes reduced set sizes and small interval width at the same time; these requirements are trade-off in the conventional fixed-margin algorithm [7] (Fig.7). A problem of the proposed algorithms is sensitivity to the factors ξs , ξw . When they are inappropriate, instability is observed after a long run (Fig.8). Another problem is slow convergence of the interval width QU − QL compared with the fixed-margin algorithm.
6
Conclusion
A parallel RL method with adaptive margins is proposed for the weighted criterion model, and its behaviors are verified experimentally with a basic task. Adaptive margins realize reduced set sizes and accurate results. A problem of the adaptive margins is instability for inappropriate parameters. Though it is robust for initial margins, it needs tuning of factor parameters. Another problem is slow convergence of the interval between upper and lower estimations. These points must be studied further.
References 1. Jaakkola, T., et al.: Neural Computation 6, 1185–1201 (1994) 2. Sutton, R.S., Barto, A.G.: Reinforcement Learning. The MIT Press, Cambridge (1998) 3. Kaneko, Y., et al.: In: Proc. IEICE Society Conference (in Japanese), vol. 167 (2004) 4. Kaneko, N., et al.: In: Proc. IEICE Society Conference (in Japanese), vol. A-2-10 (2005) 5. Natarajan, S., et al.: In: Proc. Intl. Conf. on Machine Learning, pp. 601–608 (2005) 6. Hiraoka, K., et al.: The Brain & Neural Networks (in Japanese). Japanese Neural Network Society 13, 137–145 (2006) 7. Yoshida, M., et al.: Proc. FIT (in Japanese) (to appear, 2007) 8. Preparata, F.P., et al.: Computational Geometry. Springer, Heidelberg (1985) 9. Alexandrov, V.N., Dongarra, J., Juliano, B.A., Renner, R.S., Tan, C.J.K. (eds.): ICCS 2001. LNCS, vol. 2073. Springer, Heidelberg (2001) 10. Fukuda, K.: J. Symbolic Computation 38, 1261–1272 (2004) 11. Fogel, E., et al.: In: Proc. ALENEX, pp. 3–15 (2006)
Convergence Behavior of Competitive Repetition-Suppression Clustering Davide Bacciu1,2 and Antonina Starita2 1
2
IMT Lucca Institute for Advanced Studies, P.zza San Ponziano 6, 55100 Lucca, Italy [email protected] Dipartimento di Informatica, Universit` a di Pisa, Largo B. Pontecorvo 3, 56127 Pisa, Italy [email protected]
Abstract. Competitive Repetition-suppression (CoRe) clustering is a bio-inspired learning algorithm that is capable of automatically determining the unknown cluster number from the data. In a previous work it has been shown how CoRe clustering represents a robust generalization of rival penalized competitive learning (RPCL) by means of M-estimators. This paper studies the convergence behavior of the CoRe model, based on the analysis proposed for the distance-sensitive RPCL (DSRPCL) algorithm. Furthermore, it is proposed a global minimum criterion for learning vector quantization in kernel space that is used to assess the correct location property for the CoRe algorithm.
1
Introduction
CoRe learning has been proposed as a biologically inspired learning model mimicking a memory mechanism of the visual cortex, i.e. repetition suppression [1]. CoRe is a soft-competitive model that allows only a subset of the most active units to learn in proportion to their activation strength, while it penalizes the least active units, driving them away from the patterns producing low firing strengths. This feature has been exploited in [2] to derive a clustering algorithm that is capable of automatically determining the unknown cluster number from the data by means of a reward-punishment procedure that resembles the rival penalization mechanism of RPCL [3]. Recently, Ma and Wang [4] have proposed a generalized loss function for the RPCL algorithm, named DSRPCL, that has been used for studying the convergence behavior of the rival penalization scheme. In this paper, we present a convergence analysis for CoRe clustering that founds on Ma and Wang’s approach, describing how CoRe satisfies the three properties of separation nature, correct division and correct location [4]. The intuitive analysis presented in [4] for DSRPCL is enforced with theoretical considerations showing that CoRe pursues a global optimality criterion for vector quantization algorithms. In order to do this, we introduce a kernel interpretation for the CoRe loss that is used to generalize the results given in [5] for hard vector quantization, to kernel-based algorithms. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 497–506, 2008. c Springer-Verlag Berlin Heidelberg 2008
498
2
D. Bacciu and A. Starita
A Kernel Based Loss Function for CoRe Clustering
A CoRe clustering network consists of cluster detector units that are characterized by a prototype ci , that identifies the preferred stimulus for the unit ui and represents the learned cluster centroid. In addition, units are characterized by an activation function ϕi (xk , λi ), defined in terms of a set of parameters λi , that determines the firing strength of the unit in response to the presentation of an input pattern xk ∈ χ. Such an activation function measures the similarity between the prototype ci and the inputs, determining whether the pattern xk belongs to the i-th cluster. In the remainder of the paper we will use an activation function that is a gaussian centered in ci with spread σi , i.e. ϕi (xk |{ci , σi }) = exp −0.5xk − ci 2 /σi2 . CoRe clustering works essentially by evolving a small set of highly selective cluster detectors out of an initially larger population by means of a competitive reward-punishment procedure that resembles the rival penalization mechanism [3]. Such a competition is engaged between two sets of units: at each step the most active units are selected to form the winners pool, while the remainder is inserted into the losers pool. More formally, we define the winners pool for the input xk as the set of units ui that fires more than θwin or the single unit that is maximally active for the pattern, that is wink = {i | ϕi (xk , {ci , σi }) ≥ θwin } ∪ {i | i = arg max ϕj (xk | {ci , σi })} j∈U
(1)
where the second term of the union ensures that wink is non-empty. Conversely, the losers pool for xk is losek = U \ wink , that is the complement of wink with respect to the neuron set U . The units belonging to the losers pool are penalized and their response is suppressed. The strength of the penalization for the pattern xk , at time t, is regulated by the repetition suppression RSkt ∈ [0, 1] and is proportional to the frequency of the pattern that has elicited the suppressive effect (see [2,6] for details). The repetition suppression is used to define a pseudo-target activation for the units in the losers pool as ϕˆti (xk ) = ϕi (xk , {ci , σi })(1 − RSkt ). This reference signal forces the losers to reduce their activation proportionally to the amount of repetition suppression they receive. The error of the i-th loser unit can thus be written as E ti,k =
1 t 1 (ϕˆi (xk ) − ϕi (xk , {ci , σi }))2 = (−ϕi (xk , {ci , σi })RSkt )2 . 2 2
(2)
Conversely, in order to strengthen the activation of the winner units, we set the target activation for the neurons ui (i ∈ wink ) to M , that is the maximum of the activation function ϕi (·). The error, in this case, can be written as t
E i,k = (M − ϕi (xk , {ci , σi })).
(3)
To analyze the CoRe convergence, we give an error formulation that accumulates the residuals in (2) and (3) for a given epoch e: summing up over all CoRe units in U and the dataset χ = (x1 , . . . , xk , . . . , xK ) yields
Convergence Behavior of Competitive Repetition-Suppression Clustering
Je (χ, U ) =
I K
δik (1 − ϕi (xk )) +
i=1 k=1
I K
499
2 (e|χ|+k) (1 − δik ) ϕi (xk )RSk (4)
i=1 k=1
where δik is the indicator function for the set wink and where {ci , σi } has been omitted from ϕi to ease the notation. Note that, in (4), we have implicitly used the fact that the units can be treated as independent. The CoRe learning equations can be derived using gradient descent to minimize Je (χ, U ) with respect to the parameters {ci , σi } [2]. Hence, the prototype increment for the e-th epoch can be calculated as follows ⎡ ⎤
K (e|χ|+k) 2 e ⎣δik ϕi (xk ) (xk − ci ) − (1 − δik ) ϕi (xk )RSk cei = αc (xk − cei )⎦ (σie )2 σie k=1
(5) where αc is a suitable learning rate ensuring that Je decreases with e. Similarly, the spread update can be calculated as σie = ασ
K k=1
δik ϕi (xk )
e 2 xk − cei 2 (e|χ|+k) 2 xk − ci − (1 − δ )(ϕ (x )RS ) . (6) i ik k k (σie )3 (σie )3
As one would expect, unit prototypes are attracted by similar patterns (first term in (5)) and are repelled by the dissimilar inputs (second term in (5)). Moreover, the neural selectivity is enhanced by reducing the Gaussian spread each time the corresponding unit happens to be a winner. Conversely, the variance of loser neurons is enlarged, reducing the units’ selectivity and penalizing them for not having sharp responses. The error formulation introduced so far can be restated by exploiting the kernel trick [7] to express the CoRe loss in terms of differences in a given feature space F . Kernel methods are algorithms that exploit a nonlinear mapping Φ : χ → F to project the data from the input space χ onto a convenient, implicit feature space F . The kernel trick is used to express all operations on Φ(x1 ), Φ(x2 ) ∈ F in terms of the inner product Φ(x1 ), Φ(x2 ) . Such inner product can be calculated without explicitly using the mapping Φ, by means of the kernel κ(x1 , x2 ) = Φ(x1 ), Φ(x2 ) . To derive the kernel interpretation for the CoRe loss in (4), consider first the formulation of the distance dFκ of two vectors x1 , x2 ∈ χ in the feature space Fκ , induced by the kernel κ, and described by the mapping Φ : χ → Fκ , that is dFκ (x1 , x2 ) = Φ(x1 ) − Φ(x2 )2Fκ = κ(x1 , x1 ) − 2κ(x1 , x2 ) + κ(x2 , x2 ). The kernel trick [7] have been used to substitute the inner products in feature space with a suitable kernel κ calculated in the data space. If κ is chosen to be a gaussian kernel, then we have that κ(x, x) = 1. Hence dFκ can be rewritten as dFκ = Φ(x1 ) − Φ(x2 )2Fκ = 2 − 2κ(x1 , x2 ). Now, if we take x1 to be an element of the input dataset, e.g. xk ∈ χ, and x2 to be the prototype ci of the i-th CoRe unit, we can rewrite dFκ in such a way to depend on the activation function ϕi . Therefore, applying the substitution κ(xk , ci ) = ϕi (xk , {ci , σi }) we obtain ϕi (xk , {ci , σi }) = 1 − 12 Φ(xk ) − Φ(ci )2Fκ . Now, if we substitute this result in the formulation of the CoRe loss in (4), we obtain
500
D. Bacciu and A. Starita
1 δik Φ(xk ) − Φ(ci )2Fκ + 2 i=1 k=1
2 I K 1 1 (e|χ|+k) 2 + (1 − δik ) RSk 1 − Φ(xk ) − Φ(ci )Fκ (7) 2 i=1 2 I
Je (χ, U ) =
K
k=1
Equation (7) states that CoRe minimizes the feature space distance between the prototype ci and those xk that are close in the kernel space induced by the activation functions ϕi , while it maximizes the feature space distance between the prototypes and those xk that are far from ci in the kernel space.
3
Separation Nature
To prove the separation nature of the CoRe process we need to demonstrate that, given a a bounded hypersphere G containing all the sample data, then after sufficient iterations of the algorithm the cluster prototypes will finally either fall into G or remain outside it and never get into G. In particular, those prototypes remaining outside the hypersphere will be driven far away from the samples by the RS repulsion. We consider a prototype ci to be far away from the data if, for a given epoch e, it is in the loser pool for every xk ∈ χ. To prove CoRe separation nature we first demonstrate the following Lemma. Lemma 1. When a prototype ci is far away from the data at a given epoch e, then it will always be a loser for every xk ∈ χ and will be driven away from the data samples. Proof. The definition of far away implies that, given cei , ∀xk ∈ χ. i ∈ loseek , where the e in the superscript refers to the learning epoch. Given the prototype update in (5), we obtain the weight vector increment Δcei at epoch e as follows cei
= −ασ
K k=1
(e|χ|+k)
ϕi (xk )RSk σie
2 (xk − cei ).
(8)
As a result of (8), the prototype ce+1 is driven further from the data. On the i other hand, by definition (1), for each of the data samples there exists at least one winner unit for every epoch e, such that its prototype is moved towards the samples for which it has been a winner. Moreover, not every prototype can be deflected from the data, since this would make the first term of Je (χ, U ) (see (4)) grow and, consequently, the whole Je (χ, U ) will diverge since the loser error term in (4) is lower bounded. However, this would contradict the fact that Je (χ, U ) decreases with e since CoRe applies gradient descent to the loss function. Therefore, there must exist at least one winning prototype cel that remains close to the samples at epoch e. On the other hand cei is already far away from the samples and, by (8), ce+1 will be further from the data and i won’t be a winner for any xk ∈ χ. To prove this, consider the definition of wink
Convergence Behavior of Competitive Repetition-Suppression Clustering
501
in (1): for ce+1 to be a winner, it must hold either (i) ϕi (xk ) ≥ θwin or (ii) i i = arg maxj∈U ϕj (xk , λj ). The former does not hold because the receptive field area where the firing strength of the i-th unit is above the threshold θwin does not contain any sample at epoch e. Consequently, it cannot contain any sample at epoch e + 1 since its center ce+1 has been deflected further from the data. The i latter does not hold since there exist at least one prototype, i.e. cl , that remains close to the data, generating higher activations than unit ui . As a consequence, a far away prototype ci will be deflected from the data until it reaches a stable point where the corresponding firing strength ϕi is negligible. Now we can proceed to demonstrate the following Theorem 1. For a CoRe process there exist an hypersphere G surrounding the sample data χ such that after sufficient iterations each prototype ci will finally either (i) fall into G or (ii) keep outside G and reach a stable point. Proof. The CoRe process is a gradient descent (GD) algorithm on Je (χ, U ), hence, for a sufficiently small learning step, the loss decreases with the number of epochs. Therefore, being Je (χ, U ) always positive the GD process will converge to a minimum J ∗ . The sequences of prototype vectors {cei } will converge either to a point close to the samples or to a point of negligible activation far away from the data. If a unit ui has a sufficiently long subsequence of prototypes {cei } diverging from the dataset then, at a certain time, will no longer be a winner for any sample and, by Lemma 1, will converge at a point far away from the data. The attractors for the sequence {cei } of the diverging units lie at a certain distance r from the samples, that is determined by those points x where the gaussian unit centered in x produces a negligible activation in response to any pattern xk ∈ χ. Hence, G can be chosen as any hypersphere surrounding the samples with radius smaller than r. On the other hand, since Je (χ, U ) decreases to J ∗ , there must exist at least one prototype that is not far away from the data (otherwise the first term of Je (χ, U ) in (4) will diverge). In this case, the sequences {cei } must have accumulation points close to the samples. Therefore any hypersphere G enclosing all the samples will also surround the accumulation points of {cei } and, after a certain epoch E, the sequence will be always within such hypersphere. In summary, Theorem 1 tells that the separation nature holds for a CoRe process: some prototypes are possibly pushed away from the data until their contribution to the error in (4) becomes negligible. Far away prototypes will always be losers and will never head back to the data. Conversely, some prototypes will converge to the samples, heading to a saddle point of the loss Je (χ, U ) by means of a gradient descent process.
4
Correct Division and Location
Following the convergence analysis in [4] we now turn our attention to the issues of correct division and location of the weight vectors. This means that the
502
D. Bacciu and A. Starita
number of prototypes falling into G will be nc , i.e. the number of the actual clusters in the sample data, and they will finally converge to the centers of the clusters. At this point, we leave the intuitive study presented for DSRPCL [4], introducing a sound analysis of the properties of the saddle points identified by CoRe, giving a sufficient and necessary condition for identifying the global minimum of a vector quantization loss in feature space. 4.1
A Global Minimum Condition for Vector Quantization in Kernel Space
The classical problem of hard vector quantization (VQ) in Euclidean space is to determine a codebook V = v1 , . . . , vN minimizing the total distortion, calculated by Euclidean norms, resulting from the approximation of the inputs xk ∈ χ by the code vectors vi . Here, we focus on a more general problem that is vector quantization in feature space. Given the nonlinear mapping Φ and the induced feature space norm · Fκ introduced in the previous sections, we aim at optimizing the distortion K N 1 1 min D(χ, ΦV ) = δik Φ(xk ) − Φvi 2Fκ K i=1
(9)
k=1
where ΦV = {Φv1 , . . . , ΦvN } represents the codebook in the kernel space and 1 δik is equal to 1 if the i-th cluster is the closest to the k-th pattern in the feature space Fκ , and is 0 otherwise. It is widely known that VQ generates a Voronoi tessellation of the quantized space and that a necessary condition for the minimization of the distortion requires the code-vectors to be selected as the centroids of the Voronoi regions [8]. In [5], it is given a necessary and sufficient condition for the global minimum of an Euclidean VQ distortion function. In the following, we generalize this result to vector quantization in feature space. To prove the global minimum condition in kernel space we need to extend the results in [9] (Proposition 3.1.7 and 3.2.4) to the most general case of a kernel induced distance metric. Therefore we introduce the following lemma. Lemma 2. Let κ be a kernel and Φ : χ → Fκ a map into the corresponding feature space Fκ . Given a dataset χ = x1 , . . . , xK partitioned into N subsets Ci , K 1 define the feature space mean Φ = χ k=1 Φ(xk ) and the i-th partition centroid K 1 Φvi = |Ci | k∈Ci Φ(xk ), then we have K k=1
Φ(xk ) − Φχ 2Fκ =
N i=1 k∈Ci
Φ(xk ) − Φvi 2Fκ +
N
|Ci |Φvi − Φχ 2Fκ . (10)
i=1
Proof. Given a generic feature vector Φ1 , consider the identity Φ(xk ) − Φ1 = (Φ(xk ) − Φvi ) + (Φvi − Φ1 ): its squared norm in feature space is Φ(xk ) − Φ1 2Fκ = Φ(xk ) − Φvi 2Fκ + Φvi − Φ1 2Fκ + 2(Φ(xk ) − Φvi )T (Φvi − Φ1 ).
Convergence Behavior of Competitive Repetition-Suppression Clustering
503
Summing over all the elements in the i-th partition we obtain Φ(xk ) − Φ1 2Fκ = Φ(xk ) − Φvi 2Fκ + Φvi − Φ1 2Fκ k∈Ci
k∈Ci
+2
k∈Ci
(Φ(xk ) − Φvi ) (Φvi − Φ1 ) T
k∈Ci
=
Φ(xk ) − Φvi 2Fκ + |Ci |Φvi − Φ1 2Fκ .
(11)
k∈Ci
The last term in (11) vanishes since k∈Ci (Φ(xk ) − Φvi ) = 0 by definition of Φvi . Now, applying the substitution Φ1 = Φχ and summing up for all the N partitions yields K
Φ(xk ) − Φχ 2Fκ =
N
Φ(xk ) − Φvi 2Fκ +
i=1 k∈Ci
k=1
N
|Ci |Φvi − Φχ 2Fκ (12)
i=1
N
where the left side of equality holds since
i=1
Ci = χ and
N i=1
Ci = ∅ .
Using the results from Lemma 2 we can proceed with the formulation of the global minimum criterion by generalizing the results of Proposition 1 in [5] to vector quantization in feature space. g } be a global minimum solution to the probProposition 1. Let {Φv1g , . . . , ΦvN lem in (9), then we have
N
|Cig |Φvig − Φχ 2Fκ ≥
N
i=1
|Ci |Φvi − Φχ 2Fκ
(13)
i=1
g for any local optimal solution {Φv1 , . . . , ΦvN } to (9), where {C1g , . . . , CN } and g g {C , . . . , C } are the χ partitions corresponding to the centroids Φ = 1/|C 1 N v i| i k∈Cig Φ(xk ) and Φvi = 1/|Ci | k∈Ci Φ(xk ) respectively, and where Φχ is the dataset mean (see definition in Lemma 2). g } is a global minimum for (9) we have Proof. Since {Φv1g , . . . , ΦvN
N
Φ(xk ) − Φvig 2Fκ ≤
i=1 k∈Cig
N
Φvi − Φχ 2Fκ
(14)
i=1 k∈Ci
for any local minimum {Φv1 , . . . , ΦvN }. From Lemma 2 we have that K
Φ(xk ) − Φχ 2Fκ =
k=1
Φ(xk ) − Φvig 2Fκ +
i=1 k∈Cig
k=1 K
N
Φ(xk ) − Φχ 2Fκ =
N
Since (14) holds, we obtain
|Cig |Φvig − Φχ 2Fκ (15)
i=1
Φ(xk ) − Φvi 2Fκ +
i=1 k∈Ci
N
N
g g i=1 |Ci |Φvi
N
|Ci |Φvi − Φχ 2Fκ . (16)
i=1
−
Φχ 2Fκ
≥
N i=1
|Ci |Φvi − Φχ 2Fκ
504
4.2
D. Bacciu and A. Starita
Correct Division and Location for CoRe Clustering
To evaluate the correct division and location properties we first analyze the case when the number of units N is equal to the true cluster number nc . Consider the loss in (4) as being decomposed into a winner and a loser dependent term, i.e. Je (χ, U ) = Jewin (χ, U )+Jelose (χ, U ). By definition, Jewin (χ, U ) = nc K i=1 k=1 δik (1 − ϕi (xk )) must have at least one minimum point. Applying the necessary condition ∂Jewin (χ, U )/∂ci = 0 we obtain an estimate of the prototypes by means of fixed point iteration, that is N e k=1 δik ϕi (xk )xk ci = . (17) N k=1 δik ϕi (xk ) When the number of prototypes equals the number of clusters, the fixed point iteration in (17) converges by positioning each unit weight vector close to the true cluster centroids. In addition, it can be shown that (17) approximates a local minima of the kernel vector quantization loss in (9). To prove this, consider the CoRe loss formulation in kernel space (7): we have Jewin (χ, U ) = 1 nc K 2 i=1 k=1 δik Φ(xk ) − Φ(ci )Fκ , where ci is estimated by (17). 2 Now, consider the VQ loss in (9): a necessary condition for its minimization requires the computation of the cluster centroids as Φvi = |C1i | k∈Ci Φ(xk ). The exact calculation of Φvi requires to know the form of the implicit nonlinear mapping Φ to solve the so-called pre-image problem [10], that is determining z such that Φ(z) = Φvi . Unfortunately, such a problem is insolvable in the general case [10]. However, instead of calculating the exact pre-image we can search an approximation by seeking z minimizing ρ(z) = Φvi − Φ(z)2Fκ , that is the feature space distance between the centroid in kernel space and the mapping of its approximated pre-image. Rather than optimizing ρ(z), it is easier to minimize the distance between Φvi and its orthogonal projection onto the span Φ(z). Due to space limitations, we omit the technicalities of this calculation (see [10] for further details). It turns out that the minimization of ρ(z) reduces to the the evaluation of the gradient of ρ (z) = Φvi , Φ(z) . By substituting the definition of Φvi and applying the kernel trick we obtain ρ (z) = (1/|Ci |) κ(xk , xj ) + κ(z, z) + (1/|Ci |) κ(xk , z) k,j∈Ci
k∈Ci
where κ(z, z) = 1 since we are using a gaussian kernel. Differentiating ρ (z) with respect to z and solving by fixed point iteration yields e−1 )xk k∈Ci κ(xk , z e z = (18) e−1 ) k∈Ci κ(xk , z that is the same as the prototype estimate obtained in (17) for gaussian kernels centered in z e . The indicator function δik in (17) is not null only for those points xk for which unit ui was in the winner set. This does not ensures the partition conditions over χ, since, by definition of wink , some points can be associated with
Convergence Behavior of Competitive Repetition-Suppression Clustering
505
two or more winners. However, by (6) we know that the variance of the winners tends to reduce as learning proceeds. Therefore, using the same arguments by Gersho [8] it can be demonstrated that, after a certain epoch E, the CoRe winners competition will become a WTA process where δik will be ensuring the partition conditions over χ. Summarizing, the minimization of the CoRe winners error Jewin (χ, U ) generates an approximate solution to the vector quantization problem in feature space in (9). As a consequence, the prototypes ci become a local solution satisfying the conditions of Proposition 1. Hence, substituting the definition of Φχ in the results of Proposition 1 we obtain that {c1 , . . . , cnc } is an approximated global minimum for (9) if and only if nc K |Ci | i=1 k=1
K
Φ(ci ) − Φ(xk )2Fκ ≥
nc K |C˜i | i=1 k=1
K
Φ(˜ ci ) − Φ(xk )2Fκ
(19)
holds for every {˜ c1 , . . . , c˜nc } that are approximated pre-images of a local minimum for (9). In summary, a global optimum to (9) should minimize the feature space distance between the prototypes and samples belonging to their cluster while maximizing the weight vector distance from the sample mean, or, equivalently, the distance from all the samples in the dataset χ. The loser component Jelose (χ, U ) in the kernel CoRe loss (7) depends on the term (1 − (1/2)Φ(ci ) − Φ(xk )2Fκ that maximizes the distance between the prototypes ci and those xk that do not fall in the respective Voronoi sets Ci . Hence, Jelose (χ, U ) produces a distortion in the estimate of ci that pursues the global optimality criterion except for the fact that it discounts the repulsive effect of the xk ∈ Ci . In fact, (19) suggests that ci has to be repelled by all the xk ∈ χ. On the other hand, the estimate ci is a linear combination of the xk ∈ Ci : applying the repulsive effect in (19) would subtract their contribution, either canceling the attractive effect (which would be catastrophic) or simply scaling the magnitude of the learning step without changing the final direction. Hence, the CoRe loss makes a reasonable assumption discarding the repulsive effect of the xk ∈ Ci when calculating the estimate of ci . Summarizing, CoRe locates the prototypes close to the centroids of the nc clusters by means of (17), escaping from local minima of the loss function by approximating the global minimum condition of Proposition 1. Finally, we need to study the behavior of Je (χ, U ) as the number of units N varies with respect to the true cluster number nc . Using the same motivations in [4], we see that the winner-dependent loss Jewin tends to reduce as the the number of units increases. However, if the number of units falling into G is larger than nc there will be a number of clusters that are erroneously split. Therefore, the samples from these clusters will tend to produce an increased level of error in Jelose contrasting the reduction of Jewin . On the other hand, Jelose will tend to reduce when the number of units inside G is lower than nc . This however will produce increased levels of Jewin since the prototype allocation won’t match the underlying sample distribution. Hence, the CoRe error will have its minimum when the number of units inside G will approximate nc .
506
5
D. Bacciu and A. Starita
Conclusion
The paper presents a sound analysis of the convergence behavior of CoRe clustering, showing how the minimization of the CoRe cost function satisfies the properties of separation nature, correct division and location [4]. As the loss reduces to a minimum, the CoRe algorithm is shown to converge allocating the correct number of prototypes to the centers of the clusters. Moreover, it is given a sound optimality criterion that shows how CoRe gradient descent pursues a global minimum of the vector quantization problem in feature space. The results presented in the paper hold for a batch gradient descent process. However, it can be proved that, under Ljung’s conditions [11], they can be extended to stochastic (online) gradient descent. Moreover, we plan to investigate further the properties of the CoRe kernel formulation, extending the convergence analysis to a wider class of activation functions other than gaussians, i.e. normalized kernels.
References 1. Grill-Spector, K., Henson, R., Martin, A.: Repetition and the brain: neural models of stimulus-specific effects. Trends in Cognitive Sciences 10(1), 14–23 (2006) 2. Bacciu, D., Starita, A.: A robust bio-inspired clustering algorithm for the automatic determination of unknown cluster number. In: Proceedings of the 2007 International Joint Conference on Neural Networks, pp. 1314–1319. IEEE, Los Alamitos (2007) 3. Xu, L., Krzyzak, A., Oja, E.: Rival penalized competitive learning for clustering analysis, rbf net, and curve detection. IEEE Trans. on Neur. Net. 4(4) (1993) 4. Ma, J., Wang, T.: A cost-function approach to rival penalized competitive learning (rpcl). IEEE Trans. on Sys., Man, and Cyber 36(4), 722–737 (2006) 5. Munoz-Perez, J., Gomez-Ruiz, J.A., Lopez-Rubio, E., Garcia-Bernal, M.A.: Expansive and competitive learning for vector quantization. Neural Process. Lett. 15(3), 261–273 (2002) 6. Bacciu, D., Starita, A.: Competitive repetition suppression learning. In: Kollias, S., Stafylopatis, A., Duch, W., Oja, E. (eds.) ICANN 2006. LNCS, vol. 4131, pp. 130–139. Springer, Heidelberg (2006) 7. Scholkopf, B., Smola, A., Muller, K.R.: Nonlinear component analysis as a kernel eigenvalue problem. Neural Comp. 10(5), 1299–1319 (1998) 8. Yair, E., Zeger, K., Gersho, A.: Competitive learning and soft competition for vector quantizer design. IEEE Trans. on Sign. Proc. 40(2), 294–309 (1992) 9. Spath, H.: Cluster analysis algorithms. Ellis Horwood (1980) 10. Scholkopf, B., Mika, S., Burges, C.J.C., Knirsch, P., Muller, K.R., Ratsch, G., Smola, A.J.: Input space versus feature space in kernel-based methods. IEEE Trans. on Neur. Net. 10(5), 1000–1017 (1999) 11. Ljung, L.: Strong convergence of a stochastic approximation algorithm. The Annals of Statistics 6(3), 680–696 (1978)
Self-Organizing Clustering with Map of Nonlinear Varieties Representing Variation in One Class Hideaki Kawano, Hiroshi Maeda, and Norikazu Ikoma Kyushu Institute of Technology, Faculty of Engineering, 1-1 Sensui-cho Tobata-ku Kitakyushu, 804-8550, Japan [email protected]
Abstract. Adaptive Subspace Self-Organizing Map (ASSOM) is an evolution of Self-Organizing Map, where each computational unit defines a linear subspace. Recently, its modified version, where each unit defines an linear manifold instead of the linear subspace, has been proposed. The linear manifold in a unit is represented by a mean vector and a set of basis vectors. After training, these units result in a set of linear variety detectors. In another point of view, we can consider the AMSOM represents the latent commonality of data as linear structures. In numerous cases, however, these are not enough to describe the latent commonality of data because of its linearity. In this paper, the nonlinear variety is considered in order to represent a diversity of data in a class. The effectiveness of the proposed method is verified by applying it to some simple classification problems.
1
Introduction
The subspace method is popular in pattern recognition, feature extraction, compression, classification and signal processing.[1] Unlike other techniques where classes are primarily defined as regions or zones in the feature space, the subspace method uses linear subspaces that are defined by a set of normalized basis vectors. One linear subspace is usually associated with one class. An input vector is classified to a particular class if its projection error into the subspace associated with one class is the minimum. The subspace method, as compared to other pattern recognition techniques, has advantages in applications where the relative intensities or energies of the vector components are more important than the overall level of the signal. It also provides an economical representation for groups of vectors with high dimensionality, since one can often use a small set of basis vectors to approximate the subspace where the vectors reside. Another paradigm is to use is use a mixture of local subspace to collectively model the data space. Adaptive-Subspace Self-Organizing Map (ASSOM)[2][3] is a mixture of local subspace method for pattern recognition. ASSOM, which is an evolution M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 507–516, 2008. c Springer-Verlag Berlin Heidelberg 2008
508
H. Kawano, H. Maeda, and N. Ikoma
of Self-Organizing Map (SOM),[4] consists of an input layer and a competitive layer arranging some computational units in a line or a lattice structure. Each computational units defines a subspace spanned by some basis vectors. ASSOM creates a set of subspaces representations by competitive selection and cooperative learning. In SOM, a set of reference vectors is spatially organized to partition the input space. In ASSOM, a set of reference sub-models is topologically ordered, with each sub-model responsible for describing a specific region of the input space by its local principal subspace. The ASSOM is attractive not only because it inherits the topographic representation property in the SOM, but also because the learning results of ASSOM can faithfully describe the the core features of various transformation groups. The simulation results in the reference [2] and the reference [3] have illustrated that different feature filters can be self-organized to different low-dimensional subspaces and a wavelet type representation does emerge in the learning. Recently, Adaptive Manifold Self-Organizing Map (AMSOM) which is a modified version of ASSOM has been proposed.[5] AMSOM is the same structure as the ASSOM, except for the way to represent each computational unit. Each unit in AMSOM defines an affine subspace which is composed of a mean vector and a set of basis vectors. By incorporating a mean vector into each unit, the recognition performance has been improved significantly. The simulation results in the reference [5] have been shown that AMSOM outperforms linear PCA-based method and ASSOM in face recognition problem. In both ASSOM and AMSOM, a local subspace in each unit can be adapted by linear PCA learning algorithms. On the other hand, it is known that there are a number of advantages in introducing nonlinearities into a PCA type network with reproducing kernels.[6][13] For example, the performance of the subspace method is affected by the dimensionality for the intersections of subspaces.[1] In other words, the dimensionality of subspace should be as possible as low in order to achieve successful performance. it is, however, not enough to describe variation in a class of patterns by low dimensional subspace because of its linearity. From this consideration, we propose a nonlinear extended version of the AMSOM with the reproducing kernels. The proposed method could be expected to construct nonlinear varieties so that effective representation of data belonging to the same category is achieved with low dimensionality. The effectiveness of the proposed method is verified by applying it to some simple pattern classification problems.
2
Adaptive Manifold Self-Organizing Map (AMSOM)
In this section, we give a brief review of the original AMSOM. Fig.1 shows the structure of the AMSOM. It consists of an input layer and a competitive layer, in which n and M units are included respectively. Suppose i ∈ {1, · · · , M } is used to index computational units in the competitive layer, the dimensionality of the input vector is n. The i-th computational unit constructs an affine subspace, which is composed of a mean vector μ(i) and a subspace spanned by H basis
Self-Organizing Clustering with Map
x1
xj
509
xn
Input Layer
1
M
i
Competitive Layer
Fig. 1. A structure of the Adaptive Manifold Self-Organizing Map (AMSOM)
input vector :
relative input vector :
x ~ ~ x=
φ
φ
φ^ mean vector : μ
x^
Subspace
Origin
Fig. 2. Affine Subspace in a computational unit (i)
vectors bh , h ∈ {1, · · · , H}. First of all, we define the orthogonal projection of a input vector x onto the affine subspace of i-th unit as x ˆ(i) = μ(i) +
H
T (i)
(i)
(φ(i) bh )bh ,
(1)
h=1
where φ(i) = x − μ(i) . Therefore the projection error is represented as x ˜(i) = φ(i) −
H
T (i)
(i)
(φ(i) bh )bh .
(2)
h=1
Figure 2 shows a schematic interpretation for the orthogonal projection and the projection error of a input vector onto the affine subspace defined in a unit i. The AMSOM is more general strategy than the ASSOM, where each computational unit solely defines a subspace. To illustrate why this is so, let us consider a very simple case: Suppose we are given two clusters as shown in Fig.3(a). It
510
H. Kawano, H. Maeda, and N. Ikoma
x2
x2 class 1 class 2
manifold 1
b1
class 1 class 2
μ1
O
x1
O
μ2
x1 b2
manifold 2
(b)
(a)
Fig. 3. (a) Clusters in 2-dimensional space: An example of the case which can not be separated without a mean value. (b) Two 1-dimensional affine subspaces to approximate and classify clusters.
is not possible to use one dimensional subspaces, that is lines intersecting the origin O, to approximate the clusters. This is true even if the global mean is removed, so that the origin O is translated to the centroid of the two clusters. However, two one-dimensional affine subspaces can easily approximate the clusters as shown in Fig.3(b), since the basis vectors are aligned in the direction that minimizes the projection error. In the AMSOM, the input vectors are grouped into episodes in order to present them to the network as an input sets. For pattern classification, an episode input is defined as a subset of training data belonging to the same category. Assume that the number of input vectors in the subset is E, then an episode input ωq in the class q is denoted as ωq = {x1 , x2 , · · · , xE } , ωq ⊆ Ωq , where Ωq is a set of training patterns belonging to the class q. The set of input vectors of an episode has to be recognized as one class, such that any member of this set and even an arbitrary linear combination of them should have the same winning unit. The training process in AMSOM has the following steps: (a) Winner lookup. The unit that gives the minimum projection error for an episode is selected. The unit is denoted as the winner, whose index is c. This decision criterion for the winner c is represented as E (i) 2 c = arg min ||˜ xe || , (3) i
where i ∈ {1, · · · , M }.
e=1
Self-Organizing Clustering with Map
(b) Learning. For each unit i, and for each xe , update μ(i) μ(i) (t + 1) = μ(i) (t) + λm (t)hci (t) xe − μ(i) (t) ,
511
(4)
where λm (t) is the learning rate for μ(i) at learning epoch t, hci (t) is the neighborhood function at learning epoch t with respect to the winner c. Both λm (t) and hci (t) are monotonic decreasing function with respect to t. f in f in t/tmax In this paper, λm (t) = λini and hci (t) = exp(−|c − i|/γ(t)), m (λm /λm ) ini f in f in t/tmax γ(t) = γ (γ /γ ) are used. Then the basis vectors are updated as (i)
(i)
(i)
bh (t + 1) = bh (t) + λb (t)hci (t)
(i)
φe (t)T bh (t) , φ(i) e (t), (i) ˆ(i) ||φ (t)||||φ (t)|| e e
(5)
(i)
where φe (t) is the relative input vector in the manifold i updated the mean (i) ˆ(i) vector, which is represented by φe (t) = xe − μ(i) (t + 1), φ e (t) is the orthogonal projection of the relative input vector, which is represented by H (i) (i) T (i) ˆ(i) φ e (t) = h=1 (φ (t) bh (t))bh (t) and λb (t) is the learning rate for the basis vectors, which is also monotonic decreasing function with respect to t. f in f in t/tmax In this paper, λb (t) = λini is used. b (λb /λb ) After the learning phase, a categorization phase to determine the class association of each unit. Each unit is labeled by the class index for which is selected as the winner most frequently when the input data for learning are applied to the AMSOM again,
3 3.1
Self-Organizing Clustering with Nonlinear Varieties Reproducing Kernels
Reproducing kernels are functions k : X 2 → R which for all pattern sets {x1 , · · · , xl } ⊂ X
(6)
give rise to positive matrices Kij := k(xi , xj ). Here, X is some compact set in which the data resides, typically a subset of Rn . In the field of Support Vector Machine (SVM), reproducing kernels are often referred to as Mercer kernels. They provide an elegant way of dealing with nonlinear algorithms by reducing them to linear ones in some feature space F nonlinearly related to input space: Using k instead of a dot product in Rn corresponds to mapping the data into a possibly high-dimensional dot product space F by a (usually nonlinear) map Φ : Rn → F , and taking the dot product there, i.e.[11] k(x, y) = (Φ(x), Φ(y)) .
(7)
By virtue of this property, we shall call Φ a feature map associated with k. Any linear algorithm which can be carried out in terms of dot products can be made
512
H. Kawano, H. Maeda, and N. Ikoma
nonlinear by substituting a priori chosen kernel. Examples of such algorithms include the potential function method[12], SVM [7][8] and kernel PCA.[9] The price that one has to pay for this elegance, however, is that the solutions are only obtained as expansions in terms of input patterns mapped into feature space. For instance, the normal vector of an SV hyperplane is expanded in terms of Support Vectors, just as the kernel PCA feature extractors are expressed in terms of training examples, Ψ=
l
αi Φ(xi ).
(8)
i=1
3.2
AMSOM in the Feature Space
The AMSOM in the high-dimensional feature space F is considered. In the proposed method, an varieties defined by a computational unit i in the competitive layer take the nonlinear form Mi = {Φ(x)|Φ(x) = Φ(μ(i) ) +
H
(i)
ξΦ(bh )},
(9)
h=1
where ξ ∈ R. Given training data set {x1 , · · · , xN }, the mean vector and the basis vector in a unit i are represented by the following form Φ(μ(i) ) =
N
(i)
(10)
(i)
(11)
αl Φ(xl ),
l=1
(i)
Φ(bh ) =
N
βhl Φ(xl ),
l=1 (i)
(i)
respectively. αl in Eq.(10) and βhl in Eq.(11) are the parameters adjusted by learning. The derivation of training procedure in the proposed method is given as follows: (a) Winner lookup. The norm of the orthogonal projection error onto the i-th affine subspace with respect to present input xp is calculated as follows: (i)
||Φ(˜ xp ) ||2 = k(xp , xp ) +
H
(i) 2
Ph
h=1
−2
N
αl k(x, xl ) + 2
l=1
−2
H N h=1 l=1
+
N N
(i) (i)
αl1 αl2 k(xl1 , xl2 )
l1 =1 l2 =1 H N
N
(i)
Ph αl1 βhl2 k(xl1 , xl2 )
h=1 l1 =1 l2 =1 (i)
Ph βhl k(x, xl ),
(12)
Self-Organizing Clustering with Map
513
(i)
where Ph means the orthogonal projection component of present input xp (i) into the basis Φ(bh ) and it is calculated by (i)
Ph =
N
N N
(i)
βhl k(xp , xl )−
l=1
(i) (i)
αl1 βhl2 k(xl1 , xl2 ).
(13)
l1 =1 l2 =1
The reproducing kernels used generally are as follows: = (xTs xt )d d ∈ N, T d = (xs xt + 1) d ∈ N, ||xs −xt ||2 k(xs , xt ) = exp − 2σ2 σ ∈ R, k(xs , xt ) k(xs , xt )
(14) (15) (16)
where N and R are the set of natural numbers and the set of reals, respectively. Eq.(14), Eq.(15) and Eq.(16) are referred as to homogeneous polynomial kernels, non-homogeneous polynomial kernels and gaussian kernels, respectively. The winner for an episode input ωq = {Φ(x1 ), · · · , Φ(xE )} is decided by the same manner as the AMSOM as follows: E (i) 2 c = arg min ||Φ(˜ xe ) || , i ∈ {1, · · · , M }. (17) i
e=1 (i)
(i)
(b) Learning. The learning rule for αl and βhl are as follows: ⎧ (i) ⎪ f or l = e ⎨−αl (t)λm (t)hci (t) (i) Δαl = −α(i) (t)λm (t)hci (t) + λm (t)hci (t) , ⎪ ⎩ l f or l = e ⎧ (i) (i) −α (t + 1)λb (t)hci (t)Th (t) ⎪ ⎪ ⎪ l ⎪ f or l = e ⎨ (i) Δβhl = −α(i) , (t + 1)λ (t)h (t)T (t) b ci l ⎪ ⎪ ⎪ +λ (t)h (t)T (t) b ci ⎪ ⎩ f or l = e where
(i)
T (t) = ˆ(i) Φ(φ e (t))) =
H N h=1
(i)
l=1
||Φ(φ(i) e )(t)|| = k(xe , xe ) − 2
N l=1
(19)
(i)
Φ(φe (t))T Φ(bh (t)) , (i) ˆ(i) ||Φ(φ e (t))||||Φ(φe (t))|| βhl k(xe , xl ) −
(18)
N N
(20)
2 12 αl1 βkl2 k(xl1 , xl2 ) ,
l1 =1 l2 =1
(i)
αl k(xe , xl ) +
N N
(21) 12 (i) (i) αl1 αl2 k(xl1 , xl2 ) ,
l1 =1 l2 =1
(22)
514
H. Kawano, H. Maeda, and N. Ikoma (i)
T Φ(φ(i) e (t)) Φ(bh (t)) =
N
βhl k(xe , xl ) −
l=1
N N
(i) (i)
αl1 βhl2 k(xl1 , xl2 ). (23)
l1 =1 l2 =1
In Eqs.(18) and (19), λm (t), λb (t) and hci (t) are the same parameters as mentioned in the AMSOM training process. After the learning phase, a categorization phase to determine the class association of each unit. The procedure of the categorization is the same manner as mentioned in previous section.
4
Experimental Results
Data Distribution 1. To analyze the effect of reducing the dimensionality for the intersections of subspaces by the proposed method, the data as shown in Fig.4(a) is used. For this data, although a set of each class can be approximated by 1-dimensional linear manifold in the input space R2 , the intersections of subspace could be happend between class 1 and class 2, and between class 2 and class 3 , even if the optimal linear manifold for each class can be obtained. However, the linear manifold in the high-dimensional space, that is the nonlinear manifold in input space, can be expected to classify the given data by reduction effect of the dimensionality for the intersections of subspaces. As the result of simulation, the given input data are classified correctly by the proposed method. Figure 4(b) and (c) are the decision regions constructed by AMSOM and the proposed method, respectively. In this experiment, 3 units were used in the competitive layer. The class associations to each unit are class 1 to unit 1, class 2 to unit 2, class 3 to unit 3, respectively. In this simulation, the experimental parameters are assigned as follows: the f in ini totoal number of epochs tmax = 100, λini = 0.1, m = 0.1, λm = 0.01, λb f in ini ini λb = 0.01, γ = 3, γ = 0.03, both in AMSOM and in the proposed method in common, H = 1 in AMSOM, and H = 2, k(x, y) = (xT y + 1)3 in the proposed method. From the experiment, it was shown that the the proposed method has the ability to reduce the dimensionality for intersections of subspaces. Data Distribution 2. To verify that the proposed method has the ability to describe the nonlinear manifolds, the data as shown in Fig.5(a) is used. This case is of impossible to describe each class by a linear manifold, that is 1-dimensional affine subspace. As the result of simulation, the given input data are classified correctly by the proposed method. Figure 5(b) and (c) are the decision regions constructed by AMSOM and the proposed method, respectively. In this experiment, 3 units were used in the competitive layer. The class associations to each unit are class 1 to unit 3, class 2 to unit 2, class 3 to unit 1, respectively. In this simulation, the experimental parameters are assigned as follows: the
Self-Organizing Clustering with Map 5
5
4
4
Unit 3
3
3
2
2
1
Unit 1
1
Unit 1
Unit 2
0
x2
x2
515
-1
-2
-2
-3
-3
-4
-4
-5 -5 -4 -3 -2 -1
0
x1
1
2
3
4
Unit 2
0
-1
Unit 3
-5 -5 -4 -3 -2 -1
5
(b)
(a)
0
x1
1
2
3
4
5
(c)
Fig. 4. (a) Training data used in the second experiment. (b) Decision regions learned by AMSOM and (c) the proposed method.
4
4
4
class 1 class 2 class 3
3
2
2
2
1
1
1
0
x2
3
x2
x2
3
0
0
-1
-1
-1
-2
-2
-2
-3
-3
-4
-3
-4 -4
-3
-2
-1
0
1
x1
(a)
2
3
4
-4 -4
-3
-2
-1
0
x1
(b)
1
2
3
4
-4
-3
-2
-1
0
1
2
3
4
x1
(c)
Fig. 5. (a) Training data used in the second experiment. (b) Decision regions learned by AMSOM and (c) the proposed method. f in ini totoal number of epochs tmax = 100, λini = 0.1, m = 0.1, λm = 0.01, λb f in λb = 0.01, γ ini = 3, γ ini = 0.03, both in AMSOM and in the proposed method in common, H = 1 in AMSOM, and H = 2, k(x, y) = (xT y)2 in the proposed method. From the experiment, it was shown that the the proposed method has the ability to extract the suitable nonlinear manifolds efficiently.
5
Conclusions
A clustering method with map of nonlinear varieties was proposed as a new pattern classification method. The proposed method has been extended to a nonlinear method easily from AMSOM by applying the kernel method. The effectiveness of the proposed method were verified by the experiments. The proposed algorithm has highly promising applications of the ASSOM in a wide area of practical problems.
516
H. Kawano, H. Maeda, and N. Ikoma
References 1. Oja, E.: Subspace Methods of Pattern Recognition. Research Studies Press (1983) 2. Kohonen, T.: Emergence of Invariant-Feature Detectors in the Adaptive-Subspace Self-Organizing Map. Biol.Cybern 75, 281–291 (1996) 3. Kohonen, T., Kaski, S., Lappalainen, H.: Self-Organizing Formation of Various invariant-feature filters in the Adaptive Subspace SOM. Neural Comput 9, 1321– 1344 (1997) 4. Kohonen, T.: Self-Organizing Maps. Springer, Berlin, Heidelberg, New York (1995) 5. Liu, Z.Q.: Adaptive Subspace Self-Organizing Map and Its Application in Face Recognition. International Journal of Image and Graphics 2(4), 519–540 (2002) 6. Saitoh, S.: Theory of Reproducing Kernels and its Applications. In: Longman Scientific & Technical, Harlow, England (1988) 7. Cortes, C., Vapnik, V.: Support Vector Networks. Machine Learning 20, 273–297 (1995) 8. Vapnik, V.: The Nature of Statistical Learning Theory, 2nd edn. Springer, New York, Berlin, Heidelberg (1995) 9. Sch¨ olkopf, B., Smola, A.J., M¨ uler, K.R.: Nonlinear Component Analysis as a Kernel Eigenvalue Problem, Technical Report 44, Max-Planck-Institut fur biologische Kybernetik (1996) 10. Mika, S., R¨ atsch, G., Weston, J., Sch¨ olkopf, B., M¨ uler, K.R.: Fisher discriminant analysis with kernels. Neural Networks for Signal Processing IX, 41–48 (1999) 11. Boser, B.E., Guyon, I.M., Vapnik, V.N.: A Training Algorithm for Otimal Margin Classifiers. In: Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory, pp. 144–152 (1992) 12. Aizerman, M., Braverman, E., Rozonoer, L.: Theoretical Foundations of the Potential Function Method in Pattern Recognition Learning. Automation and Remote Control 25, 821–837 (1964) 13. Sch¨ olkopf, B., Smola, A.J.: Learning with Kernels. The MIT Press, Cambridge (2002)
An Automatic Speaker Recognition System P. Chakraborty1, F. Ahmed1, Md. Monirul Kabir2, Md. Shahjahan1, and Kazuyuki Murase2,3 1
Department of Electrical & Electronic Engineering, Khulna University of Engineering and Technology, Khulna-920300, Bangladesh 2 Dept. of Human and Artificial Intelligence Systems, Graduate School of Engineering 3 Research and Education Program for Life Science, University of Fukui, 3-9-1 Bunkyo, Fukui 910-8507, Japan [email protected], [email protected]
Abstract. Speaker Recognition is the process of identifying a speaker by analyzing spectral shape of the voice signal. This is done by extracting & matching the feature of voice signal. Mel-frequency Cepstrum Co-efficient (MFCC) is the feature extraction technique in which we will get some coefficients named Mel-Frequency Cepstrum coefficient. This Cepstrum Coefficient is extracted feature. This extracted feature is taken as the input of Vector Quantization process. Vector Quantization (VQ) is the typical feature matching technique in which VQ codebook is generated by providing predefined spectral vectors for each speaker to cluster the training vectors in a training session. Finally test data are provided for searching the nearest neighbor to match that data with the trained data. The result is to recognize correctly the speakers where music & speech data (Both in English & Bengali format) are taken for the recognition process. The correct recognition is almost ninety percent. It is comparatively better than Hidden Markov model (HMM) & Artificial Neural network (ANN). Keywords: MFCC- Mel-Frequency Cepstrum Co-efficient, DCT: Discrete cosine Transform, IIR: - Infinite impulse response, FIR: - Finite impulse response, FFT: - Fast Fourier Transform, VQ: - Vector Quantization.
1 Introduction Speaker Recognition is the process of automatic recognition of the person who is speaking on the basis of individual information included in speech waves. This paper deals with the automatic Speaker recognition system using Vector Quantization. There are another techniques for speaker recognition such as Hidden Markov model (HMM), Artificial Neural network (ANN) for speaker recognition. We have used VQ because of its less computational complexity [1]. There are two main modulesfeature extraction and feature matching in any speaker recognition system [1, 2]. The speaker specific features are extracted using Mel-Frequency Cepstrum Co-efficient (MFCC) processor. A set of Mel-frequency cepstrum coefficients was found, which are called acoustic vectors [3]. These are the extracted features of the speakers. These M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 517–526, 2008. © Springer-Verlag Berlin Heidelberg 2008
518
P. Chakraborty et al.
acoustic vectors are used in feature matching using vector quantization technique. It is the typical feature matching technique in which VQ codebook is generated using trained data. Finally tested data are provided for searching the nearest neighbor to match that data with the trained data. The result is to recognize correctly the speakers where music & speech data (Both in English & Bengali format) are taken for the recognition process. This work is done with about 70 spectral data. The correct recognition is almost ninety percent. It is comparatively better than Hidden Markov model (HMM) & Artificial Neural network (ANN) because the correct recognition for HMM & ANN is below ninety percent. The future work is to generate a VQ codebook with many pre-defined spectral vectors. Then it will be possible to add many trained data in that codebook in a training session, but the main problem is that the network size and training time become prohibitively large with increasing data size. To overcome these limitations, time alignment technique can be applied, so that continuous speaker recognition system becomes possible. There are several implementations for feature matching & identification. Lawrence Rabiner & B. H. Juang proposed Mel- frequency cepstrum co-efficient (MFCC) method to extract the feature & Vector Quantization as feature matching technique [1]. Lawrence Rabiner & R.W. Schafer discussed the performance of MFCC Processor by following several theoretical concepts [2]. S. B. Davis & P. Mammelstein described the characteristics of acoustic speech [3]. Linde A Buzo & R. Gray proposed the LBG Algorithm to generate a VQ codebook by splitting technique [4]. S. Furui describes the speaker independent word recognition using dynamic features of speech spectrum [5]. S. Furui also proposed the overall speaker recognition technology using MFCC & VQ method [6].
2 Methodology A general model for speaker recognition system for several people is shown in fig: 1. the model consists of four building blocks. The first is data extraction that converts a wave data stored in audio wave format into a form that is suitable for further computer processing and analysis. The second is pre-processing, which involves filtering,
Fig. 1. Block diagram of speaker recognition system
An Automatic Speaker Recognition System
519
removing pauses, silences and weak unvoiced sound signal and detect the valid speech signal. The third block is feature extraction, where speech features are extracted from the speech signal. The selected features have enough information to recognize a speaker. Here a class label is assigned to each word uttered by each speaker by examining the extracted features and comparing them with classes learnt during the training phase. Vector quantization is used as an identifier.
3 Pre-processing A digital filter is a mathematical algorithm implemented in hardware and/or software that operates on a digital input signal to produce a digital output signal for the purpose of achieving a filtering objective. Digital filters often operate on digitized analog signals or just numbers, representing some variable, stored in a computed memory. Digital filters are broadly divided into two classes, namely infinite impulse response (IIR) and finite impulse response (FIR) filters. We chose FIR filter because, FIR filters can have an exactly linear phase response. The implication of this is that no phase distortion is introduced into the signal by the filter. This is an important requirement in many applications, for example data transmission, biomedicine, digital audio and image processing. The phase responses of IIR filters are non-linear, especially at the band edges. When a machine is continuously listening to speech, a difficulty arises when it is trying to figure out to where a word starts and stops. We solved this problem by examining the magnitude of several consecutive samples of sound. If the magnitude of these samples is great enough, then keep those samples and examine them later. [1]
Fig. 2. Sample Speech Signal
Fig. 3. Example of speech signal after it has been cleaned
One thing that is clearly noticeable in the example speech signal is that there is lots of empty space where nothing is being said, so we simply remove it. An example speech signal is shown before cleaning in Figure 2, and after in Figure 3. After the empty space is removed from the speech signal, the signal is much shorter. In this case the signal had about 13,000 samples before it was cleaned. After it was run through the clean function, it only contained 2,600 samples. There are several advantages of this. The amount of time required to perform calculations on 13,000 samples is much larger than that required for 2,600 samples. The cleaned sample now contains all the important data that is required to perform the analysis of the speech. The sample produced from the cleaning process is then fed in to the other parts of the ASR system.
520
P. Chakraborty et al.
4 Feature Extraction The purpose of this module is to convert the speech waveform to some type of parametric representation (at a considerably lower information rate) for further analysis and processing. This is often referred as the signal-processing front end. A wide range of possibilities exist for parametrically representing the speech signal for the speaker recognition task, namely: • • • • •
Linear Prediction Coding (LPC), Mel-Frequency Cepstrum Coefficients (MFCC), Linear Predictive Cepstral Coefficients (LPCC) Perceptual Linear Prediction (PLP) Neural Predictive Coding (NPC)
Among the above classes we used MFCC, because it is the best known and most popular. MFCC’s are based on the known variation of the human ear’s critical bandwidths with frequency; filters spaced linearly at low frequencies and logarithmically at high frequencies have been used to capture the phonetically important characteristics of speech. This is expressed in the mel-frequency scale, which is linear frequency spacing below 1000 Hz and a logarithmic spacing above 1000 Hz [1, 2]. 4.1 Mel-Frequency Cepstrum Processor A diagram of the structure of an MFCC processor is given in Figure 4. The speech input is typically recorded at a sampling rate above 10000 Hz. This sampling frequency was chosen to minimize the effects of aliasing in the analog-to-digital conversion. These sampled signals can capture all frequencies up to 5 kHz, which cover most energy of sounds that are generated by humans. As been discussed previously, the main purpose of the MFCC processor is to mimic the behavior of the human ears. In addition, rather than the speech waveforms themselves, MFCC’s are shown to be less susceptible to mentioned variations [5, 6].
Fig. 4. Block diagram of the MFCC processor
4.1.1 Frame Blocking In this step the continuous speech signal is blocked into frames of N samples, with adjacent frames being separated by M (M < N). The first frame consists of the first N samples. The second frame begins M samples after the first frame, and overlaps it by
An Automatic Speaker Recognition System
521
N - M samples. Similarly, the third frame begins 2M samples after the first frame (or M samples after the second frame) and overlaps it by N - 2M samples. This process continues until all the speech is accounted for within one or more frames. Typical values for N and M are N = 256 and M = 100. 4.1.2 Windowing Next processing step is windowing. By means of windowing the signal discontinuities, at the beginning and end of each frame, is minimized. The concept here is to minimize the spectrum distortion by using the window to taper the signal to zero at the beginning and end of each frame.If we define the window as w(n), 0 ≤ n≤ N - 1 , where N is the number of samples, then the result of windowing is the signal
yl (n) = xl (n) w(n), 0 ≤ n ≤ N − 1
(1)
The followings are the types of window method: ▪ Hamming ▪ Rectangular ▪ Barlett (Triangular) ▪ Hanning
▪ Kaiser ▪ LancZos ▪ Blackman-Harris
Among all the above, we used Hamming Window method most to serve our purpose for ease of mathematical computations, which is described as:
⎛ 2π n ⎞ w ( n ) = 0 . 54 − 0 .46 cos ⎜ (2) ⎟, 0 ≤ n ≤ N − 1 ⎝ N −1⎠ Besides this, we’ve also used Hanning window and Blackman-Harris window. As an example, a Hamming window with 256 samples is shown here.
Fig. 5. Hamming window with 256 speech sample
4.1.3 Fast Fourier Transform (FFT) Fast Fourier Transform, converts a signal from the time domain into the frequency domain. The FFT is a fast algorithm to implement the Discrete Fourier Transform (DFT) which is defined on the set of N samples {xn}, as follow:
Xn =
N −1
∑x k =0
k
e − 2 π jkn / N ,
n = 0 ,1, 2 ,..., N − 1
(3)
522
P. Chakraborty et al.
We use j here to denote the imaginary unit, i.e. j = √ (-1). In general Xn’s are complex numbers. The resulting sequence {Xn} is interpreted as follow: the zero frequency corresponds to n = 0, positive frequencies 0
mel ( f ) = 2595 * log 10(1 + f / 700)
(4)
One approach for simulating the subjective spectrum is to use a filter bank, spaced uniformly on the mel scale. That filter bank has a triangular band pass frequency response, and the spacing as well as the bandwidth is determined by a constant mel frequency interval. The modified spectrum of S (ω) thus consists of the output power of these filters when S (ω) is the input. The number of mel spectrum coefficients, K, is typically chosen as 20 [3]. 4.1.5 Cepstrum In this step the log mel-spectrum is converted back to the time domain. The cepstral representation of the speech spectrum provides a good representation of the local spectral properties of the signal for the given frame analysis. Because the melspectrum coefficients (and so their logarithm) are real numbers, we convert them to the time domain using the Discrete Cosine Transform (DCT). In this step we find Mel Frequency Cepstrum Coefficient (MFCC). This set of coefficients is called an acoustic vector. Therefore each input utterance is transformed into a sequence of acoustic vectors.
5 Speech Feature Matching There are several methods of Feature matching, which are stated below: • Template models − Vector Quantization (VQ) − Dynamic Time Wrapping (DTW) • Stochastic models − Gaussian Mixture Models (GMM) − Hidden Markov Modeling (HMM) • Neural Networks (NNs) • Support Vector Machines (SVMs) [5] We used VQ approach due to ease of implementation and high accuracy.
An Automatic Speaker Recognition System
5.1
523
Vector Quantization
The objective of VQ is the representation of a set of feature vectors by a set,
x ∈ X ⊆ ℜk
Y = { y1 ,........, y N C } , of NC reference vectors in ℜ k . Y is called
codebook and its elements codewords. The vectors of X are called also input patterns or input vectors. So, a VQ can be represented as a function: q : X → Y . The knowledge of q permits us to obtain a partition S of X constituted by the NC subsets Si (called cells):
S i = { x ∈ X : q( x ) = y i }
i = 1, ….,NC
(5)
In brief, VQ is a process of mapping vectors from a large vector space to a finite number of regions in that space. Each region is called a cluster and can be represented by its center, called a codeword. The collection of all codewords is called a codebook. Figure 6 shows a diagram to illustrate this recognition process. In the figure, only two speakers and two dimensions of the acoustic space are shown. The circles refer to the acoustic vectors from the speaker 1 while the triangles are from the speaker 2. In the training phase, a speaker-specific VQ codebook is generated for each known speaker by clustering his/her training acoustic vectors. The result codewords (centroids) are shown in Figure 6 by black circles and black triangles for speaker 1 and 2, respectively. The distance from a vector to the closest codeword of a codebook is called a VQ-distortion. In the recognition phase, an input utterance of an unknown voice is “vector-quantized” using each trained codebook and the total VQ distortion is computed. The speaker corresponding to the VQ codebook with smallest total distortion is identified.
Fig. 6. Diagram illustrating vector quantization codebook formation
One speaker can be discriminated from another based of the location of centroids. After the enrolment session, the acoustic vectors extracted from input speech of a speaker provide a set of training vectors. As described above, the next important step is to build a speaker-specific VQ codebook for this speaker using those training
524
P. Chakraborty et al.
vectors. There is a well-know algorithm, namely LBG algorithm [Linde, Buzo and Gray, 1980], for clustering a set of L training vectors into a set of M codebook vectors. The LBG VQ design algorithm is an iterative algorithm which alternatively solves the above two optimality criteria. The algorithm requires an initial codebook c(0). This initial codebook is obtained by the splitting method. In this method, an initial codevector is set as the average of the entire training sequence. This codevector is then split into two. The iterative algorithm is run with these two vectors as the initial codebook. The final two codevectors are splitted into four and the process is repeated until the desired number of codevectors is obtained. The algorithm is summarized here [4]. 1. Design a 1-vector codebook: This is the centroid of the entire set of training vectors (hence, no iteration is required here). 2. Double the size of the codebook by splitting each current codebook yn according to the rule.
3.
4. 5. 6.
yn+ = yn (1+ ∈)
(6)
yn− = yn (1− ∈)
(7)
Where n varies from 1 to the current size of the codebook and ∈ is a splitting Parameter (we choose ∈= 0.01 ) Nearest neighbor search: for each training vector, find the codeword in the current codebook that is closest (in terms of similarity measurement), and assign that vector to the corresponding cell (associated with the closest codeword). Centroid Update: update the code word in each cell using the centroid of the training vectors assigned to that cell. Iteration1: repeat steps 3 and 4 until the average distance falls below a preset threshold. Iteration 2: repeat steps 2, 3 and 4 until a codebook size of M is designed. [1,2].
5.2 The Testing Procedure The recognition algorithm can be summarized by the following steps. Step 1 : Unknown speakers speech is recorded first. Step 2 : The starting and endpoint is detected and speech should go through the filtering process. : Step 3 Speech features are extracted from the speech signal which is used to create the testing Vector (acoustic vector) for that utterances. : Step 4 The testing vector is then fed into the vector quantizer Step 5 : The predefined knowledge is used by the vector quantization to calculate the spectral distortion(distance) for each utterance and smallest distance value is selected. : Step 6 The smallest distance value is compared with a threshold value and a decision of whether the unknown speaker to be recognized or not is made[6].
An Automatic Speaker Recognition System
525
6 Result In this work, the utterances of several speakers are taken and the data are divided in music (rock and melody) and sample voice data (Bengali and English).Each sample data is taken to train the Vector Quantizer and then all the utterances are used for recognition or testing. The input of the VQ is obtained by the frequency analysis for the given input utterances. The detail of the VQ is specified by representing the input in the form of Matrix. We’ve taken about 70 data for which the result is shown in percentage below. Data Type Music Speech (English) Speech (Bengali)
Correct Recognition 86 % 92 % 91 %
False Inclusion 6% 3% 4%
False Rejection 8% 5% 5%
7 Conclusion This paper deals with the automatic Speaker recognition system using Vector Quantization. There are two main modules, feature extraction and feature matching. The speaker specific features are extracted using Mel-Frequency Cepstrum Coefficient (MFCC) processor. A set of Mel-frequency cepstrum coefficients was found, which are called acoustic vectors. These are the extracted features of the speakers. These acoustic vectors are used in feature matching using vector quantization technique. There are another techniques for feature matching such as Hidden Markov model (HMM), Artificial Neural network (ANN) for speaker recognition. We’ve used VQ as its computational complexity is less than others. Vector Quantization is the typical feature matching technique in which VQ codebook is generated using trained data. Finally tested data are provided for searching the nearest neighbor to match that data with the trained data. The result is to recognize correctly the speakers where music & speech data (Both in English & Bengali format) are taken for the recognition process. The correct recognition is almost ninety percent. It is comparatively better than Hidden Markov model (HMM) & Artificial Neural network (ANN) because the correct recognition for HMM & ANN is below ninety percent. The future work is to generate a VQ codebook with many pre-defined spectral vectors. Then it will be possible to add many trained data in that codebook in a training session, but the main problem is that the network size and training time become prohibitively large with increasing data size. To overcome these limitations, time alignment technique can be applied, so that continuous speaker recognition system becomes possible. Acknowledgement. This work was supported by grants to Shahjahan from KUET, and to KM from the Japanese Society for the Promotion of Science, the Yazaki Memorial Foundation for Science and Technology, and the University of Fukui.
526
P. Chakraborty et al.
References 1. Rabiner, L., Juang, B.H.: Fundamentals of Speech Recognition. Prentice Hall, Englewood Cliffs (1978) 2. Rabiner, L., Schafer, R.W.: Digital Processing of Speech Signals. Prentice Hall, Englewood Cliffs (1978) 3. Davis, S.B., Mermelstein, P.: Comparison of Parametric representations for monosyllabic word recognition in Continuously Spoken Sentences. IEEE Transactions on Acoustics Speech, Signal Processing ASSP-28(4) (August 1980) 4. Buzo, L.A., Gray, R.: An Algorithm for Vector Quantizer Design. IEEE Transactions on Communications 28, 84–95 (1980) 5. Furui, S.: Speaker Independent Isolated word Recognition using Dynamic Features of Speech Spectrum. IEEE Transactions on Acoustic, and Speech Signal Processing ASSP34(1), 52–59 (1986) 6. Furui, S.: An Overview of Speaker Recognition Technology. In: ESCA Workshop on Automatic Speaker Recognition, Identification & Verification, pp. 1–9 (1994)
Modified Modulated Hebb-Oja Learning Rule: A Method for Biologically Plausible Principal Component Analysis Marko Jankovic1, Pablo Martinez1, Zhe Chen2, and Andrzej Cichocki1 1
Laboratory for Advanced Brain Signal Processing, RIKEN Brain Science Institute, Wako-shi, Saitama, 351-0198, Japan {elmarkoni,cia}@brain.riken.jp 2 Neuroscience Statistics Research Laboratory, Departement of Anesthesia and Critical Care, Massachusetts General Hospital, Harvard Medical School, Boston, MA 02114, USA [email protected]
Abstract. This paper presents Modified Modulated Hebb-Oja (MHO) method that performs principal component analysis. Method is based on implementation of Time-Oriented Hierarchical Method applied on recently proposed principal subspace analysis rule called Modulated Hebb-Oja learning rule. Comparing to some other well-known methods for principal component analysis, the proposed method has one feature that could be seen as desirable from the biological point of view – synaptic efficacy learning rule does not need the explicit information about the value of the other efficacies to make individual efficacy modification. Simplicity of the “neural circuits” that perform global computations and a fact that their number does not depend on the number of input and output neurons, could be seen as good features of the proposed method. The number of necessary global calculation circuit is one. Some similarity to a part of the frog retinal circuit will be suggested, too. Keywords: Principal component analysis, time oriented hierarchy, Stiefel submanifold, local learning rules.
1 Introduction Neural networks provide a way for parallel on-line computations of the principal component analysis (PCA) or principal subspace analysis (PSA). Due to their parallelism and adaptivity to input data, such algorithms and their implementations in neural networks are potentially useful in feature extraction and data compression. It is well known that in the first step of any pattern recognition scheme, which is the representation of the objects from a usually large amount of raw data, some preprocessing and data compression are essential. In that case a minimal loss of information is a central objective. Principal subspace analysis (PSA) and principal component analysis (PCA) are powerful and popular tools that are used in statistical signal processing and data compression and they have objective to reduce the M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 527–536, 2008. © Springer-Verlag Berlin Heidelberg 2008
528
M. Jankovic et al.
complexity of input data and at the same time keep as much as possible of the input data information. Within last years various PCA and PSA learning algorithms have been proposed and mathematically investigated (see [1-5], [8-15], [17-28]). Most of the proposed algorithms are based on local Hebbian learning - due to locality it has been argued that these algorithms are biologically plausible. In [11], biologically inspired PSA methods, named Modulated Hebbian (MH) and Modulated Hebb-Oja learning rules have been introduced. Major objectives for the methods derivation were: – to obtain a network which has a learning rule for individual synaptic efficacy that requires the least possible amount of explicit information about the other synaptic efficacies, especially those related to other neurons – minimization of the neural hardware that is necessary for implementation of the proposed learning rule. In this paper modification of MHO algorithm, MMHO algorithm is analyzed. MMHO algorithm performs PCA. New algorithm is obtained by implementation of time-oriented hierarchical method (TOHM) [14] on MHO method. TOHM method is general method that transforms PSA methods into PCA methods. It is based on learning on approximate principal Grassman/Stiefel submanifold [14]. Section 2 reviews basic theory of modulated Hebb (MH) and modulated Hebb-Oja (MHO) learning rules and suggests its computational circuit in the context of retinal processing. Section 3 is devoted to the introduction of new MMHO learning rule. Section 4 contains some simulation results. In Section 5 suggests speculative role of the proposed method for early visual processing in part of the frog retinal circuit. Section 6 gives some conclusion.
2 Modulated Hebb Learning Rule and Computational Circuit 2.1 Modulated Hebbain Rule: The Case N=K Let x ∈ ℜK denotes the input random variables with zero mean, and let y = WTx ∈ ℜN denotes the output, where W ∈ ℜKxN denotes the synaptic weight matrix. It is assumed N=K. In scalar form, for the kth output, we have yk =wkTx (k = 1, …, N), where wk ∈ ℜK denotes the column vector of W. Modulated Hebbian (MH) learning rule is introduced in [11] and analyzed in [13][15]. Here we will give a little bit different interpretation. MH rule can be derived as a gradient descent learning rule for minimization of the following cost function
(
J (W ) = E x − Wy
(
4
) = E⎛⎜⎜ ⎛⎜⎝ x − WW (
⎝
) (
T
2 2⎞ ⎞ x ⎟ ⎟ ⎠ ⎟⎠
= E⎛⎜ tr (C ) − 2tr W T CW + tr W T CWW TW ⎝ under assumption W TW = I ,
)) ⎞⎟⎠ 2
(1)
Modified Modulated Hebb-Oja Learning Rule
529
or in compact notation
⎧⎛ N N ⎞ ⎪ J NLoPCA (W ) = E ⎨⎜ ∑ xi2 − ∑ yi2 ⎟ ⎜ ⎟ ⎪⎩⎝ i =1 i =1 ⎠
2⎫
⎪ ⎬. ⎪⎭
(2)
It leads to the following gradient descent algorithm: N ⎛ N ⎞ W (i + 1) = W (i) + γ (i )⎜ xk2 (i ) − y k2 (i ) ⎟ x (i ) y T (i ). ⎜ ⎟ k =1 ⎝ k =1 ⎠
∑
∑
(3)
It is probably noticed that only an implicit constraint WTW=I is used - there is no explicit constraint. As it is shown in [13], that this algorithm, named Modulated Hebbian learning (MH), will lead toward a solution which insures WTW=I, although no additional terms, that should keep weight vectors orthonormal, are added. Also, it is interesting to notice that algorithm can be seen as a variant of the moving threshold algorithm, which is a form similar to the BCM learning rule [3], where the moving threshold is equal to the power contained in input signals. From equation (3) we can see that the original Hebb learning rule modulated with a single signal is applied. The network structure of interest is depicted on Fig. 1. The proposed learning method posses the locality and homogeneity. These are desirable properties for implementing artificial neural networks in parallel hardware [21]. Comparing to some other approaches (e.g. [19] and [26]) in this case we have solution in which update learning rule for individual synaptic efficacy does not require information about the explicit value of the other synaptic efficacies. Also, the number of circuits that perform global calculations (in this case summators) is two, regardless of the dimension of input and output vectors. 2.2 Modulated Hebb-Oja Rule: Case N < K
The MH learning rule is of low practical interest since the output has the same dimensionality as the input (for instance it can be used for statistical orthogonalization of a matrix). In the case when the number of output neurons is lower than the number of input neurons, or if N
(
)(
(
))
W (i + 1) = W (i ) + γ (i) x (i ) T x (i ) − y (i ) T y (i) x (i) y (i ) T − W (i )diag y (i ) y (i )T .
(4)
It is called Modulated Hebb Oja (MHO) learning rule. This learning rule generates a solution W, whose range is equal to the subspace spanned by N biggest eigenvectors of input signal covariance matrix C (C= E(tr(xxT)). Structure of the neural network of
530
M. Jankovic et al.
Input x x1
x2
xK
Σ
y1
y2
Output y
yN
Σ Fig. 1. MH(O) principal subspace network (small empty circle at the end of an arrow denotes the square function)
interest, can also be presented by Fig. 1. The only difference is that some additional calculations at the output are necessary. In [15] was shown that matrix W(i) has bounded elements Wkn (i).
3 Introduction of the New PCA Learning Rule Now it will be explained how we can transform MHO algorithm into PCA algorithm. First we will recall the definition of Grassman and Stiefel manifold that can be found in [7]. Grassman manifold is defined as the space of matrices W∈ O KxN ⊂ ℜKxN (N≤K) such that WTW=I and a homogeneous function J: O KxN→ ℜ such that J(W)=J(WQ) for any NxN orthogonal matrix Q. Stiefel manifold is defined as the space of matrices W∈ O KxN ⊂ ℜKxN (N≤K) and a function J: O KxN→ ℜ. Some learning algorithms can be seen as learning on Stiefel manifold if the weight matrix is kept automatically orthonormal at each update step. Such algorithms are known as strongly orthonormaly constrained (SOC) algorithms (e.g. [9]). Here we will use recently proposed time oriented hierarchical method (TOHM) (or more general GTOHM [14]) in order to transform PSA algorithm MHO into PCA method. The method is based on the following idea, that Each neuron tries to do what is the best for his family, and then what is the best for itself. We shall call this idea “the family principle”. In other words, the algorithm consists of two parts: the first part is responsible for the family-desirable feature learning and the second part is responsible for the individual-neuron-desirable feature learning. The second part is taken with a weight coefficient which is, by absolute
Modified Modulated Hebb-Oja Learning Rule
531
value, smaller than 1. This means that some time-oriented hierarchy in realization of the family and individual parts of the learning rules is made. In order to realize “the family principle”, we propose the following class of learning rules that can be used for parallel extraction of principal components, defined by the following equation ΔW PCA = ΔW PSA + D(i)IPSPCAorSMCA ,
(5)
where ΔWPCA represents modification of the weights, ΔWPSA defines family part of the learning rule (that is PSA), IPSPCAorSMCA represent individual part of the learning rule (single unit PCA or MCA algorithm) and D(i) is diagonal matrix which diagonal elements can be functions of time (in the case of homogeneous algorithm D(i, i) = α). In the most general case all individual parts can be implemented as different learning rules. It is interesting to note that individual part can pursue minor component while the whole algorithm pursue principal components. The intuition is as follows: we use two times scales; so if α is sufficiently small, then the term multiplied by α does not affect the PSA learning direction. When algorithm reaches the principal subspace, then part of the algorithm that is multiplied by α will perform learning on approximately Grasmman/Stiefel principal submanifold (definitions are given in [14]), and will rotate weight vectors toward principal components. By implementation of the proposed principle new learning rule can be written in the form
)(
(
w k (i + 1) = w k (i ) + γ (i) x (i ) y k (i ) − w k (i ) y k2 (i ) x (i ) − y (i ) ⎛ + αγ (i) x (i ) y k (i ) − w k (i ) y k2 (i ) ⎜ x (i ) ⎜ ⎝
(
)
2
2
−
k
⎞
j =1
⎠
∑ y 2j (i )⎟⎟,
2
)
k = 1,.., N .
(6)
We can see that we actually have a system of equations that have same family part of the learning rule, and all individual parts of the learning rules are different. By implementation of the same method that was proposed in [15], it can be shown that synaptic vectors have bounded norms under some reasonable assumptions. It can be proved that the stable points of the algorithm are principal eigenvectors of the input signal covariance matrix. However, the proof is lengthy and will be omitted here. The sketch of the proof is as follows: – difference equations (6) can be related to their differential counterparts by implementation of stochastic approximation [16], [20]; – it is possible to show that all equations actually represent eigenvector equations for the set of matrices that have eigenvectors equal to eigenvectors of the input signal covariance matrix C= E(tr(xxT); – if the weight matrix is full rank for all t, then columns of the synaptic matrix will converge toward some of the principal eigenvectors of the input covariance matrix; – then, using the method proposed in [15] it is possible to prove that those eigenvectors will correspond to principal eigenvectors of the input covariance matrix.
532
M. Jankovic et al.
4 Simulation Results Now we will examine the small scale numerical simulations results. The number of inputs was K = 5 and the number of output neurons was N = 3. Artificial zero-mean vectors with uncorrelated elements were generated by the following equations: x(1, i ) = .45 sin(i/2); x(2, i ) = .45 ((rem(i,23) - 11)/9).^5; x(3, i ) = .35sin(i/17.8); x(4, i ) = .145 ((rand(1,1) < .5) * 2 - 1). ⋅ log(rand(1,1) + .5); x(5, i ) = .18 randn(1,1) . Input signal is constructed as s = 0.47*mix*x, where mixing matrix mix is defined as mix = - .5 + rand( K ). In Fig.2 cosine of the angles between column vectors of matrix W and numerically calculated eigenvectors of input signal covariance matrix were used as illustration of the algorithm efficiency. Learning rate was chosen constant, γi = 3.45, for the first 15000 iterations, and then was set at γi = 0.115. Individual part was taken as α=0.5. Initial value for matrix W was selected as Winit=-.5+rand(5,3). In Fig. 3 we can see that proposed method can be efficiently used for extraction of the deterministic input signals (under the assumption that mixing matrix is orthogonal). First column in Fig. 3 represents original signals, second column represents input signals obtained from original signals by multiplication with orthonormal matrix mix ( mix = mix (mixTmix)-.5), and last column represents output, or reconstructed signals, obtained by implementation of MMHO. 1.4
1.2
cos(wi, pcai)
1
0.8
0.6
0.4
0.2
0
0
0.5
1 1.5 Number of iterations
2
2.5 x 10
4
Fig. 2. Cosine of angles between column vectors of W and three principal eigenvectors of input covariance matrix versus the number of iterations
Modified Modulated Hebb-Oja Learning Rule
Original signals
Input
0.5
Output
1
1
0
0
0
-0.5
-1
-1
0
50
100
150
200
250
2
0
50
100
150
200
250
0.5
0
0
0
-0.5
-0.5
50
100
150
200
250
0
50
100
150
200
250
0.5
0.5
0.2
0
0
0
-0.5
0
50
100
150
200
250
-0.5
0.1
0.5
0
0
-0.1
0
50
100
150
200
250
0.5
-0.5
0
50
100
150
200
250
0
50
100
150
200
250
0
50
100
150
200
250
0.5
-2
0
533
0
50
100
150
200
250
0
50
100
150
200
250
0
50
100
150
200
250
-0.2
0.5
0
0
-0.5
-0.5
0
50
100
150
200
250
Fig. 3. Blind signal extraction of deterministic components (mixing matrix is orthonormal)
5 Comparison to a Part of the Frog Retinal Wiring Equation (6) can be rewritten as
)(
(
2
w k (i + 1) = w k (i ) + γ (i ) x (i ) y k (i) − w k (i ) y k2 (i ) (1 + α ) x (i) − y (i )
2
)
⎛ k −1 ⎞ − αγ (i) x (i ) y k (i) − w k (i ) y k2 (i ) ⎜ y 2j (i) + y k2 (i ) ⎟, k = 1,.., N . ⎜ j =1 ⎟ ⎝ ⎠
(
)∑
(7)
Proposed learning rule can be implemented by the network structure shown in Fig. 4. By analyzing the proposed structure, we can see that some degree of similarity with the part of the frog’s retinal wiring exists [6]. We can make the following analogies: • the inputs x1,…,xK can be viewed as photoreceptors • the outputs y1,…,yN correspond to bipolar cells (BC) • circuit for calculation of the input energy marked as Σ can be viewed as horizontal cell • units labeled as Ak and which calculate the output partial powers ⎛ k −1 ⎞ Ak (i ) = ⎜ y 2j (i ) + y k2 (i ) ⎟ ⎜ j =1 ⎟ ⎝ ⎠
∑
could be related to the amacrine cells (AC) • IP unit can be related to interplexiform cell (output power = IP = AN)
534
M. Jankovic et al.
Input x1
x2
xK
6
Output y1
y2
A1
yN
A2
IP
AN
Fig. 4. Structure of the computational circuit that can be used for realization of MMHO algorithm (small empty circle at the end of an arrow denotes the square function)
• we can notice presence of backward inhibitory projections from amacrine cells to the bipolar cells • it is possible to see that Ak cells are serially connected which can possibly explain serial synapsing between amacrine cells in frog’s retina. Although there is no physiological evidence which can support the similarity with the part of the frog’s retinal processing, high degree of structural similarity can be sign of possible similarity in the physiological sense. Although, some further refinement of the circuit structure is necessary in order to match known results (like reciprocal connections between ACs, or fact that back projections of amacrine cell will finish on BC terminals rather than on their dendrite) here we will suggest that species which have IP cells damaged will have significantly deteriorated selectivity abilities.
6 Conclusion In this paper a biologically plausible method that performs PCA has been proposed. In the proposed method, learning rule for individual synaptic efficacy does not require explicit information about the other synaptic efficacies, especially those related to other neurons. Simplicity of the “neural circuits” that perform global computations and a fact that their number does not depend on the number of input and output neurons, could be seen as good features of the proposed method. Comparing to some other recursive PCA methods that use local feedback connections in order to maintain stability, the MMHO method uses global and local feedback connections. Some degree of similarity of the circuit that can be used for the realization of the proposed learning rule with the part of the retinal wiring of the frog’s retina has been suggested.
Modified Modulated Hebb-Oja Learning Rule
535
It is suggested that the role of the interplexiform cells in overall retinal processing could be significant, although it is usually assumed that their role is not important.
References 1. Abed-Meraim, K., Attallah, V., Ckheif, A., Hua, Y.: Orthogonal Oja algorithm. IEEE Signal Processing Letters 7, 116–119 (2000) 2. Baldi, P., Hornik, K.: Learning in linear neural networks: A survey. IEEE Trans. Neural Networks 6, 837–858 (1995) 3. Bienenstock, E., Cooper, L.N., Munro, P.W.: Theory of the development of the neuron selectivity: Orientation specificity and binocular interaction in visual cortex. Journal of Neuroscience, 32-48 (1982) 4. Chen, T.-P., Amari, S.-I.: Unified stabilization approach to principal and minor components. Neural Networks 14, 1377–1387 (2001) 5. Cichocki, A., Amari, S.–I.: Adaptive Blind Signal and Image Processing – Learning Algorithms and Applications. John Wiley and Sons, Chichester (2003) 6. Dowling, J.: The Retina: An approachable part of the brain. The Belknap Press of Harward University Press (1987) 7. Edelman, A., Arias, T.A., Smith, S.T.: The geometry of algorithms with orthogonality constraints. SIAM Journal on Matrix Analysis Applications 20, 303–353 (1998) 8. Douglas, S.C., Kung, S.Y., Amari, S.-I.: A self-stabilized minor subspace rule. IEEE Signal Processing Letters 5, 328–330 (1998) 9. Fiori, S.: A theory for learning by weight flow on Stiefel-Grassman Manifold. Neural Computation 13, 1625–1647 (2001) 10. Földiák, P.: Adaptive network for optimal linear feature extraction. In: IJCNN 1989, Washington, D.C., USA, pp. 401–405 (1989) 11. Jankovic, M.: A new modulated Hebbian learning rule – Method for local computation of a principal subspace. In: ICONIP2001, Shanghai, China, vol. 1, pp. 470–475 (2001) 12. Jankovic, M.: A new simple ∞OH neuron model as a biologically plausible principal component analyzer. IEEE Trans. on Neural Networks 14, 853–859 (2003) 13. Jankovic, M., Ogawa, H.: A new modulated Hebb learning rule – Biologically plausible method for local computation of principal subspace. Int. J. Neural Systems 13, 215–224 (2003) 14. Jankovic, M., Reljin, B.: Neural learning on Grassman/Stiefel principal/minor submanifold. In: EUROCON 2005, Serbia, pp. 249–252 (2005) 15. Jankovic, M., Ogawa, H.: Modulated Hebb-Oja learning rule – A method for principal subspace analysis. IEEE Trans. on Neural Networks 17, 345–356 (2006) 16. Ljung, L.: Analysis of recursive stochastic algorithms. IEEE Trans. Automat. Contr. 22, 551–575 (1977) 17. Möller, R., Konies, A.: Coupled principal component analysis. IEEE Trans. on Neural Networks 15, 214–222 (2004) 18. Oja, E.: A simplified neuron model as a principal component analyzer. J. Math. Biol. 15, 267–273 (1982) 19. Oja, E.: Subspace Method of Pattern Recognition. Research Studies Press and J. Wiley, Letchworth (1983) 20. Oja, E., Karhunen, J.: On stochastic approximation of the eigenvectors and eigenvalues of the expectation of a random matrix. J. Math. Anal., Appl. 106, 69–84 (1985)
536
M. Jankovic et al.
21. Oja, E., Ogawa, H., Wangviwattana, J.: Principal component analysis by homogeneous neural networks, Part I: The weighted subspace criterion. IEICE Trans. Inf.&Syst. E75-D, 366–375 (1992) 22. Peltonen, J., Kaski, S.: Discriminative components of data. IEEE Trans. on Neural Networks 16, 68–83 (2005) 23. Plumbley, M.D., Oja, E.: A “nonnegative PCA” algorithm for independent component analysis. IEEE Trans. on Neural Networks 15, 66–76 (2004) 24. Tanaka, T.: Generalized weighted rules for principal components tracking. IEEE Trans. on Signal Processing 53, 1243–1253 (2005) 25. Waheed, K., Salem, F.M.: Blind information-theoretic multiuser detection algorithms for DS-CMDA and WCDMA downlink systems. IEEE Trans. on Neural Networks 16, 937– 948 (2005) 26. Xu, L.: Least mean square error reconstruction principle for self-organizing neural nets. Neural Networks 6, 627–648 (1993) 27. Yang, B.: Projection approximation subspace tracking. IEEE Trans. on Signal Processing 43, 95–107 (1995) 28. Zhang, Y., Weng, J., Hwang, W.-S.: Auditory learning: a developmental method. IEEE Trans. on Neural Networks 16, 601–616 (2005)
Orthogonal Shrinkage Methods for Nonparametric Regression under Gaussian Noise Katsuyuki Hagiwara Faculty of Education, Mie University, 1577 Kurima-Machiya-cho, Tsu 514-8507 Japan [email protected]
Abstract. In this article, for regression problems, we proposed shrinkage methods in training a machine in terms of a regularized cost function. The machine considered in this article is represented by a linear combination of fixed basis functions, in which the number of basis functions, or equivalently, the number of adjustable weights is identical to the number of training data. This setting can be viewed as a nonparametric regression method in statistics. In the regularized cost function employed in this article, the error function is defined by the sum of squared errors and the regularization term is defined by the quadratic form of the weight vector. By assuming i.i.d. Gaussian noise, we proposed three thresholding methods for the orthogonal components which are obtained by eigendecomposition of the Gram matrix of the vectors of basis function outputs. The final weight values are obtained by a linear transformation of the thresholded orthogonal components and are shrinkage estimators. The proposed methods are quite simple and automatic, in which the regularization parameter is enough to be fixed for a small constant value. Simple numerical experiments showed that, by comparing the leave-one-out cross validation method, the computational costs of the proposed methods are strictly low and the generalization capabilities of the trained machines are comparable when the number of data is relatively large.
1
Introduction
This article considers a regression method using learning machines that are defined by a linear combination of fixed basis functions. Especially, we focus on the machines in which the number of basis functions, or equivalently, the number of adjustable weights is identical to the number of training data. This problem is viewed as a nonparametric regression method in statistics. In machine learning, support vector regressions belong to this type of regression, in which the kernel trick together with the representer theorem yields a linear combination of kernel functions(e.g.[4]). In recent works, it has also been proposed variations of support vector machines, such as least squared support vector machines[8], or equivalently, regularized least squares[6]. In training the above stated machines, regularization methods are naturally introduced because of the following two reasons. The first reason is to stabilize M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 537–546, 2008. c Springer-Verlag Berlin Heidelberg 2008
538
K. Hagiwara
the training process. Since the number of basis functions is large, the basis functions can be nearly linearly dependent depending on choices of basis functions. In this case, for example, the least squares estimators of weights are numerically unstable solutions. The second reason is to avoid over-fitting. For example, the outputs of the above stated machine with the least squares estimators are identical to output data. Obviously, such outputs may not generalize to unseen data well due to over-fitting. The generalization capability of the trained machine is generally sensitive to the choice of a regularization parameter value which is often chosen by using a cross validation method in applications. In this article, we consider rather other methods for improving the generalization performance while the regularizer is needed for stabilizing the training process. In this article, we propose shrinkage methods for training the above stated machine in terms of a regularized cost function. In the regularized cost function, the error function is defined by the sum of squared errors and the regularizer is defined by the quadratic form of the weight vector. Therefore, the formulation in this article is almost equivalent to least squares support vector machines/regularized least squares. However, it is slightly general since the basis functions are not assumed to be the Mercer kernels. Eigendecomposition is firstly applied to the Gram matrix of the vectors of basis function outputs. It is viewed as orthogonalization of the vectors of basis function outputs. This procedure has also been applied for reducing computational costs in training kernel machines[9]. We then obtain the coefficients of the orthogonalized basis outputs by a linear transformation of the regularized estimators of weights. The coefficients are referred to as orthogonal components here. By assuming i.i.d. Gaussian noise, we consider to keep or remove the orthogonal components by a hard thresholding method, in which the orthogonal components to be removed are set to zero. By introducing an idea of the universal threshold level used in wavelet denoising[5], we propose three hard thresholding methods for the orthogonal components. The final weights are obtained by a linear transformation of the coefficients of the thresholded orthogonal components and are shrinkage estimators. Although our methods include a regularization parameter, it can be fixed for a small value. This is because the machines trained by our methods have compact representations in an orthogonal domain and the contribution of the regularizer in improving the generalization performance is not significant. Therefore, we need the regularizer only for ensuring the numerical stability of training and it is enough to be fixed for a small value. There are several related works to our methods. The lasso implements the l1 regularizer which is known to act as a thresholding method[7]. A thresholding operation of the lasso is directly applied on weights of basis functions while our methods are applied on orthogonal components. In the lasso, the threshold level is determined by a regularization parameter while the levels are given with a theoretical implication in our methods. The regularization parameter may be chosen by cross validation methods in the lasso, which is time consuming compared to our methods since it can be fixed in our methods. Recently, the Gram-Schmidt type orthogonalization procedure has been introduced to least squares support
Orthogonal Shrinkage Methods for Nonparametric Regression
539
vector machines(e.g. [2]), where the orthogonal components are chosen via a greedy manner according to the residual error. [2] further proposed a method in which regularization parameters are assigned individually to orthogonal components and are updated based on the Bayesian log evidence. This method includes some parameters such as a criterion for determining the number of orthogonal components and the number of updates, which are heuristically determined. Our methods have an advantage at this point since they do not include parameters that are needed to be empirically adjusted. This article is organized as follows. In Section 2, we formulate a learning machine and its training procedure using a regularization method. We also introduce eigendecomposition or orthogonalization of basis function outputs here. In Section 3, we give shrinkage methods. Some numerical experiments for the proposed methods including comparisons with the leave-one-out cross validation method are shown in Section 4. Conclusions and future works are given in Section 5.
2 2.1
Formulation Regression Problem and Regularized Cost
Let {(xi , yi ) : xi ∈ Rd , yi ∈ R, i = 1, . . . , n} be a set of input-output training data. We assume that the output data are generated according to the following rule. yi = h(xi ) + ei , i = 1, . . . , n, (1) where h is a fixed target function on Rd and e1 , . . . , en are i.i.d. noise that are assumed to be sampled from a Gaussian distribution N (0, σ 2 ). x1 , . . . , xn are independent samples from a probability distribution on Rd and are independent of y1 , . . . , yn . We consider curve-fitting by using a machine: fw (x) =
n
wj gj (x), x ∈ Rd
(2)
j=1
where w = (w1 , . . . , wn ) ∈ Rn is a coefficient or weight vector. For simplifying the subsequent discussions, we define y = (y1 , . . . , yn ) , where denotes the transpose. Also we redefine w as a vertical vector; i.e. w = (w1 , . . . , wn ) . Let G be the n × n matrix whose (i, j) element is gj (xi ). We define g j = (gj (x1 ), . . . , gj (xn )) that is the jth column vector of G. We assume that g 1 , . . . , g n are linearly independent. We train the machine in terms of the regularized cost function defined by C(w) = E(w) + λR(w), λ ∈ (0, ∞),
(3)
where E is an error function, R is a regularization term and λ is a regularization parameter. In this article, we employ E(w) = y − Gw2
(4)
R(w) = w ,
(5)
2
where · denotes the Euclidean norm.
540
K. Hagiwara
2.2
Training Via Eigendecomposition of Gram Matrix
We define Fλ = (G G + λI n ), where I n is the n × n identity matrix. Under (3), (4) and (5), the estimate of the weight vector is given by λ = (w w λ,1 , . . . , w λ,n ) = Fλ−1 G y
(6)
and the minimum of the regularized cost function is written by λ) = y y − w λ Fλ w λ. C(w
(7)
Since the column vectors of G are linearly independent, G G is positive definite, where G G is the Gram matrix of (g 1 , . . . , g n ). Then, there exists an orthonormal matrix Q such that Q (G G)Q = Γ where Γ = diag(γ1 , . . . , γn ) with γk > 0 for any k. Here, diag(a1 , . . . , an ) denotes a diagonal matrix whose (k, k) element is ak . Note that QΓ Q = G G holds since Q is orthonormal. γk is the kth eigen value and the kth column vector of Q is the corresponding eigen vector. We define Γλ = diag(γ1 + λ, . . . , γn + λ). (8) Then, Q Fλ Q = Γλ and QΓλ Q = Fλ hold. By using Q and Γλ , (6) and (7) are written by λ = QΓλ−1 Q G y w
λ) = y y − C(w where
(9)
λ v λ v −1/2
λ = ( v vλ,1 , . . . , vλ,n ) = Γλ
(10) Q G y.
(11)
λ and v λ are linked by the equation By (9) and (11), w −1/2
λ = QΓλ w or
λ v
(12)
λ = Γλ Q w λ. v
(13)
1/2
λ is a coordinate vector relative to G while v λ is one relative to Q whose column w vectors are orthonormal. We refer to vλ,1 , . . . , vλ,n as orthogonal components.
3
Shrinkage Methods
We define w∗ = (G G)−1 G h, where h = (h(x1 ), . . . , h(xn )) . We also define −1/2
∗ ∗ v ∗λ = (vλ,1 , . . . , vλ,n ) = Γλ Γ Q w ∗ γ1 γn Γ λ = diag ,..., . γ1 + λ γn + λ
(14) (15)
λ = v λ (xn ) given xn = (x1 , . . . , xn ) is Then, the conditional distribution of v ∗ 2 n ∗ λ ∼ N (v λ , σ Γ λ |x ). The components with vλ,i shown to be v = 0 can be viewed
Orthogonal Shrinkage Methods for Nonparametric Regression
541
as purely noise components which we need to remove. We now consider to keep or remove the orthogonal components by a hard thresholding method. Let Tn,i ( vλ,i ) be the ith component after thresholding; i.e. Tn,i ( vλ,i ) = vλ,i if it is kept Tn,i ( vλ,i ) = 0 if it is removed. We define Tn ( v λ ) = (T1,n ( vλ,1 ), . . . , Tn,n ( vλ,n )) . Then, training by using a thresholding procedure is as follows: 1. Fix a small positive value for λ. 2. By eigendecomposition of G G, we obtain eigen values (γ1 , . . . , γn ) and an orthonormal matrix Q whose column vectors are the corresponding eigen λ . vectors. By using (11) with (8), we obtain v 3. Calculate Tn,i ( vλ,i ), i = 1, . . . , n. −1/2
4. By (12), we obtain wλ = (wλ,1 , . . . , w λ,n ) = QΓλ Tn ( v λ ) as a trained weight vector after thresholding. Then, the output of a trained machine is n given by fwλ (x) = j=1 wλ,j gj (x) for any x ∈ Rd . By the definition of Tn,i , i = 1, . . . , n, it is easy to see that the resulting wλ is a shrinkage estimator of w λ . Typically, if we set threshold levels {θn,1 , . . . , θn,n } for each of orthogonal components then we can calculate the thresholded components by 2 vλ,i vλ,i > θn,i Ti,n ( vλ,i ) = , i = 1, . . . , n. (16) 2 0 vλ,i ≤ θn,i , We propose three methods as Tn . Component-wise hard thresholding(CHT). We first consider to derive appropriate threshold levels. To do this, we show a basic result on i.i.d. Gaussian random variables. Let W1 , . . . , Wn be i.i.d. random variables having N (0, σ 2 ). We define Cn, = 2 log n + (−1 + ) log log n, where is a constant. Then, it can be shown that lim P max |Wi |2 > Cn, = 0, for > 0 (17) n→∞ 1≤i≤n lim P max |Wi |2 ≤ Cn, = 0, for < 0. (18) n→∞
1≤i≤n
The strong convergence result with Cn,1 is known and Cn,1 is used as the universal threshold level for removing irrelevant wavelet coefficients in wavelet denoising[5]. The universal threshold level is shown to be asymptotically equivalent to the minimax one[5]. (17) and (18) are the weaker results while they evaluate the O(log log n) term. We return to our problem and consider the case of v ∗λ = 0n ; i.e. all of vλ,i ’s are noise components. Here, 0n denotes the n-dimensional zero vector. We define = ( ∼ N (0n , σ 2 Γ λ ). We define σi2 = σ 2 γi /(γi + λ), u u1 , . . . , u n ) that satisfies u 2 i = 1, . . . , n, where σi = 0 for any i. We also define u = (u1 , . . . , un ), where ui = u i /σi . Then, u ∼ N (0n , I n ). By (17) and the definition of u, we have
n
2 2 P u i > σi Cn, = P max u2i > Cn, → 0 (n → ∞), (19) i=1
1≤i≤n
542
K. Hagiwara
if > 0. On the other hand, by using (18) and the definition of u, we have
n
2 2 2 P u i ≤ σi Cn, = P max ui ≤ Cn, → 0 (n → ∞), (20) 1≤i≤n
i=1
if < 0. (19) tells us that, for any i, u 2i cannot exceed σi2 Cn, with high probability when n is large and > 0. Therefore, if we employ σi Cn, with > 0 as component-wise threshold levels, (19) implies that they remove noise components if it is. On the other hand, (20) tells us that there are some noise components which satisfy u 2i > σi2 Cn, with high probability when n is large and < 0. Therefore, the ith component should be recognized as a noise component if u 2i ≤ σi2 Cn, even when it is not. In other words, we can not distinguish nonzero mean components from zero mean components in this case. Hence, σi2 Cn,0 , i = 1, . . . , n are critical levels for identifying noise components. We note that ∗ these results are still valid when vλ,i are not zero for some i but the number of such components is very small. This is the case of assuming the sparseness of the representation of h in the orthogonal domain. As a conclusion, we propose a hard thresholding method by putting θn,i = σi2 Cn,0 , i = 1, . . . , n in (16), where we set = 0. We refer to this method by component-wise hard thresholding(CHT). In practical applications of the method, we need an estimate of noise variance σ 2 . Fortunately, in nonparametric regression methods, [1] suggested to apply σ 2 =
y [I − Hλ ]2 y , trace[(I − Hλ )2 ]
(21)
where Hλ is defined by Hλ = GFλ−1 G = GQΓλ Q G . Although the method includes a regularization parameter, it can be fixed for a small value. This is because the thresholding method keeps a trained machine compact on the orthogonal domain, by which the contribution of the regularizer may not be significant in improving the generalization performance. Therefore it is needed only for ensuring the numerical stability of the matrix calculations. Since the basis functions can be nearly linearly dependent in practical applications, small eigen values are less reliable. We therefore ignore the orthogonal components whose eigen values are less than a predetermined small value, say, 10−16 . Although the run time of eigendecomposition is O(n3 ), the subsequent procedures of CHT such as calculations of σ 2 and wλ are achieved with less computational costs by the eigendecomposition. Hard thresholding with the universal threshold level(HTU). Basically eigendecomposition of G G corresponds to the principal component analysis of g 1 , . . . , g n . Therefore, for nearly linearly dependent basis functions, only several eigen vectors are largely contributed. On the other hand, the components with small eigen values are largely affected by numerical errors. Therefore, it is natural to take care of only the components with large eigen values. For a component
Orthogonal Shrinkage Methods for Nonparametric Regression
543
with a large eigen value, γi λ holds since we can choose a small value for λ. Thus, σ i2 σ 2 holds by the definition of σ i2 in CHT. We then consider to apply 2 a single threshold level σ Cn,0 instead of σ i2 Cn,0 , i = 1, . . . , n in CHT; i.e. we 2 set θn,i = σ Cn,0 in (16). This is a direct application of the universal threshold level in wavelet denoising[5]. This method is referred to by hard thresholding with the universal threshold level(HTU). Backward hard thresholding(BHT). On the other hand, since the threshold level derived here is the worst case evaluation for a noise level, CHT and HTU have a possibility of yielding a bias between fwλ and h by unexpected removes of contributed components. The component with a large eigen value is composed by a linear sum of many basis functions, it may be a smooth component. Therefore, removes of these components may yield a large bias. Actually, in wavelet denoising, fast/detail components are the target of a thresholding method and slow/approximation components are harmless by the thresholding method[5], which may be a device for reducing the bias. For our method, we also introduce this idea and consider the following procedure. We assume that γ1 ≤ γ2 ≤ · · · ≤ γn . The method is that, by increasing j from 1 to n, we find the first j = j for which vλ,j > σ 2 Cn,0 occurs. Then, thresholding is made by Tj ( vλ,j ) = vλ,j if j ≥ j and Tj ( vλ,j ) = 0 if j < j. This keeps components with large eigen values and possibly reduces the bias. We refer to this method by backward hard thresholding(BHT). BHT can be viewed as a stopping criterion for choosing contributed components in orthogonal components that are enumerated in order of the magnitudes of eigen values.
4 4.1
Numerical Experiments Choice of Regularization Parameter
CHT, HTU and BHT do not include parameters to be adjusted except the regularization parameter. Since thresholding of orthogonal components yields a certain simple representation of a machine, the regularization parameter may not be significant in improving the generalization performance. To demonstrate this property of our methods, through a simple numerical experiment, we see the relationship between generalization performances of trained machines and regularization parameter values. The target function is h(x) = 5 sinc(8 x) for x ∈ R. x1 , . . . , xn are randomly drawn in the interval [−1, 1]. We assume i.i.d. Gaussian noise with mean zero and variance σ 2 = 1. The basis functions are Gaussian basis functions that are defined by gj (x) = exp{−(x − xi )2 /(2τ 2 )}, j = 1, . . . , n, where we set τ 2 = 0.05. In this experiment, under a fixed value of a regularization parameter, we trained machines for 1000 sets of training data of size n. At each trial, the test error is measured by the mean squared error between the target function and the trained machine, which is calculated on 1000 equally spaced input points in [−1, 1]. Figure 1 (a) and (b) depict the results for n = 200 and n = 400 respectively, which show the relationship between the averaged test errors of trained machines
544
K. Hagiwara
0.1 averaged test error
averaged test error
0.1
0.05
0 10
−6
−4
−2
0
10 10 10 10 reguralization parameter
2
(a) n = 200
0.05
0 10
−6
−4
−2
0
10 10 10 10 reguralization parameter
2
(b) n = 400
Fig. 1. The dependence of test errors on regularization parameters. (a) n = 200 and (b) n = 400. The filled circle, open circle and open square indicate the results for the raw estimate, CHT and BHT respectively.
and regularization parameter values. The filled circle, open circle and open square indicate the results for the raw estimator, CHT and BHT respectively, where the raw estimate is obtained by (6) at each fixed value of a regularization parameter. We do not show the result for HTU since it is almost the same as the result for CHT. In these figures, we can see that the averaged test errors of our methods are almost unchanged for small values of a regularization parameter while those of the row estimates are sensitive to regularization parameter values. We can also see that BHT is entirely superior to the raw estimate and CHT while CHT is worse than the raw estimate around λ = 101 for both of n = 200 and 400. In practical applications, the regularization parameter of the raw estimate should be determined based on training data and the performance comparisons to the leave-one-out cross validation method are shown in below. 4.2
Comparison with LOOCV
We compare the performances of the proposed methods to the performance of the leave-one-out cross validation(LOOCV) choice of a regularization parameter value. We see not only generalization performances but also computational times of the methods. For the regularized estimator considered in this article, it is known that the LOOCV error is calculated without training on validation sets[3,6]. We assume the same conditions as the previous experiment. The CPU time is measured only for the estimation procedure. The experiments are conducted by using Matlab on the computer that has a 2.13 GHz Core2 CPU, 1 GByte memory. Table 1 (a) and (b) show the averaged test errors and averaged CPU times of LOOCV and our methods respectively, in which the standard deviations(divided by 2) are also appended. The examined values for a regularization parameter in LOOCV is {m × 10−j : m = 1, 2, 5, j = −4, −3, . . . , 3}. In our methods, the
Orthogonal Shrinkage Methods for Nonparametric Regression
545
Table 1. Test errors and CPU times of LOOCV, CHT, HTU and BHT n 100 200 400
LOOCV 0.079± 0.027 0.040± 0.013 0.021± 0.006
CHT 0.101± 0.034 0.046± 0.014 0.023± 0.007
HTU 0.100± 0.034 0.045± 0.014 0.023± 0.007
BHT 0.076± 0.027 0.035± 0.011 0.017± 0.005
(a) Test errors n 100 200 400
LOOCV 0.079± 0.002 0.533± 0.003 3.657± 0.003
CHT 0.013± 0.002 0.080± 0.002 0.523± 0.004
HTU 0.014± 0.002 0.080± 0.002 0.523± 0.004
BHT 0.014± 0.002 0.080± 0.002 0.523± 0.004
(b) CPU times
regularization parameter is fixed at 1 × 10−4 which is the smallest value in candidate values for LOOCV. Based on Table 1 (a), we first discuss the generalization performances of the methods. CHT and HTU are almost comparable. This implies that only the components corresponding to large eigen values are contributed. CHT and HTU are entirely worse than LOOCV in average while the differences are within the standard deviations for n = 200 and n = 400. BHT entirely outperforms LOOCV, CHT and HTU in average while the difference between the averaged test error of LOOCV and that of BHT is almost within the standard deviations. As pointed out previously, CHT and HTU have a possibility to remove smooth components accidentally since the threshold levels were determined based on the worst case evaluation of dispersion of noise. The better generalization performance of BHT compared with CHT and HTU in Table 1 (a) is caused by this fact. On the other hand, as shown in Table 1 (b), our methods completely outperform LOOCV in terms of the CPU times.
5
Conclusions and Future Works
In this article, we proposed shrinkage methods in training a machine by using a regularization method. The machine is represented by a linear combination of fixed basis functions, in which the number of basis functions, or equivalently, the number of weights is identical to that of training data. In the regularized cost function, the error function is defined by the sum of squared errors and the regularization term is defined by the quadratic form of the weight vector. In the proposed shrinkage procedures, basis functions are orthogonalized by eigendecomposition of the Gram matrix of the vectors of basis function outputs. Then, the orthogonal components are kept or removed according to the proposed thresholding methods. The proposed methods are based on the statistical properties of regularized estimators of weights, which are derived by assuming i.i.d. Gaussian noise. The final weights are obtained by a linear transformation of the thresholded orthogonal components and are shrinkage estimators of weights. We
546
K. Hagiwara
proposed three versions of thresholding methods which are component-wise hard thresholding, hard thresholding with the universal threshold level and backward hard thresholding. Since the regularization parameter can be fixed for a small value in our methods, our methods are automatic. Additionally, since eigendecomposition algorithms are included in many software packages and the thresholding methods are simple, the implementations of our methods are quite easy. The numerical experiments showed that our methods achieve relatively good generalization capabilities in strictly less computational time by comparing with the LOOCV method. Especially, the backward hard thresholding method outperformed the LOOCV method in average in terms of the generalization performance. As future works, we need to investigate the performance of our methods on real world problems. Furthermore, we need to evaluate the generalization error when applying the proposed shrinkage methods.
References 1. Carter, C.K., Eagleson, G.K.: A comparison of variance estimators in nonparametric regression. J. R. Statist. Soc. B 54, 773–780 (1992) 2. Chen, S.: ‘Local regularization assisted orthogonal least squares regression. Neurocomputing 69, 559–585 (2006) 3. Craven, P., Wahba, G.: Smoothing noisy data with spline functions. Numerische Mathematik 31, 377–403 (1979) 4. Cristianini, N., Shawe-Taylor, J.: An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, Cambridge (2000) 5. Donoho, D.L., Johnstone, I.M.: Ideal spatial adaptation by wavelet shrinkage. Biometrika 81, 425–455 (1994) 6. Rifkin, R.: Everything old is new again: a fresh look at historical approaches in machine learning. Ph.D thesis, MIT (2002) 7. Tibshirani, R.: Regression shrinkage and selection via lasso. J.R. Statist. Soc. B 58, 267–288 (1996) 8. Suykens, J.A.K., Brabanter, J.D., Lukas, L., Vandewalle, J.: Weighted least squares support vector machines: robustness and sparse approximation. Neurocomputing 48, 85–105 (2002) 9. Williams, C.K.I., Seeger, M.: Using the Nystr¨ om method to speed up kernel machines. In: Leen, T.K., Diettrich, T.G., Tresp, V. (eds.) Advances in Neural Information Processing Systems, vol. 13, pp. 682–688 (2001)
A Subspace Method Based on Data Generation Model with Class Information Minkook Cho, Dongwoo Yoon, and Hyeyoung Park School of Electrical Engineering and Computer Science Kyungpook National University, Deagu, Korea [email protected], [email protected], [email protected]
Abstract. Subspace methods have been used widely for reduction capacity of memory or complexity of system and increasing classification performances in pattern recognition and signal processing. We propose a new subspace method based on a data generation model with intra-class factor and extra-class factor. The extra-class factor is associated with the distribution of classes and is important for discriminating classes. The intra-class factor is associated with the distribution within a class, and is required to be diminished for obtaining high class-separability. In the proposed method, we first estimate the intra-class factors and reduce them from the original data. We then extract the extra-class factors by PCA. For verification of proposed method, we conducted computational experiments on real facial data, and show that it gives better performance than conventional methods.
1
Introduction
Subspace methods are for finding a low dimensional subspace which presents some meaningful information of input data. They are widely used for high dimensional pattern classification such as image data owing to two main reasons. First, by applying a subspace method, we can reduce capacity of memory or complexity of system. Also, we can expect to increase classification performances by eliminating useless information and by emphasizing essential information for classification. The most popular subspace method is PCA(Principal Component Analysis) [10,11,8] and FA(Factor Analysis)[6,14] which are based on data generation models. The PCA finds a subspace of independent linear combinations (principal components) that retains as much of the information in the original variables as possible. However, PCA method is an unsupervised method, which does not use class information. This may cause some loss of critical information for classification. Contrastively, LDA(Linear discriminant analysis)[1,4,5] method is a supervised learning method which uses information of the target label of data set. The LDA method attempts to find basis vectors of subspace maximizing the linear class separability. It is generally known that LDA can give better classification performance than PCA by using class information. However, LDA gives M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 547–555, 2008. c Springer-Verlag Berlin Heidelberg 2008
548
M. Cho, D. Yoon, and H. Park
at most k-1 basis of subspace for k-class, and cannot extract features stably for the data set with limited number of data in each class. Another subspace method for classification is the intra-person space method[9], which is developed for face recognition. The intra-person space is defined as a difference between two facial data from same person. For dimension reduction, the low dimensional eigenspace is obtained by applying PCA to the intra-person space. In the classification tasks, raw input data are projected to the intra-personal eigenspace to get the low dimensional features. The intra-person method showed better performance than PCA and LDA in the FERET Data[9]. However, it is not based on data generation model and it cannot give a sound theoretical reason why the intra-person space gives good information for classification. On the other hand, a data generation model with class information has recently been developed[12]. It is a variant of the factor analysis model with two type of factors; class factor(what we call extra-class factor) and environment factor(what we call intra-class factor). Based on the data generation model, the intra-class factor is estimated by using difference vectors between two data in the same class. The estimated probability distribution of intra-class factor is applied to measuring the similarity of data for classification. Through this method takes similar approaches to the intra-person method in the sense that it is using the difference vectors to get the intra-class information and similarity measure, it is based on the data generation model which can give an explanation on the developed similarity measure. Still, it does not include dimension reduction process and another subspace method is necessary for high-dimensional data. In this paper, we propose an appropriate subspace method for the data generation model developed in [12]. The proposed method finds subspace which undercuts the effect of intra-class factors and enlarges the effect of extra-class factors based on the data generation model. In Section 2, the model will be explained in detail.
2
Data Generation Model
In advance of defining the data generation model, let us consider that we obtained several images(i.e. data) from different persons(i.e. class). We know that pictures of different persons are obviously different. Also, we know that the pictures of a person is not exactly same due to some environmental condition such as illumination. Therefore, it is natural to assume that a data consists of the intra-class factor which represents within-class variations such as the variation of pictures of same person and the extra-class factor which represents betweenclass variations such as the differences between two persons. Under this condition, a random variable x for observed data can be written as a function of two distinct random variable ξ and η, which is the form, x = f (ξ, η),
(1)
where ξ represents an extra-class factor, which keeps some unique information in each class, and η represents an intra-class factor which represents environmental
A Subspace Method Based on Data Generation Model
549
variation in the same class. In [12], it was assumed that η keeps information of any variation within class and its distribution is common for all classes. In order to explicitly define the generation model, a linear additive factor model was applied such as xi = W ξ i + η.
(2)
This means that a random sample xi in class Ci is generated by the summation of a linearly transformed class prototype ξ i and a random class-independent variation η. In this paper, as an extension of the model, we assume that the low dimensional intra-class factor is linearly transformed to generate an observed data x. Therefore, function f is defined as xi = W ξi + V η i .
(3)
This model implies that a data of a specific class is generated by the extra-class factor ξ which gives some discriminative information among class and the intraclass factor η which represents some variation in a class. In this equation, W and V are transformation matrices of corresponding factors. We call W the extra-class factor loading and call V the intra-class factor loading. Figure 1 represents this data generation model. To find a good subspace for classification, we try to find V and W using class information which is given with the input data. In Section 3, we explain how to find the subspace based on the data generation model.
ξi
ηi V
W Xi
Fig. 1. The proposed data generation model
3
Factor Analysis Based on Data Generation Model
For given data x, if we keep the extra-class information and reduce the intraclass information as much as possible, we can expect better classification performances. In this aspect, the proposed method can be thought to be similar to the traditional LDA method. The LDA finds a projection matrix which simultaneously maximizes the between-scatter matrix and minimizes the within-scatter matrix of original data set. On the other hand, the proposed method first estimates the intra-class information from the difference data set from same class, and excludes the intra-class information from original data to keep the extraclass information. Therefore, as compared to LDA, the proposed method does
550
M. Cho, D. Yoon, and H. Park
not need to compute a inverse matrix of the within-scatter matrix and the number of basis of subspace does not depend on a number of class. In addition, the proposed method is so simple to get subspaces and many variations of the proposed method can be developed by exploiting various data generation model. In this section, we will state how to obtain the subspace in detail based on simple linear generative model. 3.1
Intra-class Factor Loading
We first find the projection matrix Λ which represents the intra-class factor instead of intra-class factor loading matrix V for obtaining intra-class information. In given data set {x}, we calculate the difference vector δ between two data from same class, which can be written as δ kij = xki − xkj = (W ξ ki − W ξ kj ) + (V η ki − V η kj ),
(4)
where xki and xkj are data from a class Ck (k=1,...,K). Because the xki , xkj are came from the same class, we can assume that the extra-class factor does not make much difference and we ignore the first term W ξki − W ξ kj . Then we can obtain a new approximate relationship δ kij ≈ V (η ki − η kj ).
(5)
Based on the relationship, we try to find the factor loading matrix V . For the obtained the data set, Δ = {δ kij }k=1,...,K,i=1,...,N,j=1,...,N ,
(6)
where K is the number of classes and N is the number of data in each class, we apply PCA to obtain the principal component of Δ. The obtained matrix Λ of the principal component of the Δ gives the subspace which maximizes the variance of intra-class factor η. The original data set X is projected to the subspace for extraction of intraclass factors Y intra , such as Y intra = XΛ.
(7)
Note that Y intra is a low dimensional data set and includes the intra-class information of X. Using the Y intra , we reconstruct X intra in original dimension by applying the calculation, X intra = Y intra ΛT (ΛΛT )−1 .
(8)
Note that X intra keeps the intra-class information which is not desirable for classification. To remove the undesirable information, we subtract X intra from ˜ such as the original data set X. As a result, we get a new data set X ˜ = X − X intra . X
(9)
A Subspace Method Based on Data Generation Model
3.2
551
Extra-Class Factor Loading
˜ we try to find the projection matrix Using the newly obtained data set X, ˜ Λ which represents the extra-class factor instead of extra-class factor loading matrix W for preserving extra-class information as much as possible. To solve ˜ Noting that a data xintra in the this problem, let us consider the data set X. intra data set X is a reconstruction from the intra-class factor, we can write the approximate relationship, xintra ≈ V η.
(10)
˜ can be ˜ in X By combining it with equation (3), the newly obtained data x rewritten as ˜ ≈ x − V η = W ξ. x
(11)
˜ mainly has the extra-class From this, we can say that the new data set X information, and thus we need to preserve the information as much as possible. ˜ and obtain the From these consideration, we apply PCA to the new data set X, ˜ ˜ ˜ to the basis principal component matrix Λ of the data set X. By projecting X vectors such as ˜Λ ˜ = Z, ˜ X
(12)
˜ which has small intra-class variance and large we can obtain the data set Z extra-class variance. The obtained data set is used for classification.
4
Experimental Results
For verification of the proposed method, we did comparison experiments on facial data sets with conventional methods : PCA, LDA and intra-person. For classification, we apply the minimum distance method[10] with Euclidean distance. When we find the subspace for each method, we optimized the dimension of subspace with respect to the classification rates for each data set. 4.1
Face Recognition
We first conducted facial recognition task for face images with different viewpoints which are obtained from FERET(Face Recognition Technology) database at the homepage(http : //www.itl.nist.gov/iad/humanid/f eret/). Figure 2 shows some samples of the data set. We used 450 images from 50 subjects and each subject consists of 9 images of different poses corresponding to 15 degree from left and right. The left, right(±60 degree) and frontal images are used for training and the rest 300 images are used for testing. The size of image is 50 × 70, thus the dimension of the raw input is 3500. For LDA method, we first applied PCA to solve small sample set problem and obtained 52 dimensional features. We then applied LDA and obtained 9 dimensional features. Similarly, in the proposed method, we first
552
M. Cho, D. Yoon, and H. Park
Fig. 2. Examples of the human face images with different viewpoints Table 1. Result on face image data Method PCA LDA Intra-Person Proposed
Dimension 117 9 92 8
Classification Rate 97 99.66 92.33 100
find 83 dimensional subspace for intra-class factor and 8 dimensional subspace for extra-class factor. In this case, there are the 50 number of classes and the number of data in each class is very limited; 3 for each class. The experimental results are shown in Table 1. The performance of the proposed method is perfect, and the other methods also give generally good results. In spite of 50 number of class and limited number of training data, the good results may due to that the variation of between-class is intrinsically high. 4.2
Pose Recognition
We conducted pose recognition task with the same data set used in Section 4.1. Therefore, we used 450 images from 9 different viewpoints classes of −60o to 60o with 15o intervals, and each class consists of 50 images from different persons. The 255 images which are composed of 25 images for each class are used for training and the rest 255 images are used for testing. For LDA method, we first applied PCA and obtained 167 dimensional features. We then applied LDA
A Subspace Method Based on Data Generation Model
553
Table 2. Result on the human pose image data Method PCA LDA Intra-Person Proposed
Dimension 65 6 51 21
Classification Rate 36.44 57.78 38.67 58.22
and obtained 6 dimensional features. For the proposed method, we first find 128 dimensional subspace for intra-class factor and 21 dimensional subspace for extra-class factor. In this case, there is the 9 number of classes and the 225 number of training data which is composed of 25 number of data from each class. The results are shown in Table 2. The performance is generally low, but the proposed method and LDA give much better performance than the PCA and intra-person method. From the low performance, we can conjecture that the variance of between-class is very small in contrast to the variance of within-class. However, the proposed method achieved the best performance. 4.3
Facial Expression Recognition
We also conducted facial expression recognition task with data set obtained from PICS(Psychological Image Collection at Stirling) at the homepage(http : //pics.psych.stir.ac.uk/). Figure 3 shows the facial expression image sample. We obtained 276 images from 69 persons and each person has 4 images of different expressions. The 80 images which is composed of 20 images from each expression are used for training and the rest 196 images are used for testing. The size of image is 80 × 90, thus the dimension of the raw input is 7200. For LDA method, we first applied PCA and obtained 59 dimensional features. We then applied LDA and obtained 3 dimensional features. For the proposed method, we first find 48 dimensional subspace for intra-class factor and 14 dimensional subspace for extra-class factor. In this case, there is 4 number of classes and 20 number of training data in each class. Although the performances of all methods are generally low, the proposed method performed much better than PCA and intra-person method. Like the pose recognition, it is also seemed that the variance of between-class Table 3. Result on facial expression image data Method PCA LDA Intra-Person Proposed
Dimension 65 3 76 14
Classification Rate 35.71 65.31 42.35 66.33
554
M. Cho, D. Yoon, and H. Park
Fig. 3. Examples of the human facial expression images
is very small in contrast to the variance of within-class. However, the proposed method is achieved the best performance.
5
Conclusions and Discussions
In this paper, we proposed a new subspace method based on a data generation model with class information which can be represented as intra-class factors and extra-class factors. By reducing the intra-class information from original data and by keeping extra-class information using PCA, we could get a low dimensional features which preserves some essential information for the given classification problem. In the experiments on various type of facial classification tasks, the proposed method showed better performance than conventional methods. As further study, it could be possible to find more sophisticated dimension reduction than PCA which can enlarge extra-class information. Also, the kernel method could be applied to overcome the non-linearity problem.
Acknowledgements This work was supported by the Korea Research Foundation Grant funded by the Korean Government(MOEHRD) (KRF-2006-311-D00807).
References 1. Belhumeur, P., Hespanha, J., Kriegman, D.: Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection. IEEE trans. on Pattern Recogntion and Machine Intelligence 19(7), 711–720 (1997) 2. Alpaydin, E.: Machine Learning. MIT Press, Cambridge (2004) 3. Jaakkola, T., Haussler, D.: Exploiting Generative Models in Discriminative Classifiers. Advances in Neural Information Processing System, 487–493 (1998)
A Subspace Method Based on Data Generation Model
555
4. Fisher, R.A.: The Statistical Utilization of Multiple Measurements. Annals of Eugenics 8, 376–386 (1938) 5. Fukunaga, K.: Introduction to Statistical Pattern Recognition, 2nd edn. Academic Press, London (1990) 6. Hinton, G.E., Zemel, R.S.: Autoencoders, Minimum Description Length and Helmholtz Free Energy. Advances In Neural Information Processing Systems 6, 3–10 (1994) 7. Hinton, G.E., Ghahramani, Z.: Generative Models for Discovering Sparse Distributed Representations. Philosophical Transactions Royal Society B 352, 1177– 1190 (1997) 8. Lee, O., Park, H., Choi, S.: PCA vs. ICA for Face Recognition. In: The 2000 International Technical Conference on Circuits/Systems, Computers, and Communications, pp. 873–876 (2000) 9. Moghaddam, B., Jebara, T., Pentland, A.: Bayesian Modeling of Facial Similarity. Advances in Neural Information Processing System, 910–916 (1998) 10. Mardia, K.V., Kent, J.T., Bibby, J.M.: Multivariate Analysis. Academic Press, London (1979) 11. Martinez, A., Kak, A.: PCA versus LDA. IEEE Trans. on Pattern Analysis and Machine Inteligence 23(2), 228–233 (2001) 12. Park, H., Cho, M.: Classification of Bio-data with Small Data set Using Additive Factor Model and SVM. In: Hoffmann, A., Kang, B.-h., Richards, D., Tsumoto, S. (eds.) PKAW 2006. LNCS (LNAI), vol. 4303, pp. 770–779. Springer, Heidelberg (2006) 13. Chopra, S., Hadsell, R., LeCun, Y.: Learning a Similarity Metric Discriminatively, with Application to Face Verification. In: Proc. of International Conference on Computer Vision on Pattern Recognition, pp. 539–546 (2005) 14. Ghahramani, Z.: Factorial Learning and The EM Algorithm. In: Advances In Neural Information Processing Systems, vol. 7, pp. 617–624 (1995)
Hierarchical Feature Extraction for Compact Representation and Classification of Datasets Markus Schubert and Jens Kohlmorgen Fraunhofer FIRST.IDA Kekul´estr. 7, 12489 Berlin, Germany {markus,jek}@first.fraunhofer.de http://ida.first.fraunhofer.de
Abstract. Feature extraction methods do generally not account for hierarchical structure in the data. For example, PCA and ICA provide transformations that solely depend on global properties of the overall dataset. We here present a general approach for the extraction of feature hierarchies from datasets and their use for classification or clustering. A hierarchy of features extracted from a dataset thereby constitutes a compact representation of the set that on the one hand can be used to characterize and understand the data and on the other hand serves as a basis to classify or cluster a collection of datasets. As a proof of concept, we demonstrate the feasibility of this approach with an application to mixtures of Gaussians with varying degree of structuredness and to a clinical EEG recording.
1
Introduction
The vast majority of feature extraction methods does not account for hierarchical structure in the data. For example, PCA [1] and ICA [2] provide transformations that solely depend on global properties of the overall data set. The ability to model the hierarchical structure of the data, however, might certainly help to characterize and understand the information contained in the data. For example, neural dynamics are often characterized by a hierarchical structure in space and time, where methods for hierarchical feature extraction might help to group and classify such data. A particular demand for these methods exists in EEG recordings, where slow dynamical components (sometimes interpreted as internal “state” changes) and the variability of features make data analysis difficult. Hierarchical feature extraction is so far mainly related to 2-D pattern analysis. In these approaches, pioneered by Fukushima’s work on the Neocognitron [3], the hierarchical structure is typically a priori hard-wired in the architecture and the methods primarily apply to a 2-D grid structure. There are, however, more recent approaches, like local PCA [4] or tree-dependent component analysis [5], that are promising steps towards structured feature extraction methods that derive also the structure from the data. While local PCA in [4] is not hierarchical and tree-dependent component analysis in [5] is restricted to the context of ICA, we here present a general approach for the extraction of feature hierarchies and M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 556–565, 2008. c Springer-Verlag Berlin Heidelberg 2008
Hierarchical Feature Extraction
557
their use for classification and clustering. We exemplify this by using PCA as the core feature extraction method. In [6] and [7], hierarchies of two-dimensional PCA projections (using probabilistic PCA [8]) were proposed for the purpose of visualizing high-dimensional data. For obtaining the hierarchies, the selection of sub-clusters was performed either manually [6] or automatically by using a model selection criterion (AIC, MDL) [7], but in both cases based on two-dimensional projections. A 2-D projection of high-dimensional data, however, is often not sufficient to unravel the structure of the data, which thus might hamper both approaches, in particular, if the sub-clusters get superimposed in the projection. In contrast, our method is based on hierarchical clustering in the original data space, where the structural information is unchanged and therefore undiminished. Also, the focus of this paper is not on visualizing the data itself, which obviously is limited to 2-D or 3-D projections, but rather on the extraction of the hierarchical structure of the data (which can be visualized by plotting trees) and on replacing the data by a compact hierarchical representation in terms of a tree of extracted features, which can be used for classification and clustering. The individual quantity to be classified or clustered in this context, is a tree of features representing a set of data points. Note that classifying sets of points is a more general problem than the well-known problem of classifying individual data points. Other approaches to classify sets of points can be found, e.g., in [9, 10], where the authors define a kernel on sets, which can then be used with standard kernel classifiers. The paper is organized as follows. In section 2, we describe the hierarchical feature extraction method. In section 3, we show how feature hierarchies can be used for classification and clustering, and in section 4 we provide a proof of concept with an application to mixtures of Gaussians with varying degree of structuredness and to a clinical EEG recording. Section 5 concludes with a discussion.
2
Hierarchical Feature Extraction
We pursue a straightforward approach to hierarchical feature extraction that allows us to make any standard feature extraction method hierarchical: we perform hierarchical clustering of the data prior to feature extraction. The feature extraction method is then applied locally to each significant cluster in the hierarchy, resulting in a representation (or replacement) of the original dataset in terms of a tree of features. 2.1
Hierarchical Clustering
There are many known variants of hierarchical clustering algorithms (see, e.g., [11, 12]), which can be subdivided into divisive top-down procedures and agglomerative bottom-up procedures. More important than this procedural aspect, however, is the dissimilarity function that is used in most methods to quantify the dissimilarity between two clusters. This function is used as the criterion to
558
M. Schubert and J. Kohlmorgen
determine the clusters to be split (or merged) at each iteration of the top-down (or bottom-up) process. Thus, it is this function that determines the clustering result and it implicitly encodes what a “good” cluster is. Common agglomerative procedures are single-linkage, complete-linkage, and average-linkage. They differ simply in that they use different dissimilarity functions [12]. We here use Ward’s method [13], also called the minimum variance method, which is agglomerative and successively merges the pair of clusters that causes the smallest increase in terms of the total sum-of-squared-errors (SSE), where the error is defined as the Euclidean distance of a data point to its cluster mean. The increase in square-error caused by merging two clusters, Di and Dj , is given by ni nj d (Di , Dj ) = mi − mj , (1) ni + nj where ni and nj are the number of points in each cluster, and mi and mj are the means of the points in each cluster [12]. Ward’s method can now simply be described as a standard agglomerative clustering procedure [11, 12] with the particular dissimilarity function d given in Eq. (1). We use Ward’s criterion, because it is based on a global fitness criterion (SSE) and in [11] it is reported that the method outperformed other hierarchical clustering methods in several comparative studies. Nevertheless, depending on the particular application, other criteria might be useful as well. The result of a hierarchical clustering procedure that successively splits or merges two clusters is a binary tree. At each hierarchy level, k = 1, ..., n, it defines a partition of the given n samples into k clusters. The leaf node level consists of n nodes describing a partition into n clusters, where each cluster/node contains exactly one sample. Each hierarchy level further up contains one node with edges to the two child nodes that correspond to the clusters that have been merged. The tree can be depicted graphically as a dendrogram, which aligns the leaf nodes along the horizontal axis and connects them by lines to the higher level nodes along the vertical axis. The position of the nodes along the vertical axis could in principle correspond linearly to the hierarchy level k. This, however, would reveal almost nothing of the structure in the data. Most of the structural information is actually contained in the dissimilarity values. One therefore usually positions the node at level k vertically with respect to the dissimilarity value of its two corresponding child clusters, Di and Dj , δ(k) = d (Di , Dj ) .
(2)
For k = n, there are no child clusters, and therefore δ(n) = 0 [11]. The function δ can be regarded as within-cluster dissimilarity. By using δ as the vertical scale in a dendrogram, a large gap between two levels, for example k and k + 1, means that two very dissimilar clusters have been merged at level k. 2.2
Extracting a Tree of Significant Clusters
As we have seen in the previous subsection, a hierarchical clustering algorithm always generates a tree containing n − 1 non-singleton clusters. This does not
Hierarchical Feature Extraction
559
necessarily mean that any of these clusters is clearly separated from the rest of the data or that there is any structure in the data at all. The identification of clearly separated clusters is usually done by visual inspection of the dendrogram, i.e. by identifying large gaps. For an automatic detection of significant clusters, we use the following straightforward criterion δ(parent(k)) > α, δ(k)
for 1 < k < n,
(3)
where parent(k) is the parent cluster level of the cluster obtained at level k and α is a significance threshold. If a cluster at level k is merged into a cluster that has a within-cluster dissimilarity which is more than α times higher than that of cluster k, we call cluster k a significant cluster. That means that cluster k is significantly more compact than its merger (in the sense of the dissimilarity function). Note that this does not necessarily mean that the sibling of cluster k is also a significant cluster, as it might have a higher dissimilarity value than cluster k. The criterion directly corresponds to the relative increase of the dissimilarity value in a dendrogram from one merger level to the next. For small clusters that contain only a few points, the relative increase in dissimilarity can be large just because of the small sample size. To avoid that these clusters are detected as being significant, we require a minimum cluster size M for significant clusters. After having identified the significant clusters in the binary cluster tree, we can extract the tree of significant clusters simply by linking each significant cluster node to the next highest significant node in the tree, or, if there is none, to the root node (which is just for the convenience of getting a tree and not a forest). The tree of significant clusters is generally much smaller than the original tree and it is not necessarily a binary tree anymore. Also note that there might be data points that are not in any significant cluster, e.g., outliers. The criterion in (3) is somewhat related to the criterion in [14], which is used to take out clusters from the merging process in order to obtain a plain, nonhierarchical clustering. The criterion in [14] accounts for the relative change of the absolute dissimilarity increments, which seems to be somewhat less intuitive and unnecessarily complicated. This criterion might also be overly sensitive to small variations in the dissimilarities. 2.3
Obtaining a Tree of Features
To obtain a representation of the original dataset in terms of a tree of features, we can now apply any standard feature extraction method to the data points in each significant cluster in the tree and then replace the data points in the cluster by their corresponding features. For PCA, for example, the data points in each significant cluster are replaced by their mean vector and the desired number of principle components, i.e. the eigenvectors and eigenvalues of the covariance matrix of the data points. The obtained hierarchy of features thus constitutes
560
M. Schubert and J. Kohlmorgen
a compact representation of the dataset that does not contain the individual data points anymore, which can save a considerable amount of memory. This representation is also independent of the size of the dataset. The hierarchy can on the one hand be used to analyze and understand the structure of the data, on the other hand – as we will further explain in the next section – it can be used to perform classification or clustering in cases where the individual input quantity to be classified (or clustered) is an entire dataset and not, as usual, a single data point.
3
Classification of Feature Trees
The classification problem that we address here is not the well-known problem of classifying individual data points or vectors. Instead, it relates to the classification of objects that are sets of data points, for example, time series. Given a “training set” of such objects, i.e. a number of datasets, each one attached with a certain class label, the problem consists in assigning one class label to each new, unlabeled dataset. This can be accomplished by transforming each individual dataset into a tree of features and by defining a suitable distance function to compare each pair of trees. For example, trees of principal components can be regarded as (hierarchical) mixtures of Gaussians, since the principal components of each node in the tree (the eigenvectors and eigenvalues) describe a normal distribution, which is an approximation to the true distribution of the underlying data points in the corresponding significant cluster. Two mixtures (sums) of Gaussians, f and g, corresponding to two trees of principal components (of two datasets), can be compared, e.g., by using the the squared L2 -Norm as distance function, which is also called the integrated squared error (ISE), ISE(f, g) = (f − g)2 dx. (4) The ISE has the advantage that the integral is analytically tractable for mixtures of Gaussians. Note that the computation of a tree of principal components, as described in the previous section, is in itself an interesting way to obtain a mixture of Gaussians representation of a dataset: without the need to specify the number of components in advance and without the need to run a maximum likelihood (gradient ascent) algorithm like, for example, expectation–maximization [15], which is prone to get stuck in local optima. Having obtained a distance function on feature trees, the next step is to choose a classification method that only requires pairwise distances to classify the trees (and their corresponding datasets). A particularly simple method is first-nearest-neighbor (1-NN) classification. For 1-NN classification, the tree of a test dataset is assigned the label of the nearest tree of a collection of trees that were generated from a labeled “training set” of datasets. If the generated trees are sufficiently different among the classes, first- (or k-) nearest-neighbor
Hierarchical Feature Extraction
561
classification can already be sufficient to obtain a good classification result, as we demonstrate in the next section. In addition to classification, the distance function on feature trees can also be used to cluster a collection of datasets by clustering their corresponding trees. Any clustering algorithm that uses pairwise distances can be used for this purpose [11, 12]. In this way it is possible to identify homogeneous groups of datasets.
4 4.1
Applications Mixtures of Gaussians
As a proof of concept, we demonstrate the feasibility of this approach with an application to mixtures of Gaussians with varying degree of structuredness. From three classes of Gaussian mixture distributions, which are exemplarily shown in Fig. 1(a)-(c), we generated 10 training samples for each class, which constitute the training set, and a total of 100 test samples constituting the test set. Each sample contains 540 data points. The mixture distribution of each test sample was chosen with equal probability from one of the three classes. Next, we generated the binary cluster tree from each sample using Ward’s criterion. Examples of the corresponding dendrograms for each class are shown in Fig. 1(d)-(f) (in gray). We then determined the significant clusters in each tree, using the significance factor α = 3 and the minimum cluster size M = 40. In Fig. 1(d)-(f), the significant clusters are depicted as black dots and the extracted trees of significant clusters are shown by means of thick black lines. The cluster of each node in a tree of significant clusters was then replaced by the principle components obtained from the data in the cluster, which turns the tree of clusters into a tree of features. In Fig. 1(g)-(i), the PCA components of all significant clusters are shown for the three example datasets from Fig. 1(a)-(c). Finally, we classified the feature trees obtained from the test samples, using the integrated squared error (Eq. (4)) and first-nearest-neighbor classification. We obtained a nearly perfect accuracy of 98% correct classifications (i.e.: only two misclassifications), which can largely be attributed to the circumstance that the structural differences between the classes were correctly exposed in the tree structures. This result demonstrates that an appropriate representation of the data can make the classification problem very simple. 4.2
Clinical EEG
To demonstrate the applicability of our approach to real-world data, we used a clinical recording of human EEG. The recording was carried out in order to screen for pathological features, in particular the disposedness to epilepsy. The subject went through a number of experimental conditions: eyes open (EO), eyes closed (EC), hyperventilation (HV), post-hyperventilation (PHV), and, finally, a stimulation with stroboscopic light of increasing frequency (PO: photic on).
562
M. Schubert and J. Kohlmorgen
50
50
50
40
40
40
30
30
30
20
20
20
10
10
10
0
0
0
−10
−10
−10
−20
−20
−20
−30
−30
−40 −30
−20
−10
0
10
20
30
40
50
60
−30
−40 −30
−20
−10
0
(a)
10
20
30
40
50
60
−40 −30
−20
−10
0
(b)
10
20
30
40
50
60
30
40
50
60
(c) 600
500 500 500
400 400
200
dissimilarity
dissimilarity
dissimilarity
400
300
300
200
100
200
100
0
100
0
data
0
data
(d)
(f)
50
50
40
40
40
30
30
30
20
20
20
10
10
10
0
0
0
−10
−10
−10
−20
−20
−20
−30
−30
−20
−10
0
10
20
(g)
data
(e)
50
−40 −30
300
30
40
50
60
−40 −30
−30
−20
−10
0
10
20
(h)
30
40
50
60
−40 −30
−20
−10
0
10
20
(i)
Fig. 1. (a)-(c) Example datasets for the three types of mixture distributions used in the application. (d)-(f) The corresponding dendrograms for each example dataset (gray) and the extracted trees of significant clusters (black). Note that the extracted tree structure exactly corresponds to the structure in the data. (g)-(i) The PCA components of all significant clusters. The components are contained in the tree of features.
During the photic phase, the subject kept the eyes closed, while the rate of light flashes was increased every four seconds in steps of 1 Hz, from 5 Hz to 25 Hz. The obtained recording was subdivided into 507 epochs of fixed length (1s). For each epoch, we extracted four features that correspond to the power in
Hierarchical Feature Extraction
563
25
dissimilarity
20
15
10
5
0
82% (EC)
69% (PHV)
92% (EO)
88% (PO)
76% (HV)
90% (HV)
Fig. 2. The tree of significant clusters (black), obtained from the underlying dendrogram (gray) for the EEG data. The data in each significant sub-cluster largely corresponds to one of the experimental conditions (indicated in %): eyes open (EO), eyes closed (EC), hyperventilation (HV), post-hyperventilation (PHV), and ‘photic on’ (PO).
specific frequency bands of particular EEG electrodes.1 The resulting set of four-dimensional feature vectors was then analyzed by our method. For the hierarchical clustering, we used Ward’s method and found the significant clusters depicted in Fig. 2. The extracted tree of significant clusters consists of a twolevel hierarchy. As expected, the majority of feature vectors in each sub-cluster corresponds to one of the experimental conditions. By applying PCA to each sub-cluster and replacing the data of each node with its principle components, we obtain a tree of features, which constitutes a compact representation of the original dataset. It can then be used for comparison with trees that arise from normal or various kinds of pathological EEG, as outlined in section 3.
5
Discussion
We proposed a general approach for the extraction of feature hierarchies from datasets and their use for classification or clustering. The feasibility of this approach was demonstrated with an application to mixtures of Gaussians with 1
In detail: (I.) the power of the α-band (8–12 Hz) at the electrode positions O1 and O2 (according to the international 10–20 system), (II.) the power of 5 Hz and its harmonics (except 50 Hz) at electrode F4, (III.) the power of 6 Hz and its harmonics at electrode F8, and (IV.) the power of the 25–80 Hz band at F7.
564
M. Schubert and J. Kohlmorgen
varying degree of structuredness and to a clinical EEG recording. In this paper we focused on PCA as the core feature extraction method. Other types of feature extraction, like, e.g., ICA, are also conceivable, which then should be complemented with an appropriate distance function on the feature trees (if used for classification or clustering). The basis of the proposed approach is hierarchical clustering. The quality of the resulting feature hierarchies thus depends on the quality of the clustering. Ward’s criterion tends to find compact, hyperspherical clusters, which may not always be the optimal choice for a given problem. Therefore, one should consider to adjust the clustering criterion to the problem at hand. Our future work will focus on the application of this method to classify normal and pathological EEG. By comparing the different tree structures, the hope is to gain a better understanding of the pathological cases. Acknowledgements. This work was funded by the German BMBF under grant 01GQ0415 and supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST-2002-506778.
References [1] Jolliffe, I.: Principal Component Analysis. Springer, New York (1986) [2] Hyvarinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. Wiley, Chichester (2001) [3] Fukushima, K.: Neural network model for a mechanism of pattern recognition unaffected by shift in position — neocognitron. Transactions IECE 62-A(10), 658–665 (1979) [4] Bregler, C., Omohundro, S.: Surface learning with applications to lipreading. In: Cowan, J., Tesauro, G., Alspector, J. (eds.) Advances in Neural Information Precessing Systems, vol. 6, pp. 43–50. Morgan Kaufmann Publishers, San Mateo (1994) [5] Bach, F., Jordan, M.: Beyond independent components: Trees and clusters. Journal of Machine Learning Research 4, 1205–1233 (2003) [6] Bishop, C., Tipping, M.: A hierarchical latent variable model for data visualization. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(3), 281–293 (1998) [7] Wang, Y., Luo, L., Freedman, M., Kung, S.: Probabilistic principal component subspaces: A hierarchical finite mixture model for data visualization. IEEE Transactions on Neural Networks 11(3), 625–636 (2000) [8] Tipping, M., Bishop, C.: Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B 61(3), 611–622 (1999) [9] Kondor, R., Jebara, T.: A kernel between sets of vectors. In: Fawcett, T., Mishra, N. (eds.) Proceedings of the ICML, pp. 361–368. AAAI Press, Menlo Park (2003) [10] Desobry, F., Davy, M., Fitzgerald, W.: A class of kernels for sets of vectors. In: Proceedings of the ESANN, pp. 461–466 (2005) [11] Jain, A., Dubes, R.: Algorithms for Clustering Data. Prentice Hall, Inc., Englewood Cliffs (1988) [12] Duda, R., Hart, P., Stork, D.: Pattern Classification. Wiley–Interscience, Chichester (2000)
Hierarchical Feature Extraction
565
[13] Ward, J.: Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association 58, 236–244 (1963) [14] Fred, A., Leitao, J.: Clustering under a hypothesis of smooth dissimilarity increments. In: Proceedings of the ICPR, vol. 2, pp. 190–194 (2000) [15] Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B 39, 1–38 (1977)
Principal Component Analysis for Sparse High-Dimensional Data Tapani Raiko, Alexander Ilin, and Juha Karhunen Adaptive Informatics Research Center, Helsinki Univ. of Technology P.O. Box 5400, FI-02015 TKK, Finland {Tapani.Raiko,Alexander.Ilin,Juha.Karhunen}@tkk.fi http://www.cis.hut.fi/projects/bayes/
Abstract. Principal component analysis (PCA) is a widely used technique for data analysis and dimensionality reduction. Eigenvalue decomposition is the standard algorithm for solving PCA, but a number of other algorithms have been proposed. For instance, the EM algorithm is much more efficient in case of high dimensionality and a small number of principal components. We study a case where the data are highdimensional and a majority of the values are missing. In this case, both of these algorithms turn out to be inadequate. We propose using a gradient descent algorithm inspired by Oja’s rule, and speeding it up by an approximate Newton’s method. The computational complexity of the proposed method is linear with respect to the number of observed values in the data and to the number of principal components. In the experiments with Netflix data, the proposed algorithm is about ten times faster than any of the four comparison methods.
1
Introduction
Principal component analysis (PCA) [1,2,3,4,5,6] is a classic technique in data analysis. It can be used for compressing higher dimensional data sets to lower dimensional ones for data analysis, visualization, feature extraction, or data compression. PCA can be derived from a number of starting points and optimization criteria [2,3,4]. The most important of these are minimization of the mean-square error in data compression, finding mutually orthogonal directions in the data having maximal variances, and decorrelation of the data using orthogonal transformations [5]. While standard PCA is a very well-established linear statistical technique based on second-order statistics (covariances), it has recently been extended into various directions and considered from novel viewpoints. For example, various adaptive algorithms for PCA have been considered and reviewed in [4,6]. Fairly recently, PCA was shown to emerge as a maximum likelihood solution from a probabilistic latent variable model independently by several authors; see [3] for a discussion and references. In this paper, we study PCA in the case where most of the data values are missing (or unknown). Common algorithms for solving PCA prove to be M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 566–575, 2008. © Springer-Verlag Berlin Heidelberg 2008
Principal Component Analysis for Sparse High-Dimensional Data
567
inadequate in this case, and we thus propose a new algorithm. The problem of overfitting and possible solutions are also outlined.
2
Algorithms for Principal Component Analysis
Principal subspace and components. Assume that we have n d-dimensional data vectors x1 , x2 , . . . , xn , which form the d×n data matrix X=[x1 , x2 , . . . , xn ]. The matrix X is decomposed into X ≈ AS,
(1)
where A is a d × c matrix, S is a c × n matrix and c ≤ d ≤ n. Principal subspace methods [6,4] find A and S such that the reconstruction error C = X − AS2F =
d n
(xij −
i=1 j=1
c
aik skj )2 ,
(2)
k=1
is minimized. There F denotes the Frobenius norm, and xij , aik , and skj elements of the matrices X, A, and S, respectively. Typically the row-wise mean is removed from X as a preprocessing step. Without any further constraints, there exist infinitely many ways to perform such a decomposition. However, the subspace spanned by the column vectors of the matrix A, called the principal subspace, is unique. In PCA, these vectors are mutually orthogonal and have unit length. Further, for each k = 1, . . . , c, the first k vectors form the k-dimensional principal subspace. This makes the solution practically unique, see [4,2,5] for details. There are many ways to determine the principal subspace and components [6,4,2]. We will discuss three common methods that can be adapted for the case of missing values. Singular Value Decomposition. PCA can be determined by using the singular value decomposition (SVD) [5] X = UΣVT ,
(3)
where U is a d × d orthogonal matrix, V is an n × n orthogonal matrix and Σ is a d × n pseudodiagonal matrix (diagonal if d = n) with the singular values on the main diagonal [5]. The PCA solution is obtained by selecting the c largest singular values from Σ, by forming A from the corresponding c columns of U, and S from the corresponding c rows of ΣVT . Note that PCA can equivalently be defined using the eigendecomposition of the d × d covariance matrix C of the column vectors of the data matrix X: C=
1 XXT = UDUT , n
(4)
Here, the diagonal matrix D contains the eigenvalues of C, and the columns of the matrix U contain the unit-length eigenvectors of C in the same order
568
T. Raiko, A. Ilin, and J. Karhunen
[6,4,2,5]. Again, the columns of U corresponding to the largest eigenvalues are taken as A, and S is computed as AT X. This approach can be more efficient for cases where d n, since it avoids the n × n matrix. EM Algorithm. The EM algorithm for solving PCA [7] iterates updating A and S alternately.1 When either of these matrices is fixed, the other one can be obtained from an ordinary least-squares problem. The algorithm alternates between the updates S ← (AT A)−1 AT X ,
A ← XST (SST )−1 .
(5)
This iteration is especially efficient when only a few principal components are needed, that is c d [7]. Subspace Learning Algorithm. It is also possible to minimize the reconstruction error (2) by any optimization algorithm. Applying the gradient descent algorithm yields rules for simultaneous updates A ← A + γ(X − AS)ST ,
S ← S + γAT (X − AS) .
(6)
where γ > 0 is called the learning rate. Oja-Karhunen learning algorithm [8,9,6,4] is an online learning method that uses the EM formula for computing S and the gradient for updating A, a single data vector at a time. A possible speed-up to the subspace learning algorithm is to use the natural gradient [10] for the space of matrices. This yields the update rules A ← A + γ(X − AS)ST AT A ,
S ← S + γSST AT (X − AS) .
(7)
If needed, the end result of subspace analysis can be transformed into the PCA solution, for instance, by computing the eigenvalue decomposition SST = 1/2 T US DS UTS and the singular value decomposition AUS DS = UA ΣA VA . The transformed A is formed from the first c columns of UA and the transformed S −1/2 T from the first c rows of ΣA VA DS UTS S. Note that the required decompositions are computationally lighter than the ones done to the data matrix directly.
3
Principal Component Analysis with Missing Values
Let us consider the same problem when the data matrix has missing entries2 . In the following there are N = 9 observed values and 6 missing values marked with a question mark (?): ⎡ ⎤ −1 +1 0 0 ? X = ⎣−1 +1 ? ? 0⎦ . (8) ? ? −1 +1 ? 1
2
The procedure studied in [7] can be seen as the zero-noise limit of the EM algorithm for a probabilistic PCA model. We make the typical assumption that values are missing at random, that is, the missingness does not depend on the unobserved data. An example where the assumption does not hold is when out-of-scale measurements are marked missing.
Principal Component Analysis for Sparse High-Dimensional Data
569
We would like to find A and S such that X ≈ AS for the observed data values. The rest of the product AS represents the reconstruction of missing values. Adapting SVD. One can use the SVD approach (4) in order to find an approximate solution to the PCA problem. However, estimating the covariance matrix C becomes very difficult when there are lots of missing values. If we estimate C leaving out terms with missing values from the average, we get for the estimate of the covariance matrix ⎡ ⎤ 0.5 1 0 1 C = XXT = ⎣ 1 0.667 ?⎦ . (9) n 0 ? 1 There are at least two problems. First, the estimated covariance 1 between the first and second components is larger than their estimated variances 0.5 and 0.667. This is clearly wrong, and leads to the situation where the covariance matrix is not positive (semi)definite and some of its eigenvalues are negative. Secondly, the covariance between the second and the third component could not be estimated at all3 . Both problems appeared in practice with the data set considered in Section 5. Another option is to complete the data matrix by iteratively imputing the missing values (see, e.g., [2]). Initially, the missing values can be replaced by zeroes. The covariance matrix of the complete data can be estimated without the problems mentioned above. Now, the product AS can be used as a better estimate for the missing values, and this process can be iterated until convergence. This approach requires the use of the complete data matrix, and therefore it is computationally very expensive if a large part of the data matrix is missing. The time complexity of computing the sample covariance matrix explicitly is O(nd2 ). We will further refer to this approach as the imputation algorithm. Note that after convergence, the missing values do not contribute to the reconstruction error (2). This means that the imputation algorithm leads to the solution which minimizes the reconstruction error of observed values only. Adapting the EM Algorithm. Grung and Manne [11] studied the EM algorithm in the case of missing values. Experiments showed a faster convergence compared to the iterative imputation algorithm. The computational complexity is O(N c2 + nc3 ) per iteration, where N is the number of observed values, assuming na¨ıve matrix multiplications and inversions but exploiting sparsity. This is quite a bit heavier than EM with complete data, whose complexity is O(ndc) [7] per iteration. Adapting the Subspace Learning Algorithm. The subspace learning algorithm works in a straightforward manner also in the presence of missing values. 3
It could be filled by finding a value that maximizes the determinant of the covariance matrix (and thus the entropy of the underlying Gaussian distribution).
570
T. Raiko, A. Ilin, and J. Karhunen
We just take the sum over only those indices i and j for which the data entry xij (the ijth element of X) is observed, in short (i, j) ∈ O. The cost function is
C=
e2ij ,
with
c
eij = xij −
aik skj .
(10)
k=1
(i,j)∈O
and its partial derivatives are ∂C = −2 eij slj , ∂ail j|(i,j)∈O
∂C = −2 ∂slj
eij ail .
(11)
i|(i,j)∈O
The update rules for gradient descent are ∂C ∂C , S ←S+γ ∂A ∂S and the update rules for natural gradient descent are A←A+γ
(12)
∂C T ∂C A A, S ← S + γSST . (13) ∂A ∂S We propose a novel speed-up to the original simple gradient descent algorithm. In Newton’s method for optimization, the gradient is multiplied by the inverse of the Hessian matrix. Newton’s method is known to converge fast especially in the vicinity of the optimum, but using the full Hessian is computationally too demanding in truly high-dimensional problems. Here we use only the diagonal part of the Hessian matrix. We also include a control parameter α that allows the learning algorithm to interpolate between the standard gradient descent (α = 0) and the diagonal Newton’s method (α = 1), much like the Levenberg-Marquardt algorithm. The learning rules then take the form 2 −α ∂ C ∂C j|(i,j)∈O eij slj α , ail ← ail − γ = ail + γ (14) 2 ∂ail ∂ail 2 s j|(i,j)∈O lj
−α ∂2C ∂C i|(i,j)∈O eij ail α . slj ← slj − γ = slj + γ (15) 2 ∂slj ∂slj a2 A←A+γ
i|(i,j)∈O
il
The computational complexity is O(N c + nc) per iteration.
4
Overfitting
A trained PCA model can be used for reconstructing missing values: xˆij =
c
aik skj ,
(i, j) ∈ / O.
(16)
k=1
Although PCA performs a linear transformation of data, overfitting is a serious problem for large-scale problems with lots of missing values. This happens when the value of the cost function C in Eq. (10) is small for training data, but the quality of prediction (16) is poor for new data. For further details, see [12].
Principal Component Analysis for Sparse High-Dimensional Data
571
Regularization. A popular way to regularize ill-posed problems is penalizing the use of large parameter values by adding a proper penalty term into the cost function; see for example [3]. In our case, one can modify the cost function in Eq. (2) as follows: Cλ = e2ij + λ(A2F + S2F ) . (17) (i,j)∈O
This has the effect that the parameters that do not have significant evidence will decay towards zero. A more general penalization would use different regularization parameters λ for different parts of A and S. For example, one can use a λk parameter of its own for each of the column vectors ak of A and the row vectors sk of S. Note that since the columns of A can be scaled arbitrarily by rescaling the rows of S accordingly, one can fix the regularization term for ak , for instance, to unity. An equivalent optimization problem can be obtained using a probabilistic formulation with (independent) Gaussian priors and a Gaussian noise model:
c p(xij | A, S) = N xij ; aik skj , vx , (18) k=1
p(aik ) = N (aik ; 0, 1) ,
p(skj ) = N (skj ; 0, vsk ) ,
(19)
where N (x; m, v) denotes the random variable x having a Gaussian distribution with the mean m and variance v. The regularization parameter λk = vsk /vx is the ratio of the prior variances vsk and vx . Then, the cost function (ignoring constants) is minus logarithm of the posterior for A and S: CBR =
d c c n 2 e2ij /vx + ln vx + a2ik + skj /vsk + ln vsk (i,j)∈O
i=1 k=1
(20)
k=1 j=1
An attractive property of the Bayesian formulation is that it provides a natural way to choose the regularization constants. This can be done using the evidence framework (see, e.g., [3]) or simply by minimizing CBR by setting vx , vsk to the means of e2ij and s2kj respectively. We will use the latter approach and refer to it as regularized PCA. Note that in case of joint optimization of CBR w.r.t. aik , skj , vsk , and vx , the cost function (20) has a trivial minimum with skj = 0, vsk → 0. We try to avoid this minimum by using an orthogonalized solution provided by unregularized PCA from the learning rules (14) and (15) for initialization. Note also that setting vsk to small values for some components k is equivalent to removal of irrelevant components from the model. This allows for automatic determination of the proper dimensionality c instead of discrete model comparison (see, e.g., [13]). This justifies using separate vsk in the model in (19). Variational Bayesian Learning. Variational Bayesian (VB) learning provides even stronger tools against overfitting. VB version of PCA by [13] approximates
572
T. Raiko, A. Ilin, and J. Karhunen
the joint posterior of the unknown quantities using a simple multivariate distribution. Each model parameter is described a posteriori using independent Gaussian distributions. The means can then be used as point estimates of the parameters, while the variances give at least a crude estimate of the reliability of these point estimates. The method in [13] does not extend to missing values easily, but the subspace learning algorithm (Section 3) can be extended to VB. The derivation is somewhat lengthy, and it is omitted here together with the variational Bayesian learning rules because of space limitations; see [12] for details. The computational complexity of this method is still O(N c + nc) per iteration, but the VB version is in practice about 2–3 times slower than the original subspace learning algorithm.
5
Experiments
Collaborative filtering is the task of predicting preferences (or producing personal recommendations) by using other people’s preferences. The Netflix problem [14] is such a task. It consists of movie ratings given by n = 480189 customers to d = 17770 movies. There are N = 100480507 ratings from 1 to 5 given, from which 1408395 ratings are reserved for validation (or probing). Note that 98.8% of the values are thus missing. We tried to find c = 15 principal components from the data using a number of methods.4 We subtracted the mean rating for each movie, assuming 22 extra ratings of 3 for each movie as a Dirichlet prior. Computational Performance. In the first set of experiments we compared the computational performance of different algorithms on PCA with missing values.The root mean square (rms) error is measured on the training data, 1 2 EO = |O| (i,j)∈O eij . All experiments were run on a dual cpu AMD Opteron SE 2220 using Matlab. First, we tested the imputation algorithm. The first iteration where the missing values are replaced with zeros, was completed in 17 minutes and led to EO = 0.8527. This iteration was still tolerably fast because the complete data matrix was sparse. After that, it takes about 30 hours per iteration. After three iterations, EO was still 0.8513. Using the EM algorithm by [11], the E-step (updating S) takes 7 hours and the M-step (updating A) takes 18 hours. (There is some room for optimization since we used a straightforward Matlab implementation.) Each iteration gives a much larger improvement compared to the imputation algorithm, but starting from a random initialization, EM could not reach a good solution in reasonable time. We also tested the subspace learning algorithm described in Section 3 with and without the proposed speed-up. Each run of the algorithm with different values of the speed-up parameter α was initialized in the same starting point (generated randomly from a normal distribution). The learning rate γ was adapted such that 4
The PCA approach has been considered by other Netflix contestants as well (see, e.g., [15,16]).
Principal Component Analysis for Sparse High-Dimensional Data
573
1.1
1.04
Gradient Speed−up Natural Grad. Imputation EM
1 0.96
Gradient Speed−up Natural Grad. Regularized VB1 VB2
1.05
0.92 1
0.88 0.84 0.95
0.8 0.76
0
1
2
4
8
16
32
64
0
1
2
4
8
16
32
Fig. 1. Left: Learning curves for unregularized PCA (Section 3) applied to the Netflix data: Root mean-square error on the training data EO is plotted against computation time in hours. Right: The root mean square error on the validation data EV from the Netflix problem during runs of several algorithms: basic PCA (Section 3), regularized PCA (Section 4) and VB (Section 4). VB1 has some parameters fixed (see [12]) while VB2 updates all the parameters. The time scales are linear below 1 and logarithmic above 1.
if an update decreased the cost function, γ was multiplied by 1.1. Each time an update would increase the cost, the update was canceled and γ was divided by 2. Figure 1 (left) shows the learning curves for basic gradient descent, natural gradient descent, and the proposed speed-up with the best found parameter value α = 0.625. The proposed speed-up gave about a tenfold speed-up compared to the gradient descent algorithm even if each iteration took longer. Natural gradient was slower than the basic gradient. Table 1 gives a summary of the computational complexities. Overfitting. We compared PCA (Section 3), regularized PCA (Section 4) and VB-PCA (Section 4) by computing the rms reconstruction error for the validation set V , that is, testing how the models generalize to new data: EV = 1 2 (i,j)∈V eij . We tested VB-PCA by firstly fixing some of the parameter |V | values (this run is marked as VB1 in Fig. 1, see [12] for details) and secondly by Table 1. Summary of the computational performance of different methods on the Netflix problem. Computational complexities (per iteration) assume na¨ıve computation of products and inverses of matrices and ignores the computation of SVD in the imputation algorithm. While the proposed speed-up makes each iteration slower than the basic gradient update, the time to reach the error level 0.85 is greatly diminished. Method Gradient Speed-up Natural Grad. Imputation EM
Complexity Seconds/Iter Hours to EO = 0.85 O(N c + nc) 58 1.9 O(N c + nc) 110 0.22 O(N c + nc2 ) 75 3.5 O(nd2 ) 110000 64 O(N c2 + nc3 ) 45000 58
574
T. Raiko, A. Ilin, and J. Karhunen
adapting them (marked as VB2). We initialized regularized PCA and VB1 using normal PCA learned with α = 0.625 and orthogonalized A, and VB2 using VB1. The parameter α was set to 2/3. Fig. 1 (right) shows the results. The performance of basic PCA starts to degrade during learning, especially using the proposed speed-up. Natural gradient diminishes this phenomenon known as overlearning, but it is even more effective to use regularization. The best results were obtained using VB2: The final validation error EV was 0.9180 and the training rms error EO was 0.7826 which is naturally larger than the unregularized EO = 0.7657.
6
Discussion
We studied a number of different methods for PCA with sparse data and it turned out that a simple gradient descent approach worked best due to its minimal computational complexity per iteration. We could also speed it up more than ten times by using an approximated Newton’s method. We found out empirically that setting the parameter α = 2/3 seems to work well for our problem. It is left for future work to find out whether this generalizes to other problem settings. There are also many other ways to speed-up the gradient descent algorithm. The natural gradient did not help here, but we expect that the conjugate gradient method would. The modification to the gradient proposed in this paper, could be used together with the conjugate gradient speed-up. This will be another future research topic. There are also other benefits in solving the PCA problem by gradient descent. Algorithms that minimize an explicit cost function are rather easy to extend. The case of variational Bayesian learning applied to PCA was considered in Section 4, but there are many other extensions of PCA, such as using non-Gaussianity, nonlinearity, mixture models, and dynamics. The developed algorithms can prove useful in many applications such as bioinformatics, speech processing, and meteorology, in which large-scale datasets with missing values are very common. The required computational burden is linearly proportional to the number of measured values. Note also that the proposed techniques provide an analogue of confidence regions showing the reliability of estimated quantities. Acknowledgments. This work was supported in part by the Academy of Finland under its Centers for Excellence in Research Program, and the IST Program of the European Community, under the PASCAL Network of Excellence, IST2002-506778. This publication only reflects the authors’ views. We would like to thank Antti Honkela for useful comments.
References 1. Pearson, K.: On lines and planes of closest fit to systems of points in space. Philosophical Magazine 2(6), 559–572 (1901) 2. Jolliffe, I.: Principal Component Analysis. Springer, Heidelberg (1986) 3. Bishop, C.: Pattern Recognition and Machine Learning. Springer, Heidelberg (2006)
Principal Component Analysis for Sparse High-Dimensional Data
575
4. Diamantaras, K., Kung, S.: Principal Component Neural Networks - Theory and Application. Wiley, Chichester (1996) 5. Haykin, S.: Modern Filters. Macmillan, Basingstoke (1989) 6. Cichocki, A., Amari, S.: Adaptive Blind Signal and Image Processing - Learning Algorithms and Applications. Wiley, Chichester (2002) 7. Roweis, S.: EM algorithms for PCA and SPCA. In: Advances in Neural Information Processing Systems, vol. 10, pp. 626–632. MIT Press, Cambridge (1998) 8. Karhunen, J., Oja, E.: New methods for stochastic approximation of truncated Karhunen-Loeve expansions. In: Proceedings of the 6th International Conference on Pattern Recognition, pp. 550–553. Springer, Heidelberg (1982) 9. Oja, E.: Subspace Methods of Pattern Recognition. Research Studies Press and J. Wiley (1983) 10. Amari, S.: Natural gradient works efficiently in learning. Neural Computation 10(2), 251–276 (1998) 11. Grung, B., Manne, R.: Missing values in principal components analysis. Chemometrics and Intelligent Laboratory Systems 42(1), 125–139 (1998) 12. Raiko, T., Ilin, A., Karhunen, J.: Principal component analysis for large scale problems with lots of missing values. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladeniˇc, D., Skowron, A. (eds.) ECML 2007. LNCS (LNAI), vol. 4701, pp. 691–698. Springer, Heidelberg (2007) 13. Bishop, C.: Variational principal components. In: Proceedings of the 9th International Conference on Artificial Neural Networks (ICANN 1999), pp. 509–514 (1999) 14. Netflix: Netflix prize webpage (2007), http://www.netflixprize.com/ 15. Funk, S.: Netflix update: Try this at home (December 2006), http://sifter.org/∼ simon/journal/20061211.html 16. Salakhutdinov, R., Mnih, A., Hinton, G.: Restricted Boltzmann machines for collaborative filtering. In: Proceedings of the International Conference on Machine Learning (2007)
Hierarchical Bayesian Inference of Brain Activity Masa-aki Sato1 and Taku Yoshioka1,2 1
2
ATR Computational Neuroscience Laboratories [email protected] National Institute of Information and Communication Technology
Abstract. Magnetoencephalography (MEG) can measure brain activity with millisecond-order temporal resolution, but its spatial resolution is poor, due to the ill-posed nature of the inverse problem, for estimating source currents from the electromagnetic measurement. Therefore, prior information on the source currents is essential to solve the inverse problem. We have proposed a new hierarchical Bayesian method to combine several sources of information. In our method, the variance of the source current at each source location is considered an unknown parameter and estimated from the observed MEG data and prior information by using variational Bayes method. The fMRI information can be imposed as prior distribution rather than the variance itself so that it gives a soft constraint on the variance. It is shown that the hierarchical Bayesian method has better accuracy and spatial resolution than conventional linear inverse methods by evaluating the resolution curve. The proposed method also demonstrated good spatial and temporal resolution for estimating current activity in early visual area evoked by a stimulus in a quadrant of the visual field.
1
Introduction
In recent years, there has been rapid progress in noninvasive neuroimaging measurement for human brain. Functional organization of the human brain has been revealed by PET and functional magnetic resonance imaging (fMRI). However, these methods can not reveal the detailed dynamics of information processing in the human brain, since they have poor temporal resolution due to slow hemodynamic responses to neural activity (Bandettini, 2000;Ogawa et al., 1990). On the other hand, Magnetoencephalography (MEG) can measure brain activity with millisecond-order temporal resolution, but its spatial resolution is poor, due to the ill-posed nature of the inverse problem, for estimating source currents from the electromagnetic measurement (Hamalainen et al., 1993)). Therefore, prior information on the source currents is essential to solve the inverse problem. One of the standard methods for the inverse problem is a dipole method (Hari, 1991; Mosher et al., 1992). It assumes that brain activity can be approximated by a small number of current dipoles. Although this method gives good estimates when the number of active areas is small, it can not give distributed brain activity for higher function. On the other hand, a number of distributed M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 576–585, 2008. c Springer-Verlag Berlin Heidelberg 2008
Hierarchical Bayesian Inference of Brain Activity
577
source methods have been proposed to estimate distributed activity in the brain such as the minimum norm method, the minimum L1-norm method, and others (Hamalainen et al., 1993)). It has been also proposed to combine fMRI information with MEG data (Dale and Sereno, 1993;Ahlfors et al., 1999; Dale et al., 2000;Phillips et al., 2002). However, there are essential differences between fMRI and MEG due to their temporal resolution. The fMRI activity corresponds to an average of several thousands of MEG time series data and may not correspond MEG activity at some time points. We have proposed a new hierarchical Bayesian method to combine several sources of information (Sato et al. 2004). In our method, the variance of the source current at each source location is considered an unknown parameter and estimated from the observed MEG data and prior information. The fMRI information can be imposed as prior information on the variance distribution rather than the variance itself so that it gives a soft constraint on the variance. Therefore, our method is capable of appropriately estimating the source current variance from the MEG data supplemented with the fMRI data, even if fMRI data convey inaccurate information. Accordingly, our method is robust against inaccurate fMRI information. Because of the hierarchical prior, the estimation problem becomes nonlinear and cannot be solved analytically. Therefore, the approximate posterior distribution is calculated by using the Variational Bayesian (VB) method (Attias, 1999; Sato, 2001). The resulting algorithm is an iterative procedure that converges quickly because the VB algorithm is a type of natural gradient method (Amari, 1998) that has an optimal local convergence property. The position and orientation of the cortical surface obtained from structural MRI can be also introduced as hard constraint. In this article, we explain our hierarchical Bayesian method. To evaluate the performance of the hierarchical Bayesian method, the resolution curves were calculated by varying the numbers of model dipoles, simultaneously active dipoles and MEG sensors. The results show the superiority of the hierarchical Bayesian method over conventional linear inverse methods. We also applied the hierarchical Bayesian method to visual experiments, in which subjects viewed a flickering stimulus in one of four quadrants of the visual field. The estimation results are consistent with known physiological findings and show the good spatial and temporal resolutions of the hierarchical Bayesian method.
2
MEG Inverse Problem
When neural current activity occurs in the brain, it produces a magnetic field observed by MEG. The relationship between the magnetic field B = {Bm |m = 1 : M } measured by M sensors and the primary source current J = {Jn |n = 1 : N } in the brain is given by B = G · J,
(1)
where G= {Gm,n |m = 1 : M, n = 1 : N } is the lead field matrix. The lead field Gm,n represents the magnetic field Bm produced by the n-th unit dipole current. The above equations give the forward model and the inverse problem
578
M. Sato and T. Yoshioka
is to estimate the source current J from the observed magnetic field data B. The probabilistic model for the source currents can be constructed assuming Gaussian noise for the MEG sensors. Then, the probability distribution, that the magnetic field B is observed for a given current J , is given by 1 P (B|J ) ∝ exp − β (B − G · J) · Σ G · (B − G · J ) , (2) 2 where (βΣG )−1 denotes the covariance matrix of the sensor noise. Σ−1 G is the −1 normalized covariance matrix satisfying T r(Σ −1 is the average G ) = M , and β noise variance.
3
Hierarchical Bayesian Method
In the hierarchical Bayesian method, the variances of the currents are considered unknown parameters and estimated from the observed MEG data by introducing a hierarchical prior on the current variance. The fMRI information can be imposed as prior information on the variance distribution rather than the variance itself so that it gives a soft constraint on the variance. The spatial smoothness constraint, that neurons within a few millimeter radius tends to fire simultaneously due to the neural interactions, can also be implemented as a hierarchical prior (Sato et al. 2004). Hierarchical Prior. Let us suppose a time sequence of MEG data B 1:T ≡ {B(t)|t = 1 : T } is observed. The MEG inverse problem in this case is to estimate the primary source current J 1:T ≡ {J (t)|t = 1 : T } from the observed MEG data B 1:T . We assume a Normal prior for the current: T 1 P0 (J 1:T |α) ∝ exp − β J (t) · Σ α · J (t) , (3) 2 t=1 where Σ α is the diagonal matrix with diagonal elements α = {αn |n = 1 : N }. We also assume that the current variance α−1 does not change over period T . The current inverse variance parameter α is estimated by introducing an ARDiAutomatic Relevance Determinationj hierarchical prior (Neal, 1996): P0 (α) =
N n=1 −1
Γ (α|¯ α, γ) ≡ α
Γ (αn |¯ α0n , γ0nα ),
(4)
(αγ/α ¯ )γ Γ (γ)−1 e−αγ/α¯ ,
where Γ (α|¯ α, γ) represents the Gamma distribution with mean α ¯ and degree of ∞ freedom γ. Γ (γ) ≡ 0 dttγ−1 e−t is the Gamma function. When the fMRI data is not available, we use a non-informative prior for the current inverse variance parameter αn , i.e., γ0nα = 0 and P0 (αn ) = α−1 n . When the fMRI data is available, fMRI information is imposed as the prior for the inverse variance parameter αn . The mean of the prior, α ¯0n , is assumed to be inversely proportional to the fMRI activity. Confidence parameter γ0nα controls a reliability of the fMRI information.
Hierarchical Bayesian Inference of Brain Activity
579
Variational Bayesian Method. The objective of the Bayesian estimation is to calculate the posterior probability distribution of J for the observed data B (in the following, B 1:T and J 1:T are abbreviated as B and J, respectively, for notational simplicity): P (J |B) = dα P (J , α|B), P (J, α, B) , P (B) P (J , α, B) = P (B|J ) P0 (J |α) P0 (α) , P (B) = dJ dαP (J , α, B) . P (J , α|B) =
The calculation of the marginal likelihood P (B) cannot be done analytically. In the VB method, the calculation of the joint posterior P (J , α|B) is reformulated as the maximization problem of the free energy. The free energy for a trial distribution Q(J, α) is defined by P (J , α, B) F (Q) = dJ dαQ (J, α) log Q (J , α) = log (P (B)) − KL [Q (J , α) || P (J , α|B)].
(5)
Equation (5) implies that the maximization of the free energy F (Q) is equivalent to the minimization of the Kullback-Leibler distance (KL-distance) defined by KL [Q(J , α) || P (J , α|B)] ≡ dJ dαQ(J, α) log (Q(J , α)/P (J, α|B)) . This measures the difference between the true joint posterior P (J , α|B) and the trial distribution Q(J , α). Since the KL-distance reaches its minimum at zero when the two distributions coincide, the joint posterior can be obtained by maximizing the free energy F (Q) with respect to the trial distribution Q. In addition, the maximum free energy gives the log-marginal likelihood log(P (B)). The optimization problem can be solved using a factorization approximation restricting the solution space (Attias, 1999; Sato, 2001): Q (J , α) = QJ (J ) Qα (α) . Under the factorization assumption (6), the free energy can be written as
F (Q) = log P (J , α, B)J α − log QJ (J ) J − log Qα (α)α
= log P (B|J )J − KL QJ (J )Qα (α) || P0 (J |α)P0 (α) ,
(6)
(7)
where ·J and ·α represent the expectation values with respect to QJ (J ) and Qα (α), respectively. The first term in the second equation of (7) corresponds to the negative sign of the expected reconstruction error. The second term (KL-distance) measures the difference between the prior and the posterior
580
M. Sato and T. Yoshioka
and corresponds to the effective degree of freedom that can be well specified from the observed data. Therefore, (negative sign of) the free energy can be considered a regularized error function with a model complexity penalty term. The maximum free energy is obtained by alternately maximizing the free energy with respect to QJ and Qα . In the first step (J-step), the free energy F (Q) is maximized with respect to QJ while Qα is fixed. The solution is given by QJ (J ) ∝ exp [log P (J , α, B)α ] .
(8)
In the second step (α-step), the free energy F (Q) is maximized with respect to Qα while QJ is fixed. The solution is given by
Qα (α) ∝ exp log P (J , α, B)J . (9) The above J- and α-steps are repeated until the free energy converges. VB algorithm. The VB algorithm is summarized here. In the J-step, the in−1 verse filter L(Σ−1 α ) is calculated using the estimated covariance matrix Σ α in the previous iteration: −1 −1 −1 −1 L(Σ−1 α ) = Σα · G · G · Σα · G + ΣG
(10)
The expectation values of the current J and the noise variance β −1 with respect to the posterior distribution are estimated using the inverse filter (10). J = L(Σ−1 α ) · B, γβ β −1 =
γβ =
1 N T, 2
1 (B − G · J ) · ΣG · (B − G · J) + J · Σα · J . 2
(11)
In the α-step, the expectation values of the variance parameters α−1 n with respect to the posterior distribution are estimated as −1 γnα α−1 n = γ0nα α0n +
T −1 αn 1 − Σ−1 · G · Σ−1 α B · G n,n , 2
where γnα is given by γnα = γ0nα +
4
T 2
(12)
.
Resolution Curve
We evaluated the performance of the hierarchical Bayesian method by calculating the resolution curve and compared with the minimum norm (MN) method. The inverse filter of the MN method L can be obtained from Eq.(10), if the inverse variance parameters αn is set to a given constant which is independent of position n. Let define the resolution matrix R by R = L · G.
(13)
Hierarchical Bayesian Inference of Brain Activity
581
Fig. 1. Resolution curves of minimum norm method for different number of model dipoles (170, 262, 502, and 1242). The number of sensors is 251. Horizontal axis denotes the radius from the source position in m.
The (n,k) component of the resolution matrix Rn,k represents the n-th estimated current when the unit current dipole is applied for the k-th position without noise, i.e., Jk = 1 and Jl = 0(l = k). The resolution curve is defined by the averaged estimated currents as a function of distance r from the source position. It can be obtained by summing the estimated currents Rn,k at the n-th position whose distance from the k-th position is in the range from r to r + dr, when the unit current dipole is applied for the k-th position. The averaged resolution curve is obtained by averaging the resolution curve over the k-th positions. If the estimation is perfect, the resolution curve at the origin, which is the estimation gain, should be one. In addition, the resolution curve should be zero elsewhere. However, the estimationgain of the linear inverse method such as MN method satisfies the constraint, N n=1 Gn ≤ M , where Gn denotes the estimation gain at n-th position (Sato et al. 2004). This constraint implies that the linear inverse method cannot perfectly retrieve more current dipoles than the number of sensors M . To see the effect of this constraint, we calculated the resolution curve for the MN method by varying the number of model dipoles while the number of sensors M is fixed at 251 (Fig. 1). We assumed model dipoles are placed evenly on a hemisphere. Fig. 1 shows that MN method gives perfect estimation if the number of dipoles are less than M . On the other hand, the performance degraded as the number of dipoles increases over M . Although the above results are obtained by using MN method, similar results can be obtained for a class of linear inverse methods. This limitation is the main cause of poor spatial
582
M. Sato and T. Yoshioka
Fig. 2. Resolution curves of hierarchical Bayesian method with 4078/10442 model dipoles and 251/515 sensors. The number of active dipoles are 240 or 400. Horizontal axis denotes the radius from the source position in m.
resolution of the linear inverse methods. When several dipoles are simultaneously active, estimated currents in the linear inverse methods can be obtained by the summation of the estimated currents for each dipole. Therefore, the resolution curve gives complete descriptions on the spatial resolution of the linear inverse methods. From the theoretical analysis (in preparation), the hierarchical Bayesian method can estimate dipole currents perfectly even when the number of model dipoles are larger than the number of sensors M . This is because the hierarchical Bayesian method effectively eliminates inactive dipoles from the estimation model by adjusting the estimation gain of these dipoles to zero. Nevertheless, the number of active dipoles gives the constraint on the performance of the hierarchical Bayesian method. The calculation of the resolution curves for the hierarchical Bayesian method are somewhat complicated, because Bayesian inverse filters are dependent on the MEG data. To evaluate the performance for the situations where multiple dipoles are active, we generated 240 or 400 active dipoles randomly on the hemisphere, and calculated the corresponding MEG data where 240 or 400 dipoles were simultaneously active. The Bayesian inverse filters were estimated using these simulated MEG data. Then, the resolution curves were calculated using the estimated Bayesian inverse filters for each active dipole and they were averaged over all active dipoles. Fig. 2 shows the resolution curves for the hierarchical Bayesian method with 4078/10442 model dipoles and 251/515 MEG sensors. When the number of simultaneously active dipoles are less than those of MEG sensors, almost perfect estimation is obtained
Hierarchical Bayesian Inference of Brain Activity
583
regardless of the number of model dipoles. Therefore, the hierarchical Bayesian method can achieve much better spatial resolution than the conventional linear inverse method. On the other hand, the performance is degraded if the numbers of simultaneously active dipoles are larger than the number of MEG sensors. The above results demonstrate the superiority of the hierarchical Bayesian method over MN method.
5
Visual Experiments
We also applied the hierarchical Bayesian method to visual experiments, in which subjects viewed a flickering stimulus in one of four quadrants of the visual field. Red and green checkerboards of a pseudo-randomly selected quadrant are presented for 700 ms in one trial. FMRI experiments with the same quadrant stimuli were also done by adopting conventional block design where the stimuli were presented for 15 seconds in a block. The global field power (sum of MEG signals of all sensors) recorded from subject RH induced by the upper right stimulus is shown in Fig. 3a. The strong peak was observed after 93 ms of the stimulus onset. Cortical currents were estimated by applying the hierarchical Bayesian method to the averaged MEG data between 100 ms before and 400 ms after the stimulus onset. The fMRI activity t-values were used as a prior for the inverse
Fig. 3. Estimated current for quadrant visual stimulus. (a) shows the global field power of MEG signal. (b) shows the temporal pattens of averaged currents in V1, V2/3, and V4. (c-e) shows spatial pattens of the current strength averaged over 20ms time windows centered at 93ms, 98ms, and 134ms.
584
M. Sato and T. Yoshioka
variance parameters. As explained in ’Hierarchical Prior’ subsection, the mean of the prior was assumed to be α ¯ −1 0n = a0 · tf (n), where tf (n) was the t-value at the n-th position and a0 was a hyper parameter and set to 500 in this analysis. Estimated spatiotemporal brain activities are illustrated in Fig. 3. We identified 3 ROIs (Region Of Interest) in V1, V2/3, and V4 and temporal patterns of the estimated currents are obtained by averaging the current within these ROIs. Fig. 3b shows that V1, V2/3, and V4 are successively activated and attained their peak around 93ms, 98ms, and 134ms, respectively. Fig. 3c-3e illustrates the spatial pattern of the current strength averaged over 20ms time windows (centered at 93ms, 98ms, and 134ms), in a flattened map format. The flattened map was made by cutting along the bottom of calcarine sulcus. We can see strongly active regions in V1, V2/3, and V4 corresponding to their peak activities. The above results are consistent with known physiological findings and show the good spatial and temporal resolutions of the hierarchical Bayesian method.
6
Conclusion
In this article, we have explained the hierarchical Bayesian method which combines MEG and fMRI by using the hierarchical prior. We have shown the superiority of the hierarchical Bayesian method over conventional linear inverse methods by evaluating the resolution curve. We also applied the hierarchical Bayesian method to visual experiments, in which subjects viewed a flickering stimulus in one of four quadrants of the visual field. The estimation results are consistent with known physiological findings and shows the good spatial and temporal resolutions of the hierarchical Bayesian method. Currently, we are applying the hierarchical Bayesian method for brain machine interface using noninvasive neuroimaging. In our approach, we first estimate current activity in the brain. Then, the intention or the motion of the subject is estimated by using the current activity. This approach enables us to use physiological knowledge and gives us more insight on the mechanism of human information proceeding. Acknowledgement. This research was supported in part by NICT-KARC.
References Ahlfors, S.P., Simpson, G.V., Dale, A.M., Belliveau, J.W., Liu, A.K., Korvenoja, A., Virtanen, J., Huotilainen, M., Tootell, R.B.H., Aronen, H.J., Ilmoniemi, R.J.: Spatiotemporal activity of a cortical network for processing visual motion revealed by MEG and fMRI. J. Neurophysiol. 82, 2545–2555 (1999) Amari, S.: Natural Gradient Works Efficiently in Learning. Neural Computation 10, 251–276 (1998) Attias, H.: Inferring parameters and structure of latent variable models by variational Bayes. In: Proc. 15th Conference on Uncertainty in Artificial Intelligence, pp. 21–30 (1999) Bandettini, P.A.: The temporal resolution of functional MRI. In: Moonen, C.T.W., Bandettini, P.A. (eds.) Functional MRI, pp. 205–220. Springer, Heidelberg (2000)
Hierarchical Bayesian Inference of Brain Activity
585
Dale, A.M., Liu, A.K., Fischl, B.R., Buchner, R.L., Belliveau, J.W., Lewine, J.D., Halgren, E.: Dynamic statistical parametric mapping: Combining fMRI and MEG for high-resolution imaging of cortical activity. Neuron 26, 55–67 (2000) Dale, A.M., Sereno, M.I.: Improved localization of cortical activity by combining EEG and MEG with MRI cortical surface reconstruction: A Linear approach. J. Cognit. Neurosci. 5, 162–176 (1993) Hamalainen, M.S., Hari, R., Ilmoniemi, R.J., Knuutila, J., Lounasmaa, O.V.: Magentoencephalography– Theory, instrumentation, and applications to noninvasive studies of the working human brain. Rev. Modern Phys. 65, 413–497 (1993) Hari, R.: On brain’s magnetic responses to sensory stimuli. J. Clinic. Neurophysiol. 8, 157–169 (1991) Mosher, J.C., Lewis, P.S., Leahy, R.M.: Multiple dipole modelling and localization from spatio-temporal MEG data. IEEE Trans. Biomed. Eng. 39, 541–557 (1992) Neal, R.M.: Bayesian learning for neural networks. Springer, Heidelberg (1996) Ogawa, S., Lee, T.-M., Kay, A.R., Tank, D.W.: Brain magnetic resonance imaging with contrast-dependent oxygenation. In: Proc. Natl. Acad. Sci. USA, vol. 87, pp. 9868–9872 (1990) Phillips, C., Rugg, M.D., Friston, K.J.: Anatomically Informed Basis Functions for EEG Source Localization: Combining Functional and Anatomical Constraints. NeuroImage 16, 678–695 (2002) Sato, M.: On-line Model Selection Based on the Variational Bayes. Neural Computation 13, 1649–1681 (2001) Sato, M., Yoshioka, T., Kajihara, S., Toyama, K., Goda, N., Doya, K., Kawato, M.: Hierarchical Bayesian estimation for MEG inverse problem. NeuroImage 23, 806–826 (2004)
Neural Decoding of Movements: From Linear to Nonlinear Trajectory Models Byron M. Yu1,2 , John P. Cunningham1 , Krishna V. Shenoy1 , and Maneesh Sahani2 1
2
Dept. of Electrical Engineering and Neurosciences Program, Stanford University, Stanford, CA, USA Gatsby Computational Neuroscience Unit, UCL, London, UK {byronyu,jcunnin,shenoy}@stanford.edu, [email protected]
Abstract. To date, the neural decoding of time-evolving physical state – for example, the path of a foraging rat or arm movements – has been largely carried out using linear trajectory models, primarily due to their computational efficiency. The possibility of better capturing the statistics of the movements using nonlinear trajectory models, thereby yielding more accurate decoded trajectories, is enticing. However, nonlinear decoding usually carries a higher computational cost, which is an important consideration in real-time settings. In this paper, we present techniques for nonlinear decoding employing modal Gaussian approximations, expectatation propagation, and Gaussian quadrature. We compare their decoding accuracy versus computation time tradeoffs based on high-dimensional simulated neural spike counts. Keywords: Nonlinear dynamical models, nonlinear state estimation, neural decoding, neural prosthetics, expectation-propagation, Gaussian quadrature.
1
Introduction
We consider the problem of decoding time-evolving physical state from neural spike trains. Examples include decoding the path of a foraging rat from hippocampal neurons [1,2] and decoding the arm trajectory from motor cortical neurons [3,4,5,6,7,8]. Advances in this area have enabled the development of neural prosthetic devices, which seek to allow disabled patients to regain motor function through the use of prosthetic limbs, or computer cursors, that are controlled by neural activity [9,10,11,12,13,14,15]. Several of these prosthetic decoders, including population vectors [11] and linear filters [10,12,15], linearly map the observed neural activity to the estimate of physical state. Although these direct linear mappings are effective, recursive Bayesian decoders have been shown to provide more accurate trajectory estimates [1,6,7,16]. In addition, recursive Bayesian decoders provide confidence regions on the trajectory estimates and allow for nonlinear relationships between M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 586–595, 2008. c Springer-Verlag Berlin Heidelberg 2008
Neural Decoding Using Nonlinear Trajectory Models
587
the neural activity and the physical state variables. Recursive Bayesian decoders are based on the specification of a probabilistic model comprising 1) a trajectory model, which describes how the physical state variables change from one time step to the next, and 2) an observation model, which describes how the observed neural activity relates to the time-evolving physical state. The function of the trajectory model is to build into the decoder prior knowledge about the form of the trajectories. In the case of decoding arm movements, the trajectory model may reflect 1) the hard, physical constraints of the limb (for example, the elbow cannot bend backward), 2) the soft, control constraints imposed by neural mechanisms (for example, the arm is more likely to move smoothly than in a jerky motion), and 3) the physical surroundings of the person and his/her objectives in that environment. The degree to which the trajectory model captures the statistics of the actual movements directly affects the accuracy with which trajectories can be decoded from neural data [8]. The most commonly-used trajectory models assume linear dynamics perturbed by Gaussian noise, which we refer to collectively as linear-Gaussian models. The family of linear-Gaussian models includes the random walk model [1,2,6], those with a constant [8] or time-varying [17,18] forcing term, those without a forcing term [7,16], those with a time-varying state transition matrix [19], and those with higher-order Markov dependencies [20]. Linear-Gaussian models have been successfully applied to decoding the path of a foraging rat [1,2], as well as arm trajectories in ellipse-tracing [6], pursuit-tracking [7,20,16], “pinball” [7,16], and center-out reach [8] tasks. Linear-Gaussian models are widely used primarily due to their computational efficiency, which is an important consideration for real-time decoding applications. However, for particular types of movements, the family of linear-Gaussian models may be too restrictive and unable to capture salient properties of the observed movements [8]. We recently proposed a general approach to constructing trajectory models that can exhibit rather complex dynamical behaviors and whose decoder can be implemented to have the same running time (using a parallel implementation) as simpler trajectory models [8]. In particular, we demonstrated that a probabilistic mixture of linear-Gaussian trajectory models, each accurate within a limited regime of movement, can capture the salient properties of goal-directed reaches to multiple targets. This mixture model, which yielded more accurate decoded trajectories than a single linear-Gaussian model, can be viewed as a discrete approximation to a single, unified trajectory model with nonlinear dynamics. An alternate approach is to decode using this single, unified nonlinear trajectory model without discretization. This makes the decoding problem more difficult since nonlinear transformations of parametric distributions are typically no longer easily parametrized. State estimation in nonlinear dynamical systems is a field of active research that has made substantial progress in recent years, including the application of numerical quadrature techniques to dynamical systems [21,22,23], the development of expectation-propagation (EP) [24] and its application to dynamical systems [25,26,27,28], and the improvement in the
588
B.M. Yu et al.
computational efficiency of Monte Carlo techniques (e.g., [29,30,31]). However, these techniques have not been rigorously tested and compared in the context of neural decoding, which typically involves observations that are high-dimensional vectors of non-negative integers. In particular, the tradeoff between decoding accuracy and computational cost among different neural decoding algorithms has not been studied in detail. Knowing the accuracy-computational cost tradeoff is important for real-time applications, where one may need to select the most accurate algorithm given a computational budget or the least computationally intensive algorithm given a minimal acceptable decoding accuracy. This paper takes a step in this direction by comparing three particular deterministic Gaussian approximations. In Section 2, we first introduce the nonlinear dynamical model for neural spike counts and the decoding problem. Sections 3 and 4 detail the three deterministic Gaussian approximations that we focus on in this report: global Laplace, Gaussian quadrature-EP (GQ-EP), and Laplace propagation (LP). Finally, in Section 5, we compare the decoding accuracy versus computational cost of these three techniques.
2
Nonlinear Dynamical Model and Neural Decoding
In this report, we consider nonlinear dynamical models for neural spike counts of the following form: xt | xt−1 ∼ N (f (xt−1 ) , Q) yti
| xt ∼ Poisson (λi (xt ) · Δ) ,
(1a) (1b)
where xt ∈ Rp×1 is a vector containing the physical state variables at time t = 1, . . . , T , yti ∈ {0, 1, 2, . . .} is the corresponding observed spike count for neuron i = 1, . . . , q taken in a time bin of width Δ, and Q ∈ Rp×p is a covariance matrix. The functions f : Rp×1 → Rp×1 and λi : Rp×1 → R+ are, in general, nonlinear. The initial state x1 is Gaussian-distributed. For notational compactness, the spike counts for all q simultaneously-recorded neurons are assembled into a q × 1 vector yt , whose ith element is yti . Note that the observations are discretevalued and that, typically, q p. Equations (1a) and (1b) are referred to as the trajectory and observation models, respectively. The task of neural decoding involves finding, at each timepoint t, the likely physical states xt given the neural activity observed up to that time {y}t1 . In other words, we seek to compute the filtered state posterior P (xt | {y}t1 ) at each t. We previously showed how to estimate the filtered state posterior when f is a linear function [8]. Here, we consider how to compute P (xt | {y}t1 ) when f is nonlinear. The extended Kalman filter (EKF) is a commonly-used technique for nonlinear state estimation. Unfortunately, it cannot be directly applied to the current problem because the observation noise in (1b) is not additive Gaussian. Possible alternatives are the unscented Kalman filter (UKF) [21,22] and the closelyrelated quadrature Kalman filter (QKF) [23], both of which employ quadrature
Neural Decoding Using Nonlinear Trajectory Models
589
techniques to approximate Gaussian integrals that are analytically intractable. While the UKF has been shown to outperform the EKF [21,22], the UKF requires making Gaussian approximations in the observation space. This property of the UKF is undesirable from the standpoint of the current problem because the observed spike counts are typically 0 or 1 (due to the use of relatively short binwidths Δ) and, therefore, distinctly non-Gaussian. As a result, the UKF yielded substantially lower decoding accuracy than the techniques presented in Sections 3 and 4 [28], which make Gaussian approximations only in the state space. While we have not yet tested the QKF, the number of quadrature points required grows geometrically with p + q, which quickly becomes impractical even for moderate values of p and q. Thus, we will no longer consider the UKF and QKF in the remainder of this paper. The decoding techniques described in Sections 3 and 4 naturally yield the smoothed state posterior P xt | {y}T1 , rather than the filtered state posterior P (xt | {y}t1 ). Thus, we will focus on the smoothed state posterior in this work. However, the filtered state posterior at time t can be easily obtained by smoothing using only observations from timepoints 1, . . . , t.
3
Global Laplace
The idea is to estimate the joint state posterior across the entire sequence (i.e., the global state posterior) as a Gaussian matched to the location and curvature of a mode of P {x}T1 | {y}T1 , as in Laplace’s method [32]. The mode is defined as {x }T1 = argmax P {x}T1 | {y}T1 = argmax L {x}T1 , (2) {x}T 1
{x}T 1
where L {x}T1 = log P {x}T1 , {y}T1 = log P (x1 ) +
T
log P (xt | xt−1 ) +
t=2
q T
log P yti | xt .
(3)
t=1 i=1
Using the known distributions (1), the gradients of L {x}T1 can be computed exactly and a local mode {x }T1 can be found by applying a gradient optimization technique. The global state posterior is then approximated as: −1 P {x}T1 | {y}T1 ≈ N {x }T1 , −∇2 L {x }T1 . (4)
4
Expectation Propagation
We briefly summarize here the application of EP [24] to dynamical models [25,26,27,28]. More details can be found in the cited references. The two primary distributions of interest here are the marginal P xt | {y}T1 and pairwise
590
B.M. Yu et al.
joint P xt−1 , xt | {y}T1 state posteriors. These distributions can be expressed in terms of forward αt and backward βt messages as follows P xt | {y}T1 =
1 αt (xt ) βt (xt ) P {y}T1 αt−1 (xt−1 ) P (xt | xt−1 ) P (yt | xt ) βt (xt ) P xt−1 , xt | {y}T1 = , P {y}T1
(5) (6)
where αt (xt ) = P (xt , {y}t1 ) and βt (xt ) = P {y}Tt+1 | xt . The messages αt and βt are typically approximated by an exponential family density; in this paper, we use an unnormalized Gaussian. These approximate messages are iteratively updated by matching the expected sufficient statistics1 of the marginal posterior (5) with those of the pairwise joint posterior (6). The updates are usually performed sequentially via multiple forward-backward passes. During the forward pass, the αt are updated while the βt remain fixed: αt−1 (xt−1 ) P (xt | xt−1 ) P (yt | xt ) βt (xt ) T P xt | {y}1 = dxt−1 (7) P {y}T1 ≈ Pˆ (xt−1 , xt ) dxt−1 (8) αt (xt ) ∝ Pˆ (xt , xt−1 ) dxt−1 βt (xt ) , (9) where Pˆ (xt−1 , xt ) is an exponential family distribution whose expected sufficient statistics are matched to those of P xt−1 , xt | {y}T1 . In this paper, Pˆ (xt−1 , xt ) is assumed to be Gaussian. The backward pass proceeds similarly, where the βt are updated while the αt remain fixed. The decoded trajectory is obtained by combining the messages αt and βt , as shown in (5), after completing the forwardbackward passes. In Section 5, we investigate the accuracy-computational cost tradeoff of using different numbers of forward-backward iterations. Although the expected sufficient statistics (or moments) of P xt−1 , xt | {y}T1 cannot typically be computed analytically for the nonlinear dynamical model (1), they can be approximated using Gaussian quadrature [26,28]. This EP-based decoder is referred to as GQ-EP. By applying the ideas of Laplace propagation (LP) [33], a closely-related developed that uses a modal decoder has been Gaussian approximation of P xt−1 , xt | {y}T1 rather than matching moments [27,28]. This technique, which uses the same message-passing scheme as GQ-EP, is referred to here as LP. In practice, it is possible to encounter invalid message updates. For example, if the variance of xt in the numerator is larger than that in the denominator in (9) due to approximation error in the choice of Pˆ , the update rule would assign αt (xt ) a negative variance. A way around this problem is to simply skip that message update and hope that the update is no longer invalid during the next 1
If the approximating distributions are assumed to be Gaussian, this is equivalent to matching the first two moments.
Neural Decoding Using Nonlinear Trajectory Models
591
forward-backward iteration [34]. An alternative is to set βt (xt ) = 1 in (7) and (9), which guarantees a valid update for αt (xt ). This is referred to as the onesided update and its implications for decoding accuracy and computation time are considered in Section 5.
5
Results
We evaluated decoding accuracy versus computational cost of the techniques described in Sections 3 and 4. These performance comparisons were based on the model (1), where f (x) = (1 − k) x + k · W · erf(x) λi (x) = log 1 + eci x+di
(10) (11)
with parameters W ∈ Rp×p , ci ∈ Rp×1 , and di ∈ R. The error function (erf) in (10) acts element-by-element on its argument. We have chosen the dynamics (10) of a fully-connected recurrent network due to its nonlinear nature; we make no claims in this work about its suitability for particular decoding applications, such as for rat paths or arm trajectories. Because recurrent networks are often used to directly model neural activity, it is important to emphasize that x is a vector of physical state variables to be decoded, not a vector of neural activity. We generated 50 state trajectories, each with 50 time points, and corresponding spike counts from the model (1), where the model parameters were randomly chosen within a range that provided biologically realistic spike counts (typically, 0 or 1 spike in each bin). The time constant k ∈ R was set to 0.1. To understand how these algorithms scale with different numbers of physical state variables and observed neurons, we considered all pairings (p, q), where p ∈ {3, 10} and q ∈ {20, 100, 500}. For each pairing, we repeated the above procedure three times. For the global Laplace decoder, the modal trajectory was found using PolackRibi`ere conjugate gradients with quadratic/cubic line searches and Wolfe-Powell stopping criteria (minimize.m by Carl Rasmussen, available at http://www.kyb. tuebingen.mpg.de/bs/people/carl/code/minimize/). To stabilize GQ-EP, we used a modal Gaussian proposal distribution and the custom precision 3 quadrature rule with non-negative quadrature weights, as described in [28]. Forboth GQ EP and LP, minimize.m was used to find a mode of P xt−1 , xt | {y}T1 . Fig. 1 illustrates the decoding accuracy versus computation time of the presented techniques. Decoding accuracy was measured by evaluating the marginal state posteriors P xt | {y}T1 at the actual trajectory. The higher the log probability, the more accurate the decoder. Each panel corresponds to a different number of state variables and observed neurons. For GQ-EP (dotted line) and LP (solid line), we varied the number of forward-backward iterations between one and three; thus, there are three circles for each of these decoders. Across all panels, global Laplace required the least computation time and yielded state
592
Log probability
-191
B.M. Yu et al. -110
(a)
-199
-114
-24
-207
-118
-28
-600
-250
(d) Log probability
-20
(b)
-2600 -1 10
50
(e)
-1600
(f)
-600
0
10
1
10
Computation time (sec)
(c)
-950 -1 10
-250
0
10
1
10
Computation time (sec)
-550 -1 10
0
10
1
10
Computation time (sec)
Fig. 1. Decoding accuracy versus computation time of global Laplace (no line), GQEP (dotted line), and LP (solid line). (a) p = 3, q = 20, (b) p = 3, q = 100, (c) p = 3, q = 500, (d) p = 10, q = 20, (e) p = 10, q = 100, (f) p = 10, q = 500. The circles and bars represent mean±SEM. Variability in computation time is not represented on the plots because they were negligible. The computation times were obtained using a 2.2-GHz AMD Athlon 64 processor with 2 GB RAM running MATLAB R14. Note that the scale of the vertical axes is not the same in each panel and that some error bars are so small that they can’t be seen.
estimates as accurate as, or more accurate than, the other techniques. This is the key result of this report. We also implemented a basic particle smoother [35], where the number of particles (500 to 1500) was chosen such that its computation time was on the same order as those shown in Fig. 1 (results not shown). Although this particle smoother yielded substantially lower decoding accuracy than global Laplace, GQ-EP, and LP, the three deterministic techniques should be compared to more recently-developed Monte Carlo techniques, as described in Section 6. Fig. 1 shows that all three techniques have computation times that scale well with the number of state variables p and neurons q. In particular, the required computational time typically scales sub-linearly with increases in p and far sublinearly with increases in q. As the q increases, the accuracies of the techniques become more similar (note that different panels have different vertical scales), and there is less advantage to performing multiple forward-backward iterations for GQ-EP and LP. The decoding accuracy and required computation time both typically increase with the number of iterations. In a few cases (e.g., GQ-EP in Fig. 1(b)), it is possible for the accuracy to decrease slightly when going from two to three iterations, presumably due to one-sided updates. In theory, GQ-EP should require greater computation time than LP because it needs to perform the same modal Gaussian approximation, then use it as a proposal distribution for Gaussian quadrature. In practice, it is possible for LP
Neural Decoding Using Nonlinear Trajectory Models
593
to be slower if it needs many one-sided updates (cf. Fig. 1(d)), since one-sided updates are used only when the usual update (9) fails. Furthermore, LP required greater computation time in Fig. 1(d) than in Fig. 1(e) due to the need for many more one-sided updates, despite having five times fewer neurons. It was previously shown that {x }T1 is a local optimum of P {x}T1 | {y}T1 (i.e., a solution of global Laplace) if and only if it is a fixed-point of LP [33]. Because the modal Gaussian approximation matches local curvature up to second order, it can also be shown that the estimated covariances using global Laplace and LP are equal at {x }T1 [33]. Empirically, we found both statements to be true if few one-sided updates were required for LP. Due to these connections between global Laplace and LP, the accuracy of LP after three forward-backward iterations was similar to that of global Laplace in all panels in Fig. 1. Although LP may have computational savings compared to global Laplace in certain applications [33], we found that global Laplace was substantially faster for the particular graph structure described by (1).
6
Conclusion
We have presented three deterministic techniques for nonlinear state estimation (global Laplace, GQ-EP, LP) and compared their decoding accuracy versus computation cost in the context of neural decoding, involving high-dimensional observations of non-negative integers. This work can be extended in the following directions. First, the deterministic techniques presented here should be compared to recently-developed Monte Carlo techniques that have yielded increased accuracy and/or reduced computational cost compared to the basic particle filter/smoother in applications other than neural decoding [29]. Examples include the Gaussian particle filter [31], sigma-point particle filter [30], and embedded hidden Markov model [36]. Second, we have compared these decoders based on one particular non-linear trajectory model (10). Other non-linear trajectory models (e.g., a model describing primate arm movements [37]) should be tested to see if the decoders have similar accuracy-computational cost tradeoffs as shown here. Acknowledgments. This work was supported by NIH-NINDS-CRCNS-R01, NDSEG Fellowship, NSF Graduate Research Fellowship, Gatsby Charitable Foundation, Michael Flynn Stanford Graduate Fellowship, Christopher Reeve Paralysis Foundation, Burroughs Wellcome Fund Career Award in the Biomedical Sciences, Stanford Center for Integrated Systems, NSF Center for Neuromorphic Systems Engineering at Caltech, Office of Naval Research, Sloan Foundation and Whitaker Foundation.
References 1. Brown, E.N., Frank, L.M., Tang, D., Quirk, M.C., Wilson, M.A.: A statistical paradigm for neural spike train decoding applied to position prediction from the ensemble firing patterns of rat hippocampal place cells. J. Neurosci 18(18), 7411– 7425 (1998)
594
B.M. Yu et al.
2. Zhang, K., Ginzburg, I., McNaughton, B.L., Sejnowski, T.J.: Interpreting neuronal population activity by reconstruction: Unified framework with application to hippocampal place cells. J. Neurophysiol 79, 1017–1044 (1998) 3. Wessberg, J., Stambaugh, C.R., Kralik, J.D., Beck, P.D., Laubach, M., Chapin, J.K., Kim, J., Biggs, J., Srinivasan, M.A., Nicolelis, M.A.L.: Real-time prediction of hand trajectory by ensembles of cortical neurons in primates. Nature 408(6810), 361–365 (2000) 4. Schwartz, A.B., Taylor, D.M., Tillery, S.I.H.: Extraction algorithms for cortical control of arm prosthetics. Curr. Opin. Neurobiol. 11, 701–707 (2001) 5. Serruya, M., Hatsopoulos, N., Fellows, M., Paninski, L., Donoghue, J.: Robustness of neuroprosthetic decoding algorithms. Biol. Cybern. 88(3), 219–228 (2003) 6. Brockwell, A.E., Rojas, A.L., Kass, R.E.: Recursive Bayesian decoding of motor cortical signals by particle filtering. J. Neurophysiol 91(4), 1899–1907 (2004) 7. Wu, W., Black, M.J., Mumford, D., Gao, Y., Bienenstock, E., Donoghue, J.P.: Modeling and decoding motor cortical activity using a switching Kalman filter. IEEE Trans Biomed Eng 51(6), 933–942 (2004) 8. Yu, B.M., Kemere, C., Santhanam, G., Afshar, A., Ryu, S.I., Meng, T.H., Sahani, M., Shenoy, K.V.: Mixture of trajectory models for neural decoding of goal-directed movements. J. Neurophysiol. 97, 3763–3780 (2007) 9. Chapin, J.K., Moxon, K.A., Markowitz, R.S., Nicolelis, M.A.L.: Real-time control of a robot arm using simultaneously recorded neurons in the motor cortex. Nat. Neurosci. 2, 664–670 (1999) 10. Serruya, M.D., Hatsopoulos, N.G., Paninski, L., Fellows, M.R., Donoghue, J.P.: Instant neural control of a movement signal 416, 141–142 (2002) 11. Taylor, D.M., Tillery, S.I.H., Schwartz, A.B.: Direct cortical control of 3D neuroprosthetic devices. Science 296, 1829–1832 (2002) 12. Carmena, J.M., Lebedev, M.A., Crist, R.E., O’Doherty, J.E., Santucci, D.M., Dimitrov, D.F., Patil, P.G., Henriquez, C.S., Nicolelis, M.A.L.: Learning to control a brain-machine interface for reaching and grasping by primates. PLoS Biology 1(2), 193–208 (2003) 13. Musallam, S., Corneil, B.D., Greger, B., Scherberger, H., Andersen, R.A.: Cognitive control signals for neural prosthetics. Science 305, 258–262 (2004) 14. Santhanam, G., Ryu, S.I., Yu, B.M., Afshar, A., Shenoy, K.V.: A high-performance brain-computer interface. Nature 442, 195–198 (2006) 15. Hochberg, L.R., Serruya, M.D., Friehs, G.M., Mukand, J.A., Saleh, M., Caplan, A.H., Branner, A., Chen, D., Penn, R.D., Donoghue, J.P.: Neuronal ensemble control of prosthetic devices by a human with tetraplegia. Nature 442, 164–171 (2006) 16. Wu, W., Gao, Y., Bienenstock, E., Donoghue, J.P., Black, M.J.: Bayesian population decoding of motor cortical activity using a Kalman filter. Neural Comput 18(1), 80–118 (2006) 17. Kemere, C., Meng, T.: Optimal estimation of feed-forward-controlled linear systems. In: Proc IEEE ICASSP, pp. 353–356 (2005) 18. Srinivasan, L., Eden, U.T., Willsky, A.S., Brown, E.N.: A state-space analysis for reconstruction of goal-directed movements using neural signals. Neural Comput 18(10), 2465–2494 (2006) 19. Srinivasan, L., Brown, E.N.: A state-space framework for movement control to dynamic goals through brain-driven interfaces. IEEE Trans. Biomed. Eng. 54(3), 526–535 (2007) 20. Shoham, S., Paninski, L.M., Fellows, M.R., Hatsopoulos, N.G., Donoghue, J.P., Normann, R.A.: Statistical encoding model for a primary motor cortical brainmachine interface. IEEE Trans. Biomed. Eng. 52(7), 1313–1322 (2005)
Neural Decoding Using Nonlinear Trajectory Models
595
21. Wan, E., van der Merwe, R.: The unscented Kalman filter. In: Haykin, S. (ed.) Kalman Filtering and Neural Networks, Wiley Publishing, Chichester (2001) 22. Julier, S., Uhlmann, J.: Unscented filtering and nonlinear estimation. Proceedings of the IEEE 92(3), 401–422 (2004) 23. Arasaratnam, I., Haykin, S., Elliott, R.: Discrete-time nonlinear filtering algorithms using Gauss-Hermite quadrature. Proceedings of the IEEE 95(5), 953–977 (2007) 24. Minka, T.: Expectation propagation for approximate Bayesian inference. In: Proceedings of the 17th Conference on Uncertainty in Artificial Intelligence (UAI), pp. 362–369 (2001) 25. Heskes, T., Zoeter, O.: Expectation propagation for approximate inference in dynamic Bayesian networks. In: Darwiche, A., Friedman, N. (eds.) Proceedings UAI2002, pp. 216–223 (2002) 26. Zoeter, O., Ypma, A., Heskes, T.: Improved unscented Kalman smoothing for stock volatility estimation. In: Barros, A., Principe, J., Larsen, J., Adali, T., Douglas, S. (eds.) Proceedings of the IEEE Workshop on Machine Learning for Signal Processing (2004) 27. Ypma, A., Heskes, T.: Novel approximations for inference in nonlinear dynamical systems using expectation propagation. Neurocomputing 69, 85–99 (2005) 28. Yu, B.M., Shenoy, K.V., Sahani, M.: Expectation propagation for inference in nonlinear dynamical models with Poisson observations. In: Proc. IEEE Nonlinear Statistical Signal Processing Workshop (2006) 29. Doucet, A., de Freitas, N., Gordon, N. (eds.): Sequential Monte Carlo Methods in Practice. Springer, Heidelberg (2001) 30. van der Merwe, R., Wan, E.: Sigma-point Kalman filters for probabilistic inference in dynamic state-space models. In: Proceedings of the Workshop on Advances in Machine Learning (2003) 31. Kotecha, J.H., Djuric, P.M.: Gaussian particle filtering. IEEE Transactions on Signal Processing 51(10), 2592–2601 (2003) 32. MacKay, D.: Information Theory, Inference and Learning Algorithms. Cambridge University Press, Cambridge (2003) 33. Smola, A., Vishwanathan, V., Eskin, E.: Laplace propagation. In: Thrun, S., Saul, L., Sch¨ olkopf, B. (eds.) Advances in Neural Information Processing Systems, vol. 16, MIT Press, Cambridge (2004) 34. Minka, T., Lafferty, J.: Expectation-propagation for the generative aspect model. In: Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence (UAI), pp. 352–359 (2002) 35. Doucet, A., Godsill, S., Andrieu, C.: On sequential Monte Carlo sampling methods for Bayesian filtering. Statistics and Computing 10(3), 197–208 (2000) 36. Neal, R.M., Beal, M.J., Roweis, S.T.: Inferring state sequences for non-linear systems with embedded hidden Markov models. In: Thrun, S., Saul, L., Sch¨ olkopf, B. (eds.) Advances in Neural Information Processing Systems, vol. 16, MIT Press, Cambridge (2004) 37. Chan, S.S., Moran, D.W.: Computational model of a primate arm: from hand position to joint angles, joint torques and muscle forces. J. Neural. Eng. 3, 327–337 (2006)
Estimating Internal Variables of a Decision Maker’s Brain: A Model-Based Approach for Neuroscience Kazuyuki Samejima1 and Kenji Doya2 1 Brain Science Institute, Tamagawa University, 6-1-1 Tamagawa-gakuen, Machida, Tokyo 194-8610, Japan [email protected] 2 Initial Research Project, Okinawa Institute of Science and Technology 12-22 Suzaki, Uruma, Okinawa 904-2234, Japan [email protected]
Abstract. A major problem in search of neural substrates of learning and decision making is that the process is highly stochastic and subject dependent, making simple stimulus- or output-triggered averaging inadequate. This paper presents a novel approach of characterizing neural recording or brain imaging data in reference to the internal variables of learning models (such as connection weights and parameters of learning) estimated from the history of external variables by Bayesian inference framework. We specifically focus on reinforcement leaning (RL) models of decision making and derive an estimation method for the variables by particle filtering, a recent method of dynamic Bayesian inference. We present the results of its application to decision making experiment in monkeys and humans. The framework is applicable to wide range of behavioral data analysis and diagnosis.
1 Introduction The traditional approach in neuroscience to discover information processing mechanisms is to correlate neuronal activities with external physical variables, such as sensory stimuli or motor outputs. However, when we search for neural correlates of higher-order brain functions, such as attention, memory and learning, a problem has been that there are no external physical variables to correlate with. Recently, the advances in computational neuroscience, there are a number of computational models of such cognitive or learning processes and make quantitative prediction of the according subject’s behavioral responses. Thus a possible new approach is to try to find neural activities that correlate with the internal variables of such computational models(Corrado and Doya, 2007). A major issue in such model-based analysis of neural data is how to estimate the hidden variables of the model. For example, in learning agents, hidden variables such as connection weights change in time. In addition, the course of learning is regulated by hidden meta-parameters such as learning rates. Another important issue is how to judge the validity of a model or to select the best model among a number of candidates. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 596–603, 2008. © Springer-Verlag Berlin Heidelberg 2008
Estimating Internal Variables of a Decision Maker’s Brain
597
The framework of Bayesian inference can provide coherent solutions to the issues of estimating hidden variables, including meta-parameter from observable experimental data and selecting the most plausible computational model out of multiple candidates. In this paper, we first review the reinforcement learning model of reward-based decision making (Sutton and Barto, 1998) and derive a Bayesian estimation method for the hidden variables of a reinforcement learning model by particle filtering (Samejima et al., 2004). We then review examples of application of the method to monkey neural recording (Samejima et al., 2005) and human imaging studies (Haruno et al., 2004; Tanaka et al., 2006; Behrens et al., 2007).
2 Reinforcement Leaning Model as an Animal or Human Decision Maker Reinforcement learning can be a model of animal or human decision based on reward delivery. Notably, the response of monkey midbrain dopamine neurons are successfully explained by the temporal difference (TD) error of reinforcement learning models (Schultz et al., 1997). The goal of reinforcement learning is to improve the policy, the rule of taking an action at at state st , so that the resulting rewards rt is maximized in the long run. The basic strategy of reinforcement learning is to estimate cumulative future reward under the current policy as the value function for each state and then it improves the policy based on the value function. In a standard reinforcement learning algorithm called “Q-learning,” an agent learns the action-value function
[
]
(1)
which estimates the cumulative future reward when action
at is taken at a
Q ( st , at ) = E rt + γrt +1 + γ 2 rt + 2 + ... | s, a state st .The discount factor
0 < γ < 1 is a meta-parameter that controls the time
scale of prediction. The policy of the learner is then given by comparing actionvalues, e.g. according to Boltzman distribution
π (a | s ) =
exp(βQ(a, st )) ∑ exp(βQ(a' , st ))
(2)
a '∈A
where the inverse temperature β > 0 is another meta-parameter that controls randomness of action selection. From an experience of state st , action at , reward rt , and next state st +1 , the action-value function is updated by Q-learning algorithm(Sutton and Barto, 1998) as
δ t = rt + γ max Q( st +1 , a) − Q( st , at ) a∈A
Q( st , at ) ⇐ Q( st , at ) + αδ t
,
(3)
598
K. Samejima and K. Doya
where α > 0 is the meta-parameter for learning rate. In the case of a reinforcement learning agent, we have three meta-parameters. Such a reinforcement learning model of behavior learning does not only predict subject’s actions, but can also provide candidates of brain’s internal processes for decision making, which may be captured in neural recording or brain imaging data. However, a big problem is that the predictions are depended on the setting of metaparameters, such as learning rate α , action randomness β and discount factor γ .
3 Probabilistic Dynamic Evolution of Internal Variable for Q-Learning Agent Let
us
consider
a
problem
of
estimating
the
course
of
action-values
{Qt ( s, a); s ∈ S , a ∈ A,0 < t < T } , and meta-parameters α, β, and γ of reinforcement
st , actions at and rewards rt . We use a Bayesian method of estimating a dynamical hidden variable {x t ; t ∈ N } from sequence of observable variable {y t ; t ∈ N } to solve this problem. We assume learner by only observing the sequence of states
that the unobservable signal (hidden variable) is modeled as a Markov process of initial distribution p (x 0 ) and the transition probability p (x t +1 | x t ) . The observations {y t ; t ∈ N } are assumed to be conditionally independent given the process
{x t ; t ∈ N } and of marginal distribution p(y t | xt ) . The problem to solve
in this setting is to estimate recursively in time the posterior distribution of hidden variable p (x1:t | y1:t ) , where x 0:T = {x 0 , " , xT } and y 1:T = {y 1 , " , y T } . The marginal distribution is given by recursive procedure of the following prediction and updating, Predicting:
p(x t | y1:t −1 ) = ∫ p(x t | x t −1 ) p(x t −1 | y
Updating:
p( xt | y1:t ) =
1:t −1
)dx t −1
p (y t | xt ) p (x t | y1:t −1 ) ∫ p(y t | xt ) p(xt | y )dxt −1 1:t −1
We use a numerical method to solve the Bayesian recursion procedure was proposed, called particle filter (Doucet et al., 2001). In the Particle filter, the distributions of sequence of hidden variables are represented by a set of random samples, also named ``particles’’. We use a Bootstrap filter, to calculate the recursion of the prediction and the update the distribution of particles (Doucet et al. 2001). Figure 1 shows the dynamical Bayesian network representation of a evolution of internal variables in Q-learning agent. The hidden variable x t consists of actionvalues
Q( s, a ) for each state-action pair, learning rate α, inverse temperature β, and
Estimating Internal Variables of a Decision Maker’s Brain
discount factor γ. The observable variable and rewards
599
y t consists of the states st , actions at ,
rt .
p(y t | xt ) is given by the softmax action selection (2). The transition probability p (x t +1 | x t ) of the hidden variable is given by the The observation probability
Q-learning rule (3) and the assumption on the meta-parameter dynamics. Here we assume that meta-parameters (α,β, and γ) are constant with small drifts. Because α, β and γ should all be positive, we assume random-walk dynamics in logarithmic space.
log( xt +1 ) = log( xt ) + ε x where
σx
ε x ~ N (0,σ x )
(4)
is a meta-meta-parameter that defines random-walk variability of meta-
parameters.
Fig. 1. A Bayesian network representation of a Q-learning agent: dynamics of observable and unobservable variable is depended on decision, reward probability, state transition, and update rule for value function. Circles: hidden variable. Double box: observable variable. Arrow: probabilistic dependency.
4 Computational Model-Based Analysis of Brain Activity 4.1 Application to Monkey Choice Behavior and Striatal Neural Activity Samejima et al (Samejima et al., 2005) used the internal parameter approach with Q-leaning model for monkey’s free choice task of a two-armed-bandit problem
600
K. Samejima and K. Doya
(Figure 2). The task has only one state, two actions, and stochastic binary reward. The reward probability for each action is fixed in a 30-150 trials of block, but randomly chosen from five kinds of probability combination, block-by-block. The reward probabilities P(a=L) for action a=L and P(a=R) for action a=R are selected randomly from five settings; [P(a=L),P(a=R)]= {[0.5,0.5], [0.5,0.1], [0.1, 0.5], [0.5,0.9], [0.9, 0.5]}, at the beginning of each block.
Fig. 2. Two-armed bandit task for monkey’s behavioral choice. Upper: Time course of the task. Monkey faced a panel in which three LEDs, right, left and up, were embedded and a small LED was in middle. When the small LED was illuminated red, the monkeys grasped a handle with their right hand and adjusted at the center position. If monkeys held the handle at the center position for 1 s, the small LED was turned off as GO signal. Then, the monkeys turned the handle to either right or left side, which was associated with a shift of yellow LED illumination from up to the turned direction. After 0.5 sec., color of the LED changed from yellow to either green or red. Green LED was followed by a large amount of reward water, while red LED was followed by a small amount of water. Lower panel: state diagram of the task. The circle indicates state. Arrow indicate possible action and state transition.
The Q-learning model of monkey behavior tries to learn reward expectation of each action, action value, and maximize reward acquired in each block. Because the task has only one state, the agent does not need to take into account next state’s value, and thus, we set the discount factor as γ = 0 . (Samejima et al., 2005) showed that the computed internal variable, the action value for a particular movement direction (left/right), that is estimated by past history of choice and outcome(reward), could predicts monkey’s future choice probability (Figure 3). Action value is an example of a variable that could not be immediately
Estimating Internal Variables of a Decision Maker’s Brain
601
Fig. 3. Time course of predicted choice probability and estimated action values. Upper panel: an example history of action( red=right, blue=left), reward (dot=small, circle=large), choice ratio (cyan line, Gaussian smoothed σ=2.5) and predicted choice probability (black line). Color of upper bar indicate reward probability combination. Lower panel: estimated action values (blue=Q-value for left/ red=Q-value for right). (From Samejima et al. 2005).
Fig. 4. An example of the activity of a caudate neuron plotted on the space of estimated action values QL(t) and QR(t).Left panel: 3-dimentional plot of neural activity on estimated QL(t) and QR(t). Right panel: 2-d projected plot for the discharge rates of the neuron on QL axes(Left side) and on QR (right side). Grey lines derived from regression model. Circles and error bars indicate average and standard deviation of neural discharge rates for each of 10 equally populated action value bins. (from Samejima et al. 2005).
obvious from observable experimental parameters but can be inferred using an actionpredictable computational model. Further more, the activity of most dorsal striatum projection neurons correlate to the estimated action value for particular action (figure 4). 4.2 Application to Human Imaging Data Not only the internal variable estimation but also the meta-parameters (e.g. learning rate, action stochasticity, and discount rate for future reward) are also estimated by this methodology. Although the subjective value of learning meta-parameters might be different for individual subject, the model-based approach could track subjective internal value for different meta-parameters. Especially, in human imaging study, this
602
K. Samejima and K. Doya
admissibility is effective to extract common neuronal circuit activation in multiple subject experiment. One problem in the cognitive neuroscience by decision making task is lack of controllability of internal variables. In conventional analysis of neuroscience and brain-imaging study, experimenter tries to control a cognitive state or an assumed internal parameter by a task demand or an experimental setting. Observed brain activities are compared to the assumed parameter. However, the subjective internal variables may depended personal behavioral tendency and may be different from the parameter assumed by experimenter. The Baysian estimation method for internal parameters including meta-parameter could reduce such a noise term of personal difference by fitting the meta-parameters. (Tanaka et al., 2006) showed that the variety of behavioral tendency for multiple human subjects could be featured by the estimated meta-parameter of Q-learning agent. Figure 5 shows distribution of three meta-parameters, learning rate α, action stochasticity β and discount rate γ. The subjects whose estimated γ are lower tend to be trapped on a local optimal polity and could not reach optimal choice sequence (figure 5 left panel) . On the other hand, the subjects, whose learing rate α and inverse temperature β are estimated lower than others, reported in post-experimental questionnaire that they could not find any confident action selection in each state even in later experimental session of the task (figure 5 right panel). Regardless of the variety of subject’s behavioral tendency, the fMRI signal that correlated to estimated action value for the selected action is observed in ventral striatum in unpredictable condition, in which the state transitions are completely random, whereas dorsal striatum is correlated to action value in predictable environment, in which the state transitions are deterministic. This suggests that the different cortico-basal ganglia circuits might be involved in different predictability of the environment. (Tanaka et al., 2006)
Fig. 5. Subject distribution of estimated meta-parameters, larning rate, α, action stochasticity (inverse temperature), β, and discount rate, γ. Left panel: distribution on α−γ space. Subject LI, NN, and NT (Left panel, indicated by inside of ellipsoid), were trapped to local optimal action sequence. Right panel: distribution on α−β space. Subject BB and LI (right panel, indicated by inside of ellipsoid) were reported that they could not find any confident strategy.
Estimating Internal Variables of a Decision Maker’s Brain
603
5 Conclusion Theoretical framework of reinforcement learning to model behavioral decision making and the Bayesian estimating method for subjective internal variable can be powerful tools for analyzing both neural recording (Samejima et al., 2005) and human imaging data (Daw et al., 2006; Pessiglione et al., 2006; Tanaka et al., 2006). Especially, tracking meta-parameter of RL can capture behavioral tendency of animal or human decision making. Recently, correlation with anterior cingulated cortex activity and learning rate in uncertain environmental change are reported by using the approach with Bayesian decision model with temporal evolving the parameter of learning rate (Behrens et al., 2007). Although not detailed in this paper, Bayesian estimation framework also provides a way of objectively selecting the best model in reference to the given data. Combination of Bayesian model selection and hidden variable estimation methods would contribute to a new understanding of decision mechanism of our brain through falsifiable hypotheses and objective experimental tests.
References 1. Behrens, T.E., Woolrich, M.W., Walton, M.E., Rushworth, M.F.: Learning the value of information in an uncertain world. Nat. Neurosci. 10, 1214–1221 (2007) 2. Corrado, G., Doya, K.: Understanding neural coding through the model-based analysis of decision making. J. Neurosci. 27, 8178–8180 (2007) 3. Daw, N.D., O’Doherty, J.P., Dayan, P., Seymour, B., Dolan, R.J.: Cortical substrates for exploratory decisions in humans. Nature 441, 876–879 (2006) 4. Doucet, A., Freitas, N., Gordon, N.: Sequential Monte Carlo Methods in Practice. Springer, Heidelberg (2001) 5. Haruno, M., Kuroda, T., Doya, K., Toyama, K., Kimura, M., Samejima, K., Imamizu, H., Kawato, M.: A neural correlate of reward-based behavioral learning in caudate nucleus: a functional magnetic resonance imaging study of a stochastic decision task. J. Neurosci. 24, 1660–1665 (2004) 6. Pessiglione, M., Seymour, B., Flandin, G., Dolan, R.J., Frith, C.D.: Dopamine-dependent prediction errors underpin reward-seeking behaviour in humans. Nature 442, 1042–1045 (2006) 7. Samejima, K., Doya, K., Ueda, Y., Kimura, M.: Advances in neural processing systems, vol. 16. The MIT Press, Cambridge, Massachusetts, London, England (2004) 8. Samejima, K., Ueda, Y., Doya, K., Kimura, M.: Representation of action-specific reward values in the striatum. Science 310, 1337–1340 (2005) 9. Schultz, W., Dayan, P., Montague, P.R.: A neural substrate of prediction and reward. Science 275, 1593–1599 (1997) 10. Sutton, R.S., Barto, A.G.: Reinforcement Learning. The MIT press, Cambridge (1998) 11. Tanaka, S.C., Samejima, K., Okada, G., Ueda, K., Okamoto, Y., Yamawaki, S., Doya, K.: Brain mechanism of reward prediction under predictable and unpredictable environmental dynamics. Neural Netw. 19, 1233–1241 (2006)
Visual Tracking Achieved by Adaptive Sampling from Hierarchical and Parallel Predictions Tomohiro Shibata1 , Takashi Bando2 , and Shin Ishii1,3 1
Graduate School of Information Science, Nara Institute of Science and Technology [email protected] 2 DENSO Corporation 3 Graduate School of Informatics, Kyoto University
Abstract. Because the inevitable ill-posedness exists in the visual information, the brain essentially needs some prior knowledge, prediction, or hypothesis to acquire a meaningful solution. From computational point of view, visual tracking is the real-time process of statistical spatiotemporal filtering of target states from an image stream, and incremental Bayesian computation is one of the most important devices. To make Bayesian computation of the posterior density of state variables tractable for any types of probability distribution, Particle Filters (PFs) have been often employed in the real-time vision area. In this paper, we briefly review incremental Bayesian computation and PFs for visual tracking, indicate drawbacks of PFs, and then propose our framework, in which hierarchical and parallel predictions are integrated by adaptive sampling to achieve appropriate balancing of tracking accuracy and robustness. Finally, we discuss the proposed model from the viewpoint of neuroscience.
1
Introduction
Because the inevitable ill-posedness exists in the visual information, the brain essentially needs some prior knowledge, prediction, or hypothesis to acquire a meaningful solution. The prediction is also essential for real-time recognition or visual tracking. Due to flood of visual data, examining the whole data is infeasible, and ignoring the irrelevant data is essentially requisite. Primate fovea and oculomotor control can be viewed from this point; high visual acuity is realized by the narrow foveal region on the retina, and the visual axis has to actively move by oculomotor control. Computer vision, in particular real-time vision faces the same computational problems discussed above, and attractive as well as feasible methods and applications have been developed in the light of particle filters (PFs) [4]. One of key ideas of PFs is importance sampling distribution or proposal distribution which can be viewed as prediction or attention in order to overcome the discussed computational problems. The aim of this paper is to propose a novel Bayesian visual tracking framework for hierarchically-modeled state variables for single object tracking, and to discuss the PFs and our framework from the viewpoint of neuroscience. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 604–613, 2008. c Springer-Verlag Berlin Heidelberg 2008
Visual Tracking Achieved by Adaptive Sampling
2 2.1
605
Adaptive Sampling from Hierarchical and Parallel Predictions Incremental Bayes and Particle Filtering
Particle filtering [4] is an approach to performing Bayesian estimation of intractable posterior distributions from time series signals with non-Gaussian noise, such to generalize the traditional Kalman filtering. This approach has been attracting attention in various research areas, including real-time visual processing (e.g., [5]). In clutter, there are usually several competing observations and these causes the posterior to be multi-modal and therefore non-Gaussian. In reality, using a large number of particles is not allowed especially for realtime processing, and thus there is strong demand to reduce number of particles. Reducing the number of particles, however, can lead to sacrificing accuracy and robustness of filtering, particularly in case that the dimension of state variables is high. How to cope with this trade-off has been one of the most important computational issues in PFs, but there have been few efforts to reduce the dimension of the state variables in the context of PFs. Making use of hierarchy in state variables seems a natural solution to this problem. For example, in the case of pose estimation of a head from a videostream involving state variables of three-dimensional head pose and the twodimansional position of face features on the image, the high-dimensional state space can be divided into two groups by using the causality in their states, i.e., the head pose strongly affects the position of feace features (cf. Fig. 1, except dotted arrows). There are, however, two big problems in this setting. First, the estimation of the lower state variables is strongly dependent on the estimation of the higher state variables. In real applications, it often happens that the assumption on generating from the higher state variable to the lower state variable is violated. Second, the assumption on generating from the lower state variable to the input, typically image frames, can also be violated. These two problems lead to failure in the estimation. 2.2
Estimation of Hierarchically-Modeled State Variables
Here we present a novel framework for hierarchically-modeled state variables for single object tracking. The intuition of our approach is that the higher and lower layer have their own dynamics respectively, and mixing their predictions over the proposal distribution based on their reliability adds both robustness and accuracy in tracking with the fewer number of particles. We assume there are two continuous state vectors at time step t, denoted by at ∈ RNa and xt ∈ RNx , and they are hierarchically modeled as in Fig. 1. Our goal is then estimating these unobservable states from an observation sequence z1:t . State Estimation. According to Bayes rule, the joint posterior density p(at , xt |z1:t ) is given by p(at , xt |z1:t ) ∝p(zt |at , xt )p(at , xt |z1:t−1 ),
606
T. Shibata, T. Bando, and S. Ishii
Fig. 1. A graphical model of a hierarchical and parallel dynamics. Right panels depict example physical representations processed in the layers.
where p(at , xt |z1:t−1 ) is the joint prior density, and p(zt |at , xt ) is the likelihood. The joint prior density p(at , xt |z1:t−1 ) is given by the previous joint posterior density p(at−1 , xt−1 |z1:t−1 ) and the state transition model p(at , xt |at−1 , xt−1 ) as follows: p(at , xt |z1:t−1 ) = p(at , xt |at−1 , xt−1 )p(at−1 , xt−1 |z1:t−1 )dat−1 dxt−1 . The state vectors at and xt are assumed to be genereted from the hierarchical model shown in Fig. 1. Furthermore, dynamics model of at is assumed to be represented as a temporal Markov chain, conditional independence between at−1 and xt given at . Under these assumptions, the state transition model p(at , xt |at−1 , xt−1 ) and the joint prior density p(at , xt |z1:t−1 ) can be represented as p(at , xt |at−1 , xt−1 ) =p(xt |at )p(at |at−1 ) and
p(at , xt |z1:t−1 ) =p(xt |at )p(at |z1:t−1 ),
respectively. Then, we can carry out hierarchically computation of the joint posterior dis(i) tribution by PF as follows; first, the higher samples {at |i = 1, ..., Na } are drawn (j) from the prior density p(at |z1:t−1 ). The lower samples {xt |j = 1, ..., Nx } are then drawn from the prior density p(xt |z1:t−1 ) described as the proposal dis(i) tribution p(xt |at ). Finally, the weights of samples are given by the likelihood p(zt |at , xt ). Adaptive Mixing of Proposal Distribution. For applying to the state estimation problem in the real world, the above proposal distribution can be contaminated by the cruel non-Gaussian noise and/or the declination of the assumption. Especially in the case of PF estimation with small number of particles, the contaminated proposal distribution can give fatal disturbance to the
Visual Tracking Achieved by Adaptive Sampling
607
estimation. In this study, we assumed the state transition, which are represented as dotted arrows in Fig.1, in lower layer independently of upper layer, and we suppose to enable robust and acculate estimation by adaptive determination of the contribution the from hypothetical state transition. The state transition p(at , xt |at−1 , xt−1 ) can be represented as p(at , xt |at−1 , xt−1 ) =p(xt |at , xt−1 )p(at |at−1 ).
(1)
Here, the dynamics model of xt is assumed to be represented as p(xt |at , xt−1 ) =αa,t p(xt |at ) + αx,t p(xt |xt−1 ),
(2)
with αa,t + αx,t = 1. In our algorithm, p(xt |at , xt−1 ) is modelled as a mixture of approximated prediction densities computed in the lower and higher layers, p(xt |xt−1 ) and p(xt |at ). Then, its mixture ratio αt = {αa,t , αx,t } representing the contribution of each layer is determined by means of which enables the interaction between the layers. We describe the determination way of the mixture ratio αt in the following subsection. From Eqs. (1) and (2), the joint prior density p(at , xt |z1:t−1 ) is given as p(at , xt |z1:t−1 ) = p(xt |at , z1:t−1 )p(at |z1:t−1 ) = π(xt |αt )p(at |z1:t−1 ), where
π(xt |αt ) =αa,t p(xt |at ) + αx,t p(xt |z1:t−1 )
(3)
is the adaptive-mixed proposal distribution for xt based on the prediction densities p(xt |at ) and p(xt |z1:t−1 ). Determination of αt using On-line EM Algorithm. The mixture ratio αt is the parameter for determining the adaptive-mixed proposal distribution π(xt |αt ), and its determination by minimalizing the KL divergence between the posterior density in the lower layer p(xt |at , z1:t ) and π(xt |αt ) gives robust and accurate estimation. The determination of αt is equal to determination of mixture ratio in two components mixture model, and we employ a sequential Maximum-Likelihood (ML) estimation of the mixture ratio. In our method, the index variable for componet selection becomes the latent variable, and therefore the sequential ML estimation is implemented by means of an on-line EM algorithm [11]. Resampling from the posterior density p(xt |at , z1:t ), we obtain Nx (i) ˜ t . Using the latent variable m = {ma , mx } indicates which prediction samples x density, p(xt |at ) or p(xt |z1:t−1 ) is trusted, the on-line log likelihood can then be represented as t N t x L(αt ) = ηt λs log π(˜ x(i) τ |αt ) τ =1
= ηt
t τ =1
s=τ +1 t s=τ +1
λs
i=1 Nx i=1
log
m
p(˜ x(i) τ , m|ατ ) ,
608
T. Shibata, T. Bando, and S. Ishii
1. Estimation of the state variable xt in the lower layer. (i) – Obtain αa,t Nx samples xa,t from p(xt |at ). (i) – Obtain αx,t Nx samples xx,t from p(xt |z1:t−1 ). ˆ t and Std(xt ) of p(xn,t |at , z1:t ) using Nx mixture – Obtain expectation x (i) (i) (i) samples xt , constituted by xx,t and xa,t . Above procedure is applied to each feature, and obtain {ˆ xn,t , Std(xn,t )}. 2. Estimation of the state variable at in the higher layer. – Obtain Da,n (t) based on Std(xn,t ), and then estimate p(at |z1:t ). 3. Determination of the mixture ratio ¸t+1 . ˜ (i) – Obtain Nx samples x from p(xn,t |at , z1:t ). t – Calculate ¸t+1 such to maximize the on-line log likelihood. Above procedure is applied to each feature, and obtain {¸n,t+1 }.
Fig. 2. Hierarchical pose estimation algorithm
where λs is a decay constant to decline adverse influence from former inaccurate
−1 t t estimation, and ηt = is a normalization constant which τ =1 s=τ +1 λs works as a learning coefficient. The optimal mixture ratio α∗t by means of the KL divergence, which gives the optimal proposal distribution π(xt |α∗t ), can be calculated by miximization of the on-line log likelihood as follows: mt , m ∈m m t
α∗m,t = where (i)
(i)
p(˜ xt , ma |αt ) =αa,t p(˜ xt |at ),
(i)
(i)
p(˜ xt , mx |αt ) = αx,t p(˜ xt |z1:t−1 ),
and mt = ηt
t τ =1
t s=τ +1
λs
Nx
(i)
p(m|˜ x(i) τ , α τ ),
(i)
p(m|˜ xt , α t ) =
i=1
p(˜ xt , m|αt ) m ∈m
(i)
p(˜ xt , m |αt )
.
Note that mt can be calculated incrementally. 2.3
Application to Pose Estimation of a Rigid Object
Here the proposed method is applied to a real problem, pose estimation of a rigid object (cf. Fig. 1). The algorithm is shown in Fig. 2. Nf features on the image plane at time step t are denoted by xn,t = (un,t , vn,t )T , (n = 1, ..., Nf ), an affine camera matrix which projects a 3D model of the object onto the image plane at time step t is reshaped into a vector form at = (a1,t , ..., a8,t )T , for simplicity, and an observed image at time step t is denoted by zt . When applied to the pose estimation of a rigid object, Gaussian process is assumed in the higher layer and the object’s pose is estimated by Kalman Filter,
Visual Tracking Achieved by Adaptive Sampling
609
while tracking of the features is performed by PF because of the existence of cruel non-Gaussian noise, e.g. occlusion, in the lower layer. (i) To obtain the samples xn,t from the mixture proposal distribution we need two prediction densities, p(xn,t |at ) and p(xn,t |z1:t−1 ). The prediction density computed in the higher layer, p(xn,t |at ), which contains the physical relationship between the features, is given by the affine projection process of the 3D model of the rigid object. 2.4
Experiments
Simulations. The goal of this simulation is to estimate the posture of a rigid object, as in Fig. 3A, from an observation sequence of eight features. The rigid object was a hexahedral piece, whose size was 20 (upper base) × 30 (lower base) × 20 (height) × 20 (depth) [mm], was rotated at 150mm apart from a pin-hole camera (focal length: 40mm) and was projected onto the image plane (pixel size: 0.1 × 0.1 [mm]) by a perspective projection process disturbed by a Gaussian noise (mean: 0, standard deviation: 1 pixel). The four features at back of the object were occluded by the object itself. An occluded feature was represented by assuming the standard deviation of measurement noise grows to 100 pixels from a non-occluded value of 1 pixel. The other four features at front of the object were not occluded. The length of the observation sequence of the features was 625 frames, i.e., about 21 sec. In this simulation, we compared the performance of our proposed method with (i) the method with a fixed αt = {1, 0} which is equivalent to the simple hierarchical modeling in which the prediction density computed in the higher layer is trusted every time, (ii) the method with a fixed αt = {0, 1} which does not implement the mutual interaction between the layers. The decay constant of the on-line EM algorithm was set at λt = 0.5. The estimated pose by the adaptive proposal distribution and αa,t at each time step are shown in Figs. 3B and 3C, respectively. Here, the object’s pose θt = {θX,t , θY,t , θZ,t } was calculated from the estimated at using Extended Kalman Filter (EKF). In our implementation, maximum and minimum values of αa,t were limited to 0.8 and 0.2, respectively, to prohibit the robustness from degenerating. As shown in Fig. 3B, our method achieved robust estimation against the occlusion. Concurrently with the robust estimation of the object’s pose, appropriate determination of the mixture ratio was exhibited. For example, as in the case of the feature x1 , the prediction density computed in the higher layer was emphasized and well predicted using the 3D object’s model during the period in which x1 was occluded, because the observed feature position contaminated by cruel noise depressed confidence of the lower prediction density. Real Experiments. To investigate the performance against the cruel nonGaussian noise existing in real environments, the proposed method was applied to a head pose estimation problem of a driver from a real image sequence captured in a car. The face extraction/tracking from an image sequence is a wellstudied problem because of its applicability to various area, and then several
610
T. Shibata, T. Bando, and S. Ishii
Fig. 3. A: a sample simulation image. B: Time course of the estimated object’s pose. C: Time course of αa,t determined by the on-line EM algorithm. Gray background represents a frame in which the feature was occluded.
PF algorithms have been proposed. However, for accurate estimation of state variables, e.g. human face, lying in a high dimensional space especially in the case of real-time processing, some techniques for dimensional reduction are required. The proposed method is expected to enable more robust estimation in spite of limitation in computing resource by exploiting hierarchy of state variables. The real image sequence was captured by a near-infrared camera at the back of a handle, i.e., the captured images did not contain color information, and the image resolution was 640 × 480. In such a visual tracking task, the true observation process p(zt |xn,t ) is unknown because the true positions of face features are unobservable, hence, a (i) model of the observation process for calculating the particle’s weight wn,t is needed. In this study, we employed normalized correlation with a template as the model of the approximate observation process. Although this observation model seems too simple to apply to problems in real environments, it is sufficient for examining the efficiency of the proposal distribution. We employed nose, eyes, canthi, eyebrows, corners of mouth as the face features. Using the 3D feature positions measured by a 3D distance measuring equipment, we constructed the 3D face model. The proposed method was applied by employing 50 particles for each face feature, as well as in the simulation, and processed by a Pentium 4 (2.8 GHz) Windows 2000 PC with 1048MB RAM. Our system processed one frame in 29.05 msec, and hence achieved real-time processing.
Visual Tracking Achieved by Adaptive Sampling
611
Fig. 4. Mean estimation error in the case when αt was estimated by EM, fixed at αa,t = {1, 0}, and fixed at αa,t = {0, 1} (100 trials). Since some bars protruded from the figure, they were shorten and the error amount is instead displayed on the top of them.
Fig. 5. Tracked face features
Fig. 4 shows the estimation error of the head pose, and the true head pose of the driver measured by a gyro sensor in the car. Our adaptive mixing proposal distribution achieved robust head pose estimation as well as in the simulation task. In Fig. 5, the estimated face features are depicted by “+” for various head pose of the driver; the variance of the estimated feature position is represented as the size of the “+” mark, i.e., the larger a “+” is, the higher estimation confidence the estimator has.
3
Modeling Primate’s Visual Tracking by Particle Filters
Here we note that our computer vision study and primates’ vision share the same computational problems and similar constraints. Namely, they need to perform real-time spatiotemporal filtering of the visual data robustly and accurately as
612
T. Shibata, T. Bando, and S. Ishii
much as possible with the limited computing resource. Although there are huge numbers of neurons in the brain, their firing rate is very noisy and much slower than recent personal computers. We can visually track only around four to six objects simultaneously (e.g., [2]). These facts indicate the limited attention resource. As mentioned at the beginning this paper, it is widely known that only the foveal region on the retina can acquire high-resolution images in primates, and that humans usually make saccades mostly to eyes, nose, lip, and contours when we watch a human face. In other words, primates actively ignore irrelevant information against massive image inputs. Furthermore, there have been many behavioral and computational studies reporting that the brain would compute Bayesian statistics (e.g, [8][10]). As we discussed, however, Bayesian computation is intractable in general, and particle filters (PFs) is an attractive and feasible solution to the problem as it is very flexible, easy to implement, parallelisable. Importance sampling is analogous to efficient computing resource delivery. As a whole, we conjecture that the primate’s brain would employ PFs for visual tracking. Although one of the major drawbacks of PF is that a large number of particles, typically exponential to the dimension of state variables, are required for accurate estimation, our proposed framework in which adaptive sampling from hierarchical and parallel predictive distributions can be a solution. As demonstrated in section 2 and in other’s [1], adaptive importance sampling from multiple predictions can balance both accuracy and robustness of the estimation with a restricted numbers of particles. Along this line, overt/covert smooth pursuit in primates could be a research target to investigate our conjecture. Based on the model of Shibata, et al. [12], Kawawaki et al. investigated the human brain mechanism of overt/covert smooth pursuit by fMRI experiments and suggested that the activity of the anterior/superior lateral occipito-temporal cortex (a/sLOTC) was responsible for target motion prediction rather than motor commands for eye movements [7]. Note that LOTC involves the monkey medial superior temporal (MST) homologue responsible for visual motion processing (e.g., [9][6]) In their study, the mechanism for increasing the a/sLOTC activity remained unclear. The increase in the a/sLOTC activity was observed particularly when subjects pursued blinking target motion covertly. This blink condition might cause two predictions, e.g., one emphasizes observation and the other its belief as proposed in [1]), and require the computational resource for adaptive sampling. Multiple predictions might be performed in other brain regions such as frontal eye filed (FEF), the inferior temporal (IT) area, fusiform face area (FFA). It is known that FEF involved in smooth pursuit (e.g., [3]), and it has reciprocal connection to the MST area (e.g., [13]), but how they work together is unclear. Visual tracking for more general object tracking rather than a small spot as a visual stimulus requires a specific target representation and distractors’ representation inculding a background. So the IT, FFA and other areas related to higher-order visual representation would be making parallel predictions to deal with the varying target appearance during tracking.
Visual Tracking Achieved by Adaptive Sampling
4
613
Conclusion
In this paper, first we have introduced particle filters (PFs) as an approximated incremental Bayesian computation, and pointed out their drawbacks. Then, we have proposed a novel framework for visual tracking based on PFs as a solution to the drawback. The keys of the framework are: (1) high-dimensional state space is decomposed into hierarchical and parallel predictors which treat state variables in the lower dimension, and (2) their integration is achieved by adaptive sampling. The feasibility of our frame work has been demonstrated by real as well as simulation studies. Finally, we have pointed out the shared computational problems between PFs and human visual tracking, presented our conjecture that at least the primate’s brain employs PFs, and discussed its possibility and perspectives for future investigations.
References 1. Bando, T., Shibata, T., Doya, K., Ishii, S.: Switching particle filters for efficient visual tracking. J. Robot Auton. Syst. 54(10), 873 (2006) 2. Cavanagh, P., Alvarez, G.A.: Tracking multiple targets with multifocal attention. Trends in Cogn. Sci. 9(7), 349–354 (2005) 3. Fukushima, K., Yamanobe, T., Shinmei, Y., Fukushima, J., Kurkin, S., Peterson, B.W.: Coding of smooth eye movements in three-dimensional space by frontal cortex. Nature 419, 157–162 (2002) 4. Gordon, N.J., Salmond, J.J., Smith, A.F.M.: Novel approach to nonlinear nonGaussian Bayesian state estimation. IEEE Proc. Radar Signal Processing 140, 107– 113 (1993) 5. Isard, M., Blake, A.: Condensation - conditional density propagation for visual tracking. Int. J. Comput. Vis. 29(1), 5–28 (1998) 6. Kawano, M., Shidara, Y., Watanabe, Y., Yamane, S.: Neural activity in cortical area MST of alert monkey during ocular following responses. J. Neurophysiol 71(6), 2305–2324 (1994) 7. Kawawaki, D., Shibata, T., Goda, N., Doya, K., Kawato, M.: Anterior and superior lateral occipito-temporal cortex responsible for target motion prediction during overt and covert visual pursuit. Neurosci. Res. 54(2), 112 8. Knill, D.C., Pouget, A.: The Bayesian brain: the role of uncertainty in neural coding and computation. Trends in Neurosci. 27(12) (2004) 9. Newsome, W.T., Wurtz, H., Komatsu, R.H.: Relation of cortical areas MT and MST to pursuit eye movements. II. Differentiation of retinal from extraretinal inputs. J. Neurophysiol 60(2), 604–620 (1988) 10. Rao, R.P.N.: The Bayesian Brain: Probabilistic Approaches to Neural Coding. In: Neural Models of Bayesian Belief Propagation, MIT Press, Cambridge (2006) 11. Sato, M., Ishii, S.: On-line EM algorithm for the normalized gaussian network. Neural Computation 12(2), 407–432 (2000) 12. Shibata, T., Tabata, H., Schaal, S., Kawato, M.: A model of smooth pursuit in primates based on learning the target dynamics. Neural Netw. 18(3), 213 13. Tian, J.-R., Lynch, J.C.: Corticocortical input to the smooth and saccadic eye movement subregions of the frontal eye field in cebus monkeys. J. Neurophysiol 76(4), 2754–2771 (1996)
Bayesian System Identification of Molecular Cascades Junichiro Yoshimoto1,2 and Kenji Doya1,2,3 1
Initial Research Project, Okinawa Institute of Science and Technology Corporation 12-22 Suzaki, Uruma, Okinawa 904-2234, Japan {jun-y,doya}@oist.jp 2 Graduate School of Information Science, Nara Institute of Science and Technology 8916-5 Takayama, Ikoma, Nara 630-0192, Japan 3 ATR Computational Neuroscience Laboratories 2-2-2 Hikaridai, “Keihanna Science City”, Kyoto 619-0288, Japan
Abstract. We present a Bayesian method for the system identification of molecular cascades in biological systems. The contribution of this study is to provide a theoretical framework for unifying three issues: 1) estimating the most likely parameters; 2) evaluating and visualizing the confidence of the estimated parameters; and 3) selecting the most likely structure of the molecular cascades from two or more alternatives. The usefulness of our method is demonstrated in several benchmark tests. Keywords: Systems biology, biochemical kinetics, system identification, Bayesian inference, Markov chain Monte Carlo method.
1
Introduction
In recent years, the analysis of molecular cascades by mathematical models has contributed to the elucidation of intracellular mechanisms related to learning and memory [1,2]. In such modeling studies, the structure and parameters of the molecular cascades are selected based on the literature and databases1 . However, if reliable information about a target molecular cascade is not obtained from those repositories, we must tune its structure and parameters so as to fit the model behaviors to the available experimental data. The development of a theoretically sound and efficient system identification framework is crucial for making such models useful. In this article, we propose a Bayesian system identification framework for molecular cascades. For a given set of experimental data, the system identification can be separated into two inverse problems: parameter estimation and model selection. The most popular strategy for parameter estimation is to find a single set of parameters based on the least mean-square-error or maximum likelihood criterion [3]. However, we should be aware that the estimated parameters might 1
DOQCS ( http://doqcs.ncbs.res.in/) and BIOMODELS ( http://biomodels. net/) are available for example.
M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 614–624, 2008. c Springer-Verlag Berlin Heidelberg 2008
Bayesian System Identification of Molecular Cascades
615
suffer from an “over-fitting effect” because the available set of experimental data is often small and noisy. For evaluating the accuracy of the estimators, statistical methods based on the asymptotic theory [4] and Fisher information [5] were independently proposed. Still, we must pay attention to practical limitations: a large number of data are required for the former method; and the mathematical model should be linear at least locally for the latter method. For model selection, Sugimoto et al. [6] proposed a method based on genetic programming. This provided a systematic way to construct a mathematical model, but a criterion to adjust a tradeoff between data fitness and model complexity was somewhat heuristic. Below, we present a Bayesian formulation for the system identification of molecular cascades.
2
Problem Setting
In this study, we consider the dynamic behavior of molecular cascades that can be modeled as a well-stirred mixture of I molecular species {S1 , . . . , SI }, which chemically interact through J elementary reaction channels {R1 , . . . , RJ } inside a compartment with a fixed volume Ω and constant temperature [7]. The dynamical state of this system is specified by a vector X(t) ≡ (X1 (t), . . . , XI (t)), where Xi (t) is the number of the molecules of species Si in the system at time t. If we assume that there are a sufficiently large number of molecules within the compartment, it is useful to take the state vector as the molar concentrations x(t) ≡ (x1 (t), . . . , xI (t)) by the transformation of x(t) = X(t)/(ΩNA ), where NA ≈ 6.0 × 1023 is Avogadro’s constant. In this case, the system behavior can be well approximated by the deterministic ordinary differential equations (ODEs) [8]: J I x˙ i (t) = j=1 (νij − νij )kj l=1 xl (t)νlj , i = 1, . . . , I, (1) where x˙ ≡ dx/dt denotes the time-derivative (velocity) of a variable x. νij and νij are the stoichiometric coefficients of the molecular species Si as a reactant and a product, respectively, in the reaction channel Rj . kj is the rate constant of the reaction channel Rj . Equation (1) is called the law of mass action. To illustrate an example of this mathematical model, let us consider a simple interaction whose chemical equation is given by kf
A + B −→ ←− AB.
(2)
kb
Here, the two values associated with the arrows (kf and kb ) are the rate constants of their corresponding reactions. The dynamic behavior of this system is given by ˙ = −[A] ˙ = −[B] ˙ = kf [A][B] − kb [AB], [AB]
(3)
where the bracket [Si ] denotes the concentration of the molecular species Si . Enzymatic reactions, which frequently appear in biological signalings, can be modeled as a series of elementary reactions. In an enzymatic reaction that
616
J. Yoshimoto and K. Doya
changes a substrate S into a product P by the catalysis of an enzyme E, for example, the enzyme-substrate complex ES is regarded as the intermediate product, and the overall reaction is modeled as kf
kcat E + S −→ ←− ES −→ E + P.
(4)
kb
We can derive a system of ODEs corresponding to the chemical equation (4) based on the law of mass action (1), but it is also common to use the approximation form called the Michaelis-Menten equation [8]: ˙ = −[S] ˙ = (kcat [Etot ][S]) / (KM + [S]) , [P] (5) where [Etot ] ≡ [E] + [ES] is the total concentration of the molecular species including the enzyme E. KM ≡ (kb + kcat )/kf is called the Michaelis-Menten constant. When considering signaling communications among other compartments and biochemical experiments wherein drugs are injected, it is convenient to separate a control vector u ≡ (u1 , . . . , uI ), which denotes I external effects on the system, from a state vector x. Let us consider a model (2), in which [B] can be arbitrarily manipulated by an experimenter. In this case, the control vector is u = ([B]), and the state vector is modified into x = ([A], [AB]). By summarizing the above discussion, the dynamic models of molecular cascades concerned in this study assume a general form of ˙ x(t) = f (x(t), u(t), q1 ) subject to x(0) = q0 , (6) where q0 ∈ RI+ is a set of I initial values of state variables2 . q1 ∈ RM is a set of + M system parameters such as rate constants and Michaelis-Menten constants. f : (x, u, q1 ) → x˙ is a function derived based on the law of mass action. We assume that a set of control inputs U ≡ {u(t)|t ∈ [0, Tf ]} has been designed before commencing an experiment, where Tf is a final time point of the experiment. Let O ⊂ {1, . . . , I} be a set of indices such that the concentration for any species in {Si |i ∈ O} can be observed through the experiment. Then, let t1 , . . . , tN ∈ [0, Tf ] be the time points at which the observations are made in the experiment. Using these notations, a set of all the data observed in the experiment is defined by Y ≡ {yi (tn )|i ∈ O; n = 1, . . . , N }, where yi (tn ) is the concentration of the species Si observed at time tn . The objective of this study is to identify the mathematical model (6) from a limited number of experimental data Y . This objective can be separated into two inverse problems. One problem involves fitting a parameter vector q ≡ (q1 , q0 ) so as to minimize the residuals i (tn ) = yi (tn ) − xi (tn ; q, U ), where xi (tn ; q, U ) denotes the solution of the ODEs (6) at time tn that depends on the parameter vector q and control inputs U . In other words, we intend to find q such that model (6) reproduces the observation data Y as closely as possible. This inverse problem is called parameter estimation. The second problem is to infer the most likely candidate among multiple options {C1 , . . . , CC } with different model structures (functional forms of f ) from the observation data Y . 2
We denote the set of all non-negative real values by R+ in accordance with usage.
Bayesian System Identification of Molecular Cascades
617
This problem is called model selection, and we call each Cc (c = 1, . . . , C) a “model candidate” in this study.
3
Bayesian System Identification
In this section, we first present our Bayesian approach to the parameter estimation problem, and then extend the discussion to the model selection problem. In statistical inference, the residuals {i (tn )|i ∈ O; n = 1, . . . , N } are modeled as unpredictable noises. At first, we present a physical interpretation of the noises. Let us assume that each observation yi (tn ) is obtained by sampling particles of the molecular species Si from a small region with a fixed volume ωi (0 < ωi < Ω) inside the model compartment. Since the interior of the model compartment is well-stirred, the particles of Si would be distributed by a spatial Poisson distribution that is independent of the states of the other molecular species. This implies that the number of particles of the sampled species, Xωi (tn ), would be distributed according to a binomial distribution with the total number of trials Xi (tn ) and success probability (ωi /Ω) [9]. Also, this distribution can be i.i.d.
well approximated by Xωi (tn ) ∼ N (ωi Xi (tn )/Ω, ωi Xi (tn )/Ω)3 from the assumption of Xi (tn ) 0. Using the two relationships, yi (tn ) = Xωi (tn )/(ωi NA ) i.i.d.
and xi (tn ) = Xi (tn )/(ΩNA ), we have yi (tn ) ∼ N (xi (tn ), xi (tn )/(ωi NA )), that is, i.i.d. i (tn ) ∼ N (0, xi (tn )/γi ) , (7) where γi ≡ ωi NA . In general, we cannot determine the exact values of γ ≡ |O| (γi )i∈O ∈ R+ ; hence, they are regarded as latent variables to be estimated along with the parameter q via the Bayesian inference. Now, we consider the Bayesian inference of the set of all the unknown variables θ ≡ (q, γ), i.e., the posterior distribution of θ. By combining (7) with (6), we have the conditional distribution of the observation data Y : N p(Y |θ; U ) = n=1 i∈O N1 (yi (tn ); xi (tn ; q, U ), xi (tn ; q, U )/γi ) , (8) where Np (·; ·, ·) denotes a p-dimensional Gaussian density function 4 . According to Bayes’ theorem, the posterior distribution of θ is given by p(θ|Y ; U ) =
p(Y |θ;U)p(θ;U) p(Y ;U)
=
p(Y |θ;U)p(θ;U) , p(Y |θ;U)p(θ;U)dθ
(9)
where p(θ; U ) is a prior distribution of θ defined before obtaining Y . In this study, we employ the prior distribution that has a form of I+M p(θ; U ) = (10) i=1 L (qi ; ξ0 ) i∈O G (γi ; α0 , β0 ) , where qi and γi are the i-th elements of the vectors q and γ, respectively. α0 , β0 , and ξ0 are small positive constants. L (·; ·) and G (·; ·, ·) denote the density 3 4
N(μ, σ 2 ) denotes a Gaussian (or normal) distribution with mean μ and variance σ 2 . Np (x; m, S) ≡ (2π)−p/2 |S|−1/2 exp{− (x − m) S−1 (x − m) /2}, where it should be noted that all vectors are supposed to be row vectors throughout this article.
618
J. Yoshimoto and K. Doya
functions of a log-Laplace distribution and a gamma distribution, respectively5 . By substituting (8) and (10) for (9), the posterior distribution is well-defined. The posterior distribution can be regarded as a degree of certainty of θ deduced from the outcome Y [10]. The most “likely” value of θ can be given by the maximum a posteriori (MAP) estimator θ MAP ≡ argmaxθ p(θ|Y ; U ). This value is asymptotically equivalent to the solutions of the maximum likelihood method and a generalized least-mean-square method as the number of time points N approaches infinity. We must be aware that the MAP estimator asymptotically has a variance of O(1/N ) as long as N is finite. Accordingly, it is informative to evaluate a high-confidence region of θ. The Bayesian inference can be used for this purpose, as will be shown in Section 4. Although we have considered the parameter estimation problem for a given model candidate so far, the denominator of (9), which is called the evidence, plays an important role in the model selection problem. To make the dependence on model candidates explicit, let us write the evidence p(Y ; U ) for each candidate Cc (c = 1, . . . , C) as p(Y |Cc ; U ). The posterior distribution of Cc is given by p(Cc |Y ; U ) ∝ p(Y |Cc ; U ) if the prior distribution over {C1 , . . . , CC } is uniform. This implies that the model candidate C ∗ , where C ∗ = argmaxc p(Y |Cc ; U ), is the optimal solution in the sense of the MAP estimation. Accordingly, the main task in the Bayesian model selection reduces to the calculation of the evidence p(Y ; U ) for each model candidate Cc . The major difficulty encountered in the Bayesian inference is that an intractable integration of the denominator in the final equation in (9) must be evaluated. For convenience of implementation, we adopt a Markov chain Monte Carlo technique [11] to approximate the posterior distribution and evidence. Hereafter, we describe only the key essence of our approximation method. The basic framework of our approximation is based on a Gibbs sampling method [11]. We suppose that it is possible to draw the realizations q and γ directly from two conditional distributions with density functions p(q|γ, Y ; U ) and p(γ|q, Y ; U ), respectively. In this case, the following iterative procedures asymptotically generate a set of L (L 1) random samples Θ1:L ≡ {θ(l) ≡ (q(l) , γ (l) )|l = 1, . . . , L} distributed according to the density function p(θ|Y ; U ). 1. 2. 3. 4.
Initialize the value of γ (0) appropriately and set l = 1. Generate q(l) from the distribution with the density p(q|γ (l−1) , Y ; U ). Generate γ (l) from the distribution with the density p(γ|q(l) , Y ; U ). Increment l := l + 1 and go to Step 2.
Since we select a gamma distribution as the prior distribution of γi in this study, p(γ|q, Y ; U ) always has the form i∈O G (γi |α, β(q) ), where 2 N i (tn ;q,U)) α = α0 + N2 and β(q) = β0 + 12 n=1 (yi (tnx)−x . (11) i (tn ;q,U) This implies that it is possible to draw γ directly from p(γ|q, Y ; U ) using a gamma-distributed random number generator in Step 3. On the other hand, the 5
L (x; ξ) ≡ exp{−| ln x|/ξ}/(2ξx) and G (γi ; α, β) ≡ β α xα−1 e−βx /Γ (α), where ∞ Γ (α) ≡ 0 tα−1 e−t dt is the gamma function.
Bayesian System Identification of Molecular Cascades
619
analytical calculation of p(q|γ, Y ; U ) is still intractable; however, we know that p(q|γ, Y ; U ) is proportional to p(Y |θ; U )p(θ; U ) up to the normalization constant. Thus, we approximate Step 2 by a Metropolis sampling method [11]. Here, a transition distribution of the Markov chain is constructed based on an adaptive direction sampling method [12] in order to improve the efficiency of the sampling. Indeed, it is required to compute the solution of the ODEs xi (tn ; q(l) , U ) explicitly in both Steps 2 and 3; however, this is not significant because many numerical ODE solvers are currently available (see [13] for example). After the procedures generate L random samples Θ 1:L distributed according to the density function p(θ|Y ; U ), the target function p(θ|Y ; U ) can be well reconstructed by the kernel density estimator [14]: (D 2+4) L θ 4 p˜(θ|Y ; U ) = L1 l=1 NDθ θ; θ (l) , V ; V = (Dθ +2)L Cov(Θ 1:L ), (12) where Dθ is the dimensionality of the vector θ, and Cov(Θ 1:L ) is the covariance matrix of the set of samples Θ1:L . Futhermore, assuming that p˜(θ|Y ; U ) is a good approximator of the true posterior distribution p(θ|Y ; U ), the importance sampling theory [11] states that the integration of the denominator in (9) (i.e., the evidence p(Y ; U )) can be well approximated by p(Y |θ(l) ;U)p(θ(l) ;U) p˜(Y ; U ) = L1 L . (13) l=1 p(θ ˜ (l) |Y ;U)
4
Numerical Demonstrations
In order to demonstrate the usefulness, we applied our Bayesian method to several benchmark problems. For all the benchmark problems, the set of constants for the prior distribution were fixed at α0 = 10−4 , β0 = 1, and ξ0 = 1. The total number of Monte Carlo samples were set at L = 107 , and every element in the initial sample θ (0) was drawn from a uniform distribution within the range of [0.99 × 10−10 , 1.01 × 10−10 ]. In the first benchmark, we supposed a situation where the structure of the molecular cascades was fixed by the chemical equation (2), but [A] was unobservable at any time point. Instead of real experiments, the observations Y with the total number of time points N = 51 were synthetically generated by (3) using the parameters q1 = (kf , kb ) = (5.0, 0.01) and the initial state q0 = ([A](0), [B](0), [AB](0)) = (0.4, 0.3, 0.01), where the noises in the observation processes were added according to the model (7) with the parameter γi = 1000. The circles and squares in Fig.1(a) show the generated observation Y . For the observation data Y , we evaluated the posterior distribution of the set of all the unknown variables θ using the method described in Section 3. Figures 1(bf) show the posterior distributions of non-trivial unknown variables (kf , kb , [A](0) and γi for species B and AB), respectively, after marginalizing out any other variate. Here, the bold-solid lines denote the true values used in generating data Y , and the bold-broken lines are MAP estimators that were approximately obtained by argmaxl∈{1,...,L} p(Y |θ(l) ; U )p(θ(l) ; U ). The histograms indicate the distribution of Monte Carlo samples Θ 1:L , and the curves surrounding the histograms
J. Yoshimoto and K. Doya Observation data Y and reconstruction
Posterior distribution: P(kf|Y;U)
0.35 0.6 0.5
[B]
0.1
0.2
model
[AB] 0 0
1
2
3
4
0 2
5
3
4
kf
Time
(a)
20
5
6
0 0
7
0.01
0.02
(b)
Posterior distribution: P([A](0)|Y;U)
-3
x 10 hist pdf true MAP 0.5% 99.5%
10 8 6 4
kb
0.03
0.04
0.05
(c)
Posterior distribution: P(γ2|Y;U)
-3
x 10
1
Posterior distribution: P(γ3|Y;U)
3
hist pdf true MAP 0.5% 99.5%
1.5 p(γ2|Y;U)
12
p([A](0)|Y;U)
30
10
0.1
model
0.05
0.3
40
b
f
[AB]obs
0.15
0.4
p(k |Y;U)
Concentration
[B]obs
hist pdf true MAP 0.5% 99.5%
50
p(k |Y;U)
0.3 0.25 0.2
Posterior distribution: P(kb|Y;U)
hist pdf true MAP 0.5% 99.5%
hist pdf true MAP 0.5% 99.5%
2.5 p(γ3|Y;U)
620
2 1.5 1
0.5
0.5
2 0 0.3
0 500
0.35 0.4 0.45 0.5 0.55 0.6 [A](0): initial concentration of species A
1000 1500 γ2 (for species B)
(d)
0
2000
400
600 800 1000 γ3 (for species AB)
(e)
1200
(f)
Fig. 1. Brief summaries of results in the first benchmark
Posterior distribution: P(kf,kb|Y) 0.03
50
25
0.45
20
30
0.01
20
0.005
10
[A](0)
40
0.015
0.4
15
4
kf
5
6
0
1000 0.4
10 0.35
0 3
0.45 [A](0)
0.02 b
Posterior distribution: P(kb,[A](0)|Y) 0.5
30
1500
0.025
k
Posterior distribution: P(kf,[A](0)|Y) 0.5
60
500
0.35 5
0.3 3
4
kf
5
6
0
0.3 0
0.01
kb
0.02
0.03
0
Fig. 2. The joint posterior distributions of two out of three parameters (kf , kb , [A](0))
denote the kernel density estimators (12). Though model (3) with the MAP estimators could reproduce the observation data Y well (see solid and broken lines in Fig.1(a)), the estimators were slightly different from their true values because the total number of data was not very large. For this reason, the broadness of the posterior distribution provided useful information about the confidence in inference. To evaluate the confidence more quantitatively, we defined “99% credible interval” as the interval between 0.5% and 99.5% percentiles along each axis of Θ 1:N . The dotted lines in Figs.1(b-f) show the 99% credible interval. Since the posterior distribution was originally a joint probability density of all the unknown variables θ, it was also possible to visualize the highly confident parameter space even in higher dimensions, which provided useful information about the dependence among the parameters. The three panels in Fig.2 show that the joint posterior distribution of two out of the three parameters (kf , kb , and [A](0)), where the white circles denote the true values. From the panels, we can easily observe that these parameters have a strong correlation between them. Interestingly, highly
Bayesian System Identification of Molecular Cascades
621
Table 1. MAP estimators and 99% credible intervals in the second benchmark Parameters True MAP 99% itvls. Parameters True MAP 99% itvls. Ys (×10−5 ) 7.00 7.00 [6.83,7.19] k2 (×10+6 ) 6.00 6.22 [5.77,6.66] ksyn (×10−2 ) 1.68 1.97 [1.46,3.97] KIB (×10−2 ) 1.00 0.74 [0.28,1.28]
Observation data Y and reconstruction
0.6
obs
0.01
[M2]
model
[M2]
0.5
1.5
1
1.5
obs
[M3]model
0.4
[M3]obs
0.3
0.008 IB
[S]obs
5
x 10 2
[M1]
[S]model Concentration
Concentration
0.012
[M1]
model
[B]obs 2
Posterior distribution: P(ksyn,KIB|Y)
Observation data Y and reconstruction 0.7
[B]model
K
2.5
1 0.006
0.2
0.5
0.004
0.5 0.1 0 0
10
20
30 Time
40
(a)
50
60
0 0
0.002 10
20
30 Time
40
50
60
0.015
(b)
0.02 0.025 ksyn
0.03
0
(c)
Fig. 3. (a-b) Observation data Y and state trajectories reconstructed in the second benchmark; and (c) the joint posterior distribution of (ksyn , KIB )
confident spaces over (kf -[A](0)) and (kb -[A](0)) had hyperbolic shapes that reflected the bilinear form in the ODEs (3). To demonstrate the applicability to more realistic molecular cascades, we adopted a simulator of a bioreactor system developed by [15] in the second benchmark. The mathematical model corresponding to this system was given by the following ODEs: ˙ = (μ − qin )[B]; [S] ˙ = qin (cin − [S]) − r1 Mw [B] [B] ˙ = r1 − r2 − μ[M1]; [M2] ˙ = r2 − r3 − μ[M2]; [M3] ˙ = rsyn − μ[M3] [M1] r1 ≡
r1,max [S] KS +[S] ;
r
[M2]
k
K
[M3][M1] 3,max syn IB r2 ≡ kK2M1 +[M1] ; r3 ≡ KM2 +[M2] ; rsyn ≡ KIB +[M2] ; μ = Ys r1 ,
where Mw , r1,max , KS , KM1 , r3,max , and KM2 were assumed to be known constants that were same as those reported in [15]. u ≡ (cin , qin ) were control inputs in this simulator, and all the state variables x ≡ ([B], [S], [M1], [M2], [M3]) were observable. The observation time-series Y were downloaded from the website6 . The objective in this benchmark was to estimate four unknown system parameters q1 ≡ (Ys , k2 , ksyn , KIB ). As a result of evaluating the posterior distribution over all the elements of θ, the MAP estimators and 99% credible intervals of q1 were obtained, and are shown in Table 1. Indeed, the observation noises in this i.i.d. benchmark were distributed according to i (tn ) ∼ N(0, (0.1xi (tn ))2 ), which was different from our model (7). This inconsistency and small amount of data led to MAP estimators that were deviated slightly from the true values. However, it was notable that the model with the MAP estimators could reproduce the observation data Y fairly well, as shown in Figs.3(a-b), so that the credible intervals provide good information about the plausibility/implausibility of the 6
http://sysbio.ist.uni-stuttgart.de/projects/benchmark
622
J. Yoshimoto and K. Doya Observation data Y1 and reconstruction
Observation data Y and reconstruction 2
0.6
C1
0.4
C
1
2
3
4
2
3
4
1
C
3
Obs. 0 0
1
2
306.69
168.29
100 C2 Model candidate
C3
log-evidence: ln p(Y2|C;U)
log-evidence: ln p(Y1|C;U)
373.41
C1
3
4
5
Time
300
0
5
0.2
5
Log-evidence of three candidates
200
4
C
Time
400
3
2
Obs. 1
2
C [P]
[P]
C3
1
0.4
C2 0.1
Obs.
0 0
C1
0 0
3
5
0.2
2
C 0.2
Obs.
0 0
C
[S]
[S] 0.2
C
0.4
1
2
C3
0.6
(a) for observation data Y1
Log-evidence of three candidates 300 200
246.92
246.16
C2 Model candidate
C3
189.74
100 0
C1
(b) for observation data Y2
Fig. 4. Brief summaries of results in the model selection problem
estimated parameters. In addition, we could find a nonlinear dependence between the two parameters (ksyn , KIB ) even in this problem, as shown in Fig.3(c). Finally, our Bayesian method was applied to a simple model selection problem. In this problem, we supposed that the observation data were obtained from the molecular cascades whose chemical equations were written as k=1
Sin −→ S;
kf 1
kf 2
kb1
kb2
S + E −→ ←− SE −→ ←− P + E,
(14)
where x ≡ ([S], [P], [E], [SE]) was a state vector of our target compartment, and only two elements, [S] and [P], were observable. u ≡ ([Sin ]) was a control input to the compartment. We considered three model candidates depending on the reversibility of the reaction: C1 : The ODEs consistent with model (14) was constructed based on the law of mass action without any approximation. In this case, four system parameters (kf 1 , kb1 , kf 2 , and kb2 ) were estimated. C2 : kb2 was sufficiently small so that the second part of model (14) was reduced to model (4). The estimated parameters were also reduced to (kf 1 , kb1 , kf 2 ). C3 : In addition to the assumption of C2 , the second part of model (14) was well approximated by the ODEs (5). In this case, the system parameters were further reduced to (KM1 , kf 2 ), where KM1 ≡ (kb1 + kf 2 )/kf 1 . The goal of this benchmark is to select the most reasonable candidate from {C1 , C2 , C3 } for the given observation data Y . We prepared two sets of observation data Y1 and Y2 that were generated by model candidates C1 and C2 , respectively. The circles in the first and second panels in Figs.4(a-b) show the points of data Y1 and Y2 . Here, the curves associated with the circles denote the state trajectories
Bayesian System Identification of Molecular Cascades
623
reconstructed by the three model candidates with their MAP estimators. Then, we evaluated the evidence p˜(Yd |Cc ; U ) for d = 1, 2 and c = 1, . . . , 3 by Eq.(13). The results are shown in the bottom panels in Figs.4(a-b), where the evidence is transformed into the natural logarithm to avoid numerical underflow. As shown in Fig.4(a), only C1 could satisfactorily reproduce the observation data Y1 , while the others could not. Accordingly, the evidence for the model candidate C1 was maximum. For the observation data Y2 , all the model candidates could reproduce them satisfactorily in a similar manner, as shown in Fig.4(b). In this case, the evidence for the simplest model candidate C3 was nearly as high as that for C2 , from which the observation data Y2 were generated. These results demonstrated that the evidence provided useful information for determining whether a structure of molecular cascades could be further simplified or not.
5
Conclusion
In this article, we presented a Bayesian method for the system identification of molecular cascades in biological systems. The advantage of this method is to provide a theoretically sound framework for unifying parameter estimation and model selection. Due to the limited space, we have shown only the small-scale benchmark results here. However, we have also confirmed that the precision of our MAP estimators was comparable to that of an existing good parameter estimation method in a medium-scale benchmark with 36 unknown system parameters [3]. The current limitations of our method are difficulty in the convergence diagnosis of the Monte Carlo sampling and time-consuming computation for large-scale problems. We will overcome these limitations in our future work.
References 1. Bhalla, U.S., Iyengar, R.: Emergent properties of networks of biological signaling pathways. Science 387, 283–381 (1999) 2. Doi, T., et al.: Inositol 1,4,5-trisphosphate-dependent Ca2+ threshold dynamics detect spike timing in cerebellar Purkinje cells. The Journal of Neuroscience 25(4), 950–961 (2005) 3. Moles, C.G., et al.: Parameter estimation in biochemical pathways: A comparison of global optimization methods. Genome Research 13(11), 2467–2474 (2003) 4. Faller, D., et al.: Simulation methods for optimal experimental design in systems biology. Simulation 79, 717–725 (2003) 5. Banga, J.R., et al.: Computation of optimal identification experiments for nonlinear dynamic process models: A stochastic global optimization approach. Industrial & Engineering Chemistry Research 41, 2425–2430 (2002) 6. Sugimoto, M., et al.: Reverse engineering of biochemical equations from time-course data by means of genetic programming. BioSystems 80, 155–164 (2005) 7. Gillespie, D.T.: The chemical langevin equation. Journal of Chemical Physics 113(1), 297–306 (2000) 8. Crampin, E.J., et al.: Mathematical and computational techniques to deduce complex biochemical reaction mechanisms. Progress in Biophysics & Molecular Biology 86, 77–112 (2004)
624
J. Yoshimoto and K. Doya
9. Daley, D.J., Vere-Jones, D.: An Introduction to the Theory of Point Processes, 2nd edn. Springer, Heidelberg (2003) 10. Bernardo, J.M., Smith, A.F.M.: Bayesian Theory. John Wiley & Sons Inc., Chichester (2000) 11. Andrieu, C., et al.: An introduction to MCMC for machine learning. Machine Learning 50(1-2), 5–43 (2003) 12. Gilks, W.R., et al.: Adaptive direction sampling. The Statistician 43(1), 179–189 (1994) 13. Hindmarsh, A.C., et al.: Sundials: Suite of nonlinear and differential/algebraic equation solvers. ACM Transactions on Mathematical Software 31, 363–396 (2005) 14. Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Chapman and Hall, Boca Raton (1986) 15. Kremling, A., et al.: A benchmark for methods in reverse engineering and model discrimination: Problem formulation and solutions. Genome Research 14(9), 1773– 1785 (2004)
Use of Circle-Segments as a Data Visualization Technique for Feature Selection in Pattern Classification Shir Li Wang1, Chen Change Loy2, Chee Peng Lim1,∗, Weng Kin Lai2, and Kay Sin Tan3 1
School of Electrical & Electronic Engineering, University of Science Malaysia Engineering Campus, 14300 Nibong Tebal, Penang, Malaysia [email protected] 2 Centre for Advanced Informatics, MIMOS Berhad 57000 Kuala Lumpur, Malaysia 3 Department of Medicine, Faculty of Medicine, University of Malaya 50603 Kuala Lumpur Malaysia
Abstract. One of the issues associated with pattern classification using databased machine learning systems is the “curse of dimensionality”. In this paper, the circle-segments method is proposed as a feature selection method to identify important input features before the entire data set is provided for learning with machine learning systems. Specifically, four machine learning systems are deployed for classification, viz. Multilayer Perceptron (MLP), Support Vector Machine (SVM), Fuzzy ARTMAP (FAM), and k-Nearest Neighbour(kNN). The integration between the circle-segments method and the machine learning systems has been applied to two case studies comprising one benchmark and one real data sets. Overall, the results after feature selection using the circlesegments method demonstrate improvements in performance even with more than 50% of the input features eliminated from the original data sets. Keywords: Feature selection, circle-segments, data visualization, principal component analysis, machine learning techniques.
1 Introduction Data-based machine learning systems have wide applications owing to the capability of learning from a set of representative data samples and performance improvements when more and more data samples are used for learning. These systems have been employed to tackle many modeling, prediction, and classification tasks [1-7]. However, one of the crucial issues pertaining to pattern classification using databased machine learning techniques is the “curse of dimensionality” [2, 6, 8]. This is especially true because it is important to identify an optimal set of input features for learning, e.g. with the support vector machine (SVM) [9]. The same problem arises in other data-based machine learning systems as well. ∗ Corresponding author. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 625–634, 2008. © Springer-Verlag Berlin Heidelberg 2008
626
S.L. Wang et al.
In pattern classification, the main task is to learn and construct an appropriate function that is able to assign a set of input features into a finite set of classes. If noisy and irrelevant input features are used, the learning process may fail to formulate a good decision boundary that has a discriminatory power for data samples of various classes. As a result, feature selection has a significant impact on classification accuracy [9]. Useful feature selection methods for machine learning systems include the principal component analysis (PCA), genetic algorithm (GA), as well as other data visualization techniques [1-2, 6, 8-13]. Despite the good performance of PCA and GA in feature selection [1-2, 6, 8-9], the circle-segments method, which is a data visualization technique, is investigated in this paper. Previously, the application of circle-segments is confined to display the history of a stock data [14]. In this research, the circle-segments method is used in a different way, i.e., it is used to display the possible relationships between the input features and output classes. More importantly, the circle-segments method allows the involvement of humans (domain users) in the process of data exploration and analysis. Indeed, use of the circle-segments method for feature selection not only focuses on the accuracy of a classification system, but also the comprehensibility of the system. Here, comprehensibility refers to how easily a system can be assessed by humans [5]. In this regard, the circle-segment method provides a platform for easy interpretation and explanation on the decision made in feature selection by a domain user. The user can obtain an overall picture of the input-output relationship of the data set, and interpret the possible relationships between the input features and the output classes based on additional, possibly intuitive information. In this paper, we apply the circle-segments method as a feature selection method for pattern classification with four different machine learning systems, i.e., Multilayer Perceptron (MLP), Support Vector Machine (SVM), Fuzzy ARTMAP (FAM) and k-Nearest Neighbour (kNN). Two case studies (one benchmark and one real data sets) are used to evaluate the effectiveness of the proposed approach. For each case study, the data set is first divided into a training set and a test set. The circle-segments method and PCA are used to select a set of important input features for learning with MLP, SVM, FAM, and kNN. The results are compared with those without any feature selection methods. The organization of this paper is as follows. Section 2 presents a description of the circle-segments method. Section 3 presents the description on the case studies. The results are also presented, analyzed, and discussed here. A summary on the research findings, contributions, and suggestions for further work is presented in Section 4.
2 The Circle-Segments Method Generally, the circle-segments method comprises three stages, i.e., dividing, ordering, and colouring. In the dividing stage, a circle is divided equally according to the number of features/attributes involved. For example, assume a process consists of seven input features and one output feature (which can be of a number of discrete classes). Then, the circle is divided into eight equal segments, with one segment representing the output feature, while others representing the seven input features.
Use of Circle-Segments as a Data Visualization Technique for Feature Selection
627
In the ordering stage, a proper way of sorting the data samples is needed to ensure that all the features can be placed into the space of the circle. Since the focus is on the effects of each input feature towards the output, the correlation between the input features and the output is used to sort the data sample accordingly. Correlation was used as feature selection strategy in classification problems [17-18]. The results had shown that the correlation based feature selection was able to determine the representative input features, and thus improve the classification accuracy. In this paper, correlation is used to sort the priority of the input features involved. The input features that have higher influence towards the output are sorted first followed by the less influential input features. Correlation is defined in Equation (1). n
rxy = ∑ ( xi − x)( yi − y ) / i =1
n
n
i =1
i =1
∑ ( xi − x) 2 ∑ ( yi − y) 2
(1)
where xi and yi refer to inputs and output features respectively, and n is the number of samples. For ease of understanding, an example is used to illustrate the ordering stage. Assume a data set of n samples. Each sample consists of seven input features, (x1, x2, … , x7), and one output feature, y, which represents four discrete classes. The combinations of the input-output data can be represented by a matrix, A, with 8 columns and n rows, as in Table 1. The original values of the input-output data are first normalized between 0 and 1. This facilitates mapping of the input features into the colour map whereby the colour values are between 0 and 1. The correlation of each input (x1, x2, … , x7) towards the output is denoted as rx1, rx2, … , rx7. Table 1. The n samples of input-output data
Sample 1 Sample 2
1st column (Output) y Output class Output class
2nd column x1 x11 x21
3rd column x2 x12 x22
#
#
#
#
Sample n
Output class
xn1
xn2
… … … … # …
8th column x7 x17 x27
# xn7
Assume the magnitudes of the correlation are as described in Equation (2),
R : rx1 > rx 2 > … > rx 7
(2)
then, matrix A is sorted based on the output column (1st column). When the output values are equal, the rows of matrix A are further sorted based on the column order specified in the vector column, Q that has the highest magnitude of correlation in an ascending order, as shown in Equation (3). Q = [C x1 , C x 2 , … , C x 7 ]
(3)
where C xi refers to the column order for feature i. Based on this example, the rows of matrix A are first sorted by the output column (1st column). When the output values are equal, the rows of A are further sorted by
628
S.L. Wang et al.
column x2. When the elements in the values of x2 are equal, the rows are further sorted by column x3, and so on according to the column order specified in Q. After the ordering stage, matrix A has a new order. The first row of data in matrix A is located at the centre of the circle, while the last row of the data is located at the outside of the circle, as shown in Figure 1. The remaining data in between these two rows are put into the circle-segments based on their row order. In the colouring stage, colour values are used to encode the relevance of each data value to the original value based on a colour map. This relevance measure is in the range of 0 to 1. Based on the colour map located at the right side of Figure 1, the colours that have the highest and lowest values are represented by dark red and dark blue, respectively. With the help of the pseudo-colour approach, the data samples within each segment are linearly transformed into colours. Therefore, a combination of colours along the perimeter represents a combination of the input-output data.
x6
x5 x46
x7
y
x47
x45
x x16
x4 x44
x15
x17
x14
y1
x13 x11 x12
y4 x41 x1
x43 x3 x42 x2
Fig. 1. Circle-segments with 7 input features and 1 output feature
3 Experiments 3.1 Data Sets
Two data sets are used to evaluate the effectiveness of the circle-segments methods for feature selection. The first is the Iris data set obtained from the UCI Machine Learning Repository [15]. The data set has 150 samples, with 4 input features (sepal length, sepal width, petal length, and petal width) and 3 output classes (Iris Setosa, Iris Versicolour and Iris Virginica). There are 50 samples in each output class. The second is a real medical data set of suspected acute stroke patients. The input features comprise patients’ medical history, physical examination, laboratory test results, etc. The task is to predict the Rankin Scale category of patients upon discharge, either class 1-Rankin scale between 0 and 1 (141 samples) or class 2-Rankin scale between 2 and 6 (521 samples). After consultation with medical experts, a total of 18 input features, denoted as V1, V2, ..., V18, were selected.
Use of Circle-Segments as a Data Visualization Technique for Feature Selection
629
3.2 Iris Classification
Figure 2 shows the circle-segments of the input-output data for the Iris data set. The circle-segments display for the three-class problem demonstrates the discrimination of the input features towards classification. Note that Iris Setosa, Iris Versicolour, and Iris Virginica are represented by dark blue, green and dark red respectively. Observing the projection of the Iris data into the circle-segments, the segments for petal length and petal width show significant colour changes as they propagate from the centre (blue) to the perimeter of the circle (red). By comparing these segments with segment Class, it is clear that petal width and petal length have a strong discriminatory power that could segregate the classes well. The discriminatory power of the other two segments is not as obvious, owing to colour overlapping and, thus, there is no clear progression of colour changes from blue to red. Based on segment Sepal Width, both Iris Versicolour (green) and Iris Virginica (dark red) have colour values lower than 0.6 for sepal width. Colour overlapping can also be observed for Iris Versicolour and Iris Virginica in segment Sepal Length too. The colour values of sepal length for both classes are distributed between 0.5 and 0.6. Therefore, sepal width and sepal length are only useful to differentiate Iris Setosa from the other two classes. As a result, the circle-segments method shows that petal width and petal length are important input features that can be used to classify the Iris data samples.
Fig. 2. Circle-segments of Iris data set
PCA is also used to extract significant input features. From the cumulative value shown in Table 2, the first principal component already accounts for 84% of the total variation. The features that tend to have strong relationships with each principal components are the features that have larger eigenvectors in absolute value than the others [16]. Therefore, the results in Table 3 indicate that the variables that tend to have strong relationship with the first principal component are petal width and petal length because their eigenvector tend to be larger in absolute value than the others. Table 2. Eigenvalues of the covariance matrix of the Iris data set Principal Eigenvalue Proportion Cumulative
PC1 0.232 0.841 0.841
PC2 0.032 0.117 0.959
PC3 0.010 0.035 0.994
PC4 0.002 0.006 1.000
630
S.L. Wang et al. Table 3. Eigenvectors of the first principal component of the Iris data set Sepal Length Sepal Width Petal Length Petal Width
PC1 -0.425 0.146 -0.616 -0.647
The results obtained from the circle-segments and PCA methods suggest petal width and petal length have more discriminatory power than the other two input features. As such, two data sets are produced; one containing the original data samples and another containing only petal width and petal length. Both data sets are used to train and evaluate the performance of the four machine learning systems The free parameters of the four machine learning systems were determined by trialand-error, as follows. For MLP, early stopping method was used to adjust the number of hidden nodes and training epochs. The fast learning rule was used to train the FAM network. The baseline vigilance parameter was determined from a fine-tuning process by performing leave-one-out cross validation on the training data sets. The same process was used to determine the number of neighbours for kNN. The radial basis function was selected as the kernel for SVM. Grid search with ten fold cross validation was performed to find the best values for the parameters of SVM. Table 4 shows the overall average classification accuracy rates of 10 runs for the Iris data set. Although the Iris problem is a simple one, the results demonstrate that it is useful to apply feature selection methods to identify the important input features for pattern classification. As shown in Table 4, the accuracy rates are better, if not the same, even with 50% of the number of input features eliminated. Table 4. Classification results of the Iris data set Method MLP FAM SVM kNN
Accuracy (%) (before feature selection) 88.67 96.33 100 100
Accuracy (%) (after feature selection 98.00 97.33 100 100
3.3 Suspected Acute Stroke Patients
Figure 3 shows the circle-segments of the input-output data for the stroke data set. Based on the circle-segments, one can observe that the data samples are dominated by class 0 (dark red). Observing the projection of the data into the circle-segments, segments V8, V16, and V18 show significant colour changes from the centre (class 0) to the perimeter of the circle (class 1). From segment V8, one can see that most of the class 0 samples have colour values equal or lower than 0.4, while for class 1, they are between 0.5 and 0.8. In segment V16, most of the class 0 samples are distributed within the colour range lower than 0.4, while for class 1, they are 0.4. By observing V18 segment, most of the class 0 samples have colour values lower than 0.7, while for class 1, they are mostly at around 0.4 with only a few samples between 0.7 and 1.0. The rest of the circle-segments do not depict a clear progression of colour changes pertaining to the two output classes.
Use of Circle-Segments as a Data Visualization Technique for Feature Selection
631
The PCA method is again used to analyse the data set. According to [16], for a real data set, five or six principal components may be required to account for 70% to 75% of the total variation, and the more principal components that are required, the less useful each one becomes. As shown in Table 5, the cumulative values indicate that six principal components account for 72.1% of the total variation. Thus, six principal components are selected for further analysis. Table 6 presents the eigenvectors of the six principal components. The variables that have strong relationship with each principal component are in bold. These variables, i.e., V2, V3, V4, V6, V7, V8, V16, V17, and V18, have eigenvectors larger (in absolute value) than those of the others. Therefore, they are identified as the important input features.
Fig. 3. Circle-segments of the stroke data Table 5. Eigenvalues of the covariance matrix of the stroke data set Principal Eigenvalue Proportion Cumulative
PC1 0.33573 0.262 0.262
PC2 0.15896 0.124 0.386
PC3 0.14386 0.112 0.499
PC4 0.11533 0.090 0.589
PC5 0.09023 0.070 0.659
PC6 0.07917 0.062 0.721
Principal Eigenvalue Proportion Cumulative
PC7 0.07403 0.058 0.779
PC8 0.06239 0.049 0.828
PC9 0.05390 0.042 0.870
PC10 0.04268 0.033 0.903
PC11 0.03252 0.025 0.929
PC12 0.03080 0.024 0.953
Principal Eigenvalue Proportion Cumulative
PC13 0.02588 0.020 0.973
PC14 0.01546 0.012 0.985
PC15 0.00894 0.007 0.992
PC16 0.00584 0.005 0.996
PC17 0.00293 0.002 0.999
PC18 0.00160 0.001 1.000
Three data sets are produced, i.e., one containing the original data samples, the other two containing the important input features identified by circle-segments and PCA, respectively. Three performance indicators that are commonly used in medical diagnostic systems are computed, i.e., accuracy (ratio of correct diagnoses to total number of patients), sensitivity (ratio of correct positive diagnoses to total number of patients with the disease), and specificity (ratio of correct negative diagnoses to total number of patients without the disease).
632
S.L. Wang et al. Table 6. Eigenvectors of the first six principal components of the stroke data set PC1 0.098 0.611 0.052 0.520 -0.054 0.569 0.039 -0.035 0.017 -0.011 0.006 0.002 0.018 0.002 -0.013 0.065 0.086 0.056
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18
PC2 0.054 -0.218 0.362 -0.059 -0.029 0.128 -0.116 0.122 0.003 0.015 -0.013 -0.011 -0.046 -0.015 0.011 0.064 0.870 0.072
PC3 -0.083 0.364 -0.064 -0.188 0.179 -0.027 -0.389 0.407 0.024 0.064 -0.023 -0.023 -0.054 -0.029 0.104 0.616 -0.018 -0.267
PC4 0.042 0.584 -0.262 -0.252 -0.244 -0.499 0.090 -0.131 -0.049 -0.073 0.045 0.019 0.059 0.020 -0.072 -0.217 0.355 -0.013
PC5 -0.038 -0.316 -0.741 0.333 -0.300 0.088 -0.163 0.027 -0.050 -0.066 0.114 0.027 0.082 0.023 -0.010 0.148 0.214 -0.136
PC6 0.048 0.024 -0.102 -0.410 -0.023 0.332 -0.683 -0.170 0.008 0.079 0.136 0.026 0.142 0.017 -0.071 -0.328 -0.083 0.216
Table 7 summarises the average results of 10 runs for the stroke data set. In general, the classification performances improve with feature selection either using PCA or circle-segments. The circle-segments method yields the best accuracy rates for all the four machine learning systems despite the fact that only three features are used for classification (a reduction of 83% of the number of input features). In terms of sensitivity, the circle-segments method also shows improvements (except FAM) from 17%-35% as compared with those before feature selection. The specificity rates are better than those before feature selection using MLP and FAM, but inferior for SVM and kNN. Table 7. Classification results and the number of input features for the stroke data set
Accuracy (%)
Sensitivity (%)
Specificity (%)
33.57
95.28
86.57
61.79
93.11
72.67
57.14
76.79
72.39
49.29
78.49
73.81
51.07
79.81
SVM
80.60
21.43
96.23
82.09
25.00
97.17
86.57
57.14
94.34
kNN
82.09
32.14
95.28
81.42
32.14
94.43
85.45
57.14
92.92
MLP
81.79
Sensitivity (%)
82.39
FAM
Accuracy (%)
Specificity (%)
Feature selection with circlesegments (3 input features)
Sensitivity (%)
Feature selection with PCA (8 input features) Accuracy (%)
Before feature selection (18 input features)
44.29
Specificity (%)
Methods
91.70
From Table 7, it seems that there is a trade-off between sensitivity and specificity with the use of the circle-segments method, i.e., more substantial improvement in sensitivity with marginal degradation in specificity for SVM and kNN while less substantial improvement in sensitivity with marginal improvement in specificity for MLP and FAM. This observation is interesting as it is important for a medical
Use of Circle-Segments as a Data Visualization Technique for Feature Selection
633
diagnostic system to have high sensitivity as well as specificity rates so that patients with and without the diseases can be identified accurately. By comparing the results of circle-segments and PCA, the accuracy and sensitivity rates of circle-segments are better than those of PCA. The trade-off is that PCA generally yields better specificity rates. Again, it is interesting to note the substantial improvement in sensitivity with marginal inferior performance in specificity of both the circle-segment and PCA results. Another observation is that the PCA results do not show substantial improvements in terms of accuracy, sensitivity, and specificity as compared with those from the original data set.
4 Summary and Further Work Feature selection in a data set based on data visualization is an interactive process involving the judgments of domain users. By identifying the patterns in the circlesegments, the important input features are distinguished from other less important ones. The circle-segments method enables the domain users to carry out the necessary filtering of input features, rather than feeding the whole data set, for learning using machine learning systems. With the circle-segments, domain users can visualize the relationships of the input-output data and comprehend the rationale on how the selection of input features is made. The results obtained from the two case studies positively demonstrate the usefulness of the circle-segments method for feature selection in pattern classification problems using four machine learning systems. Although the circle-segments method is useful, selection of the important features is very dependent on the users’ knowledge, expertise, interpretation, and judgement. As such, it would be beneficial if some objective function could be integrated with the circle-segments method to quantify the information content of the selected features with respect to the original data sets. In addition, the use of the circle-segments method may be difficult when the problem involves high dimensional data, and the users face limitation in analyzing and extracting patterns from massive data sets. It is infeasible to carry out the data analysis process when the dimension of the data is high, which cover hundreds to thousands of variables. Therefore, a better characterization method in reducing the dimensionality of the data set is needed.
References 1. Lerner., B., Levinstein, M., Roserberg, B., Guterman, H., Dinstein, I., Romem, Y.: Feature Selection and Chromosome Classification Using a Multilayer Perceptron. IEEE World Congress on Computational Intelligence 6, 3540–3545 (1994) 2. Kavzoglu, T., Mather, P.M.: Using Feature Selection Techniques to Produce Smaller Neural Networks with Better Generalisation Capabilities. Geoscience and Remote Sensing Symposium 7, 3069–3071 (2000) 3. Lou, W.G., Nakai, S.: Application of artificial neural networks for predicting the thermal inactivation of bacteria: a combined effect of temperature, pH and water activity. Food Research International 34(2001), 573–579 (2001) 4. Spedding, T.A., Wang, Z.Q.: Study on modeling of wire EDM process. Journal of Materials Processing Technology 69, 18–28 (1997)
634
S.L. Wang et al.
5. Meesad, P., Yen, G.G.: Combined Numerical and Lingustic Knowledge Representation and Its Application to Medical Diagnosis. IEEE Transaction on Systems, Man and Cybernetics, Part A 33, 202–222 (2003) 6. Harandi, M.T., Ahmadabadi, M.N., Arabi, B.N., Lucas, C.: Feature selection using genetic algorithm and it’s application to face recognition. In: Proceedings of the 2004 IEEE Conference on Cybernetics and Intelligent systems, pp. 1368–1373 (2004) 7. Wu, T.K., Huang, S.C., Meng, Y.R.: Evaluation of ANN and SVM classifiers as predictors to the diagnosis of student with learning abilities. Expert System with Applications (Article in Press) 8. Melo, J.C.B., Cavalcanti, G.D.C., Guimarães, K.S.,, P.C.A.: feature selection for protein structure prediction. In: Proceedings of the International Joint Conference on Neural Networks, vol. 4, pp. 2952–2957 (2003) 9. Huang, C.L., Wang, C.J.: A GA-based feature selection and parameters optimization for support vector machines. Expert Systems with Applications 31, 231–240 (2006) 10. Johansson, J., Treloar, R., Jern, M.: Integration of Unsupervised Clustering, Interaction and Parallel Coordinates for the Exploration of Large Multivariate Data. In: Proceedings the Eighth International Conference on Information Visualization (IV 2004), pp. 52–57 (2004) 11. McCarthy, J.F., Marx, K.A., Hoffman, P.E., Gee, A.G., O’Neil, P., Ujwal, M.L., Hotchkiss, J.: Applications of Machine Learning and High Dimensional in Cancer Detection, Diagnosis, and Management. Analysis New York Academy Science 1020, 239– 262 (2004) 12. Ruthkowska, D.: IF-THEN rules in neural networks for classification. In: Proceedings of the 2005 International Conference on Computational Intelligence for Modelling, Control and Automation and International Conference on Intelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC 2005), vol. 2, pp. 776–780 (2005) 13. Hoffman, P., Grinstein, G., Marx, K., Grosse, I., Stanley, E.,, D.N.A.: visual and analytic data mining. In: Proceedings on Visualization 1997, pp. 437–441 (1997) 14. Ankerst, M., Keim, D.A., Kriegel, H.P.: ‘Circle Segments’: A Technique for Visualizing Exploring Large Multidimensional Data Sets. In: Proc. Visualization 1996, Hot Topics Session (1996) 15. Newman, D.J., Hettich, S., Blake, C.L., Merz, C.J.: UCI Repository of machine learning databases. University of California, Department of Information and Computer Science, Irvine, CA (1998), http://www.ics.uci.edu/~mlearn/MLRepository.html 16. Johnson, D.E.: Applied Multivariate Methods for Data Analysts. Duxbury Press, USA (1998) 17. Michalak, K., Kwasnicka, H.: Correlation-based Feature Selection Strategy in Neural Classification. In: Proceedings of Sixth International Conference on Intelligent Systems Design and Applications (ISDA 2006), vol. 1, pp. 741–746 (2006) 18. Kim, K.-J., Cho, S.-B.: Ensemble Classifiers based on Corrrelation Analysis for DNA Microarray Classification. Neurocomputing 70(1-3), 187–199 (2006)
Extraction of Approximate Independent Components from Large Natural Scenes Yoshitatsu Matsuda1 and Kazunori Yamaguchi2 1
Department of Integrated Information Technology, Aoyama Gakuin University, 5-10-1 Fuchinobe, Sagamihara-shi, Kanagawa, 229-8558, Japan [email protected] http://www-haradalb.it.aoyama.ac.jp/∼ matsuda 2 Department of General Systems Studies, Graduate School of Arts and Sciences, The University of Tokyo, 3-8-1, Komaba, Meguro-ku, Tokyo, 153-8902, Japan [email protected]
Abstract. Linear multilayer ICA (LMICA) is an approximate algorithm for independent component analysis (ICA). In LMICA, approximate independent components are efficiently estimated by optimizing only highly-dependent pairs of signals. Recently, a new method named “recursive multidimensional scaling (recursive MDS)” has been proposed for the selection of pairs of highly-dependent signals. In recursive MDS, signals are sorted by one-dimensional MDS at first. Then, the sorted signals are divided into two sections and each of them is sorted by MDS recursively. Because recursive MDS is based on adaptive PCA, it does not need the stepsize control and its global optimality is guaranteed. In this paper, the LMICA algorithm with recursive MDS is applied to large natural scenes. Then, the extracted independent components of large scenes are compared with those of small scenes in the four statistics: the positions, the orientations, the lengths, and the length to width ratios of the generated edge detectors. While there are no distinct differences in the positions and the orientations, the lengths and the length to width ratios of the components from large scenes are greater than those from small ones. In other words, longer and sharper edges are extracted from large natural scenes.
1
Introduction
Independent component analysis (ICA) is a widely-used method in signal processing [1,2,3]. It solves blind source separation problems under the assumption that source signals are statistically independent of each other. Though many efficient algorithms of ICA have been proposed [4,5], it nevertheless requires heavy computation for optimizing nonlinear functions. In order to avoid this problem, linear multilayer ICA (LMICA) has been recently proposed [6]. LMICA is a variation of Jacobian methods, where the sources are extracted by maximizing the independency of each pair of signals [7]. The difference is that LMICA optimized M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 635–642, 2008. c Springer-Verlag Berlin Heidelberg 2008
636
Y. Matsuda and K. Yamaguchi
only pairs of highly-dependent signals instead of those of all ones. LMICA is based on an intuition that optimizations on highly-dependent pairs probably increase the independency of all the signals more than those of low-dependent ones, and its validity was verified by numerical experiments for natural scenes. Besides, an additional method named “recursive multidimensional scaling (MDS)” has been proposed for improving the selection of highly-dependent signals [8]. The method is based on the repetition of the simple MDS. It sorts signals in a one-dimensional array by MDS, then divides the array into the former and latter sections and sorts each of them recursively. In consequence, highly-correlated signals are brought into a neighborhood. Because the simple MDS is equivalent to PCA [9], it can be solved efficiently without the stepsize control by adaptive PCA algorithms (e.g. PAST [10]). The global optimality in adaptive PCA is guaranteed if the number of steps is sufficient. In this paper, the above LMICA algorithm with recursive MDS was applied to large natural scenes and the results were compared with those of small natural scenes in the following four statistics: the positions, the orientations, the lengths, and the length to width ratios of the generated edge detectors. As a result, it was observed that the independent components from large natural scenes are longer and sharper edge detectors than those from small ones. This paper is organized as follows. Section 2 gives a brief description of LMICA with recursive MDS. Section 3 shows the results of numerical experiments for small and large natural scenes. This paper is concluded in Sect. 4.
2
LMICA with Recursive MDS
Here, LMICA with recursive MDS is described in brief. See [6] and [8] for the details. 2.1
MaxKurt Algorithm
LMICA is an algorithm for extracting approximate independent components from signals. It is based on MaxKurt [7], which minimizes the contrast function of kurtoses by optimally “rotating pairs” of signals. The observed signals x = (xi ) are assumed to be prewhitened. Then, one iteration of MaxKurt is given as follows: – Pick up every pair (i, j) of signals, and find the optimal rotation θˆkurt given as θˆkurt = argminθ − E{(xi )4 + (xj )4 }, (1) where E{} is the expectation operator, xi = cos θ · xi + sin θ · xj , and xj = − sin θ · xi + cos θ · xj . 4 ˆ θ kurt is determined analytically and calculated easily. E{xi } can generalize to E{G (xi )} where G (u) is any suitable function such as log (cosh (u)). In this case, θˆkurt is initially set to θ because it is expected to give a good initial value,
Extraction of Approximate Independent Components
637
then θ is incrementally updated by Newton’s method for minimizing E{G (xi ) + G (xj )} w.r.t θ. Though it is much more time-consuming, it is expected to be much more robust to outliers. 2.2
Recursive MDS
In MaxKurt, all the pairs are optimized. On the other hand, LMICA optimizes only highly-correlated pairs in higher-order statistics so that approximate components can be extracted quite efficiently. In order to select such pairs, LMICA forms a one-dimensional array where highly-correlated signals are near to each other by recursive MDS. Then, LMICA optimized only the nearest-neighbor pairs of signals in the array. In recursive MDS, first, a one-dimensional mapping is formed by a simple MDS method [9], where the original distance Dij between the i-th and j-th signals is defined by 2 Dij = E{ x2i − x2j }. (2) Because Dij is greater if xi and xj are more independent of each other in higher-order statistics, highly-dependent signals are globally close together in the formed map. Such a one-dimensional map is easily transformed into a discrete array by sorting it. Because the simple MDS utilizes all the distances among signals instead of only neighbor relations, the nearest-neighbor pair in the formed array do not always correspond to the one with the smallest distance Dij . In order to avoid this problem, the array is divided into the former and latter parts and MDS is applied to each part recursively. If a part includes only two signals, the recursion is terminated and pair optimization is applied. If a part include three signals, the pair of the first and second signals and that of the second and third ones are optimized after MDS is applied to the three signals. The whole algorithm of RMDS(signals) is described as follows:
RMDS(signals) If the number of signals (N ) is 2, optimize the signals. Otherwise, 1. Sort the signals in a one-dimensional array by the simple MDS. 2. If N is 3, optimize the pairs of the first and second signals, and then do the second and third ones. Otherwise, do RMDS(the former section of signals) and RMDS(the latter section of signals).
Note that recursive MDS also does not confirm that Dij of the nearest-neighbor pair is the smallest. But, numerical experiments has verified the validity of this method in [8]. Regarding the algorithm of the simple MDS in the one-dimensional space, its solution is equivalent to the first principal component of a covariance matrix of
638
Y. Matsuda and K. Yamaguchi
x2
i i transformed signals z, each component of which is given as zi = x2i − N where N is the number of signals [9]. Therefore, MDS is solved efficiently by applying an adaptive PCA algorithm to a sequence of signals z. Here, well-known PAST [10] is employed. It is fast and does not need the stepsize control. PAST repeats the following procedure for T times where y = (yi ) is the coordinates of signals in the one-dimensional space:
1. 2. 3. 4.
Pick up a z randomly, Calculate α = i yi zi and β := β + α, Calculate e = (ei ) as ei = zi − αyi , Update y := y + α β e.
The initial y is given randomly and the initial β is set to 1.0. This algorithm is guaranteed to converge to the global optimum. The value of T is the only parameter to set empirically.
3 3.1
Results Experimental Settings
Two experiments were carried out for comparing independent components of small natural scenes and those of large ones. In the first experiment, simple fast ICA [4] is applied to 10 sets of 30000 small natural scenes of 12 × 12 pixels with the symmetrical approach and G (u) = log (cosh (u)). Because no dimension reduction method was applied, the total number of extracted components is 1440 (144 ∗ 10). Second, LMICA with recursive MDS was applied to 100000 large images of 64 × 64 pixels for 1000 layers with G (u) = log (cosh (u)) The original images were downloaded from http://www.cis.hut.fi/projects/ica/data/images/. The large images were prewhitened by ZCA. Note that an image of 64 × 64 pixels is not regarded as “large” in general. Since they are enough large in comparison with image patches usually used in other ICA algorithms, they are referred as large images in this paper. 3.2
Experimental Results
Figure 1 shows that the decreasing curve of the contrast function nearly converged around 1000 layers. It means that LMICA reached to an approximate optimal solution at the 1000th layer. Figure 2 displays extracted independent components in the two experiments. It shows that edge detectors were generated in both cases. Regarding the other comparative experiments of efficiency between our method and other ICA algorithms, see [8]. In order to examine the statistical properties of the generated edge detectors, they are analyzed in the similar way as in [11]. First, the envelope of each detector was calculated by Hilbert transformation. Next, the dominant elements were strengthened by raising each element to the fourth power. Then, the means and
Extraction of Approximate Independent Components
639
Decrease of Contrast Function for LMICA 0.324 LMICA E[log cosh x]
0.322 0.32 0.318 0.316 0.314 0
200
400
600
800
1000
layers
Fig. 1. Decreasing curves of the contrast function by LMICA for large natural scenes
(a) From small natural scenes.
E (log (cosh (xi ))) along the layers
(b) From large natural scenes.
Fig. 2. Independent components extracted from small and large natural scenes
covariances on each detector was calculated by regarding the values of elements as a probability distribution on the discrete two-dimensional space ([0.5, 11.5] × [0.5, 11.5] for small scenes or [0.5, 63.5] × [0.5, 63.5] for large scenes because they exist only in the ranges). By approximating each edge as a Gaussian distribution with the same means and covariances on the continuous two-dimensional space, its position (means), its orientation (the angle of the principal axis), its length (the full width at half maximum (FWHM) along the principal one), and its width (FWHM perpendicular to the principal one) were calculated. The results are shown in Figs. 3-6 and Table 1. The scatter diagrams of the positions of edges in Fig. 3 show that they are uniformly distributed in both cases and there are no distinct differences. Figure 4 displays the histograms of orientations of edges from 0 to π. It shows that edges with the horizontal (0 and π) and vertical (0.5π) orientations are dominant as reported in [11] and there are no distinct differences between small and large scenes.
640
Y. Matsuda and K. Yamaguchi
Table 1. Means and medians of the lengths and the length to width ratios for small and large scenes
mean of lengths median of lengths mean of lw ratios median of lw ratios
small scenes large scenes 1.84 2.15 1.53 1.23 1.69 2.53 1.55 1.70
scatter diagram of places (small)
scatter diagram of places (large) 63.5 y-axis
y-axis
11.5
6
0.5 0.5
6 x-axis
32
0.5 0.5
11.5
(a) small scenes.
32 x-axis
63.5
(b) large scenes.
Fig. 3. Plot diagrams of the positions of edges over the two-dimensional spaces: (a). Distribution for small scenes over [0.5, 11.5] × [0.5, 11.5]. (b). Distribution for large scenes over [0.5, 63.5] × [0.5, 63.5]. histgram of orientations (small)
histgram of orientations (large) 20 percentage
percentage
20 15 10 5 0 0
0.5π angles of edges
(a) small scenes.
π
15 10 5 0 0
0.5π angles of edges
π
(b) large scenes.
Fig. 4. Histograms of the orientations of edges from 0 to π
On the contrary, Figs. 5 and 6 show that the statistical properties of edges for large scenes obviously differ from those for small ones. In Fig. 5, short edges of 1-1.5 length are much more dominant for large scenes than for small ones. But, the rate of the long edges over 3 in length is more for large scenes than for small scenes. Table 1 also shows this strangeness, where edges for large scenes are shorter on the median and longer on the mean than those for small ones. In Fig. 6, the length to width ratios for large scenes are greater than those for
Extraction of Approximate Independent Components
641
small scenes. Table 1 also shows both mean and median of the ratios for large scenes are greater than those for small scenes. 3.3
Discussion
The results show that there are significant differences in the distributions of the length and width of edges. First, a few long edges were observed for large scenes in Fig. 5. It shows that large images includes some intrinsic components of long edges and the division into small images hides them. Second, it was also observed in Fig. 5 that the rate of short edges for large images was greater than that for small ones. This seemingly strange phenomenon may be caused by the approximation of LMICA. Because LMICA optimizes only nearest neighbor pairs, it is expected to be biased in favor of locally short edges. Third, Fig. 6 shows that the length to width ratios for large natural scenes were greater than those for small ones. In other words, the edges from large scenes were “sharper” than those from small ones. The utilization of large scenes without any division histgram of lengths (small)
histgram of lengths (large) 80 percentage
percentage
80 60 40 20 0
60 40 20 0
0
2
4 6 8 lengths of edges
10
0
(a) small scenes.
2
4 6 8 lengths of edges
10
(b) large scenes.
Fig. 5. Histograms of the lengths of edges from 0 to 10. Edges longer than 10 are counted in the rightmost bar. histgram of lw ratios (small)
histgram of lw ratios (large) 60 percentage
percentage
60 40 20 0
40 20 0
1
3
5 7 9 lw ratios of edges
(a) small scenes.
11
1
3
5 7 9 lw ratios of edges
11
(b) large scenes.
Fig. 6. Histograms of the length to width ratios of edges from 1 to 11. Edges beyond 11 in ratio are counted in the rightmost bar.
642
Y. Matsuda and K. Yamaguchi
drastically weakens the effect of constraints that edges can not exist beyond the borders. It may be the reason why many sharper edges were generated.
4
Conclusion
In this paper, the method of LMICA with recursive MDS was described first. Then, the method was applied to two datasets of small natural scenes and large natural scenes and the generated edge detectors were compared in some statistical properties. Consequently, it was observed in the experiment for large scenes that there are a few long edges and many edges are shaper than those generated from small scenes. We are now planning to do additional experiments for verifying the speculations in this paper. Besides, we are planning to compare our results with the statistical properties observed in real brains in the similar way to [11]. In addition, we are planning to apply this algorithm to a movie where a sequence of large natural scenes is given as a sample. This work is supported by Grant-in-Aid for Young Scientists (KAKENHI) 19700267.
References 1. Jutten, C., Herault, J.: Blind separation of sources (part I): An adaptive algorithm based on neuromimetic architecture. Signal Processing 24(1), 1–10 (1991) 2. Comon, P.: Independent component analysis - a new concept? Signal Processing 36, 287–314 (1994) 3. Bell, A.J., Sejnowski, T.J.: An information-maximization approach to blind separation and blind deconvolution. Neural Computation 7, 1129–1159 (1995) 4. Hyv¨ arinen, A.: Fast and robust fixed-point algorithms for independent component analysis. IEEE Transactions on Neural Networks 10(3), 626–634 (1999) 5. Cardoso, J.F., Laheld, B.: Equivariant adaptive source separation. IEEE Transactions on Signal Processing 44(12), 3017–3030 (1996) 6. Matsuda, Y., Yamaguchi, K.: Linear multilayer ICA generating hierarchical edge detectors. Neural Computation 19, 218–230 (2007) 7. Cardoso, J.F.: High-order contrasts for independent component analysis. Neural Computation 11(1), 157–192 (1999) 8. Matsuda, Y., Yamaguchi, K.: Linear multilayer ICA with recursive MDS (preprint, 2007) 9. Cox, T.F., Cox, M.A.A.: Multidimensional scaling. Chapman & Hall, London (1994) 10. Yang, B.: Projection approximation subspace tracking. IEEE Transactions on Signal Processing 43(1), 95–107 (1995) 11. van Hateren, J.H., van der Schaaf, A.: Independent component filters of natural images compared with simple cells in primary visual cortex. Proceedings of the Royal Society of London: B 265, 359–366 (1998)
Local Coordinates Alignment and Its Linearization Tianhao Zhang1,3, Xuelong Li2, Dacheng Tao3, and Jie Yang1 1 Institute of IP & PR, Shanghai Jiao Tong Univ., P.R. China Sch. of Comp. Sci. and Info. Sys., Birkbeck, Univ. of London, U.K. 3 Dept. of Computing, Hong Kong Polytechnic Univ., Hong Kong {z.tianhao,dacheng.tao}@gmail.com, [email protected], [email protected] 2
Abstract. Manifold learning has been demonstrated to be an effective way to discover the intrinsic geometrical structure of a number of samples. In this paper, a new manifold learning algorithm, Local Coordinates Alignment (LCA), is developed based on the alignment technique. LCA first obtains the local coordinates as representations of a local neighborhood by preserving the proximity relations on the patch which is Euclidean; and then the extracted local coordinates are aligned to yield the global embeddings. To solve the out of sample problem, the linearization of LCA (LLCA) is also proposed. Empirical studies on both synthetic data and face images show the effectiveness of LCA and LLCA in comparing with existing manifold learning algorithms and linear subspace methods. Keywords: Manifold learning, Local Coordinates Alignment, dimensionality reduction.
1 Introduction Manifold learning addresses the problem of discovering the intrinsic structure of the manifold from a number of samples. The generic problem of manifold learning can be G described as following. Consider a dataset X , which consists of N samples xi G G ( 1 ≤ i ≤ N ) in a high dimensional Euclidean space \ m . That is X = [ x1 ,", xN ] ∈ \ m× N . G G The sample xi is obtained by embedding a sample zi , which is drawn from M d (a low
dimensional nonlinear manifold) to \ m (a higher dimensional Euclidean space, d < m ), G G i.e., ϕ : M d → \ m and xi = ϕ ( zi ) . A manifold learning algorithm aims to find the G corresponding low-dimensional representation yi in a low dimensional Euclidean space G \ d of xi to preserve the geometric property of M d . Recently, a number of manifold learning algorithms have been developed for pattern analysis, classification, clustering, and dimensionality reduction. Each algorithm detects a specific intrinsic, i.e., geometrical, structure of the underlying manifold of the raw data. The representative ones include ISOMAP [6], locally linear embedding (LLE) [7], Laplacian eigenmaps (LE) [1], local tangent space alignment (LTSA) [3], etc. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 643–652, 2008. © Springer-Verlag Berlin Heidelberg 2008
644
T. Zhang et al.
ISOMAP is a variant of MDS [13]. Unlike MDS, the distance measure between two samples is not the Euclidean distance but the geodesic distance, which is the shortest path from one sample to another computed by the Dijkstra’s algorithm. ISOMAP is intrinsically a global approach since it attempts to preserve the global geodesic distances of all pairs of samples. Different from ISOMAP, LLE, LE and LTSA are all local methods, which preserve the geometrical structure of every local neighborhood. LLE uses the linear coefficients, which reconstruct the given point by its neighbors, to represent the local structure, and then seeks a low-dimensional embedding, in which these coefficients are still suitable for reconstructions. LE is developed based on the Laplace Beltrami operator in the spectral graph theory [1,4]. This operator provides an optimal embedding for a manifold. The algorithm preserves proximity relationships by the manipulations on the undirected weighted graph, which indicates the neighbor relations of the pairwise points. LTSA exploits the local tangent information as a representation of the local structure and this local information is then aligned to give the global coordinates. Inspired by LTSA [3], we propose a new manifold learning algorithm, i.e., Local Coordinates Alignment (LCA). LCA obtains the local coordinates as representations of a local neighborhood by preserving the proximity relations on the patch. The extracted local coordinates are then aligned by the alignment trick [3] to yield the global embeddings. In LCA, the local representation and the global alignment are explicitly implemented as two steps for intrinsic structure discovery. In addition, to solve the out of sample problem, the linear approximation is applied to LCA, called Linear LCA or LLCA. Experimental results show that LCA discovers the manifold structure, and LLCA outperforms representative subspace methods for the face recognition. The rest of the paper is organized as follows: Section 2 describes LCA and its linearization is given in Section 3. Section 4 shows the empirical studies over synthetic data and face databases. Section 5 concludes.
2 LCA: Local Coordinates Alignment The proposed algorithm is based on the assumption that each local neighborhood of the manifold M d is intrinsically Euclidean, i.e., it is homeomorphic to an open subset of the Euclidean space with dimension d . The proposed LCA extracts local coordinates by preserving the neighbor relationships of the raw data, and then obtains the optimal embeddings by aligning these local coordinates in globality. 2.1 Local Representations
G G G Given an arbitrary sample xi ∈ \ m and its k nearest neighbors xi1 ," , xik , measured G G G in terms of Euclidean distance, we let X i = [ xi0 , xi1 ," , xik ] ∈ \ m ×( k +1) denote a G G neighborhood in the manifold M d , where xi0 = xi . For X i , we have the local map G G G ψ i : \ m → \ d , i.e., ψ i : X i 6 Yi = [ yi0 , yi1 ," , yik ] ∈ \ d × ( k +1) . Here, Yi is the local coordinates to parameterize the corresponding points in X i . To yield the faithful maps,
Local Coordinates Alignment and Its Linearization
645
G we expect that the nearby points remain nearby, that is, the point yi0 ∈ \ d is close to G G yi1 ," , yik , i.e., k G G arg min G ∑ yi0 − yi j yi0
2
.
(1)
j =1
To capture the geometrical structure of the neighborhood, the weighting vector G wi ∈ \ k is introduce to (1) as k G G arg min G ∑ yi0 − yi j yi0
2
j =1
G
( wi ) j ,
(2)
G where the jth element of wi is obtained by the heat kernel [4] with parameter t ∈ \ , i.e., G
G G = exp ⎛⎜ − xi0 − xi j ⎝
( wi ) j
2
/ t ⎞⎟ . ⎠
(3)
G G Vector wi is a natural measure of distances between xi0 and its neighbors. The G G larger the value ( wi ) j is (the closer the point xi0 is to the jth neighbor), the more important is the corresponding neighbor. In setting t = ∞ , Eq. (2) reduces to Eq. (1). To make further deduction, Eq. (2) is transformed to ⎛⎡ ⎜⎢ arg min tr ⎜ ⎢ G yi0 ⎜⎢ ⎜⎜ ⎢ ⎝⎣
( yG (
G − yi1
)
⎤ ⎥ ⎥ ⎡ yGi − yGi , " , # 1 ⎥⎣ 0 G G T⎥ yi0 − yik ⎦ i0
T
)
⎞ ⎟ G G G ⎟ ⎤ yi0 − yik ⎦ diag ( wi ) , ⎟ ⎟⎟ ⎠
(4)
where tr ( ⋅) denotes the trace operator, and diag ( ⋅) denotes the diagonal matrix whose diagonal entries are the corresponding components of the given vector. Let G G T T M i = [ − ek I k ] ∈ \ ( k +1)× k , where ek = [1, " ,1] ∈ \ k and I k is the k × k identity matrix. Then, Eq. (4) reduces to: G G T arg min tr ( Yi M i ) Yi M i diag ( wi ) = arg min tr ( Yi M i diag ( wi ) M iT YiT ) Yi Yi (5) = arg min tr (Yi LiYiT ) ,
(
)
Yi
G where Li = M i diag ( wi ) M iT , which encapsulates the local geometric information. The matrix Li is equivalent to:
G ⎡ −ek ⎤ G G Li = ⎢ diag ( wi ) [ −ek ⎥ ⎣ Ik ⎦
⎡ k G ⎢ ∑ ( wi ) j I k ] = ⎢ j =1 G ⎢⎣ − wi
⎤ ⎥ . G ⎥ diag ( wi ) ⎥⎦ G − wi T
(6)
646
T. Zhang et al.
Therefore, according (5), the local coordinates Yi for the ith neighborhood can be expressed. 2.2 Global Alignment
According to Section 2.1, local coordinates are obtained for all local neighborhoods respectively, namely Y1 , Y2 ," , YN . In this section, these local coordinates are aligned to yield the global one. G G Let Y = [ y1 ,", yN ] denote the embedding coordinates which are faithful G G representations for X = [ x1 ," , xN ] sampled from a manifold. Define the selection matrix Si ∈ \
N ×( k +1)
:
( Si ) pq
G ⎧⎪1 if p = hi =⎨ ⎪⎩0 else
( )
q
,
(7)
where, G hi = [i0 , i1 ," , ik ] ∈ \ k +1
(8)
denotes the set of indices for the ith point and the corresponding neighbors. Then we can express the local coordinates Yi as the selective combinations of the embedding coordinates Y by using the selection matrix Si : Yi = YSi .
(9)
Now, we can write Eq. (5) as follows:
arg min tr (YSi Li SiT Y T ) .
(10)
Y
For all samples, we have the global optimization: N ⎛ N arg min ∑ tr (YSi Li SiT Y T ) = arg min tr ⎜ Y ∑ Si Li SiT Y T Y Y i =1 ⎝ i =1
= arg min tr (YLY
T
Y
⎞ ⎟ ⎠,
(11)
)
where L = ∑ i =1 Si Li SiT . The matrix L can be termed by the alignment matrix [3] N
which is symmetric, positive semi-definite, and sparse. We can form the matrix L by the iterative procedure: G G G G L hi , hi ← L hi , hi + Li , (12)
(
)
(
for 1 ≤ i ≤ N with the initialization L = 0 .
)
Local Coordinates Alignment and Its Linearization
To uniquely determine Y , we impose the constraint YY T = I d d × d identity matrix. Now, the objective function is:
arg min tr (YLY T ) Y
, where
s.t. YY T = I d .
647
I d is the
(13)
The above minimization problem can be converted to solving an eigenvalue decomposition problem as follows: G G (14) Lf = λ f .
G G G The column vectors f1 , f 2 " , f d in \ N are eigenvectors associated with ordered nonzero eigenvalues, λ1 < λ2 < " < λd . Therefore, we obtain the embedding coordinates as: G G G T Y = ⎡⎣ f1 , f 2 " , f d ⎤⎦ .
(15)
3 LLCA: Linear Local Coordinates Alignment LCA aims to detect the intrinsic features of nonlinear manifolds, but it fails in the out of sample problem. To solve this problem, the linearization approximation of LCA, i.e., Linear LCA (LLCA), is developed in this Section. LLCA finds the transformation matrix A ∈ \ m× d that maps the dataset X to the dataset Y , i.e., Y = AT X . Thus, the linearized objective function (11) is:
arg min tr ( AT XLX T A) . A
(16)
Impose the column orthonormal constraint to A , i.e. AT A = I d , we have:
arg min tr ( AT XLX T A) A
s.t. AT A = I d .
The solution of (17) is given by the eigenvalue decomposition: G G XLX T α = λα .
(17)
(18)
The transformation matrix A is
G G G A = [α1 , α 2 " , α d ] ,
(19)
G G G where α1 , α 2 " , α d are eigenvectors of XLX T associated with the first d smallest eigenvalues. LLCA can be implemented in either an unsupervised or a supervised mode. In the later case, the label information is utilized, i.e., each neighborhood is replaced by points G which have identical class label information. For a given point xi and its same-class
648
T. Zhang et al.
G G points xi1 ," , xini−1 , where ni denotes the number of samples in this class, one can G construct the vector wi ∈ \ ni −1 as follows: G
( wi ) j
G G = exp ⎛⎜ − xi0 − xi j ⎝
2
/ t ⎞⎟ , j = 1,", ni − 1 . ⎠
(20)
Furthermore, to denote new indices set for the ith point and its same-class points, one need reset G hi = [ i0 , i1 ," , ini −1 ] ∈ \ ni . (21)
4 Experiments In this section, several tests are performed to evaluate LCA and LLCA respectively.
Fig. 1. Synthetic Data: the left sub-figure is the S-curve dataset and the right is the Punctured Sphere dataset
4.1 Non-linear Dimensionality Reduction Using LCA
We employ two synthetic datasets, as shown in Fig. 1, which are randomly sampled from the S-curve and the Punctured Sphere [5]. LCA is implemented in comparison with PCA, ISOMAP, LLE, LE, and LTSA. For LCA, the reduced dimension is 2, the number of neighbors is 15, and the parameter t in the heat kernel is 5. Fig. 2 and Fig. 3 illustrate the experimental results. It is shown that PCA which only sees the global Euclidean structure fails to detect the underlying structure of the raw data, while LCA can unfold the non-linear manifolds as well as ISOMAP, LLE, LE, and LTSA. 4.2 Face Recognition Using LLCA
Since face images, parameterized by some continuous variables such as poses, illuminations and expressions, often belong to an intrinsically low dimensional submanifold [2,12], LLCA is implemented for effective face manifold learning and recognition. We briefly introduce three steps in our face recognition experiments. First, LLCA is conducted on the training face images and learn the transformation matrix. Second, each test face image is mapped into a low-dimensional subspace via the transformation matrix. Finally, we classify the test images by the Nearest Neighbor classifier with Euclidean measure.
Local Coordinates Alignment and Its Linearization
PCA
LE
ISOMAP
LTSA
649
LLE
LCA
Fig. 2. Embeddings of the S-curve dataset
PCA
ISOMAP
LLE
LE
LTSA
LCA
Fig. 3. Embeddings of the Punctured Sphere dataset
We compare LLCA with PCA [8], LDA [9], and LPP [14], over the publicly available databases: ORL [10] and YALE [11]. PCA and LDA are two of the most popular traditional dimensionality reduction methods, while LPP are newly proposed manifold learning method. Here, LPP is the supervised version, which is introduced in [15] as “LPP1”. For LLCA, the algorithm are also implemented in the supervised mode and the parameter t in the heat kernel is +∞ . For all experiments, images are cropped based on centers of eyes, and cropped images are normalized to the 40 × 40 pixel arrays with 256 gray levels per pixel. 4.2.1 ORL The ORL database [10] contains 400 images of 40 individuals including variation in facial expression and pose. Fig. 4 illustrates a sample subject of ORL along with its all
650
T. Zhang et al.
Fig. 4. Sample face images from ORL
Fig. 5. Recognition rate vs. dimensionality reduction on ORL: the left sub-figure is achieved by selecting 3 images per person for training and the right sub-figure is achieved by selecting 5 images per person for training Table 1. Best recognition rates(%) of four algorithms on ORL
Method PCA LDA LPP LLCA
3 Train 79.11 (113) 87.20 (39) 88.09 (41) 91.48 (70)
5 Train 88.15 (195) 94.68 (39) 94.77 (48) 97.42 (130)
10 views. For each person, p (= 3, 5) images are randomly selected for training and the rest are used for testing. For each given p , we average the realizations over 20 random splits. Fig. 5 shows the plots of the average recognition rates versus subspace dimensions. The best average results and the corresponding reduced dimensions are listed in Table 1. As can be seen, LLCA algorithm outperforms the other algorithms involved in this experiment and LLCA is more competitive to discover the intrinsic structure from the raw face images. 4.2.2 YALE The YALE database [11] contains 15 subjects and each subject has 11 samples with varying facial expression and illumination. Fig. 6 shows the sample images of an individual. Similarly to the strategy adopted on ORL, p (= 3, 5) images per person are randomly selected for training and the rest are used for testing. All the tests are repeated over 20 random splits independently, and then the average recognition results are calculated. The recognition results are shown in Fig. 7 and Table 2. We can draw a similar conclusion as before.
Local Coordinates Alignment and Its Linearization
651
Fig. 6. Sample face images from YALE
Fig. 7. Recognition rate vs. dimensionality reduction on YALE: the left sub-figure is achieved by selecting 3 images per person for training and the right sub-figure is achieved by selecting 5 images per person for training Table 2. Best recognition rates(%) of four algorithms on YALE
Method PCA LDA LPP LLCA
3 Train 50.71 (44) 61.42 (14) 66.79 (15) 67.50 (15)
5 Train 58.44 (74) 74.44 (14) 75.94 (19) 78.22 (33)
5 Conclusions This paper presents a new manifold learning algorithm, called Local Coordinates Alignment (LCA). It first expresses the local coordinates as the parameterizations to each local neighborhood, and then achieves global optimization by performing eigenvalue decomposition on the alignment matrix which can be obtained by an iterative procedure. Meantime, the linearization of LCA (LLCA) is also proposed to solve the out of sample problem. Experiments over both synthetic datasets and real face datasets have shown the effectiveness of LCA and LLCA. Acknowledgments. The authors would like to thank anonymous reviewers for their constructive comments on the first version of this paper. The research was supported by the Internal Competitive Research Grants of the Department of Computing with the Hong Kong Polytechnic University (under project number A-PH42), National Science Foundation of China (No. 60675023) and China 863 High Tech. Plan (No. 2007AA01Z164).
652
T. Zhang et al.
References 1. Belkin, M., Niyogi, P.: Laplacian eigenmaps and spectral techniques for embedding and clustering. In: Advances in Neural Information Processing Systems (NIPS), pp. 585–591. MIT Press, Cambridge (2001) 2. Saul, L.K., Weinberger, K.Q., Ham, J.H., Sha, F., Lee, D.D.: Spectral methods for dimensionality reduction. In: Chapelle, O., Schoelkopf, B., Zien, A. (eds.) Semisupervised Learning, MIT Press, Cambridge (to appear) 3. Zhang, Z.Y., Zha, H.Y.: Principal manifolds and nonlinear dimensionality reduction via tangent space alignment. SIAM Journal of Scientific Computing 26(1), 313–338 (2004) 4. Rosenberg, S.: The Laplacian on a Riemannian Manifold. Cambridge University Press, Cambridge (1997) 5. Lafon, S.: Diffusion Maps and Geometric Harmonics. Ph.D. dissertation. Yale University (2004) 6. Tenenbaum, J.B., de Silva, V., Langford, J.C.: A global geometric framework for nonlinear dimensional reduction. Science 290(5500), 2319–2323 (2000) 7. Saul, L., Rowels, S.: Think globally, fit locally: unsupervised learning of nonlinear manifolds. Journal of Machine Learning Research 4, 119–155 (2003) 8. Turk, M., Pentland, A.: Eigenfaces for recognition. Journal of Cognitive Neuroscience 3(1), 71–86 (1991) 9. Belhumeur, P.N., Hespanha, J.P., Kriegman, D.J.: Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Trans. Pattern Anal. Mach. Intell. 19(7), 711–720 (1997) 10. Available at http://www.uk.research.att.com/facedatabase.html 11. Available at http://cvc.yale.edu/projects/yalefaces/yalefaces.html 12. Shakhnarovich, G., Moghaddam, B.: Face recognition in subspaces. In: Li, S.Z., Jain, A.K. (eds.) Handbook of Face Recognition, Springer, Heidelberg (2004) 13. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. John Wiley & Sons, Inc., Chichester (2001) 14. He, X., Niyogi, P.: Locality preserving projections. In: Advances in Neural Information Processing Systems (NIPS) (2003) 15. Cai, D., He, X., Han, J.: Using graph model for face analysis, Department of Computer Science Technical Report No. 2636, University of Illinois at Urbana-Champaign (UIUCDCS-R-2005-2636) (September 2005)
Walking Appearance Manifolds without Falling Off Nils Einecke1 , Julian Eggert2 , Sven Hellbach1 , and Edgar K¨ orner2 1
Technical University of Ilmenau Department of Neuroinformatics and Cognitive Robotics 98684 Ilmenau, Germany 2 Honda Research Institute Europe GmbH Carl-Legien-Str.30, 63073 Offenbach/Main, Germany
Abstract. Having a good description of an object’s appearance is crucial for good object tracking. However, modeling the whole appearance of an object is difficult because of the high dimensional and nonlinear character of the appearance. To tackle the first problem we apply nonlinear dimensionality reduction approaches on multiple views of an object in order to extract the appearance manifold of the object and to embed it into a lower dimensional space. The change of the appearance of the object over time then corresponds to a walk on the manifold, with view prediction reducing to a prediction of the next step on the manifold. An inherent problem here is to constrain the prediction to the embedded manifold. In this paper, we show an approach towards solving this problem by applying a special mapping which guarantees that low dimensional points are mapped only to high dimensional points lying on the appearance manifold.
1
Introduction
One focus of the current research in computer vision is to find a way to represent the appearance of objects. Attempts of full 3D modeling of an object’s 3D shape turned out to be not reasonable as it is computationally intensive and learning or generating appropriate models is laborious. According to the viewer-centered theory [1,2,3] the human brain stores multiple views of an object in order to be able to recognize the object from various view points. For example in [4] an approach is introduced that uses multiple views of objects to model their appearance. Thereto the desired object is tracked and at each time step the pose of the object is estimated and a view is inserted into the model of appearance if it holds new or better information. Unfortunately, this is very time consuming as this approach works directly with the high dimensional views. Actually, the different views of an object are samples of the appearance manifold of the object. This manifold is a nonlinear subspace in the space of all possible appearances (appearance space) where all the views of this particular object are located. In general, the appearance manifold has a much lower dimensionality than the appearance space it is embedded in. Non-Linear Dimensionality Reduction (NLDR) algorithms, like Locally Linear Embedding (LLE) M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 653–662, 2008. c Springer-Verlag Berlin Heidelberg 2008
654
N. Einecke et al.
[5], Isometric Feature Mapping (Isomap) [6] or Local Tangent Space Alignment (LTSA) [7], can embed a manifold into lower dimensional spaces (embedding space) by means of a sufficient number of samples of the manifold. Elgammal and Lee [8] use embedded appearance manifolds for 3D body pose estimation of humans based on silhouettes of persons and LLE. Pose estimation is realized via a RBF1 -motivated mapping from the visual input to the embedded manifold and from there to the pose space. Note that they model the embedded manifold with cubic splines in order to be able to project points mapped into the embedding space onto the manifold. Lim et al. [9] follow a similar approach but in contrast to Elgammal and Lee they use the model of the embedded manifold to predict the next appearance. Actually, both approaches are limited to one-dimensional manifolds as the views were sampled during motion sequences and the modeling of the manifold is based on the available time information. In [10] Lui et al. use an aligned mixture of linear subspace models to generate the embedding of the appearance manifold which does not dependent on additional time information. Using a Dynamic Bayesian Network they infer the next position in the embedding space and based on this the position and scale parameters. The approach of Lui et al. is able to handle manifolds with more than one dimension but the prediction process is not constrained to the structure of the manifold. This, however, is very important for predictions over a larger time span as without this constraint the prediction would tend to leave the manifold, leading to awkward views when projected back to the appearance space or to wrong pose parameter estimates. Unfortunately, this constraining is quite difficult because of the highly nonlinear shape of the manifold. In the work presented here, we do not attempt to tackle this problem directly. Instead, we just use a simple non-constrained linear predictor in the low dimensional embedding space and leave the work of imposing the manifold constraint to the mapping procedure between the low dimensional embedding space and the high dimensional appearance space. The rest of this paper is organized as follows. In Sect. 2 we show what kind of objects we used to investigate our approach and we discuss the shape of appearance manifolds of rigid objects in the light of our way of sampling views. Section 3 introduces our approach for mapping between the spaces which guarantees to map only to points lying on the manifold and its embedding. Then Sect. 4 provides the workflow of our view prediction approach. In Sect. 5 we describe the experiments conducted for analyzing our view prediction approach and present the results. Finally, Sect. 6 summarizes this paper and outlines future work.
2
Appearance Manifolds and Embedding
All possible views of an object together form the so-called appearance manifold. By embedding such a manifold in a low dimensional space one gets a low dimensional equivalent of this manifold. If one is able to correctly map between the spaces one can work efficiently in the low dimensional space and project the 1
Radial Basis Function.
Walking Appearance Manifolds without Falling Off
655
Fig. 1. A trajectory on a two-dimensional band-like manifold. On the left we see the actual manifold and on the right its two-dimensional embedding.
results to the high dimensional space. For example series of views exhibit as a trajectory on the appearance manifold. Mapping such a trajectory into the low dimensional space eases the processing as the trajectory’s dimensionality and nonlinearity is reduced. Figure 1 shows a simple band-like manifold resident in the three-dimensional space and its embedding in the two-dimensional space. We used the POV-Ray2 tool for generating views of virtual objects3 (see Fig. 2). This way we are able to verify our approach under ideal conditions and, for now, we do not have to deal with problems like segmenting the object from the background. A view of an object is described mainly by the orientation parameters of the object. These could for example comprise: scaling, rotation about the three axis of the three dimensional space, deformation and translation. However, we will concentrate only on the rotation here. While tracking deformation would considerably blow up the complexity of the problem, scaling can be handled by a resolution pyramid. Furthermore, it makes sense to use views which are centered because this could be dealt with by a preprocessing step, like a translational tracker. So we are left with a three-dimensional parameter space spanned by the three rotation angles. In addition, sampling views over all three angles is not feasible as this would lead to a too large number of views. Therefore we decided to sample views by varying only 2 axes. Unfortunately, experiments have shown that, in general, the views sampled varying 2 axes are not embeddable in a non-pervasive manner in a low dimensional (three-dimensional) space. Hence we reconsidered to rotate the objects full 360◦ only about one axis. We sampled views every 5◦ while rotating the object 360◦ about its vertical axis (y-axis) and tilting it from −45◦ to +45◦ about its horizontal axis (x-axis). Each 360◦ rotation for itself leads to a cyclic trajectory in the appearance space. As these trajectories are neighboring, all views together form a cylindric appearance manifold. This can be seen exemplarily at the embedding of the views of the bee and the bird in Fig. 3. For embedding the appearance manifold into a low dimensional space we use the Isomap approach because comparisons of LLE [5], LTSA [7] and Isomap [6] have shown that Isomap is most appropriate for this purpose. 2 3
POV-Ray is a freely available tool for rendering 3D scenes. The objects we used are templates from http://objects.povworld.org
656
N. Einecke et al.
Fig. 2. The objects used for analyzing our view prediction approach embedding of the bird’s appearance manifold
10 5 0 −5
d 2n
5
dim
0
n sio en
−5
−10
−8
−6
−4
−2
0
2
4
6
8
10 15 0 10 5
−10 15
10
ension 1st dim
en sio n
10
0
10 5 0 2nd dime −5 nsion
−5 −10 −15
−10
di m
−10
1s t
3rd dimension
3rd dimension
embedding of the bee’s appearance manifold
Fig. 3. Three-dimensional embeddings of the views of the bee (left) and the bird (right) generated with Isomap. Views were sampled in an area of 360◦ vertical and from −45◦ to +45◦ horizontal. The colors encode the rotation angle about the vertical axis from blue 0◦ to red 360◦ . As each full rotation about the vertical axis exhibits as a cyclic trajectory in the appearance space and since all cyclic trajectories are neighboring, the embedding of the views leads to a cylinder-like structure (appearance manifold).
3
Mapping between the Spaces
We prefer not to predict views directly in the high dimensional appearance space but on the low dimensional embedding of the appearance manifold. Two problems arise. First, most NLDR algorithms do not yield a function for mapping between appearance space and embedding space, and second, it is difficult to ensure that the prediction does not leave the manifold. In order to actually ensure that the prediction is done only along the manifold one has to constrain the prediction with the nonlinear shape of the manifold. This, however, is very problematic because appearance manifolds often exhibit highly nonlinear and wavy shapes. Take for example a simple linear prediction. Such a prediction is quite likely to predict positions that do not lie on the manifold as can be seen in Fig. 4 a). Leaving the manifold in the low dimensional space means also leaving the appearance manifold, i.e. for a point in the low dimensional space which is not lying on the embedded manifold there is simple no valid corresponding view of the object. Usual interpolation methods cannot handle this problem. They just try to find an appropriate counterpart but in doing so they are not directly constrained to the appearance manifold. This means that the views they map those points to are no valid views of the object and often show heavy distortions.
Walking Appearance Manifolds without Falling Off
a)
3.5
b)
3
3.5 w5=0.18 w3=−0.068 w =0.453
3 w =0.68 1
2.5
1
2.5
w =0.32
w =0.427
2
2
2
4
1.5 1
t−2
0.5 0 −4.5
w =0.008
2
t−1
1.5 1
657
0.5
−4
−3.5
−3
−2.5
−2
−1.5
0 −4.5
−1
−4
−3.5
−3
−2.5
−2
−1.5
−1
Fig. 4. These two figures show a subsection of a one-dimensional cyclic manifold. a) A linear prediction using the last two positions (light blue) on the manifold leads to a point (red) not belonging to the manifold. Reconstructing this point by convex combination of its nearest neighbors (orange) projects it back to the manifold. b) Reconstruction using the LLE idea does not ensure positive weights. However, iterative repetition of the reconstruction (yellow-to-green points) makes the weights converge to positive values. The reconstruction weights after 4 iterations are displayed.
A possible way out of this dilemma is the reconstruction idea upon which LLE [11] is based. What we want to do is to map between two structures whereas one is a manifold in a high dimensional space and the other its embedding in a low dimensional space. By assuming a local linearity (which is a fundamental assumption of most NLDR algorithms anyway) it is possible to calculate reconstruction weights for a point on one of these structures that accounts for both spaces, i.e. it is possible to calculate the reconstruction weights for a point in either of the two spaces and by means of these the counterpart of this point in the other space can be reconstructed. The weights in the appearance space are calculated via minimizing the following energy function 2 j E(wi ) = xi − wi · xj j∈Ni
with
wij = 1,
(1)
j∈Ni
where x is a D-dimensional point in the appearance space, xi is the point to reconstruct, Ni = {j|xj is a k-NearestNeighbor of xi } and wi is the vector holding the reconstruction weights. After the weights are determined the counterpart yi of xi in the embedding space can be calculated by yi =
wij · yj ,
(2)
j∈Ni
with the yj ’s being the d-dimensional embedding counterparts of the xj ’s. Naturally d < D but in general d D. Reconstructing a xi from a yi works in an analogous way. The neighbors Ni of a data point are chosen only among those data points whose mapping is known, namely the data points that were used for the nonlinear dimensionality reduction.
658
N. Einecke et al.
If one demands the reconstruction weights to be larger than zero and summing up to one, then the reconstructed points always lie on the manifold. The reason is that this corresponds to a convex combination whose result is constrained to lie in the convex hull of the support points. Together with the local linear assumption this leads to reconstruction results where the reconstructed points always lie on the manifold. So even if a point beyond the manifold is predicted the mapping by reconstruction ensures that only valid views of the object are generated because it inherently projects the point onto the manifold. This can be seen in Fig. 4 a). In [11] it has been shown that the energy function (1) can be rewritten as a system of linear equations. This enables to directly calculate the weights using matrix operations. Although the calculated weights are constrained to sum up to one they are not constrained to be positive. This is a problem as it violates the convex combination criteria and hence it is not ensured that a reconstructed point lies on the manifold. However, an iterative repetition of the reconstruction, i.e. reconstructing the reconstructed point, projects the reconstructed point onto the manifold. During this process the weights converge to positive values. Figure 4 b) depicts an example.
4
View Prediction
Embedding a set of views of an object into a low dimensional space leads to tuples (xi ,yi ) of views xi in the appearance space and their low dimensional counterparts yi . With this representation of the object’s appearance the process of view prediction is as follows: 1) At each time step t the current view xt of the object is provided e.g. from a tracking or a detection stage. 2) Determine the k nearest-Neighbors among the represented views. Nt = {i|xi is a k-NearestNeighbor of xt } 3) Calculate the reconstruction weights wt in the appearance space. 2 i ˆ t = arg minwt xt − i∈Nt wti · xi , w i∈Nt wt = 1 4) Calculate the mapping to the embedding space by reconstructing the low dimensional counterpart of view xt . yt = i∈Nt w ˆti · yi 5) Predict the next position in the low dimensional embedding space, e.g. using the last two views. pred yt−1 , yt → yt+1 6) Determine the reconstruction weights wa in the embedding space by iterative reconstruction. pred Set ya = yt+1 and repeat the following steps: (i) Na = {i|yi is a k-NearestNeighbor of ya } 2 i ˆ a = arg minwa ya − i∈Na wai · yi , (ii) w i∈Na wa = 1 i (iii) ya = i∈Na w ˆa · yi
Walking Appearance Manifolds without Falling Off
659
7) Map back to thei appearance space. xpred ˆa · xi t+1 = i∈Na w As explained in the last section, the iterative reconstruction assures that only valid object views are generated. We denote this procedure embedding view prediction.
5
Experiments
In order to analyze the embedding view predictor we conducted some experiments where we compared this view predictor with two view predictors working directly in the high dimensional appearance space. The first predictor predicts linearly the next view directly in the appearance space from the last two views. In general, this predicted view will lie beyond the manifold of the views. In order to be comparable to the embedding view predictor, the nearest neighbor of the linearly predicted view is determined and returned as the actual predicted view. We denote this view predictor the nearest neighbor view predictor. The second view predictor works like our embedding view predictor but in contrast to this it works directly in the high dimensional appearance space. This means that it linearly predicts views in the appearance space and projects the predicted views onto the appearance manifold using the iterative reconstruction idea. We denote this view predictor the iterative reconstruction view predictor. To validate our view prediction we generated two trajectories in the appearance space for each object. The trajectories are depicted exemplarily with the views of the bird in Fig. 5. It can be seen that the views of the trajectory do not correspond to already represented views in the set of sampled views as introduced in Sect. 2. The tests we conducted surveyed only the prediction ability of the embedding view predictor compared to the other two view predictors. The view predictors had to predict the views along the discussed trajectories. Thereto each view is predicted using its two predecessors in the trajectory. The predicted views are compared with the actual next views by means of a sum of absolute difference. Figure 6 shows the prediction error of the three view predictors applied on the two trajectories rotate and whirl (see Fig. 5) of the bee and the bird. It can be observed that the prediction in the low dimensional space is comparable to the predictors operating directly in the high dimensional appearance space. In general, the embedding view predictor is even slightly better. Sometimes, however, it tends to predict views with a large error which appear as single high peaks in the error curve. A closer look revealed that this may be due to topological defects of the embedded appearance manifolds. The strong peaks occur more often when predicting the bird than the bee and indeed the embedding of the bird’s appearance manifold is more distorted than that of the bee (see Fig. 3).
660
N. Einecke et al.
Fig. 5. From left to right the views in the two rows show the two variants of trajectories, the view predictors are tested with. The upper is a simple rotation about the vertical axis. The lower starts at 320◦ horizontal and 2.5◦ vertical and goes straight to 40◦ horizontal and 360◦ vertical and consist of 72 equally distributed views. In order to distinguish between these two trajectories, the first is called “rotate” and the second “whirl”. The degrees in the top left corners of the images denote the horizontal rotation and in the top right corners the vertical rotation.
embedding view predictor
bird rotate
4
x 10
3
2
1 0
10
20
30
40
iterative reconstruction view prediction
bird whirl
4
8
prediction error
prediction error
4
nearest neighbor view prediction
50
60
x 10
6 4 2 0 0
70
10
20
time x 10
6
4
3
2 0
10
20
30
40
time
40
50
60
70
50
60
70
bee whirl
4
prediction error
prediction error
5
30
time
bee rotate
4
50
60
70
x 10
5 4 3 2 1 0
10
20
30
40
time
Fig. 6. This figure displays the prediction error of the embedding view predictor, nearest neighbor view predictor and iterative reconstruction view predictor for the two trajectories rotate and whirl of the bird and the bee. The error is a sum of absolute difference between the predicted and the actual view. Almost all time the embedding view predictor is superior to the other two.
Furthermore, we analyzed the three predictors concerning their ability to predict further views without being updated with actual views, i.e. we simulated an occlusion of the objects. To this end the three view predictors were again applied to the whirl and rotate trajectories but this time they had to rely solely on their own prediction from the 10th time step on. The results are shown in Fig. 7. It strikes that the embedding view predictor is able to reliably predict up to 10 further views while the other two predictors are only able to predict 2-3 further views. A possible explanation could be the higher ambiguity in the high
Walking Appearance Manifolds without Falling Off embedding view predictor
bird rotate
4
x 10
6 4 2 0 0
5
10
15
iterative reconstruction view prediction
bird whirl
4
12
prediction error
prediction error
8
nearest neighbor view prediction
20
x 10
10 8 6 4 2 0 0
25
5
10
time
10 8 6 4 2 0 0
5
10
15
time
20
25
20
25
bee whirl
4
12
prediction error
prediction error
x 10
15
time
bee rotate
4
12
661
20
25
x 10
10 8 6 4 2 0 0
5
10
15
time
Fig. 7. This figure shows the prediction error of the three view predictors applied to the two trajectories rotate and whirl of the bird and the bee. From the 10th time step (view) on the objects are considered to be completely occluded. This means that the predictors have to rely entirely on their own prediction. It can be observed that the embedding predictor can reliably predict up to 10 further views while the other two predictors cannot predict more than two to three views.
dimensional appearance space. This is a hint that predicting on the embedding of the appearance manifold in a low dimensional space is more appropriate for tracking the appearance of objects than predicting directly in the high dimensional appearance space.
6
Conclusion
We introduced an approach for predicting views of an object by means of its appearance manifold. By applying Isomap to the various views of an object the appearance manifold of that object can be extracted and embedded into a lower dimensional space. A change of object appearance corresponds to a trajectory on the appearance manifold as well as its embedding. By keeping track of the position of the object on the embedded manifold it is possible to forecast the upcoming appearance. We used an iterative version of the reconstruction idea of LLE in order to map points from the embedding space back into the appearance space and showed that this maps points from the embedding space only to points on the appearance manifold, i.e. only valid views of the object are predicted. Simulations have shown that following the trajectory (and by doing so predicting views) is less error prone using the embedded manifold than its high dimensional equivalent. Furthermore, we have shown that predicting the appearance for several following time steps is also more accurate using the low dimensional embedding. We want to stress that the introduced approach is
662
N. Einecke et al.
no full-fledged real object tracking system but rather a scheme for predicting complex views. In future work we want to investigate the possibility of using the simplex method for calculating the reconstruction weights as it implicitly constraints the weights to a convex combination. Furthermore, we want to analyze our approach with real objects and integrate it into a tracking architecture based on a view prediction and confirmation model, hopefully boosting the performance of the tracker strongly.
References 1. Poggio, T., Edelman, S.: A network that learns to recognize three-dimensional objects. Nature 343, 263–266 (1990) 2. Edelman, S., Buelthoff, H.: Orientation dependence in the recognition of familiar and novel views of 3D objects. Vision Research 32, 2385–2400 (1992) 3. Ullman, S.: Aligning pictorial descriptions: An approach to object recognition. Cognition 32(3), 193–254 (1989) 4. Morency, L.P., Rahimi, A., Darrell, T.: Adaptive View-Based Appearance Models. In: Proceedings of CVPR 2003, vol. 1, pp. 803–812 (2003) 5. Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by Locally Linear Embedding. Science 290(5500), 2323–2326 (2000) 6. Tenenbaum, J.B., de Silva, V., Langford, J.C.: A global geometric framework for nonlinear dimensionality reduction. Science 290(5500), 2319–2323 (2000) 7. Zhang, Z., Zha, H.: Principal Manifolds and Nonlinear Dimensionality Reduction via Tangent Space Alignment. SIAM J. Sci. Comput. 26(1), 313–338 (2004) 8. Elgammal, A., Lee, C.S.: Inferring 3D Body Pose from Silhouettes Using Activity Manifold Learning. In: Proceedings of CVPR 2004, vol. 2, pp. 681–688 (2004) 9. Lim, H., Camps, O.I., Sznaier, M., Morariu, V.I.: Dynamic Appearance Modeling for Human Tracking. In: Proceedings of CVPR 2006, pp. 751–757 (2006) 10. Liu, C.B., et al.: Object Tracking Using Globally Coordinated Nonlinear Manifolds. In: Proceedings of ICPR 2006, pp. 844–847 (2006) 11. Saul, L.K., Roweis, S.T.: Think globally, fit locally: unsupervised learning of low dimensional manifolds. Journal of Machine Learning Research 4, 119–155 (2003)
Inverse-Halftoning for Error Diffusion Based on Statistical Mechanics of the Spin System Yohei Saika Wakayama National College of Technology, 77 Noshima, Nada, Gobo, Wakayama 644-0023, Japan [email protected]
Abstract. On the basis of statistical mechanics of the Q-Ising model with ferromagnetic interactions under the random fields, we formulate the problem of inverse-halftoning for the error diffusion using the Floyd-Steinburg kernel. Then using the Monte Carlo simulation for a set of the snapshots of the Q-Ising model and a standard image, we estimate the performance of our method based on the mean square error and edge structures of the reconstructed image, such as the edge length and the gradient of the gray-level. We clarify that the optimal performance of the MPM estimate is achieved by suppressing the gradient of the gray-level on the edges of the halftone image and by removing a part of the halftone image if we set parameters appropriately. Keywords: Bayes inference, digital halftoning, error diffusion, inversehalftoning, Monte Carlo simulation.
1 Introduction For many years, a lot of researchers have investigated information processing, such as image analysis, spatial data and the Markov-random fields [1-5]. In recent years, based on the analogy between probabilistic information processing and statistical mechanics, statistical-mechanical methods have been applied to image restoration [6] and error-correcting codes [7]. Pryce and Bruce [8] have formulated the threshold posterior marginal (TPM) estimate based on statistical mechanics of the classical spin system. Then Sourlas [9,10] has pointed out the analogy between error-correction of the Sourlas’ codes and statistical mechanics of spin glasses. Then Nishimori and Wong [11] have constructed the unified framework of image restoration and errorcorrecting codes based on statistical mechanics of the Ising model. Recently, the statistical-mechanical techniques are applied to various problems, such as mobile communication [12]. In the field of the print technology, a lot of techniques in information processing have played important roles to print a multi-level image with high quality. Especially digital halftoning is an essential technique to convert a multi-level image into a bi-level image which is visually similar to the original image [13]. Various techniques for digital halftoning have been established, such as the threshold mask method [14], the dither method [15], the blue noise mask method [16] and the error diffusion [17,18]. Also inverse-halftoning is an important technique to reconstruct the M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 663–672, 2008. © Springer-Verlag Berlin Heidelberg 2008
664
Y. Saika
multi-level image from the halftone image [19]. For this purpose, various techniques [20] for image restoration have been used for inverse-halftoning. In recent years, the MAP estimate [21] has been applied both for the threshold mask method and the error diffusion method. Recently the statistical mechanical method has applied for the threshold mask method [21-23]. In this article, we show the statistical-mechanical formulation for the problem of inverse-halftoning of the error diffusion method using the maximizer of the posterior marginal (MPM) estimate. This method is based on the Bayes inference and then based on the Bayes formula the posterior probability can be estimated using the model prior and the likelihood. In this study, we use the model prior expressed by the Boltzmann factor of the Q-Ising model and the likelihood expressed by the Boltzmann factor of the random fields enhancing the halftone image. Then using the Monte Carlo simulation both for a set of the snapshots of the Q-Ising model and a standard image we estimate the performance of our method based on the mean square error and edge structures observed in an original, halftoning and reconstructed images, such as the edge length and the gradient of the gray-level. In this study, we investigate the edge structures of the reconstructed image because the dot pattern with complex structures appearing in the halftone image is considered to influence on the performance of inverse-halftoning. The simulation clarifies that the MPM estimate works effectively for the problem of inverse-halftoning for the halftone image converted by the error diffusion method using the Floyd-Steinburg kernel if we set the parameters appropriately. We also clarify that the optimal performance of our method is achieved by suppressing the gray-level difference between neighboring pixels and by removing a part of the edges which are embedded in the halftone image through the procedure of error diffusion. Further we clarify the dynamical properties of the MPM estimate for inverse-halftoning. If the parameters are set appropriately, the mean square error smoothly converges to the optimal value irrespective of the choice of the initial condition. On the other hand, the convergent value of the MPM estimate depends on the initial condition of the Monte Carlo simulation. The present article is organized as follows. In chapter 2, we show the statisticalmechanical formulation for the problem of inverse-halftoning for the error diffusion. Then using the Monte Carlo simulation both for a set of the snapshots of the Q-Ising model and a gray-level standard image, we estimate the performance of the MPM estimate based on the mean square error and the edge structures of the original, halftone and reconstructed images, such as the edge length and the gradient of the graylevel. The chapter 4 is devoted to summary and discussion.
2 General Formulation Here we show a statistical-mechanical formulation for the problem of inversehalftoning for a halftone image which is generated by the error diffusion using the Floyd-Steinberg kernel. First, we consider a gray-level image {ξx,y} in which all pixels are arranged on the lattice points located on the square lattice. Here we set as ξx,y = 0,…,Q-1, x, y = 1,…,L.
Inverse-Halftoning for Error Diffusion
(a)
(b)
(c)
(d)
(e)
(f)
665
Fig. 1. (a) a sample of the snapshot of the Q=4 Ising model with 100×100 pixels, (b) the halftone image converted from (a) by the error diffusion method using the Floyd-Steinburg kernel, (c) the 4-level image reconstructed by the MPM estimate when hs=1, Ts=1, h=1, T=0.1, J=0.875, (d) the 256-level standard image “girl” with 100×100 pixels, (e) the halftone image converted from (d) by the error diffusion method using the Floyd-Steinburg kernel, (f) the 256level image reconstructed from (e) by the MPM estimate when h=1, T=0.1, J=1.40.
In this study, we treat two kinds of the original images. One is the set of the graylevel images generated by a true prior expressed by the Boltzmann factor of the Q-Ising model:
(ξ )
Pr { x, y } =
⎡ h 1 exp ⎢− s ξ x, y − ξ x', y ' ⎢ Ts Zs n.n. ⎣⎢
∑(
⎤
)2 ⎥⎥,
(1)
⎦⎥
where Zs is the normalization factor and n.n. is the nearest neighboring pairs. As shown in Fig. 1 (a), we can generate the gray-level image which has smooth structures, as can be seen in the natural images, if we appropriately set the parameters hs and Ts. Then the other is the 256-level standard image “girl”, as shown in Figs. 1(d). Next, in the procedure of digital halftoning based on the error diffusion, we convert the gray-level image {ξx,y} into a halftone image {τx,y} which is visually similar to the original gray-level image, where we set as τx,y=0, 255 and x, y = 1, …, L. The halftone images obtained due to the error diffusion method are in Figs. 1 (b) and (e). The error diffusion algorithm is performed following the block diagram in Fig. 2 and the FloydSteinburg kernel in Fig. 3. As shown in these figures, the algorithm proceeds through the image in a raster scan and then a binary decision at each pixel is made based on the input gray-level at the (x,y)-th pixel and filtered errors from the previous threshold samples. At the (x,y)-th pixel the gray-level ux,y is rewritten into the modified graylevel u’x,y as
u' x , y = u x , y −
∑h
( k ,l )∈S
e
k ,l x − k , y − l
.
(2)
666
Y. Saika
Here {hk,l} is the Floyd-Steinburg kernel and S is the region which supports the site (x,y) by the Floyd-Steinburg kernel. Then ex,y is the error of the halftone image τx,y to the gray-level one ux,y at the site (x, y) as
ex , y = τ x , y − u' x , y .
(3)
Here the pixel value τx,y of the halftone image is obtained using the threshold procedure as
⎧Q − 1 ( z' x , y ≥ (Q − 1) / 2) . (otherwize) ⎩0
τ x, y = ⎨
(4)
Next, using the MPM estimate based on statistical mechanics of the Q-Ising model, we reconstruct a gray-level image for the halftone image converted by the error diffusion method using the Floyd-Steinburg kernel. In this method, we use the model system which expressed by a set of Q-Ising spins {zx,y} (zx,y= 0,…, Q-1, x, y = 1,…, L) located on the square lattice. The procedure of inverse-halftoning is carried out so as to maximize the posterior marginal probability as zˆ x , y = arg max ∑ P ({z} | {J }), z x,y
(5)
{ z }≠ z x , y
where the posterior probability is estimated based on the Bayes formula:
P ({z}|{J }) = P ({z}) P ({J }|{z})
(6)
using the model of the true prior and the likelihood. In this study, we assume the model of the true prior which is expressed by the Boltzmann factor of the Q-Ising model as
P({z}) =
⎡ J 1 exp ⎢− ∑ (z − z Z ⎣ T n .n.
m
x ,y
m
) ⎤⎥. 2
x ', y '
⎦
(7)
This model prior is expected to enhance smooth structures which can be seen in natural images. Then, we assume the likelihood:
⎡ h P({z} | {τ }) ∝ exp⎢− ⎣ Tm
∑ (z
x, y
x, y
2⎤ − τˆx , y ) ⎥. ⎦
(8)
which generally enhances the gray-level image:
τˆx , y =
1
∑a
i , j = −1
τ
i , j x +i , y + j
(9)
where {ai,j} is the kernel of the conventional filter. In this study, we set the halftone image itself which is obtained by the error diffusion method using the FloydSteinburg kernel. We note that our method corresponds to the MAP estimate in the limit of T→0.
Inverse-Halftoning for Error Diffusion
ξx,y
Tx,y
+
τx,y
threshold
+
667
‐ + Ex,y Error filter aa ij
Fig. 2. The block diagram of the error diffusion algorithm
Fig. 3. The Floyd-Steinburg kernel
Next, in order to estimate the performance of our method for a standard image, we use the mean square error as
σ=
2 1 L ˆ ( z − ξ ) ∑ x , y x , y NQ 2 x , y =1
(10)
where ξx, y and zˆ are the pixel values of the original gray-level and reconstructed x,y
images. On the other hand, if we estimate the performance of gray-level images generated by the true prior P({ξ}), we evaluate the mean square error which is averaged over the true prior as
σ = ∑ P({ξ }) {ξ }
1 NQ 2
∑ (zˆ
2
L
x , y =1
x, y
− ξ x, y ) .
(11)
This value becomes zero, if each pixel value of all reconstructed images is completely same with that of the corresponding original images. As we have said in above, because the edge structures embedded in the halftone image is considered to influence on the performance of inverse-halftoning, we estimate the edge length appearing both in the halftone and reconstructed images as
668
Y. Saika
LGH edge =
⎡
∑ξ P({ξ })⎢⎢⎢∑ (1 − δτ ⎣ n.n.
{ }
⎤
)(
)
⎥ x, y ,τ x ', y ' 1 − δz x , y , z x ', y ' ⎥
(12)
⎥⎦
which is averaged all over the snapshots of the Q-Ising model. We also estimate the gradient of the gray-level on the edges appearing both in the halftone and reconstructed images as δz GH x, y =
⎡ ⎤ P ({ξ })⎢ 1 − δτ x , y ,τ x ', y ' | z x, y − z x ', y ' |⎥ ⎢ ⎥ ⎢⎣ n.n. {ξ } ⎦⎥
∑
∑(
)
(13)
which is averaged over the snapshots of the Q-Ising model. Next in order to clarify how the edge structures of the original image is observed in the reconstructed image, we estimate the edge length on the edges in the original, halftone and reconstructed images: LGHO edge =
⎡
∑ξ P({ξ})⎢⎢⎢∑(1 − δξ { }
⎣ n.n.
)(
⎤
)(
)
⎥ x, y , ξ x ', y ' 1 − δτ x, y ,τ x ', y ' 1 − δz x, y , z x ', y ' ⎥
(14)
⎥⎦
which is averaged over the original images. Also we estimate the edge length on the edges in the original, halftone and reconstructed images: δz GHO x, y =
⎡
∑ξ P({ξ })⎢⎢⎢∑ (1 − δξ { }
⎣ n.n.
)(
)
⎤
⎥ x , y , ξ x ', y ' 1 − δτ x, y ,τ x', y ' | z x, y − z x ', y ' |⎥
(15)
⎦⎥
which is averaged over the original images.
3 Performance In order to estimate the performance of the MPM estimate, we carry out the Monte Carlo simulation both the set of gray-level images generated by the Boltzmann factor of the Q-Ising model and the gray-level standard image. First, we estimate the performance of the MPM estimate for the set of the snapshots of the Q=4 Ising model as shown in Fig. 1 (a). Then the halftone image shown in Fig. 1(d) is obtained by the error diffusion method using the Floyd-Steinburg kernel. When we estimate the performance of the MPM estimate, we use the mean square error which is averaged over the 10 samples of the snapshots of the Q=4 Ising model with hs=1 and Ts=1. Now we investigate static properties of the MPM estimate based on the mean square error and the edge structures observed both in the original, halftone and reconstructed images. First we confirm the static properties of the MPM estimate when the posterior probability has same form with the likelihood at J=0 or with the model prior in the limit of J→∞. At J=0, as the likelihood is assumed to enhance the halftone image, the MPM estimate reconstructs the gray-level image which is almost same with the halftone image. Therefore the reconstructed image at J=0 has the edge length which is
Inverse-Halftoning for Error Diffusion
669
Fig. 4. The mean square error as a function of J obtained by the MPM estimate for the halftone image which is obtained from the set of the snapshots of the Q=4 Ising model by the error diffusion method using the Floyd-Steinburg kernel when hs=1, Ts=1, h=1, T=0.1
Fig. 5. The edge structures observed in the reconstructed image obtained by the MPM estimate using the Q-Ising model for the set of the snapshots of the 4-Ising model when hs=1, Ts=1, h=1, T=1
longer than the original image, as the halftone image is expressed by the dot pattern which is visually similar to the original image. For instance, the edge length averaged over the set of the Q=4 Ising model and the halftone images obtained by the error diffusion using the Floyd-Steinburg kernel are 6021.3 and 12049.8. On the other hand, in the limit of J→∞, as the model prior is assumed to enhance smooth structures, the MPM estimate reconstructs the flat pattern.
670
Y. Saika
Fig. 6. The mean square error as a function of J obtained by the MPM estimate for error diffusion for Q=255 standard image “girl” when h=1, T=1
Fig. 7. The edge length and the gradient of the gray-level in the gray-level image restored by the MPM estimate for Q=256 standard image “girl” when h=1, T=1
Then we investigate the performance of the MPM estimate when the posterior probability is composed both of the model prior and the likelihood. In Fig. 4 we show the mean square error as a function of J for the snapshots of the Q=4 Ising model. The figure indicates that the mean square error takes its minimum at J=0.875. This means that the MPM estimate reconstructs the gray-level image by suppressing the gradient of the gray-level of the edges embedded in the halftone image by the Boltzmann factor of the Q-Ising model. Also the figure indicates that the mean square error rapidly decreases as we increase J from 0.025 to 0.3. The origin of the rapid drop of the mean square error can be explained by the edge structures of the reconstructed image, such
Inverse-Halftoning for Error Diffusion
671
as the edge length and the gradient of the gray-level as L , L , | δz | and | δz | which are shown in (12)-(15) in the following. Now we evaluate the edge structures of the reconstructed images, such as the edge length and the gradient of the gray-level. Figure 5 indicates how LGH, LGHO, |δzGH| and |δzGHO| depend on the parameter J due to the MPM estimate for the set of the Q=4 Ising model. Figure 5 indicates that LGH is steady in 0<J<0.6, though it gradually decreases with the increase in J from 0.6 and becomes zero in J>1.175. This means that the edge structures of the halftone image survives in J<0.6 and that they are removed by the model prior in J>0.6. On the other hand, |δzGH| decreases with the increase in J from 0.025 to 0.3 and then becomes almost same with the edge length LGH at J=0.3. Then, as shown in Fig. 5, although the LGH is steady in 0<J<0.6, LGH becomes short with the increase in J in from 0.6 and then takes LGH=4096.3 at J=Jopt=0.875. Then, as we further increase J, Ledge becomes zero = 0 at J=1.075 in the end. These results indicate that |δzGH| is rapidly suppressed by the model prior which enhances smooth structures in J<0.6 and that the MPM estimate the edges embedded in the halftone image is removed by the model prior in J>0.6. Further we show that the optimal performance is achieved by suppressing the gradient of the gray-level and by removing a part of the edges embedded in the halftone image if we set the parameters appropriately. Also we numerically estimate the dynamical properties of the MPM estimate using the Monte Carlo simulation for the Q=4 Ising model on the square lattice. The simulation clarifies that the mean square error smoothly converges to the same value within the statistical uncertainty and however that the convergent value depends on the initial condition if we set J>Jopt. Next, we estimate the performance of the MPM estimate for the 256-level standard image “girl” with 100×100 pixels which is shown in Fig. 1 (d). Then the halftone image in Fig.1 (e) is obtained from the original image by the error diffusion method using the Floyd-Steinburg kernel. As shown in Figs. 6 and 7, we estimate the performance in terms of the mean square error, the edge length and the gradient of the gray-level. Figure 6 indicates that the MPM estimate shows optimal performance at J=1.45. Then, as shown in Fig. 7, the Monte Carlo simulation clarifies that the edge length is longer than that of the halftone image and further that the gray-level difference is larger than that of the edge length in the reconstructed image. These indicate that the fine structures introduced into the halftone image remains in the reconstructed image, though the gradient of the gray-level is sharply suppressed with the increase of J in 0<J<1.0. As shown in above, these results indicate the similar behavior to the Q=4 model. GH
GHO
GH
GHO
edge
edge
edge
edge
4 Summary and Discussion In the previous chapters, we formulate the problem of inverse-halftoning for error diffusion using the Floyd-Steinburg kernel based on the MPM estimate. Then using the Monte Carlo simulation for the snapshots of the Q-Ising model and the standard image, we estimate the performance of our method in terms of the mean square error and the edge structures of the reconstructed image. We clarify that the optimal performance of the MPM estimate using the above model prior and the likelihood is available for the problem of inverse-halftoning, if we tune the parameters. This also means that the assumed model prior is available to suppress the fine structures introduced into the halftone image and reconstruct the original gray-level image.
672
Y. Saika
Acknowledgment The author was financially supported by Grant-in-Aid Scientific Research on Priority Areas “Deepening and Expansion of Statistical Mechanical Informatics (DEX-SMI)” of the Ministry of Education, Culture, Sports, Science and Technology (MEXT) No.18079001.
References 1. Besag, J.: Journal of the Royal Statistical Society B 48, 259-302 (1986) 2. Winkler, G.: Image analysis, Random fields and Dynamic Monte Carlo Methods. A Mathematical Introduction. Springer, Berlin (1995) 3. Cressie, N.A.: Statistics for Spatial Data. Wiley, New York (1993) 4. Possio, A. (ed.): Spatial Statistics and Imaging. Institute of Mathematical Statistics, Hayward, California. Lecture Notes-monograph Series, vol. 20 (1991) 5. Ogata, Y., Tanemura, M.: Ann. Inst. Stat. Math. B 33, 131 (1981) 6. Nishimori, H.: Statistical Physics of Spin Glasses and Information Processing: An Introduction. Oxford University Press, Oxford (2001) 7. Tanaka, K.: Journal of Physics A: Mathematics and General 35, R38-R150 (2002) 8. Pryce, J.M., Bruce, A.D.: Journal of Physics A 28, 511-532 (1995) 9. Sourlas, N.: Nature 339, 693-695 (1989) 10. Sourlas, N.: Europhys. Letters 25, 159–164 (1994) 11. Nishimori, H., Wong, K.Y.M.: Statistical mechanics of image restoration and errorcorrecting codes. Physical Review E 60, 132–144 (1999) 12. Tanaka, T.: Statistical mechanics of CDMA multiuser demodulation. Europhysics Letters 54, 540–546 (2001) 13. Ulichney, R.: Digital Halftoning. The MIT Press, Cambridge, Massachusetts, London, England (1987) 14. Bayer, B.E.: An optimum method for two-level rendition of continuous-tone pictures. In: ICC CONF. RECORD, vol. 26, pp. 11–15 (1973) 15. Ulichney, R.: Dithering with blue noise. Proc. IEEE 76(1), 56–79 (1988) 16. Mitsa, T., Parker, K.J.: Digital halftoning technique using a blue noise mask. Journal of the Optical Society of America A/9(11), 1920–1929 (1992) 17. Yao, M., Parker, K.J.: The blue noise mask and its comparison with error diffusion. SID 94 Digest 37(3), 805–807 (1994) 18. Yao, M., Parker, K.J.: Modified approach to the construction of a blue noise mask. Journal of Electric Imaging 3(1), 92–97 (1994) 19. Miceli, C.M., Parker, K.J.: Inverse halftoning. Journal of Electric Imaging 1, 143–151 (1992) 20. Wong, P.W.: Inverse Halftoning and Kernel Estimation for Error Diffusion. IEEE Trans. Image Processing 4, 486–498 (1995) 21. Saika, Y., Inoue, J.: Probabilistic inference to the problem of inverse-halftoning based on statistical mechanics of spin systems, Probabilistic inference to the problem of inversehalftoning based on statistical mechanics of spin systems. In: Proc. SICE-ICCAS 2006, pp. 4563–4568 (2006) 22. Saika, Y., Inoue, J.: Probabilistic inference to the problem of invere-halftoning based on the Q-Ising model. In: Proc. IEEE FOCI 2007, pp. 429–433 (2007) 23. Inoue, J., Saika, Y., Okada, M.: accepted for the 7th International Conference on Intelligent the System Design and Applications (ISDA 2007) (2007)
Chaotic Motif Sampler for Motif Discovery Using Statistical Values of Spike Time-Series Takafumi Matsuura and Tohru Ikeguchi Graduate School of Science and Engineering, Saitama University, 225 Shimo-Ohkubo, Sakura-ku, Saitama-city 338-8570, Japan [email protected]
Abstract. One of the most important issues in bioinformatics is to discover a common and conserved pattern, which is called a motif, from biological sequences. We have already proposed a motif extraction method called Chaotic Motif Sampler (CMS) by using chaotic dynamics. Exploring a searching space with avoiding undesirable local minima, the CMS discovers the motifs very effectively. During a searching process, chaotic neurons generate very complicated spike time-series. In the present paper, we analyzed the complexity of the spike time-series observed from each chaotic neuron by using a statistical measure, such as a coefficient of variation and a local variation of interspike intervals, which are frequently used in the field of neuroscience. As a result, if a motif is embedded in a sequence, corresponding spike time-series show characteristic behavior. If we use these characteristics, multiple motifs can be identified.
1
Introduction
The Human Genome Project was completed in April 2003. The project generated a large amount of genomic sequence data sets. The human genomes consist of approximately three billion base pairs and approximately 25,000 genes. Then, one of the primary issues in the field of bioinformatics is how to identify biologically important parts. It is generally considered that the biologically important parts repeatedly appear in a biological sequence, for example, DNA, RNA, and protein sequences. Therefore, one of the most popular analyses involves extracting a common and conserved pattern, which is often called a motif. The problem of extracting motifs (motif extracting problem, MEP) from biological sequences can be mathematically described as follows: We have a data set S = {s1 , s2 , ..., sN }, where si is the ith biological sequence (Fig. 1). The ith sequence consists of mi (i = 1, 2, ..., N ) elements. In the case of DNA or RNA, the elements correspond to the four bases, while in case of protein sequences they correspond to 20 amino acids. The most basic case usually includes exactly one motif whose length is L in each sequence. However, real life problems have several variations; the number of embedded motifs in each sequence can be more than one or random (including zero). To extract motifs from a biological sequence, one of the simple and natural methods is a brute force method, or an enumeration method, which enumerates M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 673–682, 2008. c Springer-Verlag Berlin Heidelberg 2008
674
T. Matsuura and T. Ikeguchi
mi
s1 s2
.. .
ATGG CTGAGTA AG · · · · · · ACTGAGTCAGTT CTATCCAGATT · · · · · · ATAGCGCGC CTGAGTA CTA
.. .
.. .
sN−1 AGATCAGGA CTGAGTA GCCTATTTG · · · · · · AGTT sN ACTACAACAATACA CTGAGTA TGC · · · · · · AGATCTC
L
Fig. 1. An example data set of DNA sequences. A, C, G, and T denote Adenine, Cytosine, Guanine, and Thymine, respectively. The bold face alignments indicate a motif. In this example, the motif “CTGAGTA” is embedded in each sequence.
all possible patterns. However, when using the brute force method for DNA or RNA sequences, there are 4L possible patterns that have to be considered. For a protein sequence, this number becomes 20L because a protein sequence has 20 amino acids. If L becomes large, the number of the motif patterns exponentially diverges. It means that it is almost impossible to explore all the possible motif patterns in a reasonable time frame. Actually, it has also been proved that MEP belongs to a class of N P-hard [2]. These facts indicate that it is inevitable to develop an effective approximation algorithm for extracting motifs. To realize the issue, many approximation algorithms have been proposed [9]. To search near-optimal solutions of combinatorial optimization problems, many heuristic algorithms have been proposed [3, 4]. It has also been shown that an algorithm, which introduces a chaotic neurodynamics for exploring solution spaces, is very effective [6, 7, 8, 10, 11]. The central idea of involving the chaotic neurodynamics is that if we use the chaotic dynamics, a volume of searching space reduces drastically. Different from using a kind of random algorithm, the volume of a state space explored by a chaotic dynamics has Lebegue measure zero, because the state space generally consists of a chaotic attractor. To realize an effective chaotic search, an algorithm for controlling a local search by the chaotic dynamics has been proposed. In this algorithm, an execution of a local search, such as 2-opt for solving traveling salesman problems (TSPs) [5, 7], and position movement for MEPs [10, 11] is driven by the chaotic neurodynamics. Then, it has been shown that the chaotic neurodynamics searches good near optimal solutions for TSPs [5, 7], MEPs [10, 11] and quadratic assignment problem (QAP) [8]. In addition, it has also been clarified that its high searching ability depends on a statistical property of refractoriness in a chaotic neuron [1] for the case of the MEPs [12, 13]. To solve MEPs, we have already proposed a new motif extraction method “Chaotic Motif Sampler (CMS),” by using a chaotic neurodynamics [11]. To realize the chaotic neurodynamics, we used a chaotic neural network composed by a chaotic neuron [1]. In the CMS, the chaotic neurons are assigned to all motif candidate position. Thus, the motif is decided by whether the corresponding chaotic neuron is firing or resting. As a result, it is shown that the CMS can find motif in a very high rate, if only a single motif is embedded in a sequence.
Chaotic Motif Sampler for Motif Discovery
675
Exploring a searching space with avoiding undesirable local minima, the CMS detects the motifs by firing neurons. Then, each neuron generates a complicated spike time-series. Then, the spike time-series of the chaotic neuron corresponding to a correct motif position may exhibit characteristic response. In this paper, in order to reveal such characteristic property, we analyzed complexity of the spike time-series from each chaotic neuron by using a statistical measure, such as coefficient of variation (CV ) and local variation of interspike intervals (LV ), which are frequently used in the field of neuroscience. As a result, if motifs are embedded in a sequence, corresponding spike time-series shows characteristic behavior. A chaotic neuron to which correct motifs are assigned has a higher CV value than the other neurons, while the value of the LV is lower than the other neurons. As a result, even if multiple motifs are embedded in a sequence, CMS can extract the motifs.
2
Chaotic Motif Sampler
The CMS uses chaotic neurons [1] generating chaotic neurodynamics to explore better solutions embed in a state space [11]. In other word, the motif candidates are decided by firing of the chaotic neuron. To realize the CMS, we first used a chaotic neural network [1] composed of N (mi − L + 1) neurons (Fig. 2). In this neural network, the firing of a chaotic i=1
neuron encodes a movement of the head position of a motif candidate to the corresponding position (Fig. 2). j=1
3
4
5
A
G
T
2
i Ai G
C
G
N
i =1
.. .
2
Ci T
6
·······
mi − L + 1
i ···
i iT
C
A
A
···
i iG
C
A
T
i i i i i i ···
i Ti A
A
G
G
.. .
C
.. .
N : The number of sequence L: Motif length mi : The number of elements in each sequence : Motif candidate
h : Chaotic neuron
Fig. 2. How to code motif positions to neuron firings. In this example, the motif length is five.
The firing of the (i, j)th neuron is then defined by xi,j (t) = f (yi,j (t)) > 12 , where f (y) = 1/(1 + exp(−y/)), and yi,j (t) is an internal state of the (i, j)th neuron at time t. The internal state of the chaotic neuron [1] is decomposed into two parts—ξi,j (t) and ζi,j (t)—each of which represents different effects or determining the firing of a neuron in the algorithm; a gain effect, a refractory effect, and a mutual inhibition effect, respectively.
676
T. Matsuura and T. Ikeguchi
The first part, ξi,j (t), which expresses the gain effect, is defined by ξi,j (t + 1) = β Ei,j (t) − Eˆ , Ei,j (t) =
L 1 fk (ω) fk (ω) log2 , L p(ω)
(1) (2)
k=1 ω∈Ω
where β is a scaling parameter of the effect; Ei,j (t) represents the objective function in CMS, a relative entropy score when a candidate motif position is ˆ the entropy score of the current moved to the jth position in the sequence si ; E, state; fk (ω), the number of appearances of an element (one of the four bases in the case of DNA or RNA sequences and 20 amino acids in the case of a protein sequence) ω ∈ Ω at the kth position of subsequences; p(ω), the background probability of appearance of the element ω; and Ω, a set of bases (ΩBASE ={A, C, G ,T}) or a set of amino acids (ΩACID ={N, S, D, Q, E, T, R, H, G, K, Y, W, C, M, P, F, A, V, L, I}). The quantity on the right-hand side of Eq.(1) becomes positive if the new motif candidate position is better than the current state. The second part, ζi,j (t), qualitatively realizes the refractoriness of the neuron. The refractoriness is one of the important properties of real biological neurons; once a neuron fires, it can hardly fire for a certain period of time. The second part is then expressed as follows: ζi,j (t + 1) = −α
t
krd xi,j (t − d) + θ
(3)
= −αxi,j (t) + kr ζi,j (t) + θ(1 − kr )
(4)
d=0
where α is a scaling parameter to decide strength of the refractory effect after neuron firing (0 < α); kr , a decay parameter that assumes values between 0 and 1; and θ, a threshold value. Thus, in Eq.(3), ζi,j (t + 1) expresses the refractory effect with a factor kr because the more frequently the neuron has fired in its past, the more negative the first term of the right side hand of Eq.(3) becomes, which in turn depresses the value of ζi,j (t+1) and leads the neuron to a relatively resting state. Although the refractoriness realized in the chaotic neuron [1] has similar effect such as tabu effect [3, 4], we have already shown that the refractoriness can control to search a solution in a state space better than the tabu effect [12, 13]. By using these two internal states, we construct an algorithm for extracting motifs, as described in the following. Consider a set of N sequences of lengths mi (i = 1, 2, ..., N ); further, let the length of a subsequence (motif) be L (Fig. 2). We proceed as follows: 1. The position of an initial subsequence ti,j (i = 1, 2, ..., N ; j = 1, 2, ..., mi − L + 1) is randomly set at the ith sequence. 2. The ith sequence si is selected cyclically. 3. For a selected sequence si at the second step, to change the position of the motif candidate to a new position, yi,j (t + 1) is calculated from the first neuron (j = 1) to the last neuron (j = mi − L + 1) in the sequence si . Then,
Chaotic Motif Sampler for Motif Discovery
677
the neuron whose internal state takes maximum (yi,max ) is selected. If the (i, max)th neuron fires (xi,max (t + 1) > 1/2), a new motif position is moved ˆ is updated. to the (i, max)th position, and the value of E 4. Repeat the steps 2 and 3 for the sufficient number of iterations.
3
Statistical Measures CV and LV [14]
The coefficient of variation (CV ) of interspike intervals is a measure of randomness of interval variation. The CV is defined as n 1 (Ti − T )2 n − 1 i=1 CV = , (5) T where Ti is the ith interspike interval (ISI), n is the number of ISIs, and T = n 1 Ti is the mean ISI. For spike intervals that are independently exponentially n i=1 distributed, CV is 1 in the limit of a large number of intervals. For a regular spike time-series in which Ti is constant, CV = 0. Next, the local variation (LV ) of interspike intervals is a measure of the spiking characteristics of an individual neuron. The LV is defined as LV =
n−1 1 3(Ti − Ti−1 )2 . n − 1 i=1 (Ti + Ti+1 )2
(6)
For spike intervals that are independently exponentially distributed, LV = 1. For a regular spike time-series in which Ti is constant, LV = 0.
4 4.1
Results Investigation of Statistical Measures of CV and LV
In order to investigate the values of CV and LV [14] for chaotic neurons in CMS which solves MEP, we used real protein sequences [9, 15]. At first, we used the real protein data set consisted of N = 30 sequences. The ith sequence is composed of mi (i = 1, 2, ..., N ) amino acids, and one motif (L = 18) is embedded in each sequences. Because correct locations of the motif for this real protein sequences [15] has already been clarified by another method [9, 15], we can compare the analysis results between the neurons of correct motif position and the others. For more details of the sequence data and how to identify the correct motif positions, see Ref.[9] and the references therein. Table 1 shows correct locations of the motif in each sequence. To extract motifs from this data set, the parameters of the CMS are set to kr = 0.8, α = 0.25, θ=1.0, β=30.0, and =0.01.
678
T. Matsuura and T. Ikeguchi Table 1. Correct location of the motif in each real protein sequence [15]
Sequence number
1
2
3
4
5
6
7
9 10 11 12
13 14 15
5 73 99 25 169 15
12 196 196
Sequence number 16 17 18 19 20 21 22 23 24 25 26 27
28 29 30
Correct location 225 198 22 326 449 22
(a)
Neuron index
Correct location 252 444 11 23
3
8
5 26 67 495 205 160
3
3 27 25
200 100 0 0
100
200
300
400
500
t
600
700
800
900
1000
(b)
25 20 15
(c)
1 0.8
(d)
1 0.8 50
100
150
200
250
Neuron index
Fig. 3. (a) Raster plot, the values of (b) the firing rate, (c) CV and (d) LV for all neurons in the 3rd sequence of Table 1
It is considered that the frequently firing neurons correspond to location of a correct motif, because the gain effect of the chaotic neurons is decided by a degree of similarity to another motif candidates. Namely, the neurons corresponding to motifs frequently fire. In this sense, it might be enough to observe firing rates of the neurons. However, only the firing rate does not always give a correct result. Figure 3 shows such an example; the results for the 3rd sequence (s3 ) of the data in Table 1. For the 3rd sequence, although the location of correct motif is 22, the most frequently firing neuron is not the 22nd, but middle (160 ∼ 170) and end (270 ∼ 280) neurons of the sequence (Fig. 3 (b)). If we evaluate a raster plot of all the neurons in the 3rd sequence, the frequently firing neurons seem to be firing randomly, and a firing pattern of the 22nd neuron shows characteristic behavior (Fig. 3(a)). To detect such characteristic firing behavior, we calculated the statistical values of output spike time-series. As a result, it is clearly seen that the value of the CV of the 22nd neuron is higher than the other neurons and the value of the LV is lower than the other neurons (Figs. 3 (c) and (d)). On the other hand, for the frequently firing neurons, the values of CV are low and the values of the LV are high. For the other
Chaotic Motif Sampler for Motif Discovery
679
Sequence number
30 25
30
20
25
15
20
10 15 5 10 100
200
300
400
500
Neuron index (a) Firing rate of neurons 30
1.3
25
1.2
20
1.1
15
1
10
0.9
5
0.8
Sequence number
Sequence number
30
1.3
25
1.2
20
1.1 1
15
0.9 10 0.8 5 0.7
0.7 100
200
300
400
500
100
Neuron index (b) Values of CV
200
300
400
500
Neuron index (c) Values of LV
Fig. 4. The values of (a) the firing rate, (b) CV , and (c) LV of all neurons for the real protein data set. The correct location of the motifs are shown in Table 1.
sequences, the neurons corresponding to a correct motif show the same tendency (Fig. 4 and Table 1). Furthermore, if the neuron at a correct motif position fires frequently, CV becomes not low but high, and LV becomes not high but low. The results indicate that spike time-series generated by the chaotic neuron of a correct motif position has different characteristics from the other neurons. 4.2
Multiple Motif Case
The original CMS [11, 12] cannot always find multiple motifs from a sequence, because the most similar subsequence is extracted as a motif from the sequence. However, the multiple motifs might be extracted from each sequence by using Table 2. Correct locations of the motif in each artificial protein sequence Sequence number Correct locations
1
2
3
4
5
6
130 225 198 22 326 467 22
7
8
9 10 11 12
15 5 99 169 35 12 196
Sequence number 13 14 15 16 17 18 19 20 21 22 23 24 25 Correct locations
11 23 196 250 150
3
50 5 26 67 495 205 178
3
3 25
(a)
T. Matsuura and T. Ikeguchi
Neuron index
680
200 100 0 0
100
200
300
400
500
600
t
700
800
900
1000
(b)
15 10
(c)
1.2 1 0.8
(d)
1.2 1 0.8 50
100
150
200
250
Neuron index
Sequence number
Fig. 5. (a) Raster plot, the values of (b) the firing rate, (c) CV and (d) LV for all neurons in the 15th sequence of Table 2
25
30
20
25
15
20
10
15
5
10 100
200
300
400
500
Neuron index (a) Firing rate of neurons 25 1.6 20 1.4 15 1.2 10 1 5 0.8 100
200
300
400
Neuron index (b) Values of CV
500
Sequence number
Sequence number
25
1.4 1.3
20 1.2 1.1
15
1 10
0.9 0.8
5 0.7 100
200
300
400
500
Neuron index (c) Values of LV
Fig. 6. The values of (a) the firing rate, (b) CV , and (c) LV of all neurons for the artificial protein data set. The correct location of the motifs are shown in Table 2.
characteristics of spike time-series. To verify this hypothesis, we prepared an artificial data set. This artificial data has 25 sequences (N = 25), and the ith sequence is composed of mi (i = 1, 2, ..., N ) amino acids. In each sequence, one
Chaotic Motif Sampler for Motif Discovery
681
or two motifs are embedded. Table 2 shows the correct locations of the motif (L = 18) in each sequence. To extract motifs from this data set, the parameters are set to kr = 0.8, α = 0.25, θ = 0.9, β = 25.0, and = 0.01. Figure 5 shows the results for the 15th sequence of Table 2 in which two motifs are embedded. For the 15th sequence, although one neuron of the motif (23rd) takes a high firing rate, the other (156th) is low (Figs. 5 (a) and (b)). However, the both values of CV are higher than the other neurons, and both values of LV are lower than the other neurons (Figs. 5 (c) and (d)). For the other sequences, the neurons of correct motifs show the same tendency (Fig. 6 and Table 2).
5
Conclusions
In this paper, to improve solving performance of motif extraction problem by a chaotic neurodynamics, we introduced the spike time-series produced from chaotic neurons in the CMS, and analyzed its complexity by using a statistical measure, such as coefficient of variation (CV ) and local variation (LV ) of interspike intervals, which are frequently used in the field of neuroscience. As a result, CV of corresponding neurons to correct motif positions becomes higher than the other neurons and LV becomes lower than the other neurons, independent of firing rates of the neurons. The result of using these characteristics for finding motifs, we can solve MEP or extract the motifs in case that multiple motifs are embed in each sequence. Although statistical values of CV and LV of spikes produced from neurons corresponding to the correct location are depend on each sequence, their value exhibit statistically significant in the sequence. Then, if we evaluate the motif positions more quantitatively by using statistical values of CV and LV , we could develop a new algorithm of the CMS, eliminating the sequences with no motifs. The research of T.I. is partially supported by a Grant-in-Aid for Scienctific Research (B) (No.16300072) and a research grant from the Mazda Foundation.
References [1] Aihara, K., et al.: Chaotic Neural Networks. Physics Letters A 144, 33–340 (1990) [2] Akutsu, T., et al.: On Approximation Algorithms for Local Multiple Alignment. In: Proceedings of the 4th Annual International Conference Molecular Biology, pp. 1–7 (2000) [3] Glover, F.: Tabu Search I. ORSA Journal on Computing 1(3), 190–206 (1989) [4] Glover, F.: “Tabu Search II”. ORSA Journal on Computing 2(1), 4–32 (1990) [5] Hasegawa, M., et al.: Combination of Chaotic Neurodynamics with the 2-opt Algorithm to Solve Traveling Salesman Problems. Physical Review Letters 79(12), 2344–2347 (1997) [6] Hasegawa, M., et al.: Exponential and Chaotic Neurodynamical Tabu Searches for Quadratic Assignment Problems. Control and Cybernetics 29(3), 773–788 (2000) [7] Hasegawa, M., et al.: Solving Large Scale Traveling Salesman Problems by Chaotic Neurodynamics. Neural Networks 15(2), 271–283 (2002)
682
T. Matsuura and T. Ikeguchi
[8] Hasegawa, M., et al.: A Novel Chaotic Search for Quadratic Assignment Problems. European Journal of Operational Research 39(3), 543–556 (2002) [9] Lawrence, C.E., et al.: Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignment. Science 262, 208–214 (1993) [10] Matsuura, T., et al.: A Tabu Search for Extracting Motifs from DNA Sequences. In: Proceedings of Nonlinear Circuits and Signal Processing 2004, pp. 347–350 (2004) [11] Matsuura, T., et al.: A Tabu Search and Chaotic Search for Extracting Motifs from DNA Sequences. In: Proceedings of Metaheuristics International Conference 2005, pp. 677–682 (2005) [12] Matsuura, T., Ikeguchi, T.: Refractory Effects of Chaotic Neurodynamics for Finding Motifs from DNA Sequences. In: Corchado, E.S., Yin, H., Botti, V., Fyfe, C. (eds.) IDEAL 2006. LNCS, vol. 4224, pp. 1103–1110. Springer, Heidelberg (2006) [13] Matsuura, T., et al.: Analysis on Memory Effect of Chaotic Dynamics for Combinatorial Optimization Problem. In: Proceedings of Metaheuristics International Conference (2007) [14] Shinomoto, S., et al.: Differences in Spiking Patterns Among Cortical Neurons. Neural Computation 15, 2823–2842 (2003) [15] Information available at, http://www.genome.jp/
A Thermodynamical Search Algorithm for Feature Subset Selection F´elix F. Gonz´alez and Llu´ıs A. Belanche Languages and Information Systems Department Polytechnic University of Catalonia, Barcelona, Spain {fgonzalez,belanche}@lsi.upc.edu
Abstract. This work tackles the problem of selecting a subset of features in an inductive learning setting, by introducing a novel Thermodynamic Feature Selection algorithm (TFS). Given a suitable objective function, the algorithm makes uses of a specially designed form of simulated annealing to find a subset of attributes that maximizes the objective function. The new algorithm is evaluated against one of the most widespread and reliable algorithms, the Sequential Forward Floating Search (SFFS). Our experimental results in classification tasks show that TFS achieves significant improvements over SFFS in the objective function with a notable reduction in subset size.
1
Introduction
The main purpose of Feature Selection is to find a reduced set of attributes from a data set described by a feature set. This is often carried out with a subset search process in the power set of possible solutions. The search is guided by the optimization of a user-defined objective function. The generic purpose pursued is the improvement in the generalization capacity of the inductive learner by reduction of the noise generated by irrelevant or redundant features. The idea of using powerful general-purpose search strategies to find a subset of attributes in combination with an inductive learner is not new. There has been considerable work using genetic algorithms (see for example [1], or [2] for a recent successful application). On the contrary, simulated annealing (SA) has received comparatively much less attention, probably because of the prohibitive cost, which can be specially acute when it is combined with an inducer to evaluate the quality of every explored subset. In particular, [3] developed a rule-based system based on the application of a generic SA algorithm by standard bitwise random mutation. We are of the opinion that such algorithms must be tailored to feature subset selection if they are to be competitive with state-of-the-art stepwise search algorithms. In this work we contribute to tackling this problem by introducing a novel Thermodynamic Feature Selection algorithm (TFS). The algorithm makes uses of a specially designed form of simulated annealing (SA) to find a subset of attributes that maximizes the objective function. TFS has a number of distinctive characteristic over other search algorithms for feature subset selection. First, the M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 683–692, 2008. c Springer-Verlag Berlin Heidelberg 2008
684
F.F. Gonz´ alez and L.A. Belanche
probabilistic capability of SA to accept momentarily worse solutions is enhanced by the concept of an -improvement, explained below. Second, the algorithm is endowed with a feature search window in the backward steps that limits the neighbourhood size while retaining efficacy. Third, the algorithm accepts any objective function for evaluating the quality of generated subsets. In this paper three different inducers have been tried to assess this quality. TFS is evaluated against one of the most reliable stepwise search algorithms, the Sequential Forward Floating Search method (SFFS). Our experimental results in classification tasks show that TFS notably improves over SFFS in the objective function in all cases, with a very notable reduction in subset size, compared to the full set size. We also find that the compared computational cost is affordable for small number of features and much better as the number of features increases.
2
Feature Selection
Let X = {x1 , . . . , xn }, n > 0 denote the full feature set. Without loss of generality, we assume that the objective function J : P(X) → R+ ∪ {0} is to be maximized, where P denotes the power set. We denote by Xk ⊆ X a subset of selected features, with |Xk | = k. Hence, by definition, X0 = ∅, and Xn = X. The feature selection problem can be expressed as: given a set of candidate features, select a subset1 defined by one of two approaches: 1. Set 0 < m < n. Find Xm ⊂ X, such that J(Xm ) is maximum. 2. Set a value J0 , the minimum acceptable J. Find the Xm ⊆ X with smaller m such that J(Xm ) ≥ J0 . Alternatively, given α > 0, find the Xm ⊆ X with smaller m, such that J(Xm ) ≥ J(X) or |J(Xm ) − J(X)| < αJ(X). In the literature, several suboptimal algorithms have been proposed for feature selection. A wide family is formed by those algorithms which, departing from an initial solution, iteratively add or delete features by locally optimizing the objective function. Among them we find the sequential forward generation (SFG) and sequential backward generation (SBG), the plus-l-minus-r (also called plus l - take away r) [12], the Sequential Forward Floating Search (SFFS) and its backward counterpart SFBS [4]. The high trustworthiness and performance of SFFS has been ascertained experimentally (see e.g. [5], [6]).
3
Simulated Annealing
Simulated Annealing (SA) is a stochastic technique inspired on statistical mechanics for finding near globally optimum solutions to large (combinatorial) optimization problems. SA is a weak method in that it needs almost no information about the structure of the search space. The algorithm works by assuming that some part of the current solution belongs to a potentially better one, and thus this part 1
Such an optimal subset of features always exists but is not necessarily unique.
A Thermodynamical Search Algorithm for Feature Subset Selection
685
should be retained by exploring neighbors of the current solution. Assuming the objective function is to be minimized, SA can jump from hill to hill and escape or simply avoid sub-optimal solutions. When a system S (considered as a set of possible states) is in thermal equilibrium (at a given temperature T ), the probability PT (s) that it is in a certain state s depends on T and on the energy E(s) of s. This probability follows a Boltzmann distribution: exp(− E(s) E(s) kT ) PT (s) = , with Z = exp − Z kT s∈S
where k is the Boltzmann constant and Z acts as a normalization factor. Metropolis and his co-workers [7] developed a stochastic relaxation method that works by simulating the behavior of a system at a given temperature T . Being s the current state and s a neighboring state, the probability of making a transition from s to s is the ratio PT (s → s ) between the probability of being in s and the probability of being in s : PT (s ) ΔE PT (s → s ) = = exp − PT (s) kT where we have defined ΔE = E(s )−E(s). Therefore, the acceptance or rejection of s as the new state depends on the difference of the energies of both states at temperature T . If PT (s ) ≥ PT (s) then the “move” is always accepted. It PT (s ) < PT (s) then it is accepted with probability PT (s, s ) < 1 (this situation corresponds to a transition to a higher-energy state). Note that this probability depends upon and decreases with the current temperature. In the end, there will be a temperature low enough (the freezing point ), wherein these transitions will be very unlikely and the system will be considered frozen. In order to maximize the probability of finding states of minimal energy at every temperature, thermal equilibrium must be reached. The SA algorithm proposed in [8] consists on using the Metropolis idea at each temperature for a finite amount of time. In this algorithm the temperature is first set at a initially high value, spending enough time at it so to approximate thermal equilibrium. Then a small decrement of the temperature is performed and the process is iterated until the system is considered frozen. If the cooling schedule is well designed, the final reached state may be considered a near-optimal solution.
4
TFS: A Thermodynamic Feature Selection Algorithm
In this section we introduce TFS (Thermodynamic Feature Selection). Considering SA as a combinatorial optimization process [10], TFS finds a subset of attributes that maximizes the value of a given objective function. A specialpurpose feature selection mechanism is embedded into the SA tehcnique that takes advantage of the probabilistic acceptance capability of worse scenarios over a finite time. This characteristic is enhanced in TFS by the notion of an -improvement: a feature -improves a current solution if it has a higher value of the objective function or a value not worse than %. This mechanism is intended
686
F.F. Gonz´ alez and L.A. Belanche
to account for noise in the evaluation of the objective function due to finite sample sizes. The algorithm is also endowed with a feature search window (of size l) in the backward step, as follows. In forward steps always the best feature is added (by looking at all possible additions). In backward steps this search is limited to l tries at random (without repetition). The value of l is incremented by one at every thermal equilibrium point. This mechanism is an additional source of non-determinism and a bias towards adding a feature only when it is the best option available. In contrast, to remove a feature, it suffices that its removal -improves the current solution. A direct consequence is of course a considerable speed-up of the algorithm. Note that the design of TFS is such that it grows more and more deterministic, informed and costly as it converges towards the final configuration.
Fig. 1. TFS algorithm for feature selection. For the sake of clarity, only those variables that are modified are passed as parameters to Forward and Backward.
The pseudo-code of TFS is depicted in Figs. 1 to 3. The algorithm consists of two major loops. The outer loop waits for the inner loop to finish and then updates the temperature according to the chosen cooling schedule. When the outer loop reaches Tmin , the algorithm halts. The algorithm keeps track of the best solution found (which is not necessarily the current one). The inner loop is the core of the algorithm and is composed of two interleaved procedures: Forward and Backward, that iterate until a thermal equilibrium point is found, represented by reaching the same solution before and after. These procedures work independently of each other, but share information about the results of their respective searches in the form of the current solution. Within them, feature selection takes place and the mechanism to escape from local minima starts
A Thermodynamical Search Algorithm for Feature Subset Selection
687
PROCEDURE Forward (var Z, JZ ) Repeat x := argmax{ J(Z ∪ {xi }) }, xi ∈ Xn \ Z If -improves (Z, x, true) then accept := true else ΔJ := J(Z ∪ {x}) − J(Z) ΔJ accept := rand(0, 1) < e t endif If accept then Z := Z ∪ {x} endif If J(Z) > JZ then JZ := J(Z) endif Until not accept END Forward Fig. 2. TFS Forward procedure (note Z, JZ are modified) PROCEDURE Backward (var Z, JZ ) A := ∅; AB := ∅ Repeat For i := 1 to min(l, |Z|) do FUNCTION -improves (Z, x, b) Select x ∈ Z \ AB randomly RETURNS boolean If -improves (Z, x, f alse) If b then Z := Z ∪ {x} then A := A ∪ {x} endif else Z := Z \ {x} endif AB := AB ∪ {x} Δx := J(Z ) − J(Z) EndFor If Δx > 0 then return true x0 := argmax{J(Z \ {x})}, x ∈ AB else return −Δx < endif J (Z) If x0 ∈ A then accept := true END -improves else ΔJ := J(Z \ {x0 }) − J(Z) ΔJ accept := rand(0, 1) < e t endif If accept then Z := Z \ {x0 } endif If J(Z) > JZ then JZ := J(Z) endif Until not accept END Forward Fig. 3. Left: TFS Backward procedure (note Z, JZ are modified and x0 can be efficiently computed while in the For loop). Right: The function for -improvement.
working, as follows: these procedures iteratively add or remove features one at a time in such a way that an -improvement is accepted unconditionally, whereas a non -improvement is accepted probabilistically.
5
An Experimental Study
In this section we report on empirical work. There are nine problems, taken mostly from the UCI repository and chosen to be a mixture of different real-life feature selection processes in classification tasks. In particular, full feature size
688
F.F. Gonz´ alez and L.A. Belanche
ranges from just a few (17) to dozens (65), sample size from few dozens (86) to the thousands (3,175) and feature type is either continuous, categorical or binary. The problems have also been selected with a criterion in mind not very commonly found in similar experimental work, namely, that these problems are amenable to feature selection. By this it is meant that performance benefits clearly from a good selection process (and less clearly or even worsens with a bad one). Their main characteristics are summarized in Table 1. Table 1. Problem characteristics. Instances is the number of instances, Features is that of features, Origin is real/artificial and Type is categorical/continuous/binary. Name Instances Features Origin Type Breast Cancer (BC) 569 30 real continuous Ionosphere (IO) 351 34 real continuous Sonar (SO) 208 60 real continuous Mammogram (MA) 86 65 real continuous Kdnf (KD) 500 40 artificial binary Knum (KN) 500 30 artificial binary Hepatitis (HP) 129 17 real categorical Splice (SP) 3,175 60 real categorical Spectrum (SC) 187 22 real categorical
5.1
Experimental Setup
Each data set was processed with both TFS and SFFS in wrapper mode [11], using the accuracy of several classifiers as the objective function : 1-Nearest Neighbor (1NN) for categorical data sets; 1-Nearest Neighbor, Linear and Quadratic discriminant analysis (LDA and QDA) for continuous data sets [12]. We chose these for being fast, parameter-free and not prone to overfit (specifically we discarded neural networks). Other classifiers (e.g. Naive Bayes or Logistic Regression) could be considered, being this matter user-dependent. Decision trees are not recommended since they perform their own feature selection in addition to that done by the algorithm and can hinder an accurate assessment of the results. The important point here is noting that the compared algorithms do not specifically depend on this choice. Despite taking more time, leave-one-out cross validation is used to obtain reliable generalization estimates, since it is known to have low bias [9]. The TFS parameters are as follows: = 0.01, T0 = 0.1 and Tmin = 0.0001. These settings were chosen after some preliminary trials and are kept constant for all the problems. The cooling function was chosen to be geometric α(t) = ct, taking c = 0.9, following recommendations in the literature [10]. Note that SFFS needs the specification of the desired size of the final solution [4], acting also as a stop criterion. This parameter is very difficult to estimate in practice. To overcome this problem, we let SFFS to run over all possible sizes 1 . . . n, where n is the size of each data set. This is a way of getting the best performance of this algorithm. In all cases, a limit of 100 hours computing time was imposed to the executions.
A Thermodynamical Search Algorithm for Feature Subset Selection
689
Table 2. Performance results. Jopt is the value of the objective function for the final solution and k its size. For categorical problems only 1NN is used. Note NR are the corresponding results with no reduction of features. The results for SFFS on the SP data set (marked with a star) were the best after 100 hours of computing time, when the algorithm was cut (this is the only occasion this happened).
BC IO SO MA SP SC KD KN HP
1NN Jopt k 0.96 14 0.95 12 0.91 25 0.91 6 0.91 8 0.74 7 0.70 11 0.86 14 0.91 4
TFS LDA Jopt k 0.98 11 0.90 13 0.84 9 0.99 10 n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a
QDA Jopt k 0.99 9 0.95 8 0.88 10 0.94 6 n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a
1NN Jopt 0.92 0.87 0.80 0.70 0.78 0.64 0.60 0.69 0.82
NR LDA QDA Jopt Jopt 0.95 0.95 0.85 0.87 0.70 0.60 0.71 0.62 n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a
1NN Jopt k 0.94 11 0.93 13 0.88 7 0.84 16 0.90∗ 6∗ 0.73 9 0.68 11 0.85 12 0.91 4
SFFS LDA Jopt k 0.98 13 0.89 6 0.79 23 0.93 6 n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a
QDA Jopt k 0.97 7 0.94 14 0.85 11 0.93 5 n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a
A number of questions are raised prior to the realization of the experiments: 1. Does the Feature Selection process help to find solutions of similar or better accuracy using lower numbers of features? Is there any systematic difference in performance for the various classifiers? 2. Does TFS find better solutions in terms of the objective function J? Does it find better solutions in terms of the size k? 5.2
Discussion of the Results
Performance results for TFS and SFFS are displayed in Table 2, including the results obtained with no feature selection of any kind, as a reference. Upon realization of the results we can give answers to the previously raised questions: 1. The feature selection process indeed helps to find solutions of similar or better accuracy using (much) lower numbers of features. This is true for both algorithms and all of the three classifiers used. Regarding a systematic difference in performance, for the classifiers, the results are non-conclusive, as can reasonably be expected, being this matter problem-dependent in general [11]. 2. In terms of the best value of the objective function, TFS outperforms SFFS in all the tasks, no matter the classifier, except in HP for 1NN, where there is a tie. The difference is sometimes substantial (more than 10%). Recall SFFS was executed for all possible final size values and the figure reported is the best overall. In this sense SFFS did never came across the subsets obtained by TFS in the search process (otherwise they would have been recorded). It is hence conjectured that TFS can have a better access to hidden good subsets than SFFS does. In terms of the final subset size, the results are quite interesting.
690
F.F. Gonz´ alez and L.A. Belanche
Both TFS and SFFS find solutions of smaller size using QDA, then using LDA and finally using 1NN, which can be checked by calculating the column totals. TFS gives priority to the optimization of the objective function, without any specific concern for reducing the final size, However, it finds very competitive solutions both in terms of accuracy and size. This fact should not be overlooked, since it is by no means clear that solutions of bigger size should have better value of the objective function and viceversa. In order to avoid the danger of loosing relevant information, many times it is better to accept solutions of somewhat bigger size if these entail a significantly higher value of the objective function. Finally, it could be argued that the conclusion that TFS consistently yields better final objective function values is but optimistic. Possibly the results for other sizes are consistently equal or even worse. To check whether this is the case, we show the entire solution path for one of the runs, representative of the experimental behaviour found (figs. 4). It is seen that for (almost) all sizes, TFS offers better accuracy, very specially at lowest values. K-SFFS
Accuracy
2 1.00 0.99 0.98 0.97 0.96 0.95 0.94 0.93
4
6
8
10
12
14
16
18
20
22
TFS-QDC
(9, 0.99)
24
26
28
30 1.00 0.99 0.98 0.97 0.96 0.95 0.94 0.93
SFFS-QDC
(7, 0.97)
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
K-TFS
Fig. 4. Comparative full performance for all k with QDA on the BC data set. The best solutions (those shown in Table 2) are marked by a bigger circle or square.
5.3
Computational Cost
The computational cost of the performed experiments is displayed in Table 3. It is seen that SFFS takes less time for smaller sized problems (e.g. HP, BC, IO and KN), whereas TFS is more efficient for medium to bigger sized ones (e.g. SO, MA, KD and SP), where time is more and more an issue. In this vain, the two-phase interleaved mechanism for forward and backward exploration carries out a good neighborhood exploration, thereby contributing to a fast relaxation of the algorithm. Further, the setup of the algorithm is made easier than in a conventional SA since the time spent at each temperature is automatically and dynamically set. In our experiments this mechanism did not lead to the stagnation of the process. The last analyzed issue concerns the distribution of features as selected by TFS. In order to determine this, a perfectly known data set is needed. An artificial problem f has been explicitly designed, as follows: letting x1 , . . . , xn be the relevant features f (x1 , · · · , xn ) = 1 if the majority of xi is equal to 1 and 0 otherwise. Next, completely irrelevant features and redundant features (taken as copies of
A Thermodynamical Search Algorithm for Feature Subset Selection
691
Table 3. Computational costs: Jeval is the number of times the function J is called and Time is the total computing time (in hours)
250
Frequencies
200 150 100 50 0 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Features
Fig. 5. Distribution of features for the Majority data set as selected by TFS
relevant ones) are added. Ten different data samples of 1000 examples each are generated with n = 21. The truth about this data set is: 8 relevant features (1-8), 8 irrelevant (9-16) and 5 redundant (17-21). We were interested in analyzing the frequency distribution of the features selected by TFS according to their type. The results are shown in figure 5: it is remarkable that TFS gives priority to all the relevant features and rejects all the irrelevant ones. Redundant features are sometimes allowed, compensating for the absence of some relevant ones. Average performance as given by 1NN is 0.95. This figure should be compared to the performance of 0.77 obtained with the full feature set.
6
Conclusions
An algorithm for feature selection based on simulated annealing has been introduced. A notable characteristic over other search algorithms for this task is its capability to accept momentarily worse solutions. The algorithm has been evaluated against the Sequential Forward Floating Search (SFFS), one of the most reliable algorithms for moderate-sized problems. In comparative results with SFFS (and using a number of inducers as wrappers) the feature selection process shows remarkable results, superior in all cases to the full feature set and
692
F.F. Gonz´ alez and L.A. Belanche
substantially better than those achieved by SFFS alone. The proposed algorithm finds higher-evaluating solutions, both when their size is bigger or smaller than those found by SFFS and offers a solid and reliable framework for feature subset selection tasks. As future work, we plan to use the concept of thermostatistical persistency [13] to improve the algorithm while reducing computational cost.
References 1. Yang, J., Honavar, V.: Feature Subset Selection Using a Genetic Algorithm. In: Motoda, H., Liu, H. (eds.) Feature Extraction, Construction, and Subset Selection: A Data Mining Perspective, Kluwer, New York (1998) 2. Schlapbach, A., Kilchherr, V., Bunke, H.: Improving Writer Identification by Means of Feature Selection and Extraction. In: 8th Int. Conf. on Document Analysis and Recognition, pp. 131–135 (2005) 3. Debuse, J.C., Rayward-Smith, V.: Feature Subset Selection within a Simulated Annealing DataMining Algorithm. J. of Intel. Inform. Systems 9, 57–81 (1997) 4. Pudil, P., Ferri, F., Novovicova, J., Kittler, J.: Floating search methods for feature selection. Pattern Recognition Letters 15(11), 1119–1125 (1994) 5. Jain, A., Zongker, D.: Feature selection: Evaluation, application, and small sample performance. IEEE Trans. Pattern Anal. Mach. Intell. 19(2), 153–158 (1997) 6. Kudo, M., Somol, P., Pudil, P., Shimbo, M., Sklansky, J.: Comparison of classifierspecific feature selection algorithms. In: Procs. of the Joint IAPR Intl. Workshop on Advances in Pattern Recognition, pp. 677–686 (2000) 7. Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., Teller, E.: Equations of state calculations by fast computing machines. J. of Chem. Phys. 21 (1953) 8. Kirkpatrick, S.: Optimization by simulated annealing: Quantitative studies. Journal of Statistical Physics 34 (1984) 9. Bishop, C.: Neural networks for pattern recognition. Oxford Press, Oxford (1996) 10. Reeves, C.R.: Modern Heuristic Techniques for Combinatorial Problems. McGraw Hill, New York (1995) 11. Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artificial Intelligence 97(1-2), 273–324 (1997) 12. Duda, R.O., Hart, P., Stork, G.: Pattern Classification. Wiley & Sons, Chichester (2001) 13. Chardaire, P., Lutton, J.L., Sutter, A.: Thermostatistical persistency: A powerful improving concept for simulated annealing algorithms. European Journal of Operational Research 86(3), 565–579 (1995)
Solvable Performances of Optimization Neural Networks with Chaotic Noise and Stochastic Noise with Negative Autocorrelation Mikio Hasegawa1 and Ken Umeno2,3 1
Tokyo University of Science, Tokyo 102-0073, Japan 2 ChaosWare Inc., Koganei-shi, 183-8795, Japan 3 NICT, Koganei-shi, 183-8795, Japan
Abstract. By adding chaotic sequences to a neural network that solves combinatorial optimization problems, its performance improves much better than the case that random number sequences are added. It was already shown in a previous study that a specific autocorrelation of the chaotic noise makes a positive effect on its high performance. Autocorrelation of such an effective chaotic noise takes a negative value at lag 1, and decreases with dumped oscillation as the lag increases. In this paper, we generate a stochastic noise whose autocorrelation is C(τ ) ≈ C × (−r)τ , similar to effective chaotic noise, and evaluate the performance of the neural network with such stochastic noise. First, we show that an appropriate amplitude value of the additive noise changes depending on the negative autocorrelation parameter r. We also show that the performance with negative autocorrelation noise is better than those with the white Gaussian noise and positive autocorrelation noise, and almost the same as that of the chaotic noise. Based on such results, it can be considered that high solvable performance of the additive chaotic noise is due to its negative autocorrelation.
1
Introduction
Chaotic dynamics have been shown to be effective for combinatorial optimization problems in many researches [1,2,3,4,5,6,7,8,9]. In one of those approaches that adds chaotic sequences to each neuron in a mutually-connected neural network solving an optimization problem to avoid local minimum problem, it has been clearly shown that the chaotic noise is more effective than random noise to improve the performance [8,9]. For realizing higher performance using the chaotic dynamics, it was also utilized in optimization methods with heuristic algorithms which are applicable to large-scale problems [3], and this approach is shown to be more effective than tabu search and simulated annealing [5,6]. In experimental analyses of the chaotic search, several factors enabling high performance were found. One is that the chaotic dynamics close to the edge of chaos has high performance [7]. The second is that a specific autocorrelation function of the chaotic dynamics has a positive effect [9], in the approach that adds the chaotic sequences as additive noise to mutually-connected neural network. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 693–702, 2008. c Springer-Verlag Berlin Heidelberg 2008
694
M. Hasegawa and K. Umeno
As another application of the chaos dynamics, it has been applied to CDMA communication system [10,11,12]. For the spreading code, cross correlation between the code sequences should be as small as possible to reduce bit error rate caused by mutual interference. In the chaotic CDMA approach, it has been shown that chaotic sequences, which have a negative autocorrelation at lag 1, make the cross correlation smaller [11]. For generating such an optimum code sequence whose autocorrelation is C(τ ) ≈ C × (−r)τ , a FIR filter has also been proposed, and the advantages of such code sequences has been experimentally shown [12]. In this paper, we analyze of effects of the chaotic noise to combinatorial optimization by introducing the stochastic noise whose autocorrelation is C(τ ) ≈ C × (−r)τ , same as the sequence utilized in the chaotic CDMA [12], because the effective chaotic noise shown in Ref. [9] has a similar autocorrelation with them. We evaluate their performance with changing the variance and parameters of chaotic and stochastic noise, and investigate a positive factor of chaotic noise to combinatorial optimization problems.
2
Optimization by Neural Networks with Chaotic Noise
We introduce the Traveling Salesman Problems (TSPs) and the Quadratic Assignment Problems (QAPs) to evaluate the performance of each type of noise. In this paper, the Hopfield-Tank neural network [13] is a base of the solution search. It is well-known that the energy function, 1 wikjl xik (t)xjl (t) + θik xik (t) 2 i=1 j=1 i=1 n
E(t) = −
n
n
k=1 l=1
n
(1)
k=1
always decreasing when the neurons are updated by the following equation, n n xik (t + 1) = f [ wikjl xjl (t) + θik ],
(2)
j=1 l=1
where xik (t) is the output of the (i, k)th neuron at time t, wikjl is the connection weight between the (i, k) th and (j, l) th neurons, and θik is the threshold of the (i, k) th neuron. Because the original Hopfield-Tank neural networks stop searching at a local minimum, the chaotic noise and other random dynamics has been added to the neurons to avoid trapping at such undesirable states and to achieve much higher performance. As already been applied in many conventional researches, the energy function for solving the TSPs [13,1,2,8,9] can be defined by the following equation, Etsp = A[{
N N N N ( xik (t) − 1)2 } + { ( xik (t) − 1)2 }] i=1 k=1
+B
N N N i=1 j=1 k=1
k=1 i=1
dij xik (t){xjk+1 (t) + xjk−1 (t)},
(3)
Solvable Performances of Optimization Neural Networks
695
Solvable Performance (%)
100
a=3.82 a=3.92 a=3.95 80 Gausian Noise 60 40 20 0 0.00
0.10
0.20
0.30
0.40
0.50
Noise Amplitude β
Fig. 1. Solvable performance of chaotic noise and white Gaussian noise on a 20-city TSP
where N is the number of cities, dij is the distance between the cities i and j, and A and B are the weight of the constraint term (formation of a closed tour) and the objective term (minimization of total tour length), respectively. From Eqs. (1) and (3), the connection weights wijkl and the threshold θijkl can be obtained as follows, wijkl = −A{δij (1 − δkl ) + δkl (1 − δij)} − Bdij (δlk+1 + δl−k−1 ),
(4)
θij = 2A,
(5)
where δij = 1 if i = j, δij = 0 otherwise. For the QAPs whose objective function is F (p) =
N N
aij bp(i)p(j) ,
(6)
i=1 j=1
we use the following energy function, Eqap = A[{
N N N N ( xik (t) − 1)2 } + { ( xik (t) − 1)2 }] i=1 k=1
+
B
N N N N
k=1 i=1
aij bkl xik (t)xjl (t).
(7)
i=1 j=1 k=1 l=1
From Eqs. (1) and (7), the connection weight and the threshold for the QAPs are obtained as follows, wijkl = −A{δij (1 − δkl ) + δkl (1 − δij)} − Baij bkl ,
(8)
θij = 2A.
(9)
696
M. Hasegawa and K. Umeno
Solvable Performance (%)
25 20
a=3.82 a=3.92 a=3.95 Gausian Noise
15 10 5 0 0.04
0.06
0.08
0.10
0.12
0.14
0.16
Noise Amplitude β
Fig. 2. Solvable performance of chaotic noise and white Gaussian noise on a QAP with 12 nodes (Nug12)
In order to introduce additive noise into the update equation of each neuron, we use the following equation, N N xik (t + 1) = f [ wikjl xjl (t) + θik + βzik (t)],
(10)
j=1 l=1
where zij (t) is a noise sequence added to the (i, j)th neuron, β is the amplitude of noise, and f is the sigmoidal output function, f (y) = 1/(1 + exp(−y/)). The noise sequence introduced as zij (t) is normalized to be zero mean and unit variance. In Figs. 1 and 2, we compare the solvable performances of the above neural c c c networks with the logistic map chaos, zij (t + 1) = azij (t)(1 − zij (t)), and white Gaussian noise for additive noise sequence zij (t). For the chaotic noise, we used 3.82, 3.92, and 3.95 for the parameter a. The abscissa axis is the amplitude of the noise, β in Eq. (10). The solvable performance on the ordinate is defined as the percentage of successful runs obtaining the optimum solution in 1000 runs with different initial conditions. The successful run obtaining of the optimum solution is defined as hitting the optimum solution state at least once in a fixed iteration. For the problems introduced in this paper, exactly optimum solution is known for each, by an algorithm which guarantees exactness of the solution but requires much larger amount of computational time. In this paper, the solvable performances of each type of noise are evaluated using a small and fixed computational time. We set the cutoff time longer for the QAP because it is more difficult thatn the TSP and includes it as a sub-problem. The cutoff times for each run are set to 1024 iterations for TSP and to 4096 iterations for QAP, respectively. The parameters of the neural network are A = 1, B = 1, = 0.3 for the TSP, and A = 0.35, B = 0.2, = 0.075 for the QAP, respectively. The problems introduced in this paper are a 20-city TSP in [9] and a QAP with 12 nodes, Nug12 in QAPLIB [14].
z(t+1)=az(t)(1−z(t))
Solvable Performances of Optimization Neural Networks 1.0 0.8 0.6 0.4 0.2 0.0 3.50
3.60
3.70
3.80
3.90
4.00
3.90
4.00
3.90
4.00
3.90
4.00
697
Solvable Performance (%)
Logistic Map Parameter a 100
β=0.325
80 60 40 20 0 3.50
3.60
3.70
3.80
Solvable Performance (%)
Logistic Map Parameter a 100
β=0.375
80 60 40 20 0 3.50
3.60
3.70
3.80
Solvable Performance (%)
Logistic Map Parameter a 100
β=0.425
80 60 40 20 0 3.50
3.60
3.70
3.80
Logistic Map Parameter a
Fig. 3. Solvable performance of logistic map chaos with different noise amplitude β on a 20-city TSP
The results in both Figs. 1 and 2 show that the neural network with the chaotic noise performs better than that with the white Gaussian noise, on a comparison of the best solvable performance of each noise. From the results, it also can be seen that the best value of noise amplitude for the highest solvable performance is different among the noise sequences. This is not due to difference in the variances of the original noise sequences, because each sequence is normalized before being added to neurons as zik (t) as described above. Figure 3 shows the solvable performance of the neural network with the logistic map chaos with different noise amplitude β. From the results, it can be seen that the value of the parameter a for high performances changes depending on the noise amplitude. It is obvious that temporal structure of the chaotic time series changes depending on a. Therefore, both the amplitude and the autocorrelation correlation should be investigated at the same time to analyze the effects of the additive noise.
698
M. Hasegawa and K. Umeno
Autocorrelation Coefficient
1.0
a=3.82 a=3.92 a=3.95 Gausian Noise
0.8 0.6 0.4 0.2 0.0 -0.2 -0.4 -0.6 0
2
4
6
8
10
τ
Fig. 4. Autocorrelation coefficients of chaotic noise that has high solvable performance
3
Negative Autocorrelation Noise
The autocorrelation coefficients of the chaotic sequences used for the results of Figs. 1 and 2 are shown in Fig. 4. The figure shows that the autocorrelation of the effective chaotic sequences has a negative value at lag 1. In other research using chaos, chaotic CDMA utilizes similar sequences whose autocorrelation is C(τ ) ≈ C × (−r)τ [12]. Such autocorrelation has been shown to lead small cross-correlation among the sequences. Therefore, the chaotic codes having such autocorrelation are effective for reducing bit error rates caused by mutual interferences. In this paper, we introduce stochastic noise having such autocorrelation C(τ ) ≈ C × (−r)τ to evaluate effects of the negative autocorrelation to the solvable performance of the neural networks, because it is similar to effective autocorrelation function of the chaotic noise shown in Fig. 4 and possible to analyze only the effect of such negative dumped oscillation in autocorrelation. We generate such stochastic sequence whose autocorrelation is C(τ ) by the following procedures. First, a target autocorrelation sequence C(τ ) is generated. By applying FFT to C(τ ), its power spectrum sequence is obtained. Only the phase of the spectrum is randomized without changing the power spectrum. Then, a stochastic sequence, whose autocorrelation was C(τ ), is generated by applying IFFT to the phase randomized sequence. When we apply the generated noise sequence to the neural network as zik (t) in Eq. (10), the sequence is normalized to zero mean and unit variance as already described. We evaluate the solvable performance of such stochastic noise by changing the autocorrelation parameter r and the noise amplitude β. Figures 5 and 6 show the results on the TSP and the QAP solved by the neural networks with the stochastic noise whose autocorrelation is C(τ ) ≈ C × (−r)τ . The upper figures (1) show the solvable performance as both the noise amplitude β and the negative correlation parameter r are changed, and the lower figures (2) show another view from the r axis direction.
Solvable Performances of Optimization Neural Networks
699
Solvable Performance (%) 100 80 60 40 20 0
-1.0 -0.5 0.0
r
0.5 1.0 0.0
0.3
0.2
0.1
0.4
0.5
0.6
β
(1) 100
Solvable Performance (%)
90 80 70 60 50 40 30 20 10 0 -1.0
-0.8
-0.6
-0.4
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
r
(2) Fig. 5. Solvable performance of a neural network with a stochastic additive noise whose autocorrelation is C(τ ) ≈ C × (−r)τ , on a 20-city TSP, with changing the noise amplitude β and the negative autocorrelation parameter r
Figures 5 (1) and 6 (1) demonstrate that the best amplitude β for the best solvable performance varies depending on the negative autocorrelation parameter r. The best β increases as r increases. In the conventional stochastic resonance, only the amplitude of the noise has been considered important for its phenomena, and the temporal structure such as autocorrelation has not been so much focused on. However, the results shown in Fig. 5 (1) and Fig. 6 (1) suggest that the temporal structure of the additive noise is also important when we analyze effect of the additive noise. Figures 5 (2) and 6 (2) clearly demonstrate that positive r (autocorrelation is negative) induces higher performance. A negative autocorrelation with oscillation corresponding to r > 0 has higher performance than the white Gaussian noise corresponding to r = 0 and positive autocorrelation noise corresponding to r < 0. Comparing the best results of Fig. 5 and 6 with the results in Figs. 1 and 2, we can see that the stochastic noise with negative autocorrelation
700
M. Hasegawa and K. Umeno
Solvable Performance (%) 20 15 10 5 0 0.16 0.14 0.12 0.10 0.08 β 0.06 0.04 1.0 0.02
-1.0 -0.5 0.0 0.5
r
(1)
Solvable Performance (%)
20
15
10
5
0 -1.0
-0.8
-0.6
-0.4
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
r
(2) Fig. 6. Solvable performance of a neural network with a stochastic additive noise whose autocorrelation is C(τ ) ≈ C × (−r)τ , on a QAP with 12 nodes (Nug12), with changing the noise amplitude β and the negative autocorrelation parameter r
has almost the same high performance as the chaotic noise. As shown in Fig. 4, the chaotic noise also has a negative autocorrelation with damped oscillation, similar to the introduced stochastic noise, C(τ ) ≈ C × (−r)τ . Based on these results, we conclude that the effects of the chaotic dynamics for the additive sequences to the optimization neural network owes to its negative autocorrelation.
4
Conclusion
Chaotic noise has been shown to be more effective than white Gaussian noise in conventional chaotic noise approaches to combinatorial optimization problems. We obtained the results indicating this in Figs. 1 and 2. In a previous research, some specific characteristics of autocorrelation function of chaotic noise was shown to be effective by using the method of surrogate data [9].
Solvable Performances of Optimization Neural Networks
701
In order to investigate more essential effect of the additive noise, we introduced stochastic noise whose autocorrelation is C(τ ) ≈ C × (−r)τ , because the chaotic sequences effective for optimization has similar autocorrelation with negative dumped oscillation. By such noise, we showed that the solvable performance depends not only on the noise amplitude but also on the negative autocorrelation parameter r. The best value of amplitude for additive noise varies depending on the autocorrelation of the noise. We also showed that the noise, whose autocorrelation takes negative value with damped oscillation, is more effective than white Gaussian noise and positive autocorrelation noise. Moreover, we also showed that such a stochastic noise having autocorrelation with dumped oscillation lead to almost the same high solvable performance as the chaotic noise. From these results, we conclude that essential effect of the specific autocorrelation of chaotic noise shown in Ref. [9] is this negative autocorrelation with dumped oscillation. In the chaotic CDMA, such noise sequences, whose autocorrelation take negative value with dumped oscillation, have been shown to be effective for minimizing the cross-correlation among the sequences. This paper showed that adding such a set of noise sequences whose cross-correlation is low leads to high performance of the neural network. To clarify this effect of the noise related to cross-correlation to solvable performance is very important future work. In this paper, we dealt with just one approach of chaotic optimization, which is based on mutually connected neural networks. However, this approach has poor performance compared to a heuristic approach using chaos [3,6,5]. Therefore, it is also important future work to create a new chaotic algorithm in a heuristic approach, using the obtained results about effective feature of chaotic dynamics.
References 1. Nozawa, H.: A neural network model as a globally coupled map and applications based on chaos. Chaos 2, 377–386 (1992) 2. Chen, L., Aihara, K.: Chaotic Simulated Annealing by a Neural Network Model with Transient Chaos. Neural Networks 8(6), 915–930 (1995) 3. Hasegawa, M., Ikeguchi, T., Aihara, K.: Combination of Chaotic Neurodynamics with the 2-opt Algorithm to Solve Traveling Salesman Problems. Physical Review Letters 79(12), 2344–2347 (1997) 4. Ishii, S., Niitsuma, H.: Lambda-opt neural approaches to quadratic assignment problems. Neural Computation 12(9), 2209–2225 (2000) 5. Hasegawa, M., Ikeguchi, T., Aihara, K., Itoh, K.: A Novel Chaotic Search for Quadratic Assignment Problems. European Journal of Operational Research 139, 543–556 (2002) 6. Hasegawa, M., Ikeguchi, T., Aihara, K.: Solving Large Scale Traveling Salesman Problems by Chaotic Neurodynamics. Neural Networks 15, 271–283 (2002) 7. Hasegawa, M., Ikeguchi, T., Matozaki, T., Aihara, K.: Solving Combinatorial Optimization Problems by Nonlinear Neural Dynamics. In: Proc. of IEEE International Conference on Neural Networks, pp. 3140–3145 (1995) 8. Hayakawa, Y., Marumoto, A., Sawada, Y.: Effects of the Chaotic Noise on the Performance of a Neural Network Model for Optimization Problems. Physical Review E 51(4), 2693–2696 (1995)
702
M. Hasegawa and K. Umeno
9. Hasegawa, M., Ikeguchi, T., Matozaki, T., Aihara, K.: An Analysis on Additive Effects of Nonlinear Dynamics for Combinatorial Optimization. IEICE Trans. Fundamentals E80-A(1), 206–212 (1997) 10. Umeno, K., Kitayama, K.: Spreading Sequences using Periodic Orbits of Chaos for CDMA. Electronics Letters 35(7), 545–546 (1999) 11. Mazzini, G., Rovatti, R., Setti, G.: Interference minimization by autocorrelation shaping in asynchronous DS-CDMA systems: chaos-based spreading is nearly optimal. Electronics Letters 35(13), 1054–10555 (1999) 12. Umeno, K., Yamaguchi, A.: Construction of Optimal Chaotic Spreading Sequence Using Lebesgue Spectrum Filter. IEICE Trans. Fundamentals E85-A(4), 849–852 (2002) 13. Hopfield, J., Tank, D.: Neural Computation of Decision in Optimization Problems. Biological Cybernetics 52, 141–152 (1985) 14. Burkard, R., Karish, S., Rendl, F.: QAPLIB-A Quadratic Assignment Problem Library. Journal of Global Optimization 10, 391–403 (1997), http://www.seas.upenn.edu/qaplib
Solving the k-Winners-Take-All Problem and the Oligopoly Cournot-Nash Equilibrium Problem Using the General Projection Neural Networks Xiaolin Hu1 and Jun Wang2 1 Tsinghua National Lab of Information Science & Technology, State Key Lab of Intelligent Technology & Systems, and Department of Computer Science & Technology, Tsinghua University, Beijing 100084, China [email protected] 2 Department of Mechanical and Automation Engineering The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong, China [email protected]
Abstract. The k-winners-take-all (k-WTA) problem is to select k largest inputs from a set of inputs in a network, which has many applications in machine learning. The Cournot-Nash equilibrium is an important problem in economic models . The two problems can be formulated as linear variational inequalities (LVIs). In the paper, a linear case of the general projection neural network (GPNN) is applied for solving the resulting LVIs, and consequently the two practical problems. Compared with existing recurrent neural networks capable of solving these problems, the designed GPNN is superior in its stability results and architecture complexity.
1
Introduction
Following the seminal work of Hopfield and Tank [1], numerous neural network models have been developed for solving optimization problems, from the earlier proposals such as the neural network proposed by Kennedy and Chua [2] based on the penalty method and the switched-capacitor neural network by Rodr´ıguezV´azquez et al. [3] to the latest development by Xia and Wang et al. [5, 8]. In recent years, new research in this direction focus on solving variational inequality (VI), a problem closely related to optimization problems, which has itself many applications in a variety of disciplines (see, e.g., [13]). Regarding solving VIs, a recurrent neural network, called projection neural network is developed (see [5,8] and references therein). An extension of this neural network is presented in [6], termed general projection neural network (GPNN), which is primarily designed for solving general variational inequality (GVI) with bound constraints on variables. Based on that work, it is found that the GPNN with proper formulations is capable of solving a linear case of VI (for short, LVI) with linear constraints. In view of this observation, we then apply the formulated GPNNs for solving two typical LVI problems, k-WTA problem and oligopoly Cournot-Nash equilibrium problem, whose descriptions can be found in Section 3 and Section 4, respectively. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 703–712, 2008. c Springer-Verlag Berlin Heidelberg 2008
704
2
X. Hu and J. Wang
Neural Network Model
Consider the following linear variational inequality (LVI): find x∗ ∈ Ω such that (M x∗ + p)T (x − x∗ ) ≥ 0, where M ∈
n×n
∀x ∈ Ω,
(1)
n
, p ∈ and Ω = {x ∈ n |Ax ∈ Y, Bx = c, x ∈ X }
(2)
with A ∈ m×n , B ∈ r×n , c ∈ r . In above, X and Y are two box sets defined as X = {x ∈ n |x ≤ x ≤ x}, Y = {y ∈ m |y ≤ y ≤ y}, where x, x ∈ n , y, y ∈ m . Note that any component of x, y may be set to −∞ and any component of x, y may be set to ∞. Without loss of generality, we assume y < y since if yi = yi for some i, then the corresponding inequality constraints can be incorporated into Bx = c. Denote Ω ∗ as the solution set of (1). Throughout the paper, it is assumed that Ω ∗ = ∅. Moreover, we assume Rank(B) = r in (2), which is always true for a well-posed problem. The above LVI is closely related to the following quadratic programming problem: minimize 12 xT M x + pT x (3) subject to x ∈ Ω where the parameters are defined as the same as in (1). It is well-known that (e.g., [13]), if M is symmetric and positive semi-definite, the two problems are actually equivalent. If M = 0, the above problem degenerates to a linear programming problem. Since Rank(B) = r, without loss of generality, B can be partitioned as [BI , BII ], where BI ∈ r×r , BII ∈ r×(n−r) and det(BI ) = 0. Then Bx = c can be decomposed into xI (BI , BII ) = c, xII where xI ∈ r , xII ∈ n−r , which yields xI = −BI−1 BII xII + BI−1 c and
where
x = QxII + q,
(4)
−1 −BI−1 BII BI c n×(n−r) Q= ∈ ,q = ∈ n . I O
(5)
Similarly, we can have x∗ = Qx∗II + q. Substitution x and x∗ into (1) will give a new LVI. Let AQ T T T ¯ ¯ u = xII , M = Q M Q, p¯ = Q M q + Q p, A = , −BI−1 BII y − Aq y − Aq V = v ∈ m+r ≤v≤ , xI − BI−1 c xI − BI−1 c U = {u ∈ n−r |xII ≤ u ≤ xII }
Solving the k-WTA Problem
705
where xI = (x1 , · · · , xr )T , xII = (xr+1 , · · · , xn )T , xI = (x1 , · · · , xr )T , xII = (xr+1 , · · · , xn )T , then the new LVI can be rewritten in the following compact form ¯ u∗ + p¯)T (u − u∗ ) ≥ 0, ∀u ∈ Ω, ¯ (M (6) where
¯ = {u ∈ n−r |Au ¯ ∈ V, u ∈ U}. Ω
Compared with the original LVI (1) the new formulation (6) has fewer variables while the equality constraints are absorbed. By following similar techniques as in [10], it is not difficult to obtain the following theorem. Theorem 1. u∗ ∈ n−r is a solution of (6) if and only if there exists v ∗ ∈ m+r ˜ ∗ ∈ W and such that w = (uT , v T )T satisfies Nw ˜ w∗ + p˜)T (w − N ˜ w∗ ) ≥ 0, (M where ˜ = M
∀w ∈ W,
(7)
¯ −A¯T M I O p¯ ˜ ,N = ¯ , p˜ = , W = U × V. O I AO O
Inequality (7) represents a class of general variational inequality [12]. Thus the following general projection neural network presented in [6] can be applied to solve the problem. – State equation dw ˜ +N ˜ )T {−Nw ˜ + PW ((N ˜ −M ˜ )w − p˜)}, = λ(M dt – Output equation x = Qu + q,
(8a) (8b)
where λ > 0 is a scalar, Q, q are defined in (5) and PW (·) is a standard projection operator [5, 8]. The architecture of the neural network can be drawn in a similar fashion in [6], and is omitted here for brevity. ¯ ≥ 0, M ˜ +N ˜ is nonsingular, and M ˜ TN ˜ is positive semiClearly, when M definite. Then we have the following theorem which follows from [6, Corollary4] directly. Theorem 2. Consider neural network (8) for solving LVI (6) with Ω defined in (2). ¯ ≥ 0, then the state of the neural network is stable in the sense of (i) If M Lyapunov and globally convergent to an equilibrium, which corresponds to an exact solution of the LVI by the output equation. ¯ > 0, then the output x(t) of the neural network is globally (ii) Furthermore, if M asymptotically stable at the unique solution x∗ of the LVI. Other neural networks existing in the literature can be also adopted to solve (6). In Table 1 we compare the proposed GPNN with several salient neural network models in terms of the stability conditions and the number of neurons. It is seen
706
X. Hu and J. Wang
Table 1. Comparison with Several Salient Neural Networks for Solving LVI (6) Refs. [7] Ref. [4] Ref. [11] Present paper ¯ >0 M ¯ ≥ 0, M ¯T = M ¯ M ¯ > 0, M ¯T = M ¯ ¯ ≥0 Conditions M M Neurons n + 2m + r n + 2m + r n+m n+m
that the GPNN requires the weakest conditions and fewest neurons. Note that fewer neurons imply fewer connective weights in the network, and consequently lower structural complexity of the network.
3
k-Winners-Take-All Network
Consider the following k-winners-take-all (k-WTA) problem xi = f (σi ) =
1, if σi ∈ {k largest elements of σ} 0, otherwise,
where σ ∈ n stands for the input vector and x ∈ {0, 1}n stands for the output vector. The k-WTA operation accomplishes a task of selecting k largest inputs from n inputs in a network. It has been shown to be a computationally powerful operation compared with standard neural network models with threshold logic gates [14]. In addition, it has important applications in machine learning as well, such as k-neighborhood classification, k-means clustering. Many attempts have been made to design neural networks to perform the k-WTA operation [15, 16, 17]. Recently, the problem was formulated into a quadratic programming problem and a simplified dual neural network was developed to k-WTA with n neurons [11]. The limitation of that approach lies in the finite resolution. Specifically, when the k-th largest input σk is equal to the (k + 1)-th largest input, the network cannot be applied. The k-WTA problem is equivalent to the following integer programming problem minimize −σ T x (9) subject to Bx = k, x ∈ {0, 1}n , where B = (1, · · · , 1). Suppose that σk = σk+1 , similar to the arguments given in [11], the k-WTA problem can be reasoned equivalent to the following linear programming problem minimize −σ T x subject to Bx = k, x ∈ [0, 1]n .
(10)
Then neural network (8) with M = 0, p = −σ, c = k, x = 0, x = 1 can be used to solve the problem. Explicitly, the neural network can be written in the following component form
Solving the k-WTA Problem
707
– State equation ⎧ ⎫ n−1 n−1 ⎨ ⎬ dui = λ −ui + uj + PUi (ui − v − σ1 + σi+1 ) + PV (− uj − v) ⎩ ⎭ dt j=1
j=1
∀i = 1, · · · , n − 1, ⎧ ⎫ n−1 n−1 ⎨ n−1 ⎬ dv =λ 2 uj − PUj (uj − v − σ1 + σj+1 ) + PV (− uj − v) , ⎩ ⎭ dt j=1 j=1 j=1 – Output equation x1 = −
n−1
uj + k
j=1
xi+1 = ui
∀i = 1, · · · , n − 1,
where U = {u ∈ n−1 |0 ≤ u ≤ 1}, V = {v ∈ | − k ≤ v ≤ 1 − k}. It is seen that no parameter is needed to choose in contrast to [11]. Next, we consider another scenario. Suppose that there are totally s identical inputs and some of them but not all should be selected as “winners”. Suppose r inputs among them are needed to be selected, where r < s. Then the proposed neural network will definitely output k − r 1’s and n − (k − l) − s 0’s which correspond to σ1 ≥ σ2 ≥ · · · ≥ σk−l and σs−l+k+1 ≥ · · · ≥ σn . The other outputs which correspond to s identical inputs σk−l+1 = · · · = σs−l+k might be neither 1’s nor 0’s, but the entire outputs x∗ still correspond to the minimum of (10). Denote the minimum value of the objective function as p∗ , then ∗
p =−
k−l
σi − σ ¯
i=1
s−l+k
x∗i ,
i=k−l+1
where σ ¯ = σk−l+1 = · · · = σs−l+k . From the constraints, we have s−l+k
x∗i = l,
∀x∗i ∈ [0, 1].
i=k−l+1
Therefore, the optimum value of the objective function will not change by varying x∗i in [0, 1] ∀i = k − l + 1, · · · , s − l + k, if only the sum of these x∗i ’s is equal to l. Clearly, we can let any l of these outputs equal to 1’s and the rest equal to 0’s. According to this rule, the output x∗ will be an optimum of (10) and consequently, an optimum of (9). So, for this particular application, a rectification module is needed which should be placed between the output layer of the GPNN and the desired output layer, as shown in Fig. 1. Example 1. In the k-WTA problem, let k = 2, n = 4 and the inputs σi = 10 sin[2π(t + 0.2(i − 2))], ∀i = 2, 3, 4 and σ1 = σ2 , where t increases from 0 to
708
X. Hu and J. Wang
1 continuously, which leads to four sinusoidal input signals σi for i = 1, · · · , 4 with σ1 = σ2 (see the top subfigure in Fig. 2). We use the rectified GPNN to accomplish the k-WTA operation. The other four subfigures in Fig. 2 record the final outputs of the proposed network at each time instant t. Note that when t is between approximately [0.05, 0.15] and [0.55, 0.65], either σ1 or σ2 may be selected as a winner, and the figure just shows one possible selection. x1
x2
xn
... Output rectification
xc1
...
xc2
xcn
General projection neural network
...
Vn
V1 V 2
Fig. 1. Diagram of the rectified GPNN for k-WTA operation in Example 1. xi stands for the output of the GPNN and xi stands for the final output. σ ,σ 1
σ
2
3
σ
4
σ
10 0 −10
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
x
1
1 0.5 0
x
2
1 0.5 0
x
3
1 0.5 0
x
4
1 0.5 0 t
Fig. 2. Inputs and outputs of the rectified GPNN for k-WTA operation in Example 1
Solving the k-WTA Problem
4
709
Oligopoly Cournot-Nash Equilibrium
In this section, we apply the GPNN for solving a spatial oligopoly model in economics formulated as a variational inequality [18, pp. 94-97]. Assume that a homogeneous commodity is produced by m firms and demanded by n markets that are generally spatially separated. Let Qij denote the commodity shipment from firm i to demand market j; qi denote the commodity produced by firm i satisfying qi = nj=1 Qij ; dj denote the demand for the commodity at market j m satisfying dj = i=1 Qij ; q denote a column vector with entries qi , i = 1, · · · , m; d denote a column vector with entries dj , j = 1, · · · , n; fi denote the production cost associated with firm i, which is a function of the entire production pattern, i.e., fi = fi (q); pj denote the demand price associated with market j, which is a function of the entire consumption pattern, i.e., pj = pj (d); cij denote the transaction cost associated with trading (shipping) the commodity from firm i to market j, which is a function of the entire shipment pattern; i.e., cij = cij (Q); and ui denote the profit or utility of firm i, given by ui (Q) =
n
pj Qij − fi −
j=1
n
cij Qij .
j=1
Clearly, for a well-defined model, Qij must be nonnegative. In addition, other constraints can be added to make the model more practical. For instance, a constrained set can be defined as follows ¯ ij , S = {Qij ∈ m×n |fi ≥ 0, p ≤ pj ≤ p¯j ,cij ≥ 0, 0 ≤ Qij ≤ Q j
∀i = 1, · · · , m; j = 1, · · · , n}, where pj and p¯j are respectively the lower and upper bound of the price pj , and ¯ ij is the capacity associated with the shipment Qij . Q An important issue in oligopoly problems is to determine the so-called CournotNash equilibrium [18]. According to Theorem 5.1 in [18, p. 97], assume that for each firm i the profit function ui (Q) is concave with respect to the variables {Qi1 , · · · , Qin } and continuously differentiable, then Q∗ is a Cournot-Nash equilibrium if and only if it satisfies the following VI −
m n ∂ui (Q∗ ) i=1 j=1
∂Qij
(Qij − Q∗ij ) ≥ 0 ∀Q ∈ S.
(11)
Then, in the linear case, this equilibrium problem can be solved by using the GPNN designed in Section 2. Example 2. In the above Cournot-Nash equilibrium problem, let m = 3, n = 5 and define the cost functions f1 (q) = 2q12 + 2q2 q3 + q2 + 4q3 + 1 f2 (q) = 3q22 + 2q32 + 2q2 q3 + q1 + 2q2 + 5q3 + 3 f3 (q) = 2q12 + q32 + 6q2 q3 + 2q1 + 2q2 + q3 + 5
710
X. Hu and J. Wang
and demand price functions p1 (d) = −3d1 − d2 + 5 p2 (d) = −2d2 + d3 + 8 p3 (d) = −d3 + 6 p4 (d) = d2 − d3 − 2d4 + 6 p5 (d) = −d3 − d5 + 3. For simplicity, let the transaction costs cij be constants; i.e., ⎛
⎞ 52329 cij = ⎝4 8 2 1 5⎠ . 65414 ¯ ij = 1. By these definitions, the In the constrained set S, let pj = 1, p¯j = 5, Q constraints fi ≥ 0 and ci ≥ 0 in S are automatically satisfied. By splitting Q into a column vector x ∈ 15 we can write (11) in the form of LVI (1) with ⎡
⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ 8 2 3 223 0 1 003 0 1 00 0 0 1 ⎢ 2 6 1 1 2 0 2 −1 0 0 0 2 −1 0 0 ⎥ ⎢ −6 ⎥ ⎢0⎥ ⎢1⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢3 1 4 3 3 0 0 1 0 0 0 0 1 0 0⎥ ⎢ −3 ⎥ ⎢0⎥ ⎢1⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ 2 1 3 6 2 0 −1 1 2 0 0 −1 1 2 0 ⎥ ⎢ −4 ⎥ ⎢0⎥ ⎢1⎥ ⎢ ⎥ ⎢ ⎢ ⎥ ⎢ ⎥ ⎥ ⎢2 2 3 2 4 0 0 1 0 1 0 0 1 0 1⎥ ⎢ 6 ⎥ ⎢0⎥ ⎢1⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢3 0 1 0 0 9 3 4 3 3 4 1 2 1 1⎥ ⎢ 1 ⎥ ⎢0⎥ ⎢1⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ 0 2 −1 0 0 3 7 2 2 3 1 3 0 1 1 ⎥ ⎢ 2 ⎥ ⎢0⎥ ⎢1⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎥ , p = ⎢ −2 ⎥ , x = ⎢ 0 ⎥ , x = ⎢ 1 ⎥ 0 0 1 0 0 4 2 5 4 4 1 1 2 1 1 M =⎢ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ 0 −1 1 2 0 3 2 4 7 3 1 0 2 3 1 ⎥ ⎢ −3 ⎥ ⎢0⎥ ⎢1⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢0 0 1 0 1 3 3 4 3 5 1 1 2 1 2⎥ ⎢ 4 ⎥ ⎢0⎥ ⎢1⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢3 0 1 0 0 6 3 4 3 3 7 1 2 1 1⎥ ⎢ 2 ⎥ ⎢0⎥ ⎢1⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ 0 2 −1 0 0 3 5 2 3 3 1 5 0 0 1 ⎥ ⎢ −2 ⎥ ⎢0⎥ ⎢1⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢0 0 1 0 0 3 3 4 3 3 2 0 3 2 2⎥ ⎢ −1 ⎥ ⎢0⎥ ⎢1⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎣ 0 −1 1 2 0 3 2 4 5 3 1 0 2 5 1 ⎦ ⎣ −4 ⎦ ⎣0⎦ ⎣1⎦ 0 0 1 013 3 4 341 1 2 13 2 0 1 ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ −3 0 −1 0 0 −3 0 −1 0 0 −3 0 −1 0 0 −4 0 ⎢ 0 −2 1 0 0 0 −2 1 0 0 0 −2 1 0 0 ⎥ ⎢ −7 ⎥ ⎢ −3 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ 0 0 −1 0 0 0 0 −1 0 0 0 0 −1 0 0 −5 A=⎢ ⎥,y = ⎢ ⎥ , y = ⎢ −1 ⎥ ⎣ 0 1 −1 −2 0 0 1 −1 −2 0 0 1 −1 −2 0 ⎦ ⎣ −5 ⎦ ⎣ −1 ⎦ 0 0 −1 0 −1 0 0 −1 0 −1 0 0 −1 0 −1 −2 2
and B = 0, c = 0. For this LVI, there is no equality constraints. Consequently, there is no need to “eliminate” the equality constraints by reformulating the problem into another LVI as in (6). And a GPNN similar to (8) can be designed for solving the LVI directly. It can be checked that M > 0. Then, the designed GPNN is globally asymptotically stable. All simulation results show that with
Solving the k-WTA Problem
711
any positive λ and any initial state the GPNN is always convergent to the unique solution: ⎛ ⎞ 0.000 1.000 0.919 0.058 0.000 Q∗ = ⎝−0.000 0.000 0.081 0.221 0.000⎠ . 0.000 1.000 0.000 0.721 0.000 Fig. 3 depicts one simulation result of the neural network (state). Clearly, all trajectories reach steady states after a period of time. It is seen that by this numerical setting, in the Cournot-Nash equilibrium state, no firm will ship any commodity to market 1 or market 5. 10 8 6 4
States
2 0 −2 −4 −6 −8 −10
0
50
100
150
Time t
Fig. 3. State trajectories of the GPNN with a random initial point in Example 2
5
Concluding Remarks
In this paper, a general projection neural network (GPNN) is designed which is capable of solving a general class of linear variational inequalities (LVIs) with linear equality and two-sided linear inequality constraints. Then, the GPNN is applied for solving two problems, the k-winners-take-all problem and the oligopoly Cournot-Nash equilibrium problem, which are formulated into LVIs. The designed GPNN is of lower structural complexity than most existing ones. Numerical examples are discussed to illustrate the good applicability and performance of the proposed method.
Acknowledgments The work was supported by the National Natural Science Foundation of China under Grant 60621062 and by the National Key Foundation R&D Project under Grants 2003CB317007, 2004CB318108 and 2007CB311003.
712
X. Hu and J. Wang
References 1. Hopfield, J.J., Tank, D.W.: Computing with Neural Circuits: A Model. Science 233, 625–633 (1986) 2. Kennedy, M.P., Chua, L.O.: Neural Networks for Nonlinear Programming. IEEE Trans. Circuits Syst. 35, 554–562 (1988) 3. Rodr´ıguez-V´ azquez, A., Dom´ınguez-Castro, R., Rueda, A., Huertas, J.L., S´ anchezSinencio, E.: Nonlinear Switched-Capacitor Neural Networks for Optimization Problems. IEEE Trans. Circuits Syst. 37, 384–397 (1990) 4. Tao, Q., Cao, J., Xue, M., Qiao, H.: A High Performance Neural Network for Solving Nonlinear Programming Problems with Hybrid Constraints. Physics Letters A 288, 88–94 (2001) 5. Xia, Y., Wang, J.: A Projection Neural Network and Its Application to Constrained Optimization Problems. IEEE Trans. Circuits Syst. I 49, 447–458 (2002) 6. Xia, Y., Wang, J.: A General Projection Neural Network for Solving Monotone Variational Inequalities and Related Optimization Problems. IEEE Trans. Neural Netw. 15, 318–328 (2004) 7. Xia, Y.: On Convergence Conditions of an Extended Projection Neural Network. Neural Computation 17, 515–525 (2005) 8. Hu, X., Wang, J.: Solving Pseudomonotone Variational Inequalities and Pseudoconvex Optimization Problems Using the Projection Neural Network. IEEE Trans. Neural Netw. 17, 1487–1499 (2006) 9. Hu, X., Wang, J.: A Recurrent Neural Network for Solving a Class of General Variational Inequalities. IEEE Transactions on Systems, Man and Cybernetics Part B: Cybernetics 37, 528–539 (2007) 10. Hu, X., Wang, J.: Solving Generally Constrained Generalized Linear Variational Inequalities Using the General Projection Neural Networks. IEEE Trans. Neural Netw. 18, 1697–1708 (2007) 11. Liu, S., Wang, J.: A Simplified Dual Neural Network for Quadratic Programming with Its KWTA Application. IEEE Trans. Neural Netw. 17, 1500–1510 (2006) 12. Pang, J.S., Yao, J.C.: On a Generalization of a Normal Map and Equations. SIAM J. Control Optim. 33, 168–184 (1995) 13. Facchinei, F., Pang, J.S.: Finite-Dimensional Variational Inequalities and Complementarity Problems, vol. I and II. Springer, New York (2003) 14. Maass, W.: On the Computational Power of Winner-Take-All. Neural Comput. 12, 2519–2535 (2000) 15. Wolfe, W.J.: K-Winner Networks. IEEE Trans. Neural Netw. 2, 310–315 (1991) 16. Calvert, B.A., Marinov, C.: Another k-Winners-Take-All Analog Neural Network. IEEE Trans. Neural Netw. 11, 829–838 (2000) 17. Marinov, C.A., Hopfield, J.J.: Stable Computational Dynamics for a Class of Circuits with O(N ) Interconnections Capable of KWTA and Rank Extractions. IEEE Trans. Circuits Syst. I 52, 949–959 (2005) 18. Nagurney, A., Zhang, D.: Projected Dynamical Systems and Variational Inequalities with Applications. Kluwer, Boston (1996)
Optimization of Parametric Companding Function for an Efficient Coding Shin-ichi Maeda1 and Shin Ishii2 1
Nara Institute of Science and Technology, 630–0192 Nara, Japan 2 Kyoto University, 611–0011 Kyoto, Japan
Abstract. Designing a lossy source code remains one of the important topics in information theory, and has a lot of applications. Although plain vector quantization (VQ) can realize any fixed-length lossy source coding, it has a serious drawback in the computation cost. Companding vector quantization (CVQ) reduces the complexity by replacing vector quantization with a set of scalar quantizations. It can represent a wide class of practical VQs, while the structure in CVQ restricts it from representing every lossy source coding. In this article, we propose an optimization method for parametrized CVQ by utilizing a newly derived distortion formula. To test its validity, we applied the method especially to transform coding. We found that our trained CVQ outperforms Karhunen-Lo¨eve transformation (KLT)-based coding not only in the case of linear mixtures of uniform sources, but also in the case of low bit-rate coding of a Gaussian source.
1
Introduction
The objective of lossy source coding is to reduce the coding length while the distortion between the original data and the reproduction data maintains constant. It is indispensable to compressing various kinds of sources such as images and audio signals to save memory amount, and may serve useful tools for extracting features from high-dimensional inputs, as the well-known Karhunen-Lo¨eve transformation (KLT) does. Plain vector quantization (VQ) has been known to be able to realize any fixed-length lossy source coding, implying VQ can be optimal among any fixed-length lossy source coding. However, it has a serious drawback in the computation cost (encoding, memory amount, and optimization of the coder) that increases exponentially as the input dimensionality increases. Companding vector quantization (CVQ) has potential to reduce the complexity inherent in the non-structured VQ by replacing vector quantization with a set of scalar quantizations, and can represent a wide class of practically useful VQs such as transform coding, shape-gain VQ, and tree structured VQ. On the other hand, the structure in CVQ restricts it from representing every lossy source coding. So far optimal CVQ has been calculated analytically for a very limited class. KLT is one of them, which is optimal for encoding Gaussian sources among the class of transform coding, in the limit of high rate [1] [2]. Although an analytical derivation of optimal CVQ has been tried, numerical optimization to obtain suboptimal CVQ has been hardly done except for the case that CVQ has a limited structure. In this article, we propose a general optimization method of the M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 713–722, 2008. c Springer-Verlag Berlin Heidelberg 2008
714
S. Maeda and S. Ishii
parametrized CVQ, which includes bit allocation algorithm based on a newly derived distortion formula, a generalization of Bennett’s formula, and test its validity through numerical simulations.
2 2.1
Companding Vector Quantization Architecture
The architecture of our CVQ is shown in Fig. 1.
y1 x
nj1
^ y
1
Ȁ
Ǿ ym
njm
^ x
^ y
m
Fig. 1. Architecture of CVQ
First, n-dimensional original datum x = [x1 , · · · , xn ]T ∈ n is transformed into m-dimensional feature vector y = [y1 , · · · , ym ]T ∈ [0, 1]m as y = ψ(x) by compressor ψ. T denotes a transpose. Then, quantizer Γ quantizes feature vector y into y ˆ = [ˆ y1 , · · · , yˆm ]T = Γ (y). Quantizer Γ consists of m uniform scalar quantizers, Γ1 , · · · , Γm , each of which individually quantizes the corresponding element of y as yˆi = Γi (yi ) with quantization level ai , as shown in Fig. 2. A set of quantization levels is represented by quantization vector a = [a1 , · · · , am ]T . Finally, expander φ transforms quantized feature vector y ˆ into reproduction vector x ˆ = [ˆ x1 , · · · , x ˆn ]T ∈ {μk ∈ n |k = 1, · · · , M }. The number of reproducm tion vectors M has a relationship with quantization vector a as M = i=1 ai . Reproduction vector x ˆ is hence obtained by expander φ as x ˆ = φ(ˆ y). The compressor and expander are parameterized as ψ(x; θα ) and φ(ˆ y ; θβ ), respectively, and are assumed to be differentiable with respect to their parameters θα and ˆ . A pair of a compressor and an expander is called θβ and their inputs x and y a compander. As a whole, reproduction vector x ˆ is calculated as x ˆ := ρ(x) = φ(Γ (ϕ(x))). We also denote ρ(x; a, θα , θβ ) for ρ(x), when we emphasize the dependence of coder ρ(x) on parameters θα and θβ and quantization vector a. 2.2
Optimization
Since we consider fixed-rate coding, the optimization problem is given by min E[d(x, x ˆ)] = min p(x)d(x, ρ(x; θα , θβ , a))dx θα ,θβ ,a
θα ,θβ ,a
s.t. M ≤ M ∀ i ∈ {1, · · · , m}, ai ∈ N,
(1)
Optimization of Parametric Companding Function
715
^ y
i
1 1- a1i
^ y i =nji (y) 㨪 㨪
^ yi = yi
㨪 㨪
2 ai 1 ai 1 ai
yi
1 1- a i 1
2 ai
Fig. 2. Quantizer yˆi = Γi (yi )
where M is the maximum permissible number of reproduction vectors, E[·] denotes an expectation with respect to the original data distribution (source) p(x), and d(x, x ˆ) denotes a distortion between an original datum and a reproduction datum. Provided that we have N independent and identically distributed samples {x(1) , · · · , x(N ) } from source p(x), source p(x) is usually unknown in practice, and so the expected distortion in equation (1) is approximated by the sample mean: E[d(x, ρ(x; θα , θβ , a)] ≈ DN (θα , θβ , a) ≡
N 1 d(x(k) , ρ(x(k) ; θα , θβ , a)). (2) T k=1
Since expected distortion can be correctly estimated by using enough samples, we try to minimize average distortion (2). This optimization consists of two parts: one is optimization of companding parameters θα and θβ , and the other is optimization of quantization vector a, i.e., bit allocation. At first, we explain the optimization of companding parameters θα and θβ . Here, we present an iterative optimization as follows. 1. θβ is optimized as
N min d x(i) , φ(ˆ y(i) ; θβ ) , θβ
(3)
i=1
where y ˆ(i) = Γ (ϕ(x(i) )) while θα is fixed. 2. θα is optimized as N min d x(i) , φ(ˆ y(i) ) , θα
i=1
where y ˆ(i) = Γ (ϕ(x(i) ; θα )) while θβ is fixed.
(4)
716
S. Maeda and S. Ishii
The two optimization steps above correspond to the well-recognized conditions that optimal VQ should satisfy: centroid and nearest neighbor, because (k) parameters θα and θβ determine = φ(ν (k) ; θβ ) and (k)reproduction vectors μ (k) partition functions S = x|ν = ϕ(x; θα ) , respectively, and centroid and nearest neighbor conditions denote the conditions that optimal reproduction vectors and optimal partition function should satisfy, respectively. Optimization for θβ in equation (3) is performed by conventional gradientbased optimization methods such as steepest descent or Newton’s method. On the other hand, cost function in equation (4) is not differentiable due to the discontinuous function Γ ; therefore, we replace the discontinuous function Γ with a continuous one, e.g., Γi (yi ) ≈ yi . (5) The dotted line in Fig. 2 depicts this approximation, which becomes exact when the quantization level ai becomes sufficiently high. With this approximation, the optimization problem given by equation (4) becomes min θα
N
d(x(i) , φ(ψ(x(i) ; θα ))),
(6)
i=1
which can be solved by a gradient-based method. In particular, if expander φ is invertible, optimality is always realized when θˆα forces compander ψ to be an inverse of expander φ, ψ = φ−1 , because distortion takes its minimum value d(x, φ(ψ(x; θα ))) = d(x, x) = 0 only in that condition. In this case, we need no parameter search for θα and hence can avoid the local minima problem. Although this simple optimization scheme works well as it is expected, we have found it sometimes fails to minimize the cost function (4). To improve optimization for θα , therefore, we estimate a local curvature of the cost function given by (4), and employ a line search along the estimated steepest gradient. Next, we present our bit allocation scheme. Here, we consider a special case in which the distortion measure is mean squared error (MSE), d(x, x ˆ) = n 1 2 (xi − x ˆi ) , and constitute an efficient search method for a based on an n i=1
estimated distortion: 1 1 E [Gj (y)], n j=1 12a2j m
E[d(x, x ˆ)] ≈
where Gj (y) =
2 n ∂φi (y) i=1
∂yj
(7)
. The derivation of equation (7) is based on Taylor
expansion, but omitted due to the space limitation. This estimation becomes accurate, if the companded source pdf, p(y), is well approximated as a uniform distribution and expander φ(y) is well approximated as a linear function within (k) (k) a small m-dimensional hyper-rectangle that includes ν (k) as C
≡ L1 ×· · · × (k)
(k)
Lm , where the interval in each dimension is given by Li
≡
(k)
li
−1 ai ,
(k)
li ai
and
Optimization of Parametric Companding Function
717
(k)
li ∈ {1, · · · , ai } for each i = 1, · · · , m. These conditions are both satisfied in the case of an asymptotic high rate. Note that expectation with respect to feature vector y in equation (7) can be replaced practically with sample average. Since equation (7) is identical to Bennett’s original integral [3] if the dimensionalities of original datum x and feature vector y are both one (i.e., a scalar companding case), equation (7) is a multidimensional version of the Bennett’s integral. The estimated distortion is useful for improving quantization vector a in our CVQ, as shown below. Using estimated distortion (7), the optimization problem for non-negative real-valued a is given by
min
m E [Gi (y)]
a
a2i
i=1
s.t.
m
log ai ≤ log M, ∀ i ∈ {1, · · · , m}, ai > 0, ai ∈ R. (8)
i=1
If we allow a non-negative real-valued solution, we can obtain the solution by conventional optimization methods as done by Segall [4]. The non-negative real-valued solution above does not guarantee level ai to be a natural integer. We then employ a two-stage optimization scheme. The first stage obtains a non-negative real-valued solution and rounds it to the nearest integer solution. This stage allows a global search for the quantization vector but provides a rough solution for the original cost function given in equation (1). The second stage then tries to optimize over integer solutions around the solution obtained in the first stage, finely but locally. The second stage utilizes rate-distortion cost function:
min J(a) = min a
a
s.t.
∀
m
m log2 ai + γE[d(x, x ˆ)] ≈ min log2 ai + a
i=1
i ∈ {1, · · · , m}, ai ∈ N,
i=1
γ E[Gi (y)] , 12na2i (9)
where parameter γ takes balance between the rate and distortion. γ is set so that the real-valued solution of the cost function (9) corresponds to the solution
n1 n 2 Mn of (8), as γ = 2C log 2 , where C = E [Gj (y)] . j=1
We choose the quantization vector that minimizes the rate-distortion cost function (9) among 2m quantization vectors each of which sets a component ai (i = 1, · · · , m) at one level larger or smaller than that of the first stage solution a. After re-optimizing the companding parameters such to be suitable for the new quantization vector a, we check whether the new quantization vector and re-optimized companding parameters {a, θα , θβ } actually decrease the average distortion (2). The optimization procedure in the case of MSE distortion measure is summarized in the panel.
718
S. Maeda and S. Ishii Summary of optimization procedure ˆ N at a sufficiently large value so that 1. Initialize D parameter update at step 5 is done at least once, and ˆ } as an initial guess. set {θˆα , θˆβ , a 2. Select a trial quantization vector a according to rough global search. Initialize companding parameters {θα , θβ } so a trial parameter set is {θα , θβ , a}. 3. Minimize DN (θα , θβ , a) with respect to parameters {θα , θβ }. ˆ N , then replace D ˆ N and 4. If DN (θα , θβ , a) < D ˆ } by DN (θα , θβ , a) and {θα , θβ , a}, {θˆα , θˆβ , a respectively, and go to step 2. Otherwise, go to step 5. 5. Improve a trial quantization vector a according to fine local search. 6. Minimize DN (θα , θβ , a) with respect to parameter {θα , θβ }. ˆ N , then replace D ˆ N and 7. If DN (θα , θβ , a) < D ˆ } by DN (θα , θβ , a) and {θα , θβ , a}, {θˆα , θˆβ , a respectively, and go to step 5. Otherwise, terminate the algorithm.
3 3.1
Applications to Transform Coding Architecture
Here, we apply the proposed learning method in particular to transform coding that utilizes linear matrices and componentwise nonlinear functions for the companding functions,ψ(x; θα ) = f (W x; ωα ), and φ(ˆ y; θβ ) = Ag(ˆ y; ωβ ) where θα = {W, ωα } and θβ = {A, ωβ }. W and A are m × n and n × m matrices, respectively, and f and g are componentwise nonlinear functions with parameters ωα and ωβ , respectively. We use a scaled sigmoid function for f which expands a certain interval of a sigmoid function so that the range becomes [0, 1], and g is chosen as the inverse function of f ; α di (aα i , bi ) α α α α )) + ci (ai , bi ) 1 + exp(−a (x − b i i
i 1 di (aβi , bβi ) gi (xi ; ωβ ) = − β log − 1 + bβi , ai xi − ci (aβi , bβi )
fi (xi ; ωα ) =
ai (1−bi )
α where ci = − e eai −1+1 , di = −ci eai bi , and ωα = {aα i , bi |i = 1, · · · , m} and β β ωβ = {ai , bi |i = 1, · · · , m} are parameters of f and g, respectively. Note that β β α when aα i = ai and bi = bi , each of the functions fi and gi is an inverse function of the other. The above transform coding is optimized according to the method described in section 2.2. However, the matrix A can be obtained analytically, because A
Optimization of Parametric Companding Function
is a solution of min
n
A j=1
ˆ T Aj Xj − Z
T
719
ˆ T Aj , where Ai is the i-th row Xj − Z
vector of matrix A, Xj is a vector consisting of the j-th components of the N ˆ j is a matrix whose column is a transformed vector original data samples, and Z corresponding to the i-th original datum x(i) . The MSE solution of this cost function is given as a set of row vectors [A1 , · · · , An ], each of which is obtained ˆ −T Xi , where Z ˆ −T is the pseudo inverse of matrix Z ˆT . individually as A∗i = Z 3.2
Simulation Results
To examine the performance of our optimization for CVQ, we compared the transform coding trained by our method with KLT-based transform coding, which uses a scalar quantizer trained by the Lloyd-I algorithm [5] for each feature component. Optimization of our CVQ and the trained KLT was each performed for 30 data sets, each consisting of 10,000 samples. The performance of each coding scheme was evaluated as average distortion using 500,000 samples independently sampled from the true source distribution. Hereafter, average distortion using 10,000 samples and 500,000 independent samples are called training error and test error, respectively. In our CVQ, matrix A was initially set at a value randomly chosen from uniform distribution and normalized so that the variance of each transformed component was 1. Initial matrix W was determined as an β inverse of A. The other companding parameters were set as aα i = ai = 6 and β α bi = bi = 0.5, which are well suited to a signal whose distribution is bell-shaped with variance being 1. First, we examined the performance of our CVQ by using two kinds of twodimensional sources: a linear mixture of uniform source or a Gaussian source. They have different characteristics as coding targets because the former cannot be decomposed into independent signals by an orthogonal transformation, whereas the latter can be. In the case of high bit-rate coding, KLT-based transform coding is optimal when an orthogonal transformation exists as that decomposes the source data into independent signals [6]. Therefore, the coding performance for a linear mixture of uniform source, where the optimality of KLT is not assured, is of great interest. Moreover, it is worth examining whether our CVQ can find any superior transformation to the KLT for low bit-rate encoding of a Gaussian source because non-orthognal matrices are permitted to be a transformation matrix in our CVQ. Figures 3(a) and (b) compare the arrangements of reproduction vectors by our CVQ and the trained KLT for two different linear mixtures of uniform sources; both of them have their density on the gray shaded parallelograms, slanted (a) 45◦ and (b) 60◦ . Each figure shows results with minimum training error among 30 runs by the two methods. Black circles indicate arranged reproduction vectors, and two axes indicate two feature vectors. Upper panels show results of 2 bit-rate coding, and the lower ones show 4 bit-rate coding. Each title denotes signal-tonoise ratio (SNR) for test data set. As seen in this figure, our CVQ obtained feature vector axes that were almost independent of each other and efficiently arranged the reproduction vectors, while
720
S. Maeda and S. Ishii
2 bit-rate
CVQ
trained KLT
CVQ
trained KLT
SNR : 10.70 dB 4
SNR :10.24 dB 4
SNR :11.44 dB 4
SNR : 10.18 dB 4 2
2
0
0
0
0
−2
−2
−2
−2
−4 −4
−4 −4
−4 −4
−4 −4
−2
0
2
4
SNR : 22.70 dB 4
4 bit-rate
2
2
−2
0
2
4
SNR : 20.88 dB 4
−2
0
2
4
SNR : 23.40 dB 4
2
2
2
0
0
0
0
−2
−2
−2
−2
−4 −4
−2
0
2
(a)
4
−4 −4
−2
0
2
4
45
−4 −4
−2
0
2
4
SNR : 21.73 dB 4 2
−2
0
2
4
(b)
−4 −4
−2
0
2
4
60
Fig. 3. Arrangements of reproduction vectors for linear mixtures of uniform sources
the trained KLT code failed. Then, the test performance was degraded especially in high rate coding, due to placing reproduction vectors on areas without original data. We also see that our CVQ modifies the feature vector axes as bit-rate changes even for the same source, as typically illustrated in Fig. 3(a). Thus, our CVQ adaptively seeks an efficient allocation of reproduction vectors, which is difficult to perform analytically. Results with various bit-rates indicated that the test error of our CVQ was likely the best especially when bit-rate was high. On the other hand, KLT code is guaranteed to be the best transform coder to encode Gaussian source in high bit-rate. However, there remains a possibility that a non-orthogonal transform coding might outperform KLT in low bit-rate by changing the transformation matrix according to the bit-rate. SNR : 6.327 (r) dB
1.6 bit-rate
SNR : 6.237 dB
5
5
0
0
−5 −5
0
best CVQ
5
−5 −5
0
5
optimal KLT code
Fig. 4. Encoding result of a Gaussian source in low bit-rate
In fact, we found that the best transformation matrix in our CVQ was superior to the analytical KLT code in low bit-rate as shown in Fig. 4. In this figure, we compare our trained CVQ with optimal KLT code, which can be numerically calculated in low bit rate and SNR of our CVQ was estimated by using 5,000,000 samples (three times the value of estimated standard deviation is shown in the brackets). At last, we examined performance when the source dimension became large. A linear mixture of uniform distribution was used for the original source because it prevents the high dimensional vector quantization from replacing a set of scalar
Optimization of Parametric Companding Function
721
CVQ (med) trained KLT (med) SNR - SNR of CVQ [dB] 0.5 0 −0.5 −1 −1.5 −2 −2.5 −3 −3.5 −4
SNR - SNR of CVQ [dB]
2
4
8
(a) 1 bit-rate
16
dimension
0.5 0 −0.5 −1 −1.5 −2 −2.5 −3 −3.5 −4
2
4
8
(b) 2 bit-rate
16
dimension
Fig. 5. Encoding results of a linear mixture of uniform sources with various dimensionality, in (a) 1 bit, and (b) 2 bits per dimension
quantizations by an orthogonal transformation while Gaussian source does not. We tested 1 and 2 bits per dimension, in the dimensionality of 2, 4, 8, or 16. In Figure 5, abscissa denotes a dimension of the source, and ordinate denotes the test error relative to the best CVQ (dB) among 30 runs. Test error by CVQ over 30 runs is shown by a box-whisker plot, and median test errors by CVQ, analytical KLT, and trained KLT are denoted by a solid line, a dashed-space line, and a dot-and-dash line, respectively. The number of poor results by CVQ, which are not plotted in the figure, is shown in brackets on each x label if they exist. As seen from Fig. 5, performance variance became large as the data dimensionality increased. We found that the initial setting of the parameters largely affected the performance of our CVQ when data dimensionality was large. However, the best trained CVQ (the best run among 30 runs) was in all cases superior to both the trained and analytical KLTs, and this superiority was consistent regardless of data dimensionality.
4
Discussion
The idea of companding has existed, and many theoretical analyses have been done. An optimal companding function for scalar quantization was analyzed by Bennett [3] and has been used to examine the ability of an optimal scalar quantizer. On the other hand, an optimal companding function for VQ has not been derived analytically except for very limited cases, and moreover, it is known that optimal CVQ does not correspond to optimal VQ except for very limited cases [7] [8] [9]. These negative results show the difficulty in analytically deriving the optimal companding function. Yet, the optimization of CVQ is attractive because CVQs constitute a wide class of practically useful codes and avoid an exponential increase in the coder’s complexity, referred to as the ‘curse of dimensionality.’ Through the numerical simulation, we noticed that theoretical analysis based on high rate assumption often deviates from real situation and analytically derived companding function based on high rate analysis does not show very good performance even when the bit-rate was quite high (e.g., 4 bit-rate). We guess
722
S. Maeda and S. Ishii
that such substantial disagreement stems from the high rate assumption. Since high rate coding implies that objective function of the lossy source coding solely comprises distortion by neglecting the coding length, the high rate analysis may lead to large disagreement when coding length cannot be completely neglected. These findings suggest the importance of the optimization of practically useful code which is realized by learning from the sample data. Fortunately, plenty of data are usually available in the case of practical lossy source coding, and the computation cost for the optimization is not very large in comparison to various machine learning problems, because the code optimization is necessary performed just once. Although we could show the potential ablitity of the CVQ, it sholud be noted that we must carefully choose the initial parameter for training of the parametrized CVQ especially in the case of high dimensionality. To find a good initial parameter, the idea presented by Hinton to train the deep network [10] may be useful. Acknowledgement. This work was supported by KAKENHI 19700219.
References 1. Huang, J.H., Schultheiss, P.M.: Block quantization of correlated gaussian random variables. IEEE Trans. Comm. CS-11, 289–296 (1963) 2. Goyal, V.K.: Theoretical foundations of transform coding. IEEE Signal Processing Mag. 18(5), 9–21 (2001) 3. Bennett, W.R.: Spectra of quantized signals. Bell Syst. Tech. J. 27, 446–472 (1948) 4. Segall, A.: Bit allocation and encoding for vector sources. IEEE Trans. Inform. Theory 22(2), 162–169 (1976) 5. Lloyd, S.: Least square optimization in pcm. IEEE Trans. Inform. Theory 28(2), 129–137 (1982) 6. Goyal, V.K., Zhuang, J., Veiterli, M.: Transform coding with backward adaptive updates. IEEE Trans. Inform. Theory 46(4), 1623–1633 (2000) 7. Gersho, A.: Asymptotically optimal block quantization. IEEE Trans. Inform. Theory 25, 373–380 (1979) 8. Bucklew, J.A.: Companding and random quantization in several dimensions. IEEE Trans. Inform. Theory 27(2), 207–211 (1981) 9. Bucklew, J.A.: A note on optimal multidimensional companders. IEEE Trans. Inform. Theory 29(2), 279 (1983) 10. Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)
A Modified Soft-Shape-Context ICP Registration System of 3-D Point Data Jiann-Der Lee1, Chung-Hsien Huang1, Li-Chang Liu1, Shih-Sen Hsieh1, Shuen-Ping Wang1, and Shin-Tseng Lee2 1
Department of Electrical Engineering, Chang Gung University, Tao-Yuan, Taiwan [email protected], [email protected], [email protected], {m9321039,m9321042}@stmail.cgu.edu.tw 2 Department of Neurosurgery, Chang Gung Memorial Hospital, Lin-Kuo, Taiwan [email protected]
Abstract. A facial point data registration system based on ICP is presented. The reference facial point data are extracted from patient’s pre-stored CT images, and the floating facial point data are captured from the patient directly by using a touch or non-touch capture device. A modified soft-shape-context ICP, which includes an adaptive dual AK-D tree for searching the closest point and a modified objective function, is embedded in this system. The adaptive dual AK-D tree searches the closest-point pair and discards insignificant control coupling points by an adaptive distance threshold on the distance between the two returned closest points which are searched by using AK-D tree search algorithm in two different partition orders. In the objective function of ICP, we utilize the modified soft-shape-context information which is one kind of projection information to enhance the robustness of the objective function. Relying on registering the floating data to the reference data, the system provides the geometric relationship for a medical assistant system and a preoperative training. Experiment results of using touch and non-touch capture devices to capture floating point data are performed to show the superiority of the proposed system. Keywords: ICP, KD-tree, Shape context, registration.
1 Introduction Besl and McKay [8] proposed the Iterative Closest Point algorithm (ICP) which has become a popular trend especially in 3D point data registration [10-12] and suggested using the K-D tree for the nearest neighbor search. Later Greenspan [7] proposed the Approximate K-D tree search algorithm (AK-D tree) by excluding the backtracking in K-D tree and giving more searching bin space. Greenspan claimed that the computation time of the best performance using AK-D tree is 7.6% and 39% of the computation time using K-D tree and Elias [7][6], respectively. One weakness of using K-D tree and AK-D tree is that a false nearest neighbor point may be found because only one projection plane is considered on partitioning a k-dimension space into two groups to build the node tree in one partition iteration. In order to improve M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 723–732, 2008. © Springer-Verlag Berlin Heidelberg 2008
724
J.-D. Lee et al.
search for the best closest point pair which affects the translation matrixes in ICP, we use AK-D tree twice in two different geometrical projection orders for determining the true nearest neighbor point to form significant coupling points used in later ICP stages. An adaptive threshold is used in the proposed Adaptive Dual AK-D tree search algorithm in order to reserve sufficient coupling points for a valid result. In the objective function of ICP, we modify the soft-shape-context idea proposed by Liu and Chen [2]. In soft-shape-context ICP (SICP) [2], each point generates a bin histogram and a low-pass filter is used to smooth the neighbor histogram values. For 3-D point data, the computation time to generate bin histograms for all point data is huge so in order to reduce the computation time of using SICP, we only compute two bin histograms for the centroid points of reference and floating data. We propose a registration system for facial point data by combining the modified soft-shape-context concept and ADAK-D tree. The floating facial point data are captured on-site from a touch or non-touch capture device, and the reference point data are extracted from pre-stored CT images in DICOM format. Experimental results will show that the registration results of the proposed system are more accurate than the results of using ICP and SICP. After the registration, surgeons can use mouse to click on the desired location on any slice of CT images in the user’s interface or use the digitizer to touch the desired location on the patient’s face, and these locations will be shown in the registration result sub-window as information survey. The purpose of the proposed system is to rebuild a medical virtual reality environment to assist surgeons for the pre-operation assistance and training.
2 Background of ICP Algorithm 2.1 The Approximate K-D Tree Search Algorithm ICP registers the floating data to reference data by finding the best matching relation with the minimum distance between two data sets. Iterations in ICP are summarized as the following 5 steps. Step 1. Search the closest neighbor points Xk from the reference data set X to a given floating data PK of the floating data set P, i.e. d(PK, Xk)=min{d(PK, X)}. Step 2. Compute the transformation which comprises a rotation matrix R and a translation matrix T by using least square method. Step 3. Calculate the mean square error of the objective function. e( R ,T ) =
1 Np
N
p
∑ i =1
2
xi − R ( pi ) − T
(1)
Step 4. Apply R and T to P. If the stop criterions are reached, then the algorithm stops. Otherwise it goes to the second step. Step 5. Find the best transform parameters of R and T with the minimum e(R,T). In the closest-point-search step, K-D tree and AK-D tree [7] are widely used. In K-D tree, one of k dimensions is used as a key to split the complete space, and the key is stored in a node to generate a binary node tree. The root node contains the complete bin regions, and the leaf nodes contain a sub-bin region. To a query point
A Modified Soft-Shape-Context ICP Registration System of 3-D Point Data
725
Pi, the key of Pi is compared with the keys in nodes of Q to find the most matched leaf node qb. All points in the node bin region of qb are computed to find the closest point. The Ball-within-bounds (BWB) test [7] is performed to avoid a false closest point in K-D tree. In AK-D tree, more bin regions are searched for an approximate closest point, and the BWB test is discarded to reduce the computations. 2.2 The Soft-Shape-Context ICP Registration The shape context [9] is a description of the similarity of the corresponding shapes between two images. Two shapes are calculated to obtain a transformation relation by changing one shape according the minimum square error of the transformation matrix equation. To a given image D, every point di in D, i.e. di ∈ D , is computed the bin information to generate a bin histogram defined below.
hi (k ) =# {d j ≠ d i : (d j − d i )∈ bin(k )}, k = 1,.., B
(2)
where B is the total bin number. An amount of the segmented bins affects the similarity result. If a segmented bin area is too small, the similarity information is too noisy. If a segmented bin area is too big, the similarity information contains the rough global information without the local difference information. To two given images D and F, a cost function defined in (3) is computed for the shape similarity. C DF =
1 2
[h D (k ) − h F (k )]2 h D (k ) + h F (k ) k =1 B
∑
(3)
Liu and Chen [8] proposed a soft-shape-context ICP (SICP) by adding the shape context information into the objective function in ICP. A symmetric triangular function Triang K (l) is used to increase the neighbor bin accumulations at K to smooth the bins. This triangular function is similar to a low-pass filter to reduce the sensitivity around the bin boundaries. The histogram of the soft shape context is called SSC and defined in (4).
SSC d j (l) =
∑ Triang
K =label ( d j )
K
( l ) ⋅ hd j ( l )
(4)
where l is from one to B for a B-bin diagram and the center is dj. Each point dj generates a B-bin histogram which is one kind of projection information. The objective function in ICP is added with the soft-shape-context information and is rewritten in (5).
e( R, T ) = ∑{ q j − R( pi ) − T + α ⋅ Eshape (q j , pi )}
(5)
i
B
(
Eshape(q j , pi ) = ∑ SSCq j (l) − SSCpi (l) l=1
)
(6)
and α is a weighting value to balance values. Liu and Chen claimed that the experimental results of 2-D images of using ICP with soft shape context are superior to the results of using ICP and ICP with shape context in [2].
726
J.-D. Lee et al.
3 The Proposed Adaptive Registration System 3.1 The Proposed Registration System To assist surgeons to track the interested regions in the pre-stored CT images, we propose a registration system to register the facial point data captured from 3-D range scanning equipment to the facial point data of pre-stored CT images. The flowchart of the proposed registration system is shown in Fig. 1.
Fig. 1. The proposed system flowchart
The first process is the point-data capture process. The reference facial point data are extracted from pre-stored CT imaging according the grayscale values and the desired width, and the floating facial point data are captured immediately from a patient in the operation space by using 3-D range scanning equipment. The second process is the trimming process to discard undesired areas of the floating point data to reduce computation time and to increase the registration accuracy. The third process is the registration process. We proposed a modified soft-shape-context ICP registration including an adaptive dual AK-D tree search algorithm (ADAK-D) for the nearest neighbor point searching and an ICP with modified soft shape context information in the objective function. The capture functions of a laser range scanner named LinearLaser (LSH II 150) [14] and a 3-D digitizer MicroScribe3D G2X [15] are embedded in the system to capture on-site floating point data. The system interface is shown in Fig. 2. The red line in the left sub-window of the user’s interface is used to assign the desired face width to be extracted from prestored CT imaging or other DICOM images, and we can select the desired slice and brightness and contrast of the organs such as skin or bone tissues to obtain the model data set. The right sub-window is used to discard unwanted areas of floating point
A Modified Soft-Shape-Context ICP Registration System of 3-D Point Data
727
data and to display the registration result. After the registration, the new target imported from a 3-D digitizer or the target inside the CT image pointed by the mouse are able to be displayed in the right sub-window, too. An example is shown in Fig. 2 where the black point in the registration result of the right sub-window is respected to the clicked point in the CT image.
Fig. 2. The system interface. The pointed target inside a CT slice is displayed in the registered coordinate as the black point.
3.2 The Adaptive Dual AK-D Tree Search Algorithm In the search structure of K-D tree and AK-D tree, the node tree is generated by splitting a k-dimension space along with a fixed geometrical projecting direction order. Because only one projection direction or a projection plane of k-dimensions is considered to split the space each time, two truly nearest neighbor points may be located at two far sub-root nodes in a binary node tree. It is assumed that if a nearestneighbor point is true, then the same point should be returned in any geometrical projection plane order when using AK-D tree or K-D tree. Based on this assumption, we utilize AK-D tree twice in two projection axis orders of “x, y, z” and “z, y, x” to examine the queried results. AK-D tree is used here because of its runtime efficiency. If two returned queried results using AK-D tree in two different projection plane orders are very close, then this query point and the returned queried point are reserved as significant coupling points, otherwise this query point is rejected. The proposed ADAK-D tree is a 5-step process to search and determine significant coupling points used in ICP. In each iteration of ICP, the nearest neighbor points are found in the following 5 steps. It’s assumed that there is a minimum 3-D rectangular box C to contain the complete floating data and M is the area of the biggest plane of C. The initial P is the amount of total floating points. Step 1. The node tree of reference/model data is built as Database 1 by using AK-D tree in the x-y-z projection axis order iteratively. Step 2. The node tree of reference/model data is built as Database 2 by using AK-D tree in the z-y-x projection axis order iteratively.
728
J.-D. Lee et al.
Step 3. The threshold T is computed by Eq. 7.
T = M /P
(7)
Step 4. To a query point from the floating data set, if the distance between two returned queried points from Database 1 and Database 2 is smaller than the distance T, this query point and the returned queried point from Database 1 are reserved as a pair of significant coupling points. Step 5. After that all floating points are queried, P is updated by the reserved pair number. The reserved significant points after using ADAK-D tree are coupling points [12] used to compute the best translation and rotation parameters in later ICP stages. In order to avoid falling into a local minimum during ICP because of insufficient coupling points, T in Step 3 is automatically adjusted by the previous iteration information. If the situation without sufficient coupling points happens, i.e. P decreases in this iteration, then T increases in the next iteration, which causes the increase of P in the next iteration. 3.3 The Modified Soft-Shape-Context ICP The SICP has a practical weakness when it is used on 3-D point data. To 3-D point data, each point of the floating data and reference data generates its bin histogram in the soft-shape-context ICP. This processing consumes huge computation time and causes ICP to appear weaker when compared with other registration algorithms such as using Mutual Information [3][4] and Genetic Algorithm (GA)[1][5]. To reduce the computation time of generating bin histograms, only two bin histograms of the centroid points of the floating and reference data are calculated and the equations of (5) and (6) are rewritten as (8) and (9).
e(R, T ) = {∑ q j − R( pi ) − T } + α ⋅ Eshape (qc , pc )
(8)
i
B
(
Eshape (qC , pC ) = ∑ SSCqc (l) − SSCpc (l) l=1
)
(9)
and qc and pc are the centroid points of data sets q and p. For the closest point search algorithm in the modified SICP (MSICP), ADAK-D tree is used. The proposed system utilizes the modified soft-shape-context ICP together with procedures and functions suggested by people of interest in the medical field to build a brain virtual reality environment.
4 Experimental Results The synthetic and real experiments are performed to compare the proposed ICP registration algorithms against other methods. In the synthetic experiments, the reference point data set is translated by 10 mm along x, y, and z axes as well as rotated 10 degrees around each of the three axes to generate 6 floating point data sets which
A Modified Soft-Shape-Context ICP Registration System of 3-D Point Data
729
labeled as (a), (b), (c), (d), (e), and (f) respectively. In the real experiments, the reference data sets are obtained from the pre-stored CT images, and the floating data sets are captured from a touch or non-touch capture device. 4.1 The Comparisons of ICP with AK-D Tree and ADAK-D Tree In the synthetic experiments, the reference point data sets with 11291 points and are captured by LinearLaser. Then six floating data sets are generated from each reference point data set and listed in Table 1. The root-square-error (RMS) distance [13] is used to measure the performance. The RMS of registration result of using AK-D tree in Table 1(a) is high because the solution was trapped in a local minimum. The results in Table 1 have shown that the proposed method has improved the accuracies over the AK-D method in most of the cases. In the real experiments, using real data, CT facial point data are extracted from prestored CT data first. The reference data are CT facial point data with 17876 points as shown in Fig. 3(a) and the first floating data are laser scan facial point data with 9889 points as shown in Fig. 3(b). The RMS of the registration result using AK-D tree as Table 1. The comparison of registration results of using laser-scan surface data
AK-D tree method [11] RMS (mm)
Runtime(sec.)
ADAK-D tree method RMS (mm)
Runtime(sec.)
(a)
16.53
2.25
0.19
5.5
(b)
6.47
1.73
0.19
5.53
(c)
2.54
1.98
0.19
5.75
(d)
2.55
2.00
0.19
5.79
(e)
2.53
2.00
0.19
5.64
(f)
2.52
1.98
1.06
5.65
(a)
(b)
(c)
(d)
Fig. 3. The illustrations of registration results from laser scan facial surface data to CT facial surface data
730
J.-D. Lee et al.
(a)
(b)
(c)
(d)
Fig. 4. The illustrations of registration results of using different capture ways by a 3D digitizer
in Fig. 3(c) is 2.47 mm and the runtime is 3.21 seconds. The RMS of the registration result using ADAK-D tree as in Fig. 3(d) is 1.16 mm and the runtime is 4.43 seconds.In the second experiment, the floating facial point data are captured by MicroScribe3D G2X. We test two different capture ways: the first capture way is a discrete way as shown in Fig. 4(a), and the second capture way is a continuous way as shown in Fig. 4(b). The registration results are 3.24 mm with 0.43 seconds and 0.989 mm with 0.578 seconds as shown in Fig. 4(c) and 4(d) respectively, and a good registration result occurs when the capture way contains more 3-D direction information even thought the floating point data contain only few hundred points. 4.2 The Comparisons of ICP, Soft-Shape-Context ICP and the Proposed Algorithm The comparisons of ICP, the soft-shape-context ICP (SICP), and the proposed modified soft-shape-context ICP (MSICP) are performed in the synthetic and real experiments. We use K-D tree in ICP and SICP, and 12*6 shell-model bins to cover the complete 3-D space are used in SICP and MSICP. In the synthetic experiment, in Table 2. The comparison of registration results of using digitized point data ICP RMS(mm)
SICP
Runtime (sec.)
RMS(mm)
MSICP
Runtime (sec.)
RMS(mm)
Runtime (sec.)
(a)
6.14
0.1
2.77
32
0.97
0.1
(b)
8.99
0.1
1.46
30
0.97
0.4
(c)
7.45
0.1
1.36
31
1.16
0.1
(d)
10.71
0.1
1.36
31
0.91
0.2
(e)
11.37
0.1
1.37
35
0.78
1.2
(f)
10.81
0.1
1.41
32
1.18
1.1
A Modified Soft-Shape-Context ICP Registration System of 3-D Point Data
(a)
(b)
(c)
(d)
731
(e)
Fig. 5. The illustrations of registration results of using ICP, SICP and MSICP in Table 6 Table 3. The registration results of the real experiment.
ICP RMS(mm)
SICP
Runtime(sec.)
2.62
0.5
RMS(mm)
1.16
MSICP
Runtime(sec.)
103.5
RMS(mm)
0.69
Runtime(sec.)
0.93
Table 2 the reference point data set is the 3-D digitized data with 729 points captured by MicroScribe3D G2X in the first experiment and artificially moved along and around the axes. In the real experiment, the reference data are CT facial point data with 31053 points and the floating facial point data with 984 points are captured by MicroScribe3D G2X. The registration results of using ICP, SICP and MSICP are 2.62 mm, 1.16 mm, and 0.69 mm as shown in Fig. 5(c), (d) and (e) respectively and in Table 3.
5 Conclusion An adaptive ICP registration system which utilizes ADAK-D tree for the closest point searching and the modified soft-shape-context objective function is presented in this paper to assist surgeons in finding a surgery target within a medical virtual reality environment. The proposed system registers floating facial point data captured by a touch or non-touch capture equipment to reference facial point data which are extracted from pre-stored CT imaging. In the searching-closest-point process, an adaptive dual AK-D tree search algorithm (ADAK-D tree) by utilizing AK-D tree twice in different partition axis orders to search the nearest neighbor point and to determine the significant coupling points as the control points in ICP. An adaptive threshold for the determination of significant coupling points in ADAK-D tree also maintains sufficient control points during the iteration. Experiment results illustrated the superiority of the proposed system of using ADAK-D tree over the registration system of using AK-D tree. In the objective function of ICP, the proposed system adapted the soft-shape-context objective function which contains the shape projection information and the distance error information to improve the accuracy. We then modified the soft-shape-context objective function again to reduce the computation time but maintaining the accuracy. Experimental results of synthetic and real experiments have shown that the proposed
732
J.-D. Lee et al.
system is more robust than ICP or soft-shape-context ICP. Additional functions such as tracking the desired location in the registration result were suggested by surgeons and are embedded in the user’s interface. In the future, we like to try other optical or electromagnetic 3D digitizers for the better capture data and export the registration information to a stereo display system.
References 1. Chow, C.K., Tsui, H.T., Lee, T.: Surface registration using a dynamic genetic algorithm. Pattern Recognition 37(1), 105–117 (2004) 2. Liu, D., Chen, T.: Soft shape context for interative closest point registration. In: IEEE International Conference on ICIP, pp. 1081–1084 (2004) 3. Tomazevic, D., Likar, B., Pernus, F.: 3-D/2-D registration by integrating 2-D information in 3-D. IEEE Transaction on Medical Imaging 25(1), 17–27 (2006) 4. Chen, H., Varshney, P.K., Arora, M.K.: Performance of mutual information similarity measure for registration of multitemporal remote sensing images. IEEE Transaction on Geoscience and Remote Sensing 41(11), 2445–2454 (2003) 5. Zhang, H., Zhou, X., Sun, J., Sun, J.: A novel medical image registration method based on mutual information and genetic algorithm. In: International Conference on Computer Graphics, Imaging and Vision: New Trends, pp. 221–226 (2005) 6. Cleary, J.G.: Analysis of an algorithm for finding nearest neighbours in Euclidean space. ACM Transaction on Mathematical Software 5(2), 183–192 (1979) 7. Greenspan, M., Yurick, M.: Approximate K-D tree search for efficient ICP. In: Proceedings of Fouth International Conference on 3-D Digital Imaging and Modeling (3DIM), pp. 442– 448 (2003) 8. Besl, P.J., Mckay, N.D.: A method for registration of 3D shape. IEEE Transcation on Pattern Analysis and Machine Intelligence 14(2), 239–256 (1992) 9. Belongie, S., Malike, J., Puzicha, J.: Shape matching and object recognition using shape contexts. IEEE Trans. PAMI 24(4), 509–522 (2002) 10. Bhandarkar, S.M., Chowdhury, A.S., Tang, Y., Yu, J., Tollner, E.W.: Surface matching algorithms computer aided reconstructive plastic surgery. In: Proceedings of IEEE International Symposium on Biomedical Imaging: Macro to Nano, vol. 1, pp. 740–743 (2004) 11. Rusinkiewicz, S., Levoy, M.: Efficient variants of the ICP algorithm. In: Proceedings of Fouth International Conference on 3-D Digital Imaging and Modeling (3DIM), pp. 145–152 (2001) 12. Jost, T., Hugli, H.: A multi-resolution ICP with heuristic closest point search for fast and robust 3D registration of range images. In: Proceedings of Fouth International Conference on 3-D Digital Imaging and Modeling (3DIM), pp. 427–433 (2003) 13. Zagrodsky, V., Walimbe, V., Castro-Pareja, C.R., Qin, J.X., Song, J.M., Shekar, R.: Registration-assisted segmentation of real-time 3-D echocardiographic data using deformable models. IEEE Transaction on Medical Imaging 24(9), 1089–1099 (2005) 14. 3DFamily Technology Corporation, http://www.3dfamily.com/ 15. Immersion Corporation, http://www.immersion.com/digitizer/
Solution Method Using Correlated Noise for TSP Atsuko Goto and Masaki Kawamura Yamaguchi University, 1677-1 Yoshida, Yamaguchi, Japan {kawamura,atsuko21}@is.sci.yamaguchi-u.ac.jp
Abstract. We suggest solution method for optimization problems using correlated noises. The correlated noises are introduced to neural networks to discuss mechanism of synfire chain. Kawamura and Okada have introduced correlated noises to associative memory models and have analyzed those dynamics. In the associative memory models, memory patterns are memorized as attractors in the minimum of the system. They found the correlated noise can make the state transit between the attractors. However, the mechanism of the state transition has not been known enough yet. One the other hand, for combinational optimization problems, the energy function of a problem can be defined. Therefore, finding a optimum solution is finding a minima of the energy function. The steepest descent method searches one of the solutions by going down along the gradient direction. By this method, however, the state is usually trapped in a local minimum of the system. In order to escape from the local minimum, the simulated annealing, i.e. Metropolis method, or chaotic disturbance is introduced. These methods can be represented by adding thermal noises or chaotic inputs to the dynamic equation. In this paper, we show that correlated noises introduced to neural networks can be applied to solve the optimization problems. We solve the TSP that is a typical combinational optimization problem of NP-hard, and evaluated solutions obtained by using the steepest descent method, the simulated annealing and the proposed method with the correlated noises. As results, in the case of ten cities, the proposed method with correlated noises can obtain more optimum solutions than the steepest descent method and the simulated annealing. In the cases of large numbers of cites, where it is hard to find one of the optimum solutions, our method can obtain solutions at least as same level as the simulated annealing.
1
Introduction
In the activities of nerve cells, synfire chains, i.e. synchronous firings of neurons, can often be observed [1]. To analyze the mechanism of synchronous firings, condition for propagating them between layers have been investigated in layered neural networks [2,3]. In the layered neural networks, it has been proofed that the spacial correlation between neurons is necessary [4]. Aoki and Aoyagi [5] M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 733–741, 2008. c Springer-Verlag Berlin Heidelberg 2008
734
A. Goto and M. Kawamura
have shown that the state transition in associative memory models is invoked by not thermal independent noises but synchronous spikes. Kawamura and Okada [6] have proposed associative memory models to which common external inputs are introduced, and found that the state could transit between attractors by the inputs. The synchronous spikes of Aoki and Aoyagi model correspond to the common external inputs. In associative memory models, memory patterns are memorized as attractors. When we consider the energy function or cost function in the associative memory models, the attractors are represented by minimum of the system. The states of neurons are attracted into one of the memory patterns near the initial state. A optimization problem is one of the problems to minimize the energy function. In engineering and social science, the optimization problems are important. The combinational optimization problem is one of the optimization problems, which is the problem that find the solution minimizing the value of object function in feasible area. Since number of feasible solutions is finite, some optimal solutions might be obtained when we could search all feasible solutions. Such solution methods are known as enumeration methods, i.e. branch-and-bound method and dynamic programming. However, the combinational optimization problems are belonging to NP-hard, and then we cannot obtain solutions within a effective time by these methods. Therefore, instead of finding optimum solutions in whole feasible area, the methods that can find optimum or quasi-optimum solutions are developed. In these methods, the optimum solutions are designed as minimum of the energy function, and the problems are formulated as finding global minimum of the energy function. The steepest descent method (SDM), the simulated annealing (SA) [7,8,9], and chaotic method [10,11] are introduced in order to find global minimum. Since the steepest descent method obtains solutions along the gradient direction, the states cannot escape from local minimum. Therefore, thermal independent noise or chaotic noise is introduced to escape from the local minimum. The simulated annealing is the method using thermal independent noise. The optimum solution can be found by decreasing temperature T through Tt+1 ≥ c/log(1 + t), where c is constant and t represents time [9]. We consider the correlated noises introduced to associative memory models by Kawamura and Okada [6], since the correlated noises can make the state transit between attractors. The optimum and quasi-optimum solutions of the optimization problems can be assumed to be attractors, and it is expect that better optimum solutions can be easily obtained using the correlated noise. We, therefore, propose the method with the correlated noises in order to solve the combinational optimization problems. We can assume that the thermal noise used in the simulated annealing corresponds to independent noise, since the noise is fed to each element independently. The correlated noises that we propose is fed to all elements mutually. Therefore, the state of each element has spacial correlation. We show that the better solutions are obtained by the proposed method efficiently than the simulated annealing and the steepest descent method.
Solution Method Using Correlated Noise for TSP
2
735
TSP
The traveling salesman problem, TSP, is one of the typical combinational optimization problems. The TSP is the problem that a salesman visits once each city and finds the shortest path. There are (N − 1)!/2 different cyclic paths for N cities. In this paper, we show that the correlated noise can be applied to the combinational optimization problem. This kind of problems is formulated as the problem for which one obtains minimum values of its energy function. The state variable Vxi takes 1 when a salesman visits x-th city at the i-th order, and 0 when he doesn’t. The energy function of the TSP is defined as E = αEc + βEo , 2 2 N N N N 1 1 Ec = Vxi − 1 + Vxi − 1 2 x=1 i=1 2 i=1 x=1 +
N N
Vxi (1 − Vxi ),
(1)
(2)
x=1 i=1
1 dxy Vxi Vy,i−1 + Vxi Vy,i+1 , 2 x=1 y=1 i=1 d 2 N
Eo =
N
N
(3)
y =x
where the energy Ec and Eo represent constrained condition and object function, respectively. The constant dxy represents distance between x-th and y-th cities, and the average distance d is given by, d =
N N 1 dxy , N (N − 1) x=1 y=1
(4)
where it shows the average distance between all different cities. The coefficients α is usually α = 1, and β is decided according to the cities’ locations and the number of them. The optimum solutions are obtained by searching minimum of the energy function E. The minimum which shall be satisfied with Ec = 0 are called solutions, and the the solutions which give the shortest paths are called optimum solutions.
3
Proposed Method
In order to obtain one of local minimum of the energy function E by the steepest descent method, the state Vxi (t) is updated by μ
N N duxi (t) = −uxi (t) + Wxiyj Vyj (t) + θxi , dt y=1 j=1
Vxi (t) = F (uxi (t)),
(5) (6)
736
A. Goto and M. Kawamura
where the function F is the output function which decides output Vxi (t) according to the internal state uxi (t). We used the output function, ⎧ ⎪ ⎨1, 1 < u F (u) = u, 0 < u ≤ 1 . (7) ⎪ ⎩ 0, u ≤ 0 From the energy function, the constant Wxiyj is given by Wxiyj = −δx,y (1 − δi,j ) − δi,j (1 − δx,y ) − β
dxy (δi−1,j + δi+1,j )(1 − δx,y ), (8) d
and the external input θxi is constant θxi = 1. The delta function δx,y is defined as
1, x = y δx,y = . (9) 0, x =y When independent noise ζxi (t) is introduced to (5), the equation corresponds to the simulated annealing. When correlated noise η(t) is introduced, the equation gives the proposed method. Therefore, we consider the equation given by μ
N N duxi (t) = −uxi (t) + Wxiyj Vyj (t) + θxi dt y=1 j=1
(10)
+ζxi (t) + η(t), Vxi (t) = F (uxi (t)).
(11)
We note that independent noise ζxi (t) is fed to each neuron independently, and the correlated noise η(t) is fed to all neurons mutually. We assume that the independent noise obeys normal distribution with mean 0 and variance σζ2 , and the correlated noises obeys normal distribution with mean 0 and variance ση2 .
4 4.1
Simulation Results Locations of Cities
Figure 1 shows the locations for 10 cities that are arranged in random order, and one for 29 cities named bayg29 in TSPLIB [12]. The shortest path for which a salesman visits 10 cities is -A-D-B-E-J-H-I-G-F-C-, and for 29 cities -1-28-6-12-9-263-29-5-21-2-20-10-4-15-18-14-17-22-11-19-25-7-23-8-27-16-13-24-. The distance of the shortest path of 10 cities is 2.69, and one of 29 cities is 9074.15, where the significant figure is until second decimal place. 4.2
Experimental Procedure
The initial values of internal state uxi (t) are determined at random with uniform distribution on the interval [−0.01, 0.01). Using the steepest descent method, the
Solution Method Using Correlated Noise for TSP
1
2400 B
21
1600
2 20
D
I
1200 18
G
800
0
1
400
13
27
0
23
16
15
19
25
400
7
11
22
(a) 10 cities
8 24
14
17 0
10
1
4
F C
28
5
29
J H
12 6
2000
A
9
26
3
E
737
800
1200
1600
2000
(b) 29 cities
Fig. 1. Locations of (a) 10 cities and (b) 29 cities. The paths show the optimal solutions.
method with independent noises, and proposed method, we perform computer simulations where α = 1 in (1). For the case of the 10 cities, we perform this case 100 times, and for the case of the 29 cities, 1000 times. We evaluate the ratio R of the path length for an obtained solution to the optimum path length, R=
[path length for obtained solution] . [optimum path length]
(12)
Since the state must satisfy 0-1 condition when the state converges, the final state is given by, Vxi = H(uxi ), where the function H(uxi ) is given by, 1, uxi > 0 H(uxi ) = . 0, uxi ≤ 0 4.3
(13)
(14)
Results
For the 10 cities, we assume β = 0.35, the variance of independent noise is σζ2 = 0.08, and the variance of correlated noise is ση2 = 0.08. We calculate the number of optimum solutions on 100 trials for this location. The histogram of the path length, when we can obtain solutions, is shown in Fig.2. Abscissa represents the ratio R in (12), and ordinate represents the percentage of number of ratio R. The number of obtained solutions by the steepest descent method is 17 times, that by the method with independent noise is 55 times, and that by proposed
738
A. Goto and M. Kawamura
100 10 cities 80
CN
60 % 40
IN
20 SDM 0
1
1.05 1.1 1.15 1.2 1.25 1.3 1.35 1.4 ratio
Fig. 2. Histogram of path length of obtained solutions for 10 cities. Solid, dot, and broken lines represent results obtained by correlated noise (CN), independent noise (IN), and steepest decent method (SDM), respectively.
10
10
3
3
10 2
2
10
IN
10 1 E -1 10 10
-2
10
-3
10
-4
CN
10
SDM
1 E -1 10
SDM
-2
10
-3
10
10-4
10 cities
-5
10
10 cities
-5
1
10
10
2
10 3 time
10 4
(a)
10 5
10 6
10
1
10
10 2
10 3 time
10 4
10 5
10 6
(b)
Fig. 3. Residual energy by (a) independent noises (IN) and (b) correlated noise (CN) for 10 cities
method with correlated noises is 90 times. The proposed method can obtain most solutions in these methods. Next, Figure 3 shows the transition of residual energy by the method with independent noise and correlated noise, where the updating steps of internal state are 1,000,000 times. The residual energy Eres means difference between energy E(t) of Vxi at time t and the energy of the optimum solution, Eopt ; Eres = E(t) − Eopt . (15) We found that the energy do not go down through 0 by the method with independent noise, but by the proposed method.
Solution Method Using Correlated Noise for TSP
739
30 29 cities
CN
25 20
IN
% 15 10 SDM
5 0
1
1.1
1.2
1.3 1.4 ratio R
1.5
1.6
1.7
1.8
Fig. 4. Histogram of path length for 29 cities. we assumed β = 0.35, the variance of independent noise is σζ2 = 0.04, and the variance of correlated noise is ση2 = 0.04. The number of obtained solutions by the CN is 979 times, one by IN is 931 times and one by the SDM is 883 times.
100
CN (β=0.35)
80
29 cities
IN (β=0.35)
60 %
CN (β=0.50) IN (β=0.50)
40 20 0
0
0.02
0.04
0.06 ση2 , σ ζ2
0.08
0.10
Fig. 5. Solutions ratio (%) for 29 cities
For the 29 cities, we calculate the solutions, where the variances of independent noise are σζ2 = 0.01 ∼ 0.10, and the variances of correlated noise are ση2 = 0.01 ∼ 0.10. The optimum solutions could not be obtained by all these methods for this location. Therefore, we calculate the path lengths for obtained solutions. Figure 4 shows the histogram of path lengths when solutions are obtained. The abscissa represents the ratio R, and the ordinate represents the percentage that the solutions having ratio R are obtained. The method with independent noise
740
A. Goto and M. Kawamura
Table 1. The optimum variance and the solution ratio for β = 0.35 and β = 0.5 β=0.35 β=0.5 optimum variance σζ2 = 0.03 ση2 = 0.03 σζ2 = 0.05 ση2 = 0.05 solution ratio(%) 93 98 56 75
and the proposed method can obtain better solutions than the steepest descent method. We compare the solution ratio of obtained solutions for the method with independent noise with one for the proposed method. Figure 5 shows the solution ratio for variances σζ2 and ση2 in the cases of β = 0.35, 0.50. Table 1 shows the optimum variance and the solution ratio. In the case of β = 0.35, the number of obtained solutions by the method of independent noises with σζ2 = 0.03 is 93 times. The number of obtained solutions by the proposed method with ση2 = 0.03 is 98 times. There are not so much of a difference between them. On the other hand, in the case of β = 0.5, number of solutions by the method of independent noises with σζ2 = 0.05 is 56 times and, one by the proposed method with ση2 = 0.05 is 75 times. Namely, the proposed method can obtain better solutions than the method with independent noises. We, therefore, found that the proposed method can be much more effective than the method with independent noises in order to obtain solutions depending on β.
5
Conclusion
In associative memory models, the correlated noise is effective in state transition. In this paper, we proposed the solution method using the correlated noise and applied to TSP that is one of the typical combinational optimization problems of NP-hard. As the results, for the case of the 10 cities, the proposed method with the correlated noises can obtains more solutions than both the steepest descent method and the method with independent noises. For the case of the 29 cities, all these methods cannot be obtained any optimum solutions. However, we found that the proposed method can obtain better solutions than the existing methods depending on β. From these results, we can show that the correlated noises is also effective for the combinational optimization problems.
Acknowledgments This work was partially supported by a Grant-in-Aid for Young Scientists (B) No. 16700210. The computer simulation results were obtained using the PC cluster system at Yamaguchi University.
References 1. Abeles, M.: Corticonics. Cambridge Univ. Press, Cambridge (1991) 2. Dlesmann, M., Gewaltig, M.-O., Aertsen, A.: Stable Propagation of Synchronous Spiking in Cortical Neural Networks. Nature 402, 529–533 (1999)
Solution Method Using Correlated Noise for TSP
741
3. Cˆ ateau, H., Fukai, T.: Fokker-Planck Approach to the Pulse Packet Propagation in Synfire Chain. Neural Networks 14, 675–685 (2001) 4. Amari, S., Nakahara, H., Wu, S., Sakai, Y.: Synchronous Firing and Higher-Order Interactions in Neuron Pool. Neural.Comp. 15, 127–143 (2003) 5. Aoyagi, T., Aoki, T.: Possible Role of Synchronous Input Spike Trains in Controlling the Function of Neural Networks. Neurocomputing 264, 58–60 (2004) 6. Kawamura, M., Okada, M.: Stochastic Transitions of Attractors in Associative Memory Models with Correlated Noise. J. Phys. Soc. Jpn 75, 124–603 (2006) 7. Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P.: Optimization by Simulated Annealing. Science 220, 671–680 (1983) 8. Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., Teller, E.: Equation of State Calculations by Fast Computing Machies. J. Chem. Phys. 21(6), 1087–1092 (1953) 9. Geman, S., Geman, D.: Stochastic Relaxation, Gibbs Distributions, and the Bayesian Ryestoration of Image. IEEE Trans 6, 721–741 (1984) 10. Zhou, C.-S., Chan, T.-L.: Chaotic Annealing for Optimization. Phys. Rev. E 55, 2580–2587 (1997) 11. Tokuda, I., Nagashima, T., Aihara, K.: Global Bifurcation Structure of Chaotic Neural Networks and its Application to Traveling Salesman Problems. Neural Networks 10, 1673–1690 (1997) 12. TSPLIB, http://www.iwr.uni-heidelberg.de/groups/comopt/software/TSPLIB95/
Bayesian Collaborative Predictors for General User Modeling Tasks Jun-ichiro Hirayama1, Masashi Nakatomi2 , Takashi Takenouchi1 , and Shin Ishii1,3 1
Graduate School of Information Science, Nara Institute of Science and Technology, Takayama 8916-5, Ikoma, Nara {junich-h,ttakashi}@is.naist.jp 2 Ricoh Company, Ltd. [email protected] 3 Graduate School of Informatics, Kyoto University [email protected]
Abstract. Collaborative approach is of crucial importance in user modeling to improve the individual prediction performance when only insufficient amount of data are available for each user. Existing methods such as collaborative filtering or multitask learning, however, have a limitation that they cannot readily incorporate a situation where individual tasks are required to model a complex dependency structure among the task-related variables, such as one by Bayesian networks. Motivated by this issue, we propose a general approach for collaboration which can be applied to Bayesian networks, based on a simple use of Bayesian principle. We demonstrate that the proposed method can improve both the prediction accuracy and its variance in many cases with insufficient data, in an experiment with a real-world dataset related to user modeling. Keywords: User modeling, collaborative method, Bayesian network, Bayesian inference.
1
Introduction
Predicting users’ actions based on past observations of their behaviors is an important topic for developing personalized systems. Such prediction usually needs a user model that effectively represents the knowledge of a user or a group of users, and is useful for the prediction. User modeling (or user profiling) [12,10,3] is currently an active research area for this aim, which seeks the methods of acquiring user models and making prediction based on them. Recently, probabilistic graphical models such as Bayesian networks (BNs) [8,6] have been attracted an attention as an effective modeling tool in this field, because of their capacity to deal with relatively complex dependency structures among variables, which is a favorable property in general user modeling tasks. One crucial demand on user modeling is to develop “collaborative” methods which utilize the other users’ information to improve the prediction of a target M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 742–751, 2008. c Springer-Verlag Berlin Heidelberg 2008
Bayesian Collaborative Predictors
743
user. This is due partly to the limited sample size for individual users [10], and partly to the assumption that users may share common intension to decide. Consider an interactive system that is repeatedly used by many users, such as some sorts of web sites, say e-commerce ones, and publicly-used electronic devices, say network printers. It should be considered that all the users do not necessarily interact with the system actively so that there are insufficient amount of data to construct reliable models of some users. This is also the case with new users for the system; prediction about new users would often fail due to their limited sample sizes. Furthermore, in large-scale problems, it would be almost impossible in practice to collect a sufficient amount of data for any user. This is actually the case in many e-commerce recommendation systems [9]. Collaborative methods are particularly important to complement the lack of individual data by considering the relationship to other users. Collaborative filtering (CF) [2] is probably the best known collaborative method in the context of recommendation, which estimates users’ unknown ratings over items based on known ratings of similar users. The problem setting of rating estimation is, however, a limited one in that it do not usually consider general dependency structures among multiple variables (including ratings and content-related ones), so that the usual CF methods cannot be directly applied to more general tasks. Alternatively, multi-task learning (MTL) has recently been an active research topic in machine learning, which solves multiple related prediction tasks simultaneously to improve each task’s performance by exploiting the generalization among the tasks. However, the application of exsisting MTL schemes to general graphical models is not so straightforward, especially when one consider the learning of graph structure in addition to their parameters. One attempt to the MTL of BNs has recently been reported in [7] in which a prior distribution over multiple graph structures of BNs is employed to force them to have similar structures for each other. However, the joint determination of multiple model structure is computationally expensive, and also the method do not allow heterogeneity among users which usually exists in user modeling tasks. In this article, we propose a simple framework to collaborative user modeling, with particular interest in its application to Bayesian networks. The principal aim here is to improve the typically low performance of prediction in the initial phase after a user started to interact the system. Our approach flexibly realizes knowledge sharing among individual models that have already been learned individually, instead of follwoing the usual MTL setting in which the models are simultaneously learned. A similar post-sharing approach has recently been investigated in a different context in [11], which focused on the selection of relevant subset of individual models. In this article, we also evaluate the proposed method with a real dataset, which has been collected in a real-world user modeling task.
2
Bayesian Network
For a general purpose of user modeling, BN is one of the key tools for representing knowledge and making prediction of users [12]. While our proposed approach
744
J. Hirayama et al.
A B
C D
E
Fig. 1. An example of DAG
described in the next section is not limited to a specific learning model, we focus on its application to (discrete) BNs in this article, considering the importance of BNs in user modeling. In this section, we briefly review the learning and prediction by BN. For more details, see [8,6]. BN is a probabilistic graphical model that defines a class of parametric distribution over multiple random variables in terms of a directed acyclic graph (DAG). Fig. 1 shows an example of DAG. Each node corresponds to a single variable, and each edge represents conditional dependence between the variables. BN has a relatively high representational power in conjunction with effective prediction and marginalization algorithms such as the belief propagation (BP) [8] or the junction tree algorithm (JT) [4]. In addition, BN is attractive because of a human-interpretable knowledge representation in terms of DAG. Let v be a set of the variables of interests. The probability distribution of BN can be written, corresponding to a specific DAG G, as p(v | θ, G) = p (v(i) | pai, θi, G) , (1) i
where v(i) denotes the i-th element (node i) of v and pai the set of parents variables of the node i. The model parameters are denoted by θ = {θi}, where θi is the parameters that define the local conditional distribution on v(i). Note that both pai and θi are defined under the specification of the DAG G. The training of BN is conducted in two steps. First, the structure of BN is determined according to a certain scoring criterion (structure learning). Second, the conditional multinomial probability distribution of each node given its parents is estimated by Bayesian inference by assuming a conjugate Dirichlet prior (parameter learning). In this study, we assume no hidden variables. The most basic (but rather computationally-expensive) scoring criterion in the structure learning is then the (log-)marginal likelihood of the graph structures. After given a structure, the parameter learning can be done in a straightforward manner.
3 3.1
Bayesian Collaborative Predictors Prediction with Individual User Models
Consider that a user u makes a repeated interaction with a target system, with generating a sample v nu = {xnu , tnu } in the n-th usage, where xu and tu denote
Bayesian Collaborative Predictors
745
the sets of input and target variables, respectively. Given a set of observations, Du = {vnu | n = 1, 2, . . . , Nu }, the individual prediction task is stated generally new as to predict a new instance of the target, tnew u , given a new input xu . In this article, we assume both xu and tu are discrete variables (for allowing the use of discrete BN), while our approach can be applied in a straightforward manner for the other cases such that the input and/or target variables are continuous. A key requirement of our approach is to construct an individual user model as a conditional probability distribution, pu (tu | xu ), where we explicitly put the subscript u to indicate the user. This can be done by any kinds of probabilistic regressors or classifiers, but we focus on the use of discrete BN to realize it. Given a BN joint distribution pu (v u ) that has already been trained based on the individual dataset Du , which can be done as described in the previous section, there are some ways to obtain the conditional distribution. One naive way is to directly calculate the conditional probabilities for all possible realizations of tu and xu with the BN parameters, and then normalize them for each condition of xu . We refer to this approach as the enumerative method. Such exhaustive enumeration becomes intractable if the number of variables increases, but this approach is easy to implement while enabling exact computation; it is thus useful in the cases with moderate number of variables. When there are a large number of variables, however, the exact calculation is not realistic, and then other ways are required. The state-of-the-art methods of probabilistic reasoning such as BP or JT would efficiently calculate the marginals of single target nodes or a subset/clique of target nodes conditional on a given input, instead of joint distribution over all the target variables. The conditional distribution pu (tu | xt ) (which considers joint probability over tu ) can then be approximated by the products of the resultant marginals. Based on the conditional distribution obtained by these methods, prediction for each individual can be simply made as giving its conditional mode: ˆtnew = argmaxtu pu (tu | xu = xnew u u ).
(2)
In this article, this type of prediction is referred to as individual prediction, in contrast to collaborative ones. 3.2
Bayesian Collaboration of Pre-learned Predictors
Consider that individual user models, i.e., the conditional distributions, have already learned for U users. The number of training samples used in the learning, however, may be quite different for each user. The prediction accuracy of the users with small sample sizes can become low, probably with a high uncertainty/variance of learning. The aim of this study is to deal with this problem by introducing a framework of knowledge sharing among the pre-learned individual user models, based on the Bayesian perspective. Our approach is based on a rather tricky use of Bayesian principle. For convenience of explanation, we assume for a moment a virtual agent who makes prediction of a specific user s by collecting the information of the other users.
746
J. Hirayama et al.
Suppose the agent can freely access any other user’s model and dataset. The available information for the agent is then the U conditional models, p1 (t1 | x1 ), p2 (t2 | x2 ), . . . , pU (tU | xU ), each of which is for different tasks but somehow related to each other, and also the U original datasets, D1 , D2 , . . . , DU . The trick here is to regard each of the other users’ models as another hypothesis that also describes the behaviors of user s. The agent then turns out to have U conditional hypotheses, p1 (ts | xs ), p2 (ts | xs ), . . . , pU (ts | xs ), for the specific prediction task of user s. If the agent is a Bayesian, a natural choice in such a situation is to compute the posterior distribution over U hypotheses, given the corresponding dataset Ds , and then form a predictive distribution by taking posterior average over the conditional models. Using the notations: Ts = {tns | n = 1, 2, . . . , Ns } and Xs = {xns | n = 1, 2, . . . , Ns }, the posterior distribution over U models can be given as Ns
pu (Ts | Xs )πs (u)
πs (u | Ds ) = U
u=1
pu (Ts | Xs )πs (u)
n n n=1 pu (ts | xs )πs (u) , Ns n n u=1 n=1 pu (ts | xs )πs (u)
= U
(3)
where πs denotes the subjective belief of the agent for user s; πs (u) is the prior belief over the models. With this posterior, the Bayesian predictive distribution is given as p¯s (ts | xs ) ≡
U
πs (u | Ds )pu (ts | xs ).
(4)
u=1
The final prediction is made, according to the predictive distribution, as ˆtnew = argmaxts p¯s (ts | xs = xnew ), s s
(5)
which we refer to as (Bayesian) collaborative prediction in this study. We note that in Eq. (4) the predictive distribution now does not depend on the other users’ data. This may be an advantage in some distributed environments, like when the users are in different sites and the communication between them is costly. The posterior probability of other users become large when the user s have only a limited number of training data and also there exist similar users to user s in the sence that they generate similar outputs given same inputs. The posterior integration will then incorporate additional knowledge from the similar users’ models to the prediction of user s. The introduced knowledge is expected to be effective in improving the prediction accuracy. In addition, the model averaging would reduces the uncertainty/variance of the model. On the other hand, if the user s have sufficient amount of data or there are no similar users, the posterior probability p(u = s) would be close to one, which reduces the collaborative predictor to the individual predictor; this is quite natural, because the model is successfully learned individually or user s is isolated.
Bayesian Collaborative Predictors
4 4.1
747
Experimental Results Printer Usage Data
We evaluated the proposed approach by using a real-world user modeling task. The dataset used here, which we refer to as “printer usage” dataset, is the set of electronic log data that were collected through the daily use of printer devices that are shared via network within a section of a company to which one of the authors belongs. The motivation has presented in detail in [5], while the current task setting and the dataset are slightly different from the previous ones. The log data consist of many records, where a single record corresponds to one printing job output via network to one of the shared printers. Each record originally consists of a number of attributes, including user ID, context-related identifiers such as date or time, and content- or function-related identifiers such as number of document pages or usage of duplex/simplex printing. The aim of the experiment here, however, is not to construct full user models but to evaluate the basic performance of our approach. Then, we pre-processed the original log data to produce a rather compact task setting which was appropriate for the evaluation purpose. The log data were first separated into those of each individual according to the user ID. Then, in each individual log, only the records that meet the following condition were extracted: Paper size is A4, with only one copy and only one page per single sheet. This is usually the default setting of printer interfaces, and the usage of most users is strongly biased toward this setting. By limiting to the default condition, the distribution of the other attributes became regularly balanced and thus the dataset became suitable for normative performance evaluation. In this reduced dataset, we consider the five attributes other than the fixed attributes, which are summarized in Table 1. The values of the attribute modelName were replaced by anonymous ones. We quantized the values of the attribute docPages, which can take any natural number, as shown in the table. The attribute docExt may originally take various values, but the number of frequently-appeared values are small, such as pdf, xls, html, etc. We thus extracted only the records that include these frequent values and removed the others from the experiment. Finally, we removed the records including missing values. In this experiment, we also fixed both of the input and target variables within the five attributes; that is, we consistently set x = {docExt, docPages} and t = {modelName, duplex, colorMode}, where the number of possible realization is |x| = 25 and |t| = 20. After the pre-processing, the total number of users became 76, and the numbers of individual training data were quite different within the range from 2 to 1, 192 (Fig. 2). 4.2
Simulation Setting
Although our method is particularly expected to improve the prediction performance of users located at the rightward of Fig. 2 (with small sample sizes) in real
748
J. Hirayama et al. Table 1. Five attributes in printer logs
Num of training data 0 500 1000
Attribute Description Values modelName Anonymous form of model names Pr1, Pr2, Pr3, Pr4, Pr5 docExt File extensions of original document doc, html, pdf, ppt, xls docPages Number of pages in original document 1, 2-5, 6-20, 21-50, 51-over duplex Duplex or simplex duplex, simplex colorMode Color mode monochrome, fullColor
20
40 User index (sorted)
60
Fig. 2. The original numbers of data for 76 users.
situations, the limited number of data for such users prevents the quantitative evaluation of prediction performance. In this experiment, therefore, we tested the collaborative predictors for the top 16 users (the leftmost users in Fig. 2), with artificially varying the number of training data, Nu , from relatively small to large by random selection. For each setting of Nu , twenty runs were repeatedly conducted, in each of which both the collaborative and individual predictors were constructed based on the same data and then their performances were evaluated (see below). In contrast, the number of test data was commonly set at 200 in every run. In each single run for a user s, both the training data and test data were randomly selected from the original data without overlapping. Then the BN for user s was first constructed based on the training dataset. The BN was trained in a standard manner as described in Sec. 2, where our implementation used the “deal” package written in the language R. To realize the structure learning, we simply used a heuristic optimization provided by “deal” as the function heuristic() (see [1] for the detail), which searches for a good structure according to the partially-greedy stepwise ascent of the score function starting from multiple initial conditions. The Dirichlet hyperparameters were commonly set at a constant such that the total number of prior pseudo-counts [6] was five. To obtain the conditional model ps (ts | xs ) from the resultant joint distribution, we used the enumerative method. After learning, with this individual predictor ps , we constructed the corresponding collaborative predictors p¯s as follows. First, in prior to the simulation, we trained individual predictors for all the 76 users based on the data of the numbers shown in Fig. 2, without limiting the number of the top 16 user’s data. org org Then an ensemble of 76 individual predictors, porg 1 , p2 , . . . , pU , which we referred to as the original ensemble, was prepared in advance. The collaborative predictor p¯s was then constructed by first replacing porg in the original ensemble s with ps , and then making the collaborative predictor p¯s based on the replaced
Bayesian Collaborative Predictors
749
ensemble. The prior πs (u) was simply set as uniform. To evaluate the prediction performance for the test dataset, we calculated the test accuracy, defined as the fraction of the test cases in which the predictions of the three target variables were all correct. 4.3
Results
0
20
40
20
40
0
20
40
1.0 20
40
0
20
40
0
20
40
0.5 0.5
1.0
0.5
1.0
0.5 0.5
1.0
0.5
1.0
0.5 1.0 0.5 1.0 0.5
0.5
1.0 0
0
0
20
40
0
20
40
0
20
40
0
20
40
1.0
40
40
40
0.5
20
20
20
1.0
0
0
0
0.5
40
40
1.0
20
20
0.5
0
0
1.0
40 1.0
20
1.0
0
0.5
1.0 0.5
0.5
1.0
Fig. 3 shows the test accuracies by individual and collaborative predictors. Each panel corresponds to one of the 16 users, where the vertical axis denotes the
Fig. 3. Test accuracy for 16 users. The vertical axis denotes the test accuracy and the horizontal the number of training data. The solid (black) and dash (gray) lines respectively show the mean over the 20 runs by the collaborative (proposed) and individual predictors. The errorbar represents ±SD (standard deviation).
750
J. Hirayama et al.
x
x x x x x x x x x x x x x x x x x x x x x x x x x
x x x x x xx x x x x x x x x x x x x
x x
x
x x x x x x x x x xx x x x x x x
x x x x x x x x x x xx x xx x
x x x x xxx x x
0
10
20
30
40
Number of training data
50
0.030
x
0.015
x x x x x x
Variances of test accuracies
0.0
0.4
x xx xx x xx xx xx x x x xxx x xxx x xxx xx xx xxx xxx xx x xx xx
0.000
0.8 xx x
−0.4
Differences of test accuracies
x
0
10
20
30
40
50
Number of training data
Fig. 4. Left: The difference in test accuracy (Collaborative − Individual). This figure collectively shows the 16 users’ results. A mark x denotes an actual value of test accuracy. A solid line denotes the mean value, with an errorbar of ±SD. Right: The individual variance of test accuracy. Solid and dash lines respectively denote the collaborative and individual predictors. An errorbar denotes the standard deviation over the 16 users.
test accuracy and the horizontal the number of training data, where only the cases with Nu = 5, 10, 20, 30, 40, and 50 are shown. In this figure, the individual predictors often exhibited relatively lower accuracy and larger variances when Nu is less than about 20 in comparison to the cases with larger Nu . In contrast, it was shown that the collaborative predictors can improve these undesirable results of individual predictors in that they showed a higher mean accuracy for a number of users, and also smaller variances for almost all the users. Fig. 4 (left) shows the improvement in test accuracy by the Bayesian collaboration over the individual predictor (i.e, the accuracy by collaborative predictor minus that by corresponding individual one). This figure collectively plots the results of the 16 users, where each point denotes the improvement in a single run for a single user. The improvement was achieved in many runs, especially with small samples. Fig.4 (right) shows the variance of test accuracy of each individual user against the number of training data. Each errorbar denotes the standard deviation over the 16 users. This figure clearly shows the variance of test performance was substantially reduced by collaborative prediction in comparison to the individual one especially in the cases of small samples.
5
Summary
In this article, we proposed a new collaborative framework for user modeling with special interest in its application to BNs, which have recently been an popular modeling tool in general user modeling tasks. Our method is essentially a simple use of Bayesian principle, but the key idea that regards the other users’ models as the target user’s ones is consistent with the basic assumption of collaborative methods, i.e., there may be some other users similar to the target user. The effectiveness of the proposed method was demonstrated with a real-world dataset related to user modeling. While the improvements by our
Bayesian Collaborative Predictors
751
method was showed only in the too limited range of sample sizes, say less than 20, it should be noted that this range will likely to be extended in a more realistic problem having large number of variables, where the needed amount of training data become increased. More detailed performance evaluation, and also the investigation of remained issues such as computational cost or effective setting of prior distribution πs (u) are our future tasks.
References 1. B¨ ottcher, S.G., Dethlefsen, C.: deal: A package for learning Bayesian networks. Journal of Statistical Software 8(20) (2003) 2. Breese, J.S., Heckerman, D., Kadie, C.: Empirical analysis of predictive algorithms for collaborative filtering. In: Proc. 14th Conf. on Uncertainty in Artificial Intelligence, pp. 43–52. Morgan Kaufmann, San Francisco (1998) 3. Godoy, D., Amandi, A.: User profiling in personal information agents: a survey. Knowl. Eng. Rev. 20(4), 329–361 (2005) 4. Lauritzen, S.L., Spiegelhalter, D.J.: Local computations with probabilities on graphical structures and their application to expert systems. Journal of the Royal Statistical Society. Series B 50(2), 157–224 (1988) 5. Nakatomi, M., Iga, S., Shinnishi, M., Nagatsuka, T., Shimada, A.: What affects printing options? - Toward personalization & recommendation system for printing devices. In: International Conference on Intelligent User Interfaces (Workshop: Beyond Personalization 2005) (2005) 6. Neapolitan, R.E.: Learning Bayesian Networks. Prentice-Hall, Inc., Upper Saddle River (2003) 7. Niculescu-Mizil, A., Caruana, R.: Inductive transfer for Bayesian network structure learning. In: Proc. 11th International Conf. on AI and Statistics (2007) 8. Pearl, J.: Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, San Mateo (1988) 9. Schafer, J.B., Konstan, J.A., Riedl, J.: E-commerce recommendation applications. Data Mining and Knowledge Discovery 5(1–2), 115–153 (2001) 10. Webb, G.I., et al.: Machine learning for user modeling. User Modeling and UserAdapted Interaction 11(1–2), 19–29 (2001) 11. Zhang, Y., Burer, S., Nick Street, W.: Ensemble pruning via semi-definite programming. Journal of Machine Learning Research 7, 1315–1338 (2006) 12. Zukerman, I., Albrecht, D.: Predictive statistical models for user modeling. User Modeling and User-Adapted Interaction 11(1) (2001)
Discovery of Linear Non-Gaussian Acyclic Models in the Presence of Latent Classes Shohei Shimizu1,2 and Aapo Hyv¨ arinen1 1
Helsinki Institute for Information Technology, Finland 2 The Institute of Statistical Mathematics, Japan http://www.hiit.fi/neuroinf
Abstract. An effective way to examine causality is to conduct an experiment with random assignment. However, in many cases it is impossible or too expensive to perform controlled experiments, and hence one often has to resort to methods for discovering good initial causal models from data which do not come from such controlled experiments. We have recently proposed such a discovery method based on independent component analysis (ICA) called LiNGAM and shown how to completely identify the data generating process under the assumptions of linearity, non-gaussianity, and no latent variables. In this paper, after briefly recapitulating this approach, we extend the framework to cases where latent classes (hidden groups) are present. The model identification can be accomplished using a method based on ICA mixtures. Simulations confirm the validity of the proposed method.
1
Introduction
An effective way to examine causality is to conduct an experiment with random assignment [1]. However, in many cases it is impossible or too expensive to perform controlled experiments. Hence one often has to resort to methods for discovering good initial causal models from data which do not come from such controlled experiments, though obviously one can never fully prove the validity of a causal model from such uncontrolled data alone. Thus, developing methods for causal inference from uncontrolled data is a fundamental problem with a very large number of potential applications such as social sciences [2], gene network estimation [3] and brain connectivity analysis [4]. Previous methods developed for statistical causal analysis of non-experimental data [2, 5, 6] generally work in one of two settings. In the case of discrete data, no functional form for the dependencies is usually assumed. On the other hand, when working with continuous variables, a linear-Gaussian approach is almost invariably taken and has hence been based solely on the covariance structure of the data. Because of this, additional information (such as the time-order of the variables and prior information) is usually required to obtain a full causal model of the variables. Without such information, algorithms based on the Gaussian assumption cannot in most cases distinguish between multiple equally possible causal models. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 752–761, 2008. c Springer-Verlag Berlin Heidelberg 2008
Discovery of Linear Non-Gaussian Acyclic Models
753
We have recently shown that when working with continuous-valued data, a significant advantage can be achieved by departing from the Gaussianity assumption [7,8,9]. The linear-Gaussian approach usually only leads to a set of possible models that are equivalent in their covariance structure. The simplest such case is that of two variables, x1 and x2 . A method based only on the covariance matrix has no way of preferring x1 → x2 over the reverse model x1 ← x2 [2, 7]. However, a linear-non-Gaussian setting actually allows the linear acyclic model to be uniquely identified [9]. In this paper, we extend our previous work to cases where latent classes (hidden groups) are present. The paper is structured as follows. In Section 2 we briefly describe the basics of LiNGAM and subsequently extend the framework in Section 3. Some illustrative examples are provided in Section 4, and the proposed method is empirically evaluated in Section 5. Section 6 concludes the paper.
2
LiNGAM
Here we provide a brief review of our previous work [9]. Assume that we observe data generated from a process with the following properties: 1. The observed variables xi , i = {1 . . . n} can be arranged in a causal order k(i), defined to be an ordering of the variables such that no later variable in the order participates in generating the value of any earlier variable. That is, the generating process is recursive [2], meaning it can be represented graphically by a directed acyclic graph (DAG) [5, 6]. 2. The value assigned to each variable xi is a linear function of the values already assigned to the earlier variables, plus a ‘disturbance’ (noise) term ei , and plus an optional constant term μi , that is xi = bij xj + ei + μi . (1) k(j)
3. The disturbances ei are all continuous random variables having non-gaussian distributions with zero means and non-zero variances, and the ei are independent of each other, i.e. p(e1 , . . . , en ) = i pi (ei ). A model with these three properties we call a Linear, Non-Gaussian, Acyclic Model, abbreviated LiNGAM. We assume that we observe a large number of data vectors x (containing the components xi ), and each is generated according to the above described process, with the same causal order k(i), same coefficients bij , same constants μi , and the disturbances ei sampled independently from the same distributions. Note that the above assumptions imply that there are no unobserved (latent) confounders [5] (hidden variables). Spirtes et al. [6] call this the causally sufficient case. To see how we can identify the parameters of the model from the set of data vectors x, we start by subtracting out the mean of each variable xi , leaving us with the following system of equations: x = Bx + e,
(2)
754
S. Shimizu and A. Hyv¨ arinen
where B is a matrix that contains the coefficients bij and that could be permuted (by simultaneous equal row and column permutations) to strict lower triangularity if one knew a causal ordering k(i) of the variables. (Strict lower triangularity is here defined as lower triangular with all zeros on the diagonal.) Solving for x one obtains x = Ae, (3) where A = (I − B)−1 . Again, A could be permuted to lower triangularity (although not strict lower triangularity, actually in this case all diagonal elements will be non-zero) with an appropriate permutation k(i). Taken together, equation (3) and the independence and non-gaussianity of the components of e define the standard linear independent component analysis (ICA) model [10,11], which is known to be identifiable. While ICA is essentially able to estimate A (and W = A−1 ), there are two important indeterminacies that ICA cannot solve: First and foremost, the order of the independent components is in no way defined or fixed. Thus, we could reorder the independent components and, correspondingly, the columns of A (and rows of W) and get an equivalent ICA model (the same probability density for the data). In most applications of ICA, this indeterminacy is of no significance and can be ignored, but in LiNGAM, we can and we have to find the correct permutation as described in [9]: the correct permutation is the only one which has no zeros in the diagonal. The second indeterminacy of ICA concerns the scaling of the independent components. In ICA, this is usually handled by assuming all independent components to have unit variance, and scaling W and A appropriately. On the other hand, in LiNGAM (as in structural equation modeling, SEM [2]) we allow the disturbance variables to have arbitrary (non-zero) variances, but fix their weight (connection strength) to their corresponding observed variable to unity. This requires us to re-normalize the rows of W so that all the diagonal elements equal unity in order to obtain B. Our LiNGAM discovery algorithm [9] can thus be briefly summarized: First, use a standard ICA algorithm to obtain an estimate of the demixing matrix W, permute its rows such that there are no zeros on its diagonal, rescale each row by dividing by the element on the diagonal, and finally compute B = I − W , where W denotes the permuted and rescaled W. To find a causal order k(i) we must subsequently find a second permutation, to be applied equally both to the rows and columns of B, which yields strict lower triangularity.
3
LiNGAM in the Presence of Latent Classes
In this section, we extend the basic LiNGAM above to cases where latent (hidden) classes are present. 3.1
Motivation
Let us begin by an example. Regarding a child’s and a parent’s height, earlier studies (e.g., [12]) pointed out that there is a hereditary effect on height, which
Discovery of Linear Non-Gaussian Acyclic Models
755
is especially stronger between a child and the same-sex parent. This implies that the connection strengths from parent’s height to child’s height (and possibly the network structures) could be different between the two classes (same-sex and different-sex children). This is a nonlinear relation between child’s and parent’s height even if the relations are still linear in each class, which cannot be found if the class-membership is ignored (see Section 4 for some artificial examples). In cases where such class-membership is observed, we only have to analyze each class separately. However, in many cases, it would be quite difficult to detect and observe class-membership especially before collecting data. Thus, we need a sophisticated method to estimate latent classes in a data-driven way. In the following, we extend the basic LiNGAM so that the method can estimate latent classes of samples that have similar network structures. 3.2
Model
Let us assume that the data are generated by the following mixture density: p(x|Θ) =
K
p(x|μk , Bk )p(C = k),
(4)
k=1
where Θ = [θ 1 , · · · , θ k ], θ k = [μTk , vec(Bk )T ]T , μk is a mean vector, Bk is a connection strength matrix for class k and C is a discrete variable that indicates the class k = 1, · · · , K. (The vec(·) denotes the vectorization operator which creates a column vector from a matrix by stacking its columns. ) Here, we do not assume that we know the number of latent classes K and a priori probability p(C = k). Moreover, the data within class k are assumed to be generated by the LiNGAM model: x = Bk x + (I − Bk )μk + ek ,
(5)
where ek is the disturbance (error) vector for class k. Note that the means, connection strengths and structure of the network (μk and Bk ) can be different between classes. See Section 4 for some illustrative examples. 3.3
Model Identification Using ICA Mixtures
We propose that the new model above can be estimated using ICA mixture models [13]. As in the basic LiNGAM, ICA model holds for each class: x = μk + Ak ek ,
(6)
where Ak = (I − Bk )−1 . Then the mixture density is just the ICA mixture model [13]. After μk and Ak are estimated, we can obtain estimates of Bk and causal orderings k(i) for class k in the same manner as the basic LiNGAM (Section 2). Some estimation methods for the ICA mixtures have been proposed [13, 14]. Here we employ the minimum β-divergence method [14] since the β-divergence method does not require that the number of classes K and a priori probability
756
S. Shimizu and A. Hyv¨ arinen
p(C = k) are known, which is a big advantage over [13]. Some drawbacks are that one has to tell the algorithm whether the disturbances ei are super- or sub-gaussian and select a tuning parameter β using a cross-validation technique [15]. Fortunately, the first problem can be solved by (possibly non-parametric) estimation of the source densities [16, 14].
4
Illustrative Examples
In this section, we provide two illustrative examples of the LiNGAM in the presence of latent classes (abbreviated as LcLiNGAM) proposed above. We selected μk and Bk manually as explained below. The disturbances followed the Laplace distribution with zero means and selected the variances so that observed variables had unit variances. Moreover, the number of latent classes was 2, and 250 data points were generated for each class. Note that the numbers of latent classes were estimated as well by the β-divergence method [14]. The scatterplots of observed variables were shown in Figure 1. 4.1
Example 1
We generated data using the following means, connection strengths and structures of networks: 0 0 0 Class 1: μ1 = , B1 = (7) 0 0.3 0 4 0 0 Class 2: μ2 = , B2 = . (8) 5 0.3 0
Fig. 1. Left: Scatterplots of the observed variables in Example 1. Right: Scatterplots of the observed variables in Example 2. In the scatterplots, “.” denote members of class 1 and “+” those of class 2.
Discovery of Linear Non-Gaussian Acyclic Models
757
Both classes 1 and 2 had the same causal orders x1 → x2 , but different means (μ1 and μ2 ). The different mean structures created a strong correlation (0.88) for the whole data, although the connection strength in each class was rather weak (0.3). The estimation results by the LcLiNGAM and basic LiNGAM were as follows: LcLiNGAM1 : −0.02 0 0 Class 1: μ1 = , B1 = (9) 0.06 0.30 0 4.01 0 0 Class 2: μ2 = , B2 = , (10) 5.01 0.41 0 LiNGAM:
−0.09 0 0.81 μ= , B= . 3.99 0 0
(11)
The LcLiNGAM successfully recovered the means and structures of the networks and estimated connection strengths fairly well for both latent classes. However, the basic LiNGAM failed to find the correct causal order and overestimated the connection strength. 4.2
Example 2
Next, we tried data whose means, connection strengths and structures of network were as follows: 0 0 0 Class 1: μ1 = , B1 = (12) 0 0.3 0 5 0 0.3 Class 2: μ2 = , B2 = . (13) 4 0 0 Now the two classes had the different causal orders: x1 → x2 for class 1 and x1 ← x2 for class 2. The connection strengths were the same but the mean structures were different between the classes. The estimation results were as follows: LcLiNGAM: −0.02 0 0 Class 1: μ1 = , B1 = (14) 0.07 0.39 0 5.01 0 0.41 Class 2: μ2 = , B2 = , (15) 4.01 0 0 LiNGAM:
μ=
1
3.88 0 0.78 , B= . 0.08 0 0
(16)
Obviously, the orders of latent classes are not recovered. In the examples, for the clarity, we permuted the classes so that the differences of estimates and true values were minimized.
758
S. Shimizu and A. Hyv¨ arinen
The LcLiNGAM estimated the connection strengths fairly well and found correct causal orders for each class. However, the basic LiNGAM could not find that the two classes have different causal orders because it cannot represent any difference between the classes; it estimated, rather arbitrarily, one single causal order x1 ← x2 .
5
Simulation
To further verify the validity of our method, we performed experiments with simulated data. We repeatedly performed the following experiment: 1. First, we randomly constructed a strictly lower-triangular matrix B for each class, where the number of classes was 2 and the number of variables was 4. We also randomly selected variances of the disturbance variables. We further generated values for the constants μi making the classes have small overlap.2 2. Next, we generated data with sample size 500 by independently drawing the disturbance variables ei from the uniform distribution with zero mean and unit variance for each class. The observed data X were generated according to the assumed recursive process and were combined to create a whole data. 3. Finally, we fed the data to our discovery algorithm. The β-divergence method was employed to estimate ICA mixtures. Here we told the algorithm that the disturbances were sub-gaussian. 4. We compared the estimated parameters to the generating parameters. In k k and B particular, we made a scatterplot of the entries in the estimates μ
Fig. 2. Left: Scatterplots of the estimated μi versus the original (generating) values. Right: Scatterplots of the estimated bij versus the original (generating) values. Five data sets were generated for the scatterplots. 2
We first set μ1 = 0 and took as the elements of μ2 1.5 times the sum of standard deviations of corresponding observed variables of each class multiplied by -1 with probability 50%.
Discovery of Linear Non-Gaussian Acyclic Models
759
against the corresponding ones in μk and Bk . (Note that the numbers of latent classes were estimated as well.) Figure 2 gives scatterplots of the elements of estimated μk and Bk versus the generating ones. The left is the scatterplot of the estimated means μi versus the original (generating) values. The right is the scatterplot of the estimated connection strengths bij versus the original (generating) values. We can see that the estimation works well, as evidenced by the grouping of the data points onto the main diagonal.
6
Conclusion
Developing statistical causal inference methods based on non-experimental data is a fundamental problem with a large number of potential applications. Previous methods developed for linear causal models [2, 6, 5] have been based on an explicit or implicit assumption of gaussianity, and have hence been based solely on the covariance structure of the data. Therefore, algorithms based on the gaussian assumption cannot in most cases distinguish between multiple equally possible causal models. In previous work, we have shown that an assumption of non-gaussianity of the disturbance variables, together with the assumption of linearity and no latent variables, allows the linear acyclic model to be completely identified [9]. In this paper, we extended our previous work to cases where latent (hidden) classes are present. The new method can identify the DAG structures within latent classes and would enjoy a wider variety of applications. Although in the artificial experiments our method worked well, obviously we need to evaluate its empirical performance by more extensive simulations as well as real-world data. For example, in many real situations latent classes would be much more overlapping than in the simulations. Unfortunately, however, for such heavily overlapping cases, ICA estimation methods are still under development [14]. These are important topics for future research. As a further analysis, it is quite important to investigate what characterizes the latent classes in order to understand how the model can be applied, for example, in the design of practical interventions. The estimated means, connection strengths and structures of networks could provide an interpretation of the latent classes. For example, in Examples 1 and 2 above (Section 4), the differences of the means would be useful to interpret the difference of the two classes (and probably classify new samples). An additional or alternative way is to analyze the samples classified into the latent classes using logistic regression analysis if some covariates such as sex and age are available. One direction of future research would be to combine the latent class LiNGAM and logistic regression to improve the class distinction ability.3 3
Such a combination of mixture modeling and logistic regression has been proposed in the context of structural equation modeling (SEM) [17], although such SEM that are based on gaussian mixtures requires that causal orders are prespecified since non-gaussianity is not utilized for the model identification.
760
S. Shimizu and A. Hyv¨ arinen
A related topic is the case where hidden confounding (continuous) variables are present (Latent variable LiNGAM) [18]. We would like to mention a useful connection between the two extensions of the basic LiNGAM. In the latent class LiNGAM discussed here, we basically have a binary (discrete) hidden confounding variable (=class membership) which determines the connection strengths when the structure of the network is the same for the different classes. In future work, we will consider a unifying framework that combines the two extensions.
Acknowledgment This work was partially carried out at Transdisciplinary Research Integration Center, Research Organization of Information and Systems. The authors would like to thank Patrik Hoyer for his valuable comments and Nurul Mollah, Mihoko Minami and Shinto Eguchi for providing access to their Matlab code for ICA mixtures. S.S. was supported by Grant-in-Aid for Scientific Research from Japan Society for the Promotion of Science.
References 1. Holland, P.W.: Statistics and causal inference (with discussion). Journal of the American Statistical Association 81, 945–970 (1986) 2. Bollen, K.A.: Structural Equations with Latent Variables. John Wiley & Sons, Chichester (1989) 3. Imoto, S., Goto, T., Miyano, S.: Estimation of genetic networks and functional structures between genes by using Bayesian network and nonparametric regression. In: Proc. Pacific Symposium on Biocomputing, vol. 7, pp. 175–186 (2002) 4. Kim, J., Zhu, W., Chang, L., Bentler, P.M., Ernst, T.: Unified structural equation modeling approach for the analysis of multisubject, multivariate functional MRI data. Human Brain Mapping 28, 85–93 (2007) 5. Pearl, J.: Causality: Models, Reasoning, and Inference. Cambridge University Press, Cambridge (2000) 6. Spirtes, P., Glymour, C., Scheines, R.: Causation, Prediction, and Search, 2nd edn. MIT Press, Cambridge (2000) 7. Shimizu, S., Kano, Y.: Use of non-normality in structural equation modeling: Application to direction of causation. Journal of Statistical Planning and Inference (in press, 2006) 8. Shimizu, S., Hyv¨ arinen, A., Hoyer, P.O., Kano, Y.: Finding a causal ordering via independent component analysis. Computational Statistics & Data Analysis 50(11), 3278–3293 (2006) 9. Shimizu, S., Hoyer, P.O., Hyv¨ arinen, A., Kerminen, A.: A linear non-gaussian acyclic model for causal discovery. J. of Machine Learning Research 7, 2003–2030 (2006) 10. Comon, P.: Independent component analysis. a new concept? Signal Processing 36, 62–83 (1994) 11. Hyv¨ arinen, A., Karhunen, J., Oja, E.: Independent component analysis. Wiley, New York (2001)
Discovery of Linear Non-Gaussian Acyclic Models
761
12. Tanner, J., Israelsohn, W.: Parent-child correlation for body measurements of children between the ages one month and seven years. Ann.Hum.Genet. 26, 245–259 (1963) 13. Lee, T.W., Lewicki, M., Sejnowski, T.: ICA mixture models for unsupervised classification of non-gaussian sources and automatic context switching in blind signal separation. IEEE Trans. on Pattern Recognition and Machine Intelligence 22(10), 1–12 (2000) 14. Mollah, M.N.H., Minami, M., Eguchi, S.: Exploring latent structure of mixture ICA models by the minimum β-divergence method. Neural Computation 18, 166–190 (2006) 15. Minami, M., Eguchi, S.: Adaptive selection for minimum β-divergence method. In: Proc. the Fourth Int. Conf. on Independent Component Analysis and Blind Signal Separation (ICA 2003), Nara, Japan, pp. 475–480 (2003) 16. Pham, D.T., Garrat, P.: Blind separation of mixture of independent sources through a quasi-maximum likelihood approach. Signal Processing 45, 1457–1482 (1997) 17. Muth´en, B.O.: Beyond SEM: General latent variables modeling. Behaviormetrika 29, 81–117 (2002) 18. Hoyer, P.O., Shimizu, S., Kerminen, A.: Estimation of linear, non-gaussian causal models in the presence of confounding latent variables. In: Proc. the third European Workshop on Probabilistic Graphical Models (PGM2006), pp. 155–162 (2006)
Efficient Incremental Learning Using Self-Organizing Neural Grove Hirotaka Inoue1 and Hiroyuki Narihisa2 1
Department of Electrical Engineering and Information Science, Kure College of Technology, 2-2-11 Agaminami, Kure-shi, Hiroshima, 737-8506 Japan [email protected] 2 Department of Information and Computer Engineering, Okayama University of Science, 1-1 Ridai-cho, Okayama-shi, Okayama, 700-0005 Japan [email protected]
Abstract. Multiple classifier systems (MCS) have become popular during the last decade. Self-generating neural tree (SGNT) is one of the suitable base-classifiers for MCS because of the simple setting and fast learning. In an earlier paper, we proposed a pruning method for the structure of the SGNT in the MCS to reduce the computational cost and we called this model as self-organizing neural grove (SONG). In this paper, we investigate a performance of incremental learning using SONG for a large scale classification problem. The results show that the SONG can ensure rapid and efficient incremental learning.
1
Introduction
Recently, to improve the classification accuracy, multiple classifier systems (MCS) such as neural network ensembles, bagging, and boosting have been used for practical data mining applications [1]. When developing classifiers using learning methods, while more training data can reduce the prediction error, the learning process can itself get computationally intractable. This issue is becoming more evident today, because there are complex classification problems waiting to be solved in many domains, where large amounts of data are already available [2]. Ideally, it is desirable to be able to consider all the examples simultaneously, to get the best possible estimate of class distribution. On the other hand when the training set is large, all the examples cannot be loaded into memory at one go. One approach to overcome this constraint is to train the classifier using an incremental learning technique, whereby only subsets of the data are to be considered at any one time and results subsequently combined. Neural networks have great advantages of adaptability, flexibility, and universal nonlinear input-output mapping capability. However, to apply these neural networks, it is necessary to determine the network structure and some parameters by human experts, and it is quite difficult to choose the right network structure suitable for a particular application at hand. Moreover, they require M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 762–770, 2008. c Springer-Verlag Berlin Heidelberg 2008
Efficient Incremental Learning Using SONG
763
a long training time to learn the input-output relation of the given data. These drawbacks prevent neural networks being the base classifier of the MCS for practical applications. Self-generating neural networks (SGNN) [3] have simple network design and high speed learning. SGNN are an extension of the self-organizing maps (SOM) of Kohonen [4] and utilize the competitive learning which is implemented as a self-generating neural tree (SGNT). The abilities of SGNN make it suitable for the base classifier of the MCS. In order to improve in the accuracy of SGNN, we proposed ensemble self-generating neural networks (ESGNN) for classification [5] as one of the MCS. Although the accuracy of ESGNN improves by using various SGNN, the computation cost, that is, the computation time and the memory capacity increases in proportion to the increase in number of SGNN in the MCS. Therefore, we proposed a pruning method for the structure of the SGNN in the MCS to reduce the computation time and the memory capacity and we called this model as self-organizing neural grove (SONG) [6,7]. In this paper, we investigate a performance of an incremental learning using the SONG for a large scale classification problem in UCI machine learning repository [8]. We use letter recognition dataset as the classification problem. We investigate the relation between the number of training data and the classification accuracy, the number of nodes and the computation time. The results show that the SONG can ensure the rapid and efficient incremental learning. The rest of the paper is organized as follows: the next section shows how to construct the SONG. Then Section 3 shows expreimental results. Finally we present some conclusions, and outline plans for future work.
2
Self-Organizing Neural Grove
In this section, we describe how to prune redundant leaves in the SONG. We implement the pruning method as two stages; the on-line pruning method and the off-line optimization method. First, we mention the on-line pruning method in learning of SGNT. Second, we show the optimization method in constructing the SONG. Finally, we show a simple example of the pruning method for a two dimensional classification problem. 2.1
On-line Pruning of Self-Generating Neural Tree
SGNT is based on SOM and implemented as a competitive learning. The SGNT can be constructed directly from the given training data without any intervening human effort. The SGNT algorithm is defined as a tree construction problem of how to construct a tree structure from the given data which consist of multiple attributes under the condition that the final leaves correspond to the given data. Before we describe the SGNT algorithm, we denote some notations. – input data vector: ei ∈ IRm . – root, leaf, and node in the SGNT: nj .
764
H. Inoue and H. Narihisa
Input: A set of training examples E = {e_i}, i = 1, ... , N. A distance measure d(e_i,w_j). Program Code: copy(n_1,e_1); for (i = 2, j = 2; i <= N; i++) { n_win = choose(e_i, n_1); if (leaf(n_win)) { copy(n_j, w_win); connect(n_j, n_win); j++; } copy(n_j, e_i); connect(n_j, n_win); j++; prune(n_win); } Output: Constructed SGNT by E. Fig. 1. SGNT algorithm Table 1. Sub procedures of the SGNT algorithm Sub procedure copy(nj , ei /w win ) choose(ei , n1 ) leaf (nwin ) connect(nj , nwin ) prune(nwin )
– – – –
Specification Create nj , copy ei /w win as w j in nj . Decide nwin for ei . Check nwin whether nwin is a leaf or not. Connect nj as a child leaf of nwin . Prune leaves if the leaves have the same class.
weight vector of nj : w j ∈ IRm . the number of the leaves in nj : cj . distance measure: d(ei , wj ). winner leaf for ei in the SGNT: nwin .
The SGNT algorithm is a hierarchical clustering algorithm. The pseudo C code of the SGNT algorithm is given in Fig. 1. In Fig. 1, several sub procedures are used. Table 1 shows the sub procedures of the SGNT algorithm and their specifications. In order to decide the winner leaf nwin in the sub procedure choose(e i,n 1), the competitive learning is used. If an nj includes the nwin as its descendant in the SGNT, the weight wjk (k = 1, 2, . . . , m) of the nj is updated as follows: wjk ← wjk +
1 · (eik − wjk ), cj
1 ≤ k ≤ m.
(1)
Efficient Incremental Learning Using SONG
765
Input T
SGNT 1
SGNT 2
o1
...
o2
SGNT K
oK
Combiner
Σ
Output
o
Fig. 2. An MCS which is constructed from K SGNTs. The test dataset T is entered each SGNT, the output oi is computed as the output of the winner leaf for the input data, and the MCS’s output is decided by voting outputs of K SGNTs.
After all training data are inserted into the SGNT as the leaves, the leaves have each class label as the outputs and the weights of each node are the averages of the corresponding weights of all its leaves. The whole network of the SGNT reflects the given feature space by its topology. For more details concerning how to construct and perform the SGNT, see [3]. Note, to optimize the structure of the SGNT effectively, we remove the threshold value of the original SGNT algorithm in [3] to control the number of leaves based on the distance because of the trade-off between the memory capacity and the classification accuracy. In order to avoid the above problem, we introduce a new pruning method in the sub procedure prune(n win). We use the class label to prune leaves. For leaves connected to the nwin , if those leaves have the same class label, then the parent node of those leaves is given the class label and those leaves are pruned. In the next sub-section, we describe how to optimize the structure of the SGNT in the MCS to improve the classification accuracy. 2.2
Optimization of the SONG
The SGNT has the capability of high speed processing. However, the accuracy of the SGNT is inferior to the conventional approaches, such as nearest neighbor, because the SGNT has no guarantee to reach the nearest leaf for unknown data. Hence, we construct an MCS by taking the majority of plural SGNT’s outputs to improve the accuracy (Figure 2). Although the accuracy of the SONG is comparable to the accuracy of conventional approaches, the computational cost increases in proportion to increase in the number of SGNTs in the SONG. In particular, the huge memory requirement prevents the use of the SONG for large datasets even with latest computers. In order to improve the classification accuracy, we propose an optimization method of the SONG for classification. This method has two parts, the merge phase
766
H. Inoue and H. Narihisa 1 begin initialize j = the height of the SGNT 2 do for each subtree’s leaves in the height j 3 if the ratio of the most class ≥ α, 4 then merge all leaves to parent node 5 if all subtrees are traversed in the height j, 6 then j ← j − 1 7 until j = 0 8 end. Fig. 3. The merge phase
1 begin initialize α = 0.5 2 do for each α 3 evaluate the merge phase with 10-fold CV 4 if the best classification accuracy is obtained, 5 then record the α as the optimal value 6 α ← α + 0.05 7 until α = 1 8 end. Fig. 4. The evaluation phase
and the evaluation phase. The merge phase is performed as a pruning algorithm to reduce dense leaves (Figure 3). This phase uses the class information and a threshold value α to decide which subtree’s leaves to prune or not. For leaves that have the same parent node, if the proportion of the most common class is greater than or equal to the threshold value α, then these leaves are pruned and the parent node is given the most common class. The optimum threshold values α of the given problems are different from each other. The evaluation phase is performed to choose the best threshold value by introducing 10-fold cross validation. (Figure 4). 2.3
Simple Example of the Pruning Method
We show an example of the pruning algorithm in Figure 5. This is a twodimensional classification problem with two equal circular Gaussian distributions that have an overlap. The shaded plane is the decision region of class 0 and the other plane is the decision region of class 1 by the SGNT. The dotted line is the ideal decision boundary. The number of training samples is 200 (class0: 100,class1: 100) (Figure 5(a)). The unpruned SGNT is given in Figure 5(b). In this case, 200 leaves and 120 nodes are automatically generated by the SGNT algorithm. In this unpruned SGNT, the height is 7 and the number of units is 320. In this, we define the unit to count the sum of the root, nodes, and leaves of the SGNT. The root is the node which is of height 0. The unit is used as a measure of the memory requirement in the next section. Figure 5(c) shows the pruned SGNT after the merge phase in α = 1. In this case, 159 leaves and 107
Efficient Incremental Learning Using SONG 1
class0 class1
class0 class1 node
Height
0.8
767
7 6 5 4 3 2 1 0
x2
0.6
0.4
1 0.8
0.2 0
0.6 0.2
0 0
0.2
0.4
0.6
0.8
0.4 x1
0.4 0.6
0.8
x2
0.2 1 0
1
x1
(a)
(b) class0 class1 node
Height
class0 class1 node
Height
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0 1
1
0.8 0
0.6 0.2
0.4 x1
0.4 0.6
0.8
(c)
0.2 1 0
x2
0.8 0
0.6 0.2
0.4 x1
0.4 0.6
0.8
0.2
x2
1 0
(d)
Fig. 5. An example of the SGNT’s pruning algorithm, (a) a two dimensional classification problem with two equal circular Gaussian distribution, (b) the structure of the unpruned SGNT, (c) the structure of the pruned SGNT (α = 1), and (d) the structure of the pruned SGNT (α = 0.6). The shaded plane is the decision region of class 0 by the SGNT and the doted line shows the ideal decision boundary.
nodes are pruned away and 54 units remain. The decision boundary is the same as the unpruned SGNT. Figure 5(d) shows the pruned SGNT after the merge phase in α = 0.6. In this case, 182 leaves and 115 nodes are pruned away and only 23 units remain. Moreover, the decision boundary is improved more than the unpruned SGNT because this case can reduce the effect of the overlapping class by pruning the SGNT. In the above example, we use all training data to construct the SGNT. The structure of the SGNT is changed by the order of the training data. Hence, we can construct the MCS from the same training data by changing the input order. We call this approach “shuffling”. To show how well the MCS is optimized by the pruning algorithm, we show an example of the MCS in the same problem used above. Figure 6(a) and Figure 6(b) show the decision region of the MCS in α = 1 and α = 0.6, respectively. We set the number of SGNTs K as 25. The result of Figure 6(b) is a better estimation of the ideal decision region than the result of Figure 6(a).
768
H. Inoue and H. Narihisa 1
1
class0 class1
0.6
0.6 x2
0.8
x2
0.8
class0 class1
0.4
0.4
0.2
0.2
0
0 0
0.2
0.4
0.6
0.8
1
0
0.2
x1
0.4
0.6
0.8
1
x1
(a)
(b)
Fig. 6. An example of the MCS’s decision boundary (K = 25), (a) α = 1, and (b) α = 0.6. The shaded plane is the decision region of class 0 by the MCS and the doted line shows the ideal decision boundary.
3
Experimental Results
We investigate the relation between the number of training data and the classification accuracy, the number of nodes, and the computation time of SONG with bagging for the benchmark problem in the UCI repository [8]. In this experiment, we use a modified Euclidean distance measure as follows: m d(x, y) = ai · (xi − yi )2 , (2) i=1
ai =
1 , (1 ≤ j ≤ N ). maxj − minj
(3)
We use letter recognition dataset in this experiment since it contain large scale data (the number of input dimension: 16, the number of classes: 26, and the number of entries: 20000). The objective of this dataset is to identify each of a large number of black-and-white rectangular pixel displays as one of the 26 capital letters in the English alphabet. The character images were based on 20 different fonts and each letter within these 20 fonts was randomly distorted to produce a file of 20,000 unique stimuli. Each stimulus was converted into 16 primitive numerical attributes (statistical moments and edge counts) which were then scaled to fit into a range of integer values from 0 through 15. The results for other benchmark problems and comparative study is shown in [7,9]. First, we divide letter recognition dataset into ten parts. Second, we select one of the ten parts as the testing data. Third, we enter one of the remaining nine parts to the SONG for training. Forth, we test the SONG using the testing data. Finally, we continue the training and the testing until all nine parts dataset is entered the SONG. We set the number of SGNT K in the SONG as 1,3,5,9,15, and 25. To select the optimum threshold value α, we set the different threshold values α which are moved from 0.5 to 1; α = [0.5, 0.55, 0.6, . . . , 1]. All computations
Efficient Incremental Learning Using SONG 1
K=1 K=3 K=5 K=9 K=15 K=25
0.95 Classification accuracy (%)
769
0.9 0.85 0.8 0.75 0.7 0.65 0
2000 4000 6000 8000 10000 12000 14000 16000 18000 # of N
Fig. 7. The relation between the number of training data and the classification accuracy Table 2. The compression ratio of the SONG for letter dataset The number of N (×103 ) 2 4 6 8 10 12 14 16 18 The compression ratio (%) 34.8 29.2 26.6 24.6 23.3 22.1 21.0 20.1 19.4
of the SONG are performed on an IBM PC-AT machine (CPU: Intel Pentium4 2.26GHz, Memory: 512MB). Figure 7 shows the relation between the number of training data and the classification accuracy. The more the number of training data increases, the more the classification accuracy improves for all the number of ensembles K. The width of the improvement is wide in small K for all the number of N . As the memory requirement, we count the number of units which is the sum of the root, nodes, and leaves of the SGNT. In this paper, we use below defined compression ratio: compression ratio =
number of remaining units . number of total units
(4)
Table 2 shows the compression ratio of the memory requirement in the SONG. The compression ratio decreases gradually as the number of training data increases. It means that the SONG has a capability of good compression for large scale data. This supports that the SONG can be effectively used for large scale datasets. Finally, Table 3 shows the computation time for training (total and interval) and testing in K = 25, α = 1.0. At first, the computation time for training and the computation time for testing is same in N = 2000. The interval training time is slightly larger than the testing time to generate new leaves and to prune redundant leaves in SONG with the exception of N = 2000. The testing time increases slightly in proportion to increase the number of training data since SONG is search the nearest node in the particular subtree which is already pruned. In conclusion, SONG is practical for incremental learning and largescale data mining.
770
H. Inoue and H. Narihisa Table 3. The computation time of training and testing in K = 25, α = 1.0 The The The The
4
number of N (×103 ) total training time (s) interval training time (s) testing time (s)
2 0.25 0.25 0.25
4 0.61 0.36 0.31
6 1.02 0.41 0.36
8 1.5 0.48 0.34
10 1.98 0.48 0.39
12 2.48 0.5 0.43
14 3.01 0.53 0.48
16 3.55 0.54 0.45
18 4.12 0.57 0.44
Conclusions
In this paper, we investigated the performance of incremental learning of SONG. Experimental results showed that the memory requirement reduces effectively, and the accuracy increases in proportion to increase the number of training data. In conclusion, the SONG is a useful and practical incremental learning method to classify large datasets. In future work, we will study a more effective pruning algorithm and a parallel and distributed processing of the SONG for large scale data mining.
References 1. Quinlan, J.R.: Bagging, Boosting, and C4.5. In: Proceedings of the Thirteenth National Conference on Artificial Intelligence, August 4–8, 1996, pp. 725–730. AAAI Press, MIT Press (1996) 2. Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P.: From data mining to knowledge discovery: an overview. In: Advances in Knowledge Discovery and Data Mining, MIT Press, Cambridge (1996) 3. Wen, W.X., Jennings, A., Liu, H.: Learning a neural tree. In: Proc. of the International Joint Conference on Neural Networks, Beijing, China, November 3–6, 1992, vol. 2, pp. 751–756 (1992) 4. Kohonen, T.: Self-Organizing Maps. Springer, Berlin (1995) 5. Inoue, H., Narihisa, H.: Improving generalization ability of self-generating neural networks through ensemble averaging. In: Terano, T., Chen, A.L.P. (eds.) PAKDD 2000. LNCS, vol. 1805, pp. 177–180. Springer, Berlin (2000) 6. Inoue, H., Narihisa, H.: Effective Pruning Method for a Multiple Classifier System Based on Self-Generating Neural Networks. In: Kaynak, O., Alpaydın, E., Oja, E., Xu, L. (eds.) ICANN 2003 and ICONIP 2003. LNCS, vol. 2714, pp. 11–18. Springer, Berlin (2003) 7. Inoue, H., Narihisa, H.: Self-organizing neural grove: Efficient multiple classifier system using pruned self-generating neural trees. In: Yao, X., Burke, E.K., Lozano, J.A., Smith, J., Merelo-Guerv´ os, J.J., Bullinaria, J.A., Rowe, J.E., Tiˇ no, P., Kab´ an, A., Schwefel, H.-P. (eds.) PPSN 2004. LNCS, vol. 3242, pp. 1113–1122. Springer, Berlin (2004) 8. Blake, C.L., Merz, C.J.: UCI repository of machine learning databases, University of California, Irvine, Dept of Information and Computer Science (1998), Datasets is available at http://www.ics.uci.edu/∼ mlearn/MLRepository.html 9. Inoue, H.: Self-organizing neural grove. WSEAS Trans. on Computers 5(10), 2238– 2244 (2006)
Design of an Unsupervised Weight Parameter Estimation Method in Ensemble Learning Masato Uchida1 , Yousuke Maehara2 , and Hiroyuki Shioya3 1
Network Design Research Center, Kyushu Institute of Technology 3–8–1 Asano, Kokurakita-ku, Kitakyushu-shi, Fukuoka 801–0001, Japan [email protected] 2 Graduate School of Computer Science and Systems Engineering, Muroran Institute of Technology 27–1 Mizumoto-cho, Muroran-shi, Hokkaido 050–8585, Japan [email protected] 3 Department of Computer Science and Systems Engineering, Muroran Institute of Technology 27–1 Mizumoto-cho, Muroran-shi, Hokkaido 050–8585, Japan [email protected]
Abstract. A learning method using an integration of multiple component predictors as an ultimate predictor is generically referred to as ensemble learning. The present paper proposes a weight parameter estimation method for ensemble learning under the constraint that we do not have any information of the desirable (true) output. The proposed method is naturally derived from a mathematical model of ensemble learning, which is based on an exponential mixture type probabilistic model and Kullback divergence. The proposed method provides a legitimate strategy for weight parameter estimation under the abovementioned constraint if it is assumed that the accuracy of all multiple predictors are the same. We verify the effectiveness of the proposed method through numerical experiments.
1
Introduction
A learning method in which an ultimate predictor, called an ensemble predictor, which is an integration of multiple component predictors, is built is generically referred to as ensemble learning. The purpose of ensemble learning is to enhance accuracy and to reduce performance fluctuation through integration. Representative ensemble learning methods include Bagging [1], Boosting [2] and their derivatives. These methods perturb the data, which is composed of input information and corresponding desirable (true) output information, by resampling to induce diversity of component predictors. For example, in a boosting method called AdaBoost [2], a weak learner/predictor (slightly better than random guessing) is trained iteratively while increasing the intensity of misclassified samples and decreasing the intensity of correctly classified samples. The ensemble predictor M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 771–780, 2008. c Springer-Verlag Berlin Heidelberg 2008
772
M. Uchida, Y. Maehara, and H. Shioya
is then built by integrating the multiple trained component predictors according to their performance with respect to the input/output data. On the other hand, the simple ensemble learning method proposed in [3] builds an ensemble predictor using the weighted average of multiple component predictors, where the weights are determined according to the squared loss of the ensemble predictor with respect to the input/output data. The above observation indicates that a number of existing studies concerning ensemble learning have considered how to use the given input/output data to obtain an efficient ensemble predictor. However, even when the input/output data is not given, the essential strategy (i.e., integration of multiple component predictors) of ensemble learning itself is hopeful if multiple trained component predictors for building the ensemble predictor are on hand. That is, it is expected that the ensemble strategy will provide an effective solution by using the multiple available trained component predictors even to a prediction problem for which the answer (i.e., true output) is unknown and will be revealed in the future. Such prediction problems are not unusual, and typical examples include “Who will win the next Academy Awards?”, “How long will the current cabinet be in power?”, and “Will a new item of a certain company make a big hit?”. Although these social/political/economical examples are not from laboratory experiments, even for these kinds of problems, it is expected that a convincing prediction can be obtained by aggregating opinions that are collected from various individuals. However, a number of previous studies that considered ensemble learning cannot be applied to solving the abovementioned problems, despite the expectation for the essential strategy of ensemble learning. This is because these previous studies implicitly assumed the input/output data to be available for building ensemble predictor, while the output information of the abovementioned problems cannot be obtained in principle. Thus, the question as to how to construct an ensemble predictor without referring to output information arises, where it is assumed that we can apply both the knowledge of the problem to be solved (i.e., input information) and opinions to the problem (i.e., multiple trained component predictors). By answering such a question, the current application fields of ensemble learning would be expanded beyond laboratory experiments. Therefore, the present paper proposes a new ensemble learning method called unsupervised ensemble learning that builds an ensemble predictor without using output information. The proposed unsupervised ensemble learning method is based on an ensemble learning model proposed in [4,5], which is a generalization of the method proposed in [3]. Although the generalization model is designed under the assumption that output data can be used, as was the case in previous studies, the proposed unsupervised ensemble learning method is naturally derived from the generalization model by focusing on its structure that is characterized by an exponential mixture type probabilistic model and Kullback divergence. In addition, if the accuracies of the individual multiple predictors are assumed to be approximately the same, the proposed method becomes a legitimate strategy under the constraint that there is no output information.
Design of an Unsupervised Weight Parameter Estimation Method
2 2.1
773
Ensemble Learning with Supervised Weight Parameter Estimation Fundamental Model
First, we briefly review the formulation of the ensemble learning method proposed in [3]. Let us consider a component predictor that outputs fθi (x) (∈ R) for the in(1) (k ) put x = (x1 , . . . , xm )T (∈ Rm ), where θi = (θi , . . . , θi i )T (∈ Rki ) is the set of modifiable parameters of the predictor fθi (x) for i = 1, . . . , M . In addition, assume that i.i.d. (independent and identically distributed) sample data composed of input x and corresponding desirable (true) output y are obtained from a certain probability density function (pdf) p∗ (x, y) (= p∗ (x)p∗ (y|x)), where the (i) (i) (i) (i) set of i.i.d. sample data is denoted by Dni (x, y) = {(x1 , y1 ), . . . , (xni , yni )}. Here, the ensemble learning method based on the square loss function, which were proposed in [3], involves finding θˆi = arg min (y − fθi (x))2 (1) θi
(x,y)∈Dni (x,y)
and then using f¯θ,β ˆ (x) =
M
βi fθˆi (x)
(2)
i=1
as an ensemble predictor, where T T θ = (θ1T , . . . , θM ) ∈R
M i=1
ki
,
β = (β1 , . . . , βM ) ∈ R , T
M
βi = 1,
M
βi > 0.
i=1
M−1 Note that βM is defined using β1 , . . . , βM−1 as βM = 1 − i=1 βi without loss of generality. This means that the essential dimension of β is M − 1. Hereafter, the parameter β is referred to as the weight parameter. Now, if we can use the extra input/output data Dn (x , y ) = {(x1 , y1 ), . . . , (xn , yn )}, which is the set of i.i.d. sample data obtained from p∗ (x, y), the value of β can be estimated as 2 βˆS = arg min (y − f¯θ,β (3) ˆ (x)) , β
(x,y)∈Dn (x ,y )
ˆ where βˆS = (βˆS,1 , . . . , βˆS,M−1 ) and βˆS,M is defined as βˆS,M = 1 − M−1 i=1 βS,i . Note that the estimation in Eq. (3) is supervised in the sense that it is executed by using output data y.
774
M. Uchida, Y. Maehara, and H. Shioya
The present paper proposes an unsupervised weight parameter estimation method. That is, the proposed method estimates the value of β without using output data y. The proposed estimation method is based on the generalization model of ensemble learning [4,5] that includes the fundamental ensemble learning model mentioned in this section as a special case. A brief review of [4,5] is given in the next section. 2.2
General Model Based on the Exponential Mixture Model
Gaussian Exponential Mixture Model. Let us define a conditional Gaussian pdf of y given x using a function g(x) (∈ R) as follows: 1 1 2 pG,g (y|x) = √ exp − 2 (y − g(x)) , 2σ 2πσ where σ is a positive constant value. Using the above definition, we identify pG,g (y|x) with g(x). In addition, Eq. (1) can be rewritten as ˆ θi = arg min − log pG,fθi (y|x) . (4) θi
(x,y)∈Dni (x,y)
On the other hand, we know the following relationship [6], which corresponds to Eq. (2): M βi i=1 pG,fθˆi (y|x) pG,f¯θ,β (y|x) = . (5) ˆ M βi i=1 pG,fθˆ (y|x) dy R i
We herein refer to pG,f¯θ,β (y|x) as a Gaussian exponential mixture model. In addition, Eq. (3) can be rewritten using Eq. (5) as ˆ βS = arg min − log pG,f¯θ,β (y|x) . (6) ˆ β
(x,y)∈Dn (x ,y )
General Exponential Mixture Model. Let us denote the set of all pdf on X (∈ Rm ) by P(X ) and the set of all conditional pdf on Y (∈ Rl ) given X (∈ Rm ) by P(Y|X ). Here, let us define a new conditional pdf, which can be regarded as a generalization of Eq. (5), using pi (y|x) (∈ Pi (Y|X ) ⊂ P(Y|X ), i = 1, . . . , M ) as M pi (y|x)βi def p¯β (y|x) = i=1 , (7) M βi i=1 pi (y|x) dy Y where we assume β = (β1 , . . . , βM−1 )T ∈ RM−1 , M M βi = 1, pi (y|x)βi dy < ∞ (∀x ∈ X ). i=1
Y i=1
We herein refer to p¯β (y|x) as an exponential mixture model.
Design of an Unsupervised Weight Parameter Estimation Method
775
On the other hand, for ∀p∗ (x),∀q(x) ∈ P(X ) and∀p∗ (y|x),∀q(y|x) ∈ P(Y|X ), the Kullback divergence D(··) between p∗ (y|x)p∗ (x) and q(y|x)q(x) meets the following chain rule: D(p∗ (y|x))p∗ (x)q(y|x)q(x)) = D(p∗ (x)q(x)) + D(p∗ (y|x)q(y|x)).
(8)
If p∗ (x) ≡ q(x), the first term on the right-hand side of Eq. (8) is 0. In the context of learning, this condition means that the same information (problem) is input to both the teacher and the learner. In the present paper, we assume this condition and focus on only the second term on the right-hand side of Eq. (8). Under this assumption, the process of the ensemble learning with supervised weight parameter estimation results in a sequence of three operations: def
pˆi (y|x) = arg
min
p(y|x)∈Pi (Y|X )
def p¯ ˆβ (y|x) = arg
min
p(y|x)∈P(Y|X )
D(p∗ (y|x)p(y|x)), M
βi D(p(y|x)ˆ pi (y|x)),
(9) (10)
i=1
def βˆS = arg min D(p∗ (y|x)p¯ ˆβ (y|x)).
(11)
β
Here, Eqs. (1) and (4) correspond to Eq. (9), Eqs. (2) and (5) correspond to Eq. (10), and Eqs. (3) and (6) correspond to Eq. (11). Note that it is easily confirmed that Eq. (10) is well defined using Jensen’s inequality. As for the efficiency of p¯βˆS (y|x), the following equation is derived from the fact that p¯β (y|x) is a kind of exponential family [7]: D(p∗ (y|x)¯ pβˆS (y|x)) = D(p∗ (y|x)pi (y|x)) − D(¯ pβˆS (y|x)pi (y|x)), ≤
3 3.1
(i = 1, . . . , M )
min D(p∗ (y|x)pi (y|x)).
i=1,...,M
(12)
Ensemble Learning with Unsupervised Weight Parameter Estimation Motivation
The efficiency shown in Eq. (12) indicates that the supervised weight parameter estimation is the recommended strategy. However, the efficiency can be enjoyed only when the output data (i.e., p∗ (y|x)) is available because the output data is needed in order to obtain βˆS through Eq. (11). This limitation narrows the application range of ensemble learning. For example, the supervised method cannot work for the prediction problems considered in Section 1, although not only are such problems quite common but also the essence of the ensemble strategy would be useful for solving such problems. This has motivated us to propose an unsupervised weight parameter estimation method that does not need the output data at all.
776
M. Uchida, Y. Maehara, and H. Shioya
The underlying assumptions of the proposed method are two-fold. First, it is assumed that we have knowledge about the problem to be solved. This knowledge is vital input information for predictors/solvers because the predictors cannot do anything if the problem itself is unknown. Therefore, the first assumption is quite natural. Note that the input information is usually given as the numerical data in the context of learning. Second, it is assumed that we have multiple opinions on the solution of the given problem. In the context of learning, the assumption means that we have multiple component predictors that have been trained in some way in advance. On the other hand, in the context of the examples considered in Section 1, the assumption means that we can collect opinions from various people. The second assumption is primal because the purpose of the present paper provides a method for integrating the multiple trained component predictors or aggregating opinions from various people. 3.2
Proposed Method
General Case: Exponential Mixture Model. The observation of the following equation, which is derived in [4], yields to the key idea of the proposed method: D(p∗ (y|x)¯ pβ (y|x))
I
=
M i=1
βi D(p∗ (y|x)pi (y|x)) −
II
M i=1
βi D(¯ pβ (y|x)pi (y|x)) .
(13)
III
Equation (13) reveals two important points. First, the minimization of I with respect to β is equivalent to the maximization of III with respect to β when D(p∗ (y|x)pi (y|x)) = D(p∗ (y|x)pj (y|x)) holds for i, j = 1, . . . , M and i = j because II is constant with respect to β in that case. Second, the maximization of III with respect to β does not depend on p∗ (y|x), which is required in the supervised weight parameter estimation method. Therefore, the maximization of III can be realized in an unsupervised manner. Based on the above two points, we proposed a new weight parameter estimation method that is formulated as the following equation: βˆU = arg max β
M
βi D(¯ pβ (y|x)pi (y|x)),
i=1
ˆ where βˆU = {βˆU,1 , . . . , βˆU,M −1 } and βˆU,M is defined as βˆU,M = 1 − M−1 i=1 βU,i . The weight βˆU is reasonable as long as pi (y|x), (i = 1, . . . , M ) have similar
Design of an Unsupervised Weight Parameter Estimation Method
777
efficiency (i.e., D(p∗ (y|x)pi (y|x)) = D(p∗ (y|x)pj (y|x)), i, j = 1, . . . , M and i = j). We hereafter use the following definition: M
def
c(β) =
βi D(¯ pβ (y|x)pi (y|x))
i=1
The present paper assumes that the input data Dn (x ) = {x1 , . . . , xn }, which is the set of i.i.d. sample data is obtained from p∗ (x). In this case, we replace c(β) by cˆ(β), which is defined as follows: M p¯β (y|x) def 1 cˆ(β) = βi p¯β (y|x) log dy . n pi (y|x) Y i=1 x∈Dn (x)
The value of β that maximizes cˆ(β) can be found using a gradient descent method that is given by β (t+1) = β (t) + η∇β cˆ(β)|β=β(t) , (t)
(t)
where β (t) = (β1 , . . . , βM−1 ) is a value of β at update time t, η is a sufficiently small positive value, and ∇β = (∂/∂β1 , . . . , ∂/∂βM−1 ) is the gradient operator with respect to β. Special Case: Gaussian Exponential Mixture Model. On the other hand, the proposed method has an additional interesting properties if pi (y|x) is modeled by conditional Gaussian pdf. In the following, we consider the case in which pi (y|x) = pG,fθˆ (y|x). In this case, the following equation holds, where, for simi plicity, we set σ = 1. M 1 1 2 cˆ(β) = βi (f¯θ,β (x) − f (x)) ˆ θˆi n 2 i=1 x∈Dn (x)
1 = n Here, using βM = 1 −
x∈Dn (x)
M−1 i=1
M 1 2 βi βj (fθˆi (x) − fθˆj (x)) . 4 i,j=1
βi , we obtain ∇β cˆ(β) = δ − Aβ,
where A = (ai,j )(M−1)×(M−1) and δ = (δ1 , . . . , δM−1 ) are defined as 1 1 ai,j = {(fθˆj (x) − fθˆM (x))2 n 2 x∈Dn (x) + (fθˆM (x) − fθˆi (x))2 − (fθˆj (x) − fθˆi (x))2 } , 1 1 δi = (fθˆM (x) − fθˆi (x))2 . n 2 x∈Dn (x)
778
M. Uchida, Y. Maehara, and H. Shioya
As a result, ∇β cˆ(β) = 0 yields the following equation if A is non-singular. βˆU = A−1 δ As an interesting example, let us consider a case that satisfies 1 1 2 (f (x) − f (x)) = ε, (∀i, j = 1, 2, . . . , M, i = j) θ θj n 2 i
(14)
x∈Dn (x)
where ε is a positive constant value. ⎛ M−1 1 1 −M −M M 1 M−1 1 ⎜− ⎜ M M −M 1 ⎜ .. .. A−1 = ⎜ ... . . ε⎜ 1 ⎝− ... − 1 M
1 −M
M
1 . . . −M
In this case, we obtain 1 ⎞ . . . −M 1 ⎟ . . . −M ⎟ .. ⎟ , δ = (ε, ε, . . . , ε)t . .. . . ⎟ ⎟ M−1 1 ⎠ −M M 1 M−1 −M M
Let βˆA be a value of β when Eq. (14) is satisfied. We then obtain βˆA =
1 1 1 , ,..., M M M
t .
The above property characterizes the meaning of the simple average in the context of unsupervised weight parameter estimation.
4 4.1
Numerical Examples Experimental Conditions
In order to evaluate the practical performance of the proposed method, we conducted numerical experiments on pattern recognition tasks using the spam e-mail dataset and the diabetes dataset provided by the UCI repository of machine learning databases [8]. The experiments are based on the Gaussian exponential mixture model described in Sections 2.2 and 3.2. We generated the multiple trained component predictors, which are modeled by multi-layer perceptrons (MLPs), by executing the training based on the gradient descent method with the square loss function using Dni (x, y). In addition, we executed the weight parameter estimation using Dn (x ) for the proposed unsupervised method. In this way, we can emulate the situation described in Section 1. That is, the output information of the problem is not given while the knowledge about the problem to be solved (i.e., Dn (x )) and the opinions regarding its solution (i.e., fθˆi (x)) are on hand. For reference, we executed the supervised weight parameter estimation using Dn (x , y ). Unknown samples are used to evaluate the generalized ability of the ensemble predictor. Detailed information on experimental parameters is shown in Table 1.
Design of an Unsupervised Weight Parameter Estimation Method
779
Table 1. Parameters of Numerical Experiments Parameter
Spam E-mail Diabetes
Total Number Number of |Dni (x, y)| Instances Known Sample |Dn (x , y )| or |Dn (x )| Unknown Sample Number of Input Attributes: m Number of Output Attributes: l
4601 100 100 4001 57 1
768 50 50 518 8 1
Number of Hidden Units in MLP Number of Component Predictors (MLPs): M
7 5
3 5
Table 2. Rate of Correct Answers (%)
Sample
Spam E-mail Diabetes Predictor(s) Condition 1 Condition 2 Condition 1 Condition 2 c = 65 c = 85 a = 85, b = 10 c = 65 c = 85 a = 75, b = 5
Known
fθˆi (x) (ave.) f¯θ, ˆβ ˆA (x) f¯θ, ˆβ ˆU (x) f¯ˆ ˆ (x)
57.04 68.80 73.40 74.00
80.80 83.00 84.20 84.60
65.50 78.70 82.50 83.90
57.84 63.60 70.00 71.40
66.80 68.20 70.00 72.80
49.92 58.80 66.80 70.80
fθˆi (x) (ave.) f¯θ, ˆβ ˆA (x) Unknown f¯θ, ˆβ ˆU (x) f¯ˆ ˆ (x)
56.72 67.41 70.14 71.40
79.54 83.37 85.12 85.37
67.56 82.47 83.68 83.93
58.41 67.41 69.53 71.67
71.19 73.89 74.49 75.51
49.81 55.62 69.74 72.47
θ,β S
θ,β S
4.2
Results and Discussion
We used two conditions to terminate the training of the component predictors. One condition is for making multiple component predictors with similar accuracy (Condition 1) and the other is for making multiple component predictors with diverse accuracy (Condition 2). In Condition 1, we stop the training of all component predictors when the rate of correct answers with respect to Dni (x, y) achieves the same value c as in the process of training. In Condition 2, we also stop the training of the component predictors when the rate achieves a−b(i−1)% for i = 1, . . . , M (= 5) (i is the index of the component predictor). We executed numerical experiments using 10 different sets of samples for each experimental condition, where each set of samples is made by random sampling from the original dataset. The results of the numerical experiments shown in Table 2 are the average of the results obtained for these 10 sets. As shown in this ¯ˆ ˆ (x), table, the efficiency increases in the order of fθˆi (x) (ave.), f¯θ, ˆβ ˆA (x), fθ, βU ¯ fθ, ˆS (x) for all experimental conditions considered herein. In the following, we ˆβ examine the experimental results in more detail. The degree of performance improvement in Condition 1 becomes better as the value of c becomes smaller. This means that the effect of the proposed method
780
M. Uchida, Y. Maehara, and H. Shioya
becomes larger as the performance of the component predictors becomes worse. Of course, the performance itself becomes better with the value of c. On the other hand, the result in Condition 2 shows that the proposed method provides a reasonable weight parameter, even when the assumption of the proposed method mentioned in Section 3.2 (i.e., all component predictors have similar efficiency) does not hold exactly.
5
Conclusion
We proposed an unsupervised weight parameter estimation for ensemble learning using a mathematical model that is based on an exponential mixture model and Kullback divergence. The proposed method is feasible even when we do not know the information about the desired output in advance. In addition, the proposed method is a legitimate strategy for weight parameter estimation in the above-mentioned unsupervised situation if the performances of component predictors, which are used to build the ensemble predictor, are almost the same. The results of numerical experiments revealed that the performance of the proposed method is much better than that of the simple ensemble learning, even when the assumption on the performances of the component predictors is not satisfied.
Acknowledgment This study was supported in part by the Japan Society for the Promotion of Science through a Grant-in-Aid for Scientific Research (S) (18100001).
References 1. Breiman, L.: Bagging predictors. Machine Learning 24, 123–140 (1996) 2. Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55, 119– 139 (1997) 3. Ueda, N., Nakano, R.: Generalization error of ensemble estimators. In: Proceedings of International Conference on Neural Networks 1996 (ICNN 1996), Washington, D. C., WA, USA, vol. 3, pp. 90–95 (1996) 4. Uchida, M., Shioya, H., Da-te, T.: Analysis and extension of ensemble learning (in Japanese). IEICE Transactions on Information and Systems, PT. 2 (Japanese Edition) J84-D-II, 1537–1542 (2001) 5. Uchida, M., Shioya, H.: A study on assignment of weight parameters in ensemble learning model (in Japanese). IEICE Transactions on Information and Systems, PT. 2 (Japanese Edition) J86-D-II, 1131–1134 (2003) 6. Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley-Interscience, Chichester (1991) 7. Amari, S., Nagaoka, H.: Methods of Information Geometry. American Mathematical Society (2000) 8. Newman, D.J., Hettich, S., Blake, C.L., Merz, C.J.: UCI repository of machine learning databases (1998)
Sparse Super Symmetric Tensor Factorization Andrzej Cichocki , Marko Jankovic, Rafal Zdunek , and Shun-ichi Amari RIKEN Brain Science Institute, Wako-shi, Saitama, Japan [email protected]
Abstract. In the paper we derive and discuss a wide class of algorithms for 3D Super-symmetric Nonnegative Tensor Factorization (SNTF) or nonnegative symmetric PARAFAC, and as a special case: Symmetric Nonnegative Matrix Factorization (SNMF) that have many potential applications, including multi-way clustering, feature extraction, multisensory or multi-dimensional data analysis, and nonnegative neural sparse coding. The main advantage of the derived algorithms is relatively low complexity, and in the case of multiplicative algorithms possibility for straightforward extension of the algorithms to L-order tensors factorization due to some nice symmetric property. We also propose to use a wide class of cost functions such as Squared Euclidean, Kullback Leibler I-divergence, Alpha divergence and Beta divergence. Preliminary experimental results confirm the validity and good performance of some of these algorithms, especially when the data have sparse representations.
1
Introduction – Problem Formulation
Tensors (also known as n-way arrays or multidimensional arrays) are used in a variety of applications ranging from Neuroscience and psychometrics to chemometrics [1, 2, 3, 4, 5, 6, 7, 8]. Non-negative Matrix Factorization (NMF), Non-negative Tensor Factorization (NTF) and parallel factor analysis (PARAFAC) models with non-negativity constraints have been recently proposed as sparse and quite efficient representations of signals, images, or general data [3, 4, 2, 5, 9, 10, 11, 12, 13, 14, 15]. From a viewpoint of data analysis, NTF is very attractive because it takes into account spatial and temporal correlations between variables more accurately than 2D matrix factorizations, such as NMF, and it usually provides sparse common factors or hidden (latent) components with physiological meaning and interpretation [5]. In most applications, especially in neuroscience (EEG, fMRI), the PARAFAC models have been used [12, 16, 17]. In this paper, we consider the special form of the PARAFAC model (referred to here as the SNTF model), but with additional nonnegativity and sparsity constraints [6, 7, 18, 19]. In general case, the PARAFAC model can be described as a factorization of a
Also with Systems Research Institute PAN and Warsaw University of Technology; Dept. of EXE; Warsaw; Poland. Also with Institute of Telecommunications, Teleinformatics, and Acoustics; Wroclaw University of Technology; Poland.
M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 781–790, 2008. c Springer-Verlag Berlin Heidelberg 2008
782
A. Cichocki et al.
given 3D tensor Y ∈ RI×J×Q into three unknown matrices: A ∈ RI×J representing the common factors, basis matrix, dictionary matrix or mixing matrix (depending on the applications), D ∈ RQ×J usually representing scaling matrix, and X ∈ RJ×T representing second common factors, hidden components or sources (See Fig.1).
Q
D
Q T
1
Y I
J
1 1
J
= A
1 J
Q
T
X +
I
T
1
N I
Fig. 1. General 3D PARAFAC model described by the set of matrix equations Y q = AD q X + N q , q = 1, 2, . . . , Q, where D q is a diagonal matrix that holds on the main diagonal the q-th row of D. In the special case of the SNTF, we impose nonnegativity and additional constraints, and I = T = Q, A = D = G ∈ RI×J , and X = GT .
A super-symmetric tensor is a tensor whose entries are invariant under any permutation of the indices. For example, a third order super-symmetric tensor Y ∈ RI×T ×Q (with I = T = Q) has yitq = yiqt = ytiq = ytqi = yqit = yqti . Super-symmetric tensors arise naturally in multi-way clustering where they represent generalized affinity tensors, in higher order statistics, and blind source separation. Pierre Comon [20] has shown a nice relationship between supersymmetric tensors and polynomials. Zass and Shashua applied them to multiway clustering problems [6,19,7], and Hazan et al. developed some multiplicative algorithms for the NTF [2]. We formulate the SNTF decomposition of a third order super-symmetric tensor Y ∈ RI×I×I as three identical sparse nonnegative matrices G = [g 1 , g 2 , . . . , g J ] ∈ RI×J with J << I according to the following factorization: Y =
J gj ◦ g j ◦ gj + N ,
(1)
j=1
where g j ∈ RI is the j-th column vector of the matrix G, the operator ◦ means outer product1 , and N is a tensor representing error. The SNTF model can be described in the equivalent matrix form as Y q = GD q GT + N q , 1
(q = 1, 2, . . . , I)
Note that if u, v, w are vectors, then [u ◦ v ◦ w]ijq = ui vj wq .
(2)
Sparse Super Symmetric Tensor Factorization
783
where Y q = Y :,:,q = [yitq ] ∈ RI×I are (frontal) slices of the given tensor Y ∈ + RI×I×I , I = Q = T is the number of the (horizontal, vertical, frontal) slices, G = + [gij ]I×J ∈ RI×J is the unknown matrix (super-common factor) to be estimated, + J×J Dq ∈ R+ is a diagonal matrix that holds the q-th row of G in its main diagonal, and N q = N :,:,q ∈ RI×I is the q-th frontal slice of a tensor N ∈ RI×I×I (not necessary super symmetric) representing error or noise, depending upon the application. The above algebraic system can be represented in an equivalent scalar form as follows yitq = zitq + nitq =
J
gij gtj gqj + nitq .
(3)
j=1
The objective is to estimate a sparse matrix G, subject to some constraints like scaling to unit length vectors, non-negativity and other possible natural constraints such as orthogonality, sparseness and/or smoothness of all or some of the columns g j . Throughout this paper, we use the following notation: the ij-th element of the matrix G is denoted by gij , and its j-th column by g j , yitq = [Y q ]it means J the it-th element of the q-th frontal slice Y q and zitq = j=1 gij gtj gqj with (i = 1, 2, . . . , I; t = 1, 2, . . . , I; q = 1, 2, . . . , I).
2 2.1
Multiplicative SNTF Algorithms Generalized Alpha Divergence
The most widely known and often used adaptive algorithms for NTF/NMF and also SNTF are based on alternating minimization of the squared Euclidean distance and the generalized Kullback-Leibler divergence [15,13,9]. In this paper, we propose to use a more general cost function: Alpha divergence. The 3D generalized alpha- divergence can be defined for our purpose as follows [1]:
⎧
α−1 yitq yitq yitq − zitq ⎪ ⎪ ⎪ −1 − , α = 0, 1, ⎪ ⎪ α(α − 1) zitq α ⎪ ⎪ itq ⎪ ⎪ ⎪ ⎪ ⎪
⎪ ⎨ yitq (α) yitq ln − yitq + zitq α = 1, DA (Y ||Z) = (4) zitq ⎪ ⎪ itq ⎪ ⎪ ⎪ ⎪ ⎪ ⎪
⎪ ⎪ zitq ⎪ ⎪ z ln + y − z α = 0, ⎪ itq itq itq , ⎩ yitq itq where yitq = [Y ]itq and zitq = [GD q GT ]it for (i = 1, 2, . . . , I), (t = 1, 2, . . . , I), (q = 1, 2, . . . , I).
784
A. Cichocki et al.
The choice of the parameter α ∈ R depends on the statistical distribution of noise and data. We recall, that as special cases of the alpha-divergence for α = 2, 0.5, −1, we obtain the Pearson’s chi squared, Hellinger and Neyman’s chisquare distances, respectively, while for the cases α = 1 and α = 0 the divergence has to be defined by the limits of (4 (a)) as α → 1 and α → 0, respectively. When these limits are evaluated, one obtains the generalized Kullback-Leibler divergence defined by equations (4 (c)) for α → 1. The gradient of the alpha divergence (4), for α = 0, can be expressed in compact form as:
α (α) ∂DA 1 yitq = gqj gtj 1 − , α = 0. (5) ∂gij α tq zitq However, instead of applying here the standard gradient descent, we use a projected (nonlinearly transformed) gradient approach (which can be considered as a generalization of the exponentiated gradient): (α)
Φ(gij ) ← Φ(gij ) − ηij where Φ(x) is a suitable chosen function. Hence, we have
−1
gij ← Φ
∂DA , ∂gij
(α)
∂DA Φ(gij ) − ηij ∂gij
(6)
,
(7)
It can be shown that using such nonlinear scaling or transformation provides a stable solution and the gradients are much better behaved in the Φ space. In our case, we employ Φ(x) = xα and choose the learning rates: ηij = αΦ(gij )/ tq gqj gjt , which leads to the generalized multiplicative alpha algorithm 2 :
α 1/α tq gqj gjt (yitq /zitq ) gij ← gij , (8) tq gqj gtj with the normalization of the columns of G to unit length at each iteration, i.e.: I gij ← gij / p=1 gpj . This SNTF algorithm can be considered as a generalization of the EMML algorithm (for α = 1) proposed in [2, 6]. We may apply nonlinear projections or filtering via suitable nonlinear monotonic functions which increase or decrease the sparseness. In the simplest case, we can apply a very simple nonlinear transformation gtj ← (gtj )1+αsp , ∀k, where αsp is a small coefficient, typically from 0.001 to 0.005, and it is positive if we want to increase sparseness of an estimated component and negative if we want to decrease the sparseness. Hence, the generalized alpha algorithm for the SNTF with sparsity control can take the following form: 2
For α = 0, instead of Φ(x) = xα we have used Φ(x) = ln(x).
Sparse Super Symmetric Tensor Factorization
⎡
gij ← ⎣gij
α
tq gqj gtj (yitq /zitq ) tq gqj gtj
785
ω/α ⎤1+αsp ⎦
,
(9)
where ω is an over-relaxation parameter (typically, in the range (0, 2)) which controls the convergence speed, and αsp is a small parameter which controls sparsity of the estimated matrix G. 2.2
SMART Algorithm
Alternative multiplicative SNTF algorithms can be derived using the exponentiated gradient (EG) descent updates instead of the standard additive gradient descent. For example, by using the alpha divergence (4) for α = 0, we have
(0) ∂DA gij ← gij exp −˜ ηi , (10) ∂gij ∂DA = gqj gtj (ln zitq − ln yitq ) . ∂gij tq (0)
Hence, we obtain the simple multiplicative learning rules:
yitq ηij gqj gtj yitq gij ← gij exp ηij gqj gtj ln = gij . zitq zitq tq tq
(11)
(12)
The nonnegative learning rates ηij can take different forms. Typically, in order to T guarantee stability of the algorithm we assume that ηij = η˜j = ω ( t=1 gtj )−2 , where ω ∈ (0, 2) is an over-relaxation parameter. The above SNTF multiplicative algorithm can be considered as an alternating minimization/projection extension of the well-known SMART (Simultaneous Multiplicative Algebraic Reconstruction Technique) [11, 21]. 2.3
Generalized Beta Divergence
The generalized beta divergence can be considered as a complementary cost function of the generalized alpha divergence and can be defined as follows:
⎧ β β β+1 β+1 y − z y − z ⎪ itq itq itq itq ⎪ ⎪ yitq − , β > 0, ⎪ ⎪ β β+1 ⎪ ⎪ itq ⎪ ⎪ ⎪ ⎪ ⎪
⎪ ⎨ yitq (β) yitq ln( ) + yitq − zitq β = 0, DB (Y ||Z) = (13) zitq ⎪ ⎪ itq ⎪ ⎪ ⎪ ⎪ ⎪ ⎪
⎪ ⎪ zitq yitq ⎪ ⎪ ln( ) + − 1 , β = −1. ⎪ ⎩ y z itq
itq
itq
786
A. Cichocki et al.
The choice of the parameter β depends on the statistical distribution of the data and the beta divergence corresponds to the Tweedie models [22]. For example, the optimal choice of the parameter β for a normal distribution is β = 1, for a gamma distribution is β = −1, for a Poisson distribution is β = 0, and for the compound Poisson β ∈ (−1, 0). From the beta generalized divergence, we can derive various kinds of SNTF algorithms: Multiplicative algorithms based on the standard gradient descent or the Exponentiated Gradient (EG) algorithms, additive algorithms using Projected Gradient (PG) or Interior Point Gradient (IPG), quasi-Newton and Fixed Point (FP) ALS algorithms [23, 24, 25, 26, 27, 28, 9, 13]. In order to derive the multiplicative SNTF learning algorithm for a sparse factorization, we compute the gradient of a regularized beta divergence (13), withthe additional regularization (sparsification) term J(G) = αG ||G||1 = αG ij gij as: (β)
∂DBreg ∂gij
=
β β−1 zitq − yitq zitq
gqj gtj + αG .
(14)
tq
Applying the simple (the first-order) gradient descent approach: (β)
gij ← gij − ηij
∂DBreg ∂gij
,
(15)
β and by choosing suitable learning rates: ηij = gij / tq zitq gqj gtj , we obtain a generalized SNTF beta algorithm: 1−β gqj gtj (yjtq /zitq ) − αG gij ← gij
tq
+ β tq zitq gqj gtj
,
(16)
where [x]ε = max{ε, x} with a small ε = 10−16 introduced to avoid zero and negative values. In the special case, for β = 0 the above algorithm simplifies to the generalized alternating EMML algorithm that is similar to the algorithm derived by Hazan et al. [2, 29]: gqj gtj (yjtq /zitq ) − αG gij ← gij
tq
tq
3 3.1
+
gqj gtj
.
(17)
Simple Alternative Approaches for Super-Symmetric Tensor Decomposition Averaging Approach
For large dimensions of tensors (I >> 1), the above derived local algorithm could be computationally very time consuming.
Sparse Super Symmetric Tensor Factorization
787
In this section, we propose an alternative simple approach which converts the problem to a simple tri-NMF model: ¯ T +N ¯, Y¯ = GDG
(18)
Q ¯ = Q Dq = diag{d¯1 , d¯2 , . . . , d¯J } and N ¯ = where Y¯ = q=1 Y q ∈ RI×I , D q=1 Q I×I . q=1 N q ∈ R The above system of linear algebraic equations can be represented in an equivalent scalar form as: y¯it = j gij gtj d¯j + n ¯ it or equivalently in the vector form: ¯ where g are columns of G. Y¯ = j g j d¯j g Tj + N j Such a simple model is justified if noise in the frontal slices is uncorrelated. It is interesting to note that the model can be written in the equivalent form: ˜G ˜T +N ¯, Y¯ = G
(19)
˜ = GD ¯ 1/2 , assuming that D ¯ ∈ RJ×J is a non-singular matrix. Thus, the where G problem can be converted to a standard symmetric NMF problem to estimate ˜ Using any available NMF algorithm: Multiplicative, FP-ALS, or PG, matrix G. ˜ we can estimate the matrix G. For example, by minimizing the following regularized cost function: T
˜G ˜ )= D(Y¯ ||G
1 ¯ ˜G ˜ T ||2 + αG ||G|| ˜ 1 ||Y − G F 2
(20)
and applying the FP-ALS approach, we obtain the following simple algorithm ˜ ← (Y¯ T G ˜ − αG E)(G ˜ T G) ˜ −1 , G (21) +
˜ to unit-length in each iteration step, subject to normalization the columns of G where E is the matrix of all ones of appropriate size. 3.2
Row-Wise and Column-Wise Unfolding Approach
It is worth noting that the diagonal matrices D q are scaling matrices that can be absorbed by the matrix G. By defining the column-normalized matrices Gq = GD q , we can use the following simplified models: Y q = Gq GT + N q ,
(q = 1, . . . , Q)
(22)
Y q = GGTq + N q ,
(q = 1, . . . , Q).
(23)
or equivalently
These simplified models can be described by a single compact matrix equation using column-wise or row-wise unfolding as follows Y c = Gc GT ,
(24)
788
A. Cichocki et al.
or Y r = GGTr ,
(25)
where Y c = Y Tr = [Y 1 ; Y 2 ; . . . ; Y Q ] ∈ RI ×I is the column-wise unfolded matrix of the slices Y q and Gc = GTr = [G1 ; G2 ; . . . ; GQ ] ∈ RJI×I is columnwise unfolded matrix of the matrices Gq = GD q (q = 1, 2, , . . . , I). Using any efficient NMF algorithm (multiplicative, IPN, quasi-Newton, or FP-ALS) [23,24,25,26,27,28,9,13], we can estimate the matrix G. For example, by minimizing the following cost function: 2
D(Y c ||Gc GT ) =
1 ||Y c − Gc GT ||2F + αG ||G||1 2
(26)
and applying the FP-ALS approach, we obtain the following iterative algorithm: G ← ([Y Tc Gc − αG E]+ )(GTc Gc )−1 (27) +
or equivalently G ← ([Y r Gr − αG E]+ )(GTr Gr )−1 ,
(28)
+
where Gc = GTr = [GD 1 ; GD 2 ; . . . ; GD Q ], Dq = diag{g q } and g q means q-th row of G. 3.3
Semi-orthogonality Constraint
The matrix G is usually very sparse and additionally satisfies orthogonality constraints. We can easily impose orthogonality constraint by incorporating additionally the following iterations: −1/2 G ← G GT G . 3.4
(29)
Simulation Results
All the NTF algorithms presented in this paper have been tested for many difficult benchmarks for signals and images with various statistical distributions of signals and additive noise. Comparison and simulation results will be presented in the ICONIP-2007.
4
Conclusions and Discussion
We have proposed the generalized and flexible cost function (controlled by sparsity penalty/regularization terms) that allows us to derive a family of SNTF algorithms. The main objective and motivations of this paper is to derive simple multiplicative algorithms which are especially suitable both for very sparse
Sparse Super Symmetric Tensor Factorization
789
representation and highly over-determined cases. The basic advantage of the multiplicative algorithms is their simplicity and relatively straightforward generalization to L-order tensors (L > 3). However, the multiplicative algorithms are relatively slow. We found that simple approaches which convert a SNTF problem to a symmetric NMF (SNMF) or symmetric tri-NMF (ST-NMF) problem provide the more efficient and fast algorithms, especially for large scale problems. Moreover, by imposing orthogonality constraints, we can drastically improve performance, especially for noisy data. Obviously, there are many challenging open issues remaining, such as global convergence and an optimal choice of the associated parameters.
References 1. Amari, S.: Differential-Geometrical Methods in Statistics. Springer, Heidelberg (1985) 2. Hazan, T., Polak, S., Shashua, A.: Sparse image coding using a 3D non-negative tensor factorization. In: International Conference of Computer Vision (ICCV), pp. 50–57 (2005) 3. Workshop on tensor decompositions and applications, CIRM, Marseille, France (2005) 4. Heiler, M., Schnoerr, C.: Controlling sparseness in non-negative tensor factorization. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 56–67. Springer, Heidelberg (2006) 5. Smilde, A., Bro, R., Geladi, P.: Multi-way Analysis: Applications in the Chemical Sciences. John Wiley and Sons, New York (2004) 6. Shashua, A., Zass, R., Hazan, T.: Multi-way clustering using super-symmetric nonnegative tensor factorization. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 595–608. Springer, Heidelberg (2006) 7. Zass, R., Shashua, A.: A unifying approach to hard and probabilistic clustering. In: International Conference on Computer Vision (ICCV), Beijing, China (2005) 8. Sun, J., Tao, D., Faloutsos, C.: Beyond streams and graphs: dynamic tensor analysis. In: Proc.of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2006) 9. Berry, M., Browne, M., Langville, A., Pauca, P., Plemmons, R.: Algorithms and applications for approximate nonnegative matrix factorization. In: Computational Statistics and Data Analysis (in press, 2006) 10. Cichocki, A., Zdunek, R., Amari, S.: Csiszar’s divergences for non-negative matrix factorization: Family of new algorithms. In: Rosca, J.P., Erdogmus, D., Pr´ıncipe, J.C., Haykin, S. (eds.) ICA 2006. LNCS, vol. 3889, pp. 32–39. Springer, Heidelberg (2006) 11. Cichocki, A., Amari, S., Zdunek, R., Kompass, R., Hori, G., He, Z.: Extended SMART algorithms for non-negative matrix factorization. In: Rutkowski, L., ˙ Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2006. LNCS (LNAI), vol. 4029, pp. 548–562. Springer, Heidelberg (2006) 12. Cichocki, A., Zdunek, R.: NTFLAB for Signal Processing. Technical report, Laboratory for Advanced Brain Signal Processing, BSI, RIKEN, Saitama, Japan (2006)
790
A. Cichocki et al.
13. Dhillon, I., Sra, S.: Generalized nonnegative matrix approximations with Bregman divergences. In: Neural Information Proc. Systems, Vancouver, Canada, pp. 283– 290 (2005) 14. Kim, M., Choi, S.: Monaural music source separation: Nonnegativity, sparseness, and shift-invariance. In: Rosca, J.P., Erdogmus, D., Pr´ıncipe, J.C., Haykin, S. (eds.) ICA 2006. LNCS, vol. 3889, pp. 617–624. Springer, Heidelberg (2006) 15. Lee, D.D., Seung, H.S.: Learning the parts of objects by nonnegative matrix factorization. Nature 401, 788–791 (1999) 16. Mørup, M., Hansen, L.K., Herrmann, C.S., Parnas, J., Arnfred, S.M.: Parallel factor analysis as an exploratory tool for wavelet transformed event-related EEG. NeuroImage 29, 938–947 (2006) 17. Miwakeichi, F., Martinez-Montes, E., Valds-Sosa, P., Nishiyama, N., Mizuhara, H., Yamaguchi, Y.: Decomposing EEG data into space−time−frequency components using parallel factor analysi. NeuroImage 22, 1035–1045 (2004) 18. Zass, R., Shashua, A.: Nonnegative sparse pca. In: Neural Information Processing Systems (NIPS), Vancuver, Canada (2006) 19. Zass, R., Shashua, A.: Doubly stochastic normalization for spectral clustering. In: Neural Information Processing Systems (NIPS), Vancuver, Canada (2006) 20. Comon, P.: Tensor decompositions-state of the art and applications. In: McWhirter, J.G., Proudler, I.K. (eds.) Institute of Mathematics and its Applications Conference on Mathematics in Signal Processing, pp. 18–20. Clarendon Press, Oxford, UK (2001) 21. Byrne, C.L.: Choosing parameters in block-iterative or ordered-subset reconstruction algorithms. IEEE Transactions on Image Processing 14, 321–327 (2005) 22. Minami, M., Eguchi, S.: Robust blind source separation by Beta-divergence. Neural Computation 14, 1859–1886 (2002) 23. Cichocki, A., Zdunek, R., Choi, S., Plemmons, R., Amari, S.I.: Novel multi-layer nonnegative tensor factorization with sparsity constraints. In: Beliczynski, B., Dzielinski, A., Iwanowski, M., Ribeiro, B. (eds.) ICANNGA 2007. LNCS, vol. 4432, pp. 271–280. Springer, Heidelberg (2007) 24. Cichocki, A., Zdunek, R.: Regularized alternating least squares algorithms for nonnegative matrix/tensor factorizations. In: Liu, D., Fei, S., Hou, Z., Zhang, H., Sun, C. (eds.) ISNN 2007. LNCS, vol. 4493, pp. 793–802. Springer, Heidelberg (2007) 25. Cichocki, A., Zdunek, R., Amari, S.: New algorithms for non-negative matrix factorization in applications to blind source separation. In: Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP2006, Toulouse, France, vol. 5, pp. 621–624 (2006) 26. Cichocki, A., Zdunek, R., Choi, S., Plemmons, R., Amari, S.: Nonnegative tensor factorization using Alpha and Beta divergencies. In: Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2007), Honolulu, Hawaii, USA, vol. III, pp. 1393–1396 (2007) 27. Zdunek, R., Cichocki, A.: Nonnegative matrix factorization with constrained second-order optimization. Signal Processing 87, 1904–1916 (2007) 28. Zdunek, R., Cichocki, A.: Nonnegative matrix factorization with quadratic programming. Neurocomputing (accepted, 2007) 29. Shashua, A., Hazan, T.: Non-negative tensor factorization with applications to statistics and computer vision. In: Proc. of the 22-th International Conference on Machine Learning, Bonn, Germany (2005)
Probabilistic Tensor Analysis with Akaike and Bayesian Information Criteria Dacheng Tao1, Jimeng Sun2, Xindong Wu3, Xuelong Li4, Jialie Shen5, Stephen J. Maybank4, and Christos Faloutsos2 1
Department of Computing, Hong Kong Polytechnic University, Hong Kong [email protected] 2 Department of Computer Science, Carnegie Mellon University, Pittsburgh, USA [email protected], [email protected] 3 Department of Computer Science, University of Vermont, Burlington, USA [email protected] 4 Sch. Computer Science & Info Systems, Birkbeck, University of London, London, UK [email protected], [email protected] 5 School of Information Systems, Singapore Management University, Singapore [email protected]
Abstract. From data mining to computer vision, from visual surveillance to biometrics research, from biomedical imaging to bioinformatics, and from multimedia retrieval to information management, a large amount of data are naturally represented by multidimensional arrays, i.e., tensors. However, conventional probabilistic graphical models with probabilistic inference only model data in vector format, although they are very important in many statistical problems, e.g., model selection. Is it possible to construct multilinear probabilistic graphical models for tensor format data to conduct probabilistic inference, e.g., model selection? This paper provides a positive answer based on the proposed decoupled probabilistic model by developing the probabilistic tensor analysis (PTA), which selects suitable model for tensor format data modeling based on Akaike information criterion (AIC) and Bayesian information criterion (BIC). Empirical studies demonstrate that PTA associated with AIC and BIC selects correct number of models. Keywords: Probabilistic Inference, Akaike Information Criterion, Bayesian Information Criterion, Probabilistic Principal Component Analysis, and Tensor.
1 Introduction In computer vision, data mining, and different applications, objects are all naturally represented by tensors or multidimensional arrays, e.g., gray face image in biometrics, colour image in scene classification, colour video shot (as shown in Figure 1) in multimedia information management, TCP flow records in computer networks, and DBLP bibliography in data mining [6]. Therefore, tensor based data modeling becomes very popular and a large number of learning models have been developed from unsupervised learning to supervised learning, e.g., high order singular value decomposition [5] [1] [2], n-mode component analysis or tensor principal component M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 791–801, 2008. © Springer-Verlag Berlin Heidelberg 2008
792
D. Tao et al.
Fig. 1. A colour video shot is a fourth order tensor. Four indices are used to locate elements. Two indices are used for pixel locations; one index is used to locate the colour information; and the other index is used for time. The video shot comes from http://www–nlpir.nist.gov/ projects/trecvid/.
analysis (TPCA) [5] [10] [6], three-mode data principal component analysis [4], tucker decomposition [9], tensorface [10], generalized low rank approximations of matrix [12], two dimensional linear discriminant analysis [11], dynamic tensor analysis [6] and general tensor discriminant analysis [7]. However, all these tensor based learning models lack systematic justifications at the probabilistic level. Therefore, it is impossible to conduct conventional statistical tasks, e.g., model selection, over existing tensor based learning models. The difficulty of applying probability theory, probabilistic inference, and probabilistic graphical modeling to tensor based learning models comes from the gap between the tensor based datum representation and the vector based input requirement in traditional probabilistic models. Is it possible to apply the probability theory and to develop probabilistic graphical models for tensors? Is it possible to apply the statistical models over specific tensor based learning models? To answer the above questions, we narrow down our focus on generalizing the probabilistic principal component analysis (PPCA) [8], which is an important generative model [3] for subspace selection or dimension reduction for vectors. This generalization forms the probabilistic tensor analysis (PTA), which is a generative model for tensors. The significances of PTA are as following: 1) Providing a probabilistic analysis for tensors. Based on PTA, a probabilistic graphical model can be constructed, the dimension reduction procedure for tensors could be understood at the probabilistic level, and the parameter estimation can be formulated by utilizing the expectation maximization (EM) algorithm under the maximum likelihood framework [3]; 2) Providing statistical methods for model selection based on the Akaike and Bayesian information criteria (AIC and BIC) [3]. It is impossible for conventional tensor based subspace analysis to find a criterion for model selection, i.e., determining the appropriate number of retained dimensions to model the original tensors. By
Probabilistic Tensor Analysis with Akaike and Bayesian Information Criteria
793
providing specific utilizations of AIC and BIC, the model selection for tensor subspace analysis comes true. In PTA associated with AIC/BIC, the number of retained dimensions can be chosen by an iterative procedure; and 3) Providing a flexible framework for tensor data modeling. PTA assumes entries in tensor measurements are drawn from multivariate normal distributions. This assumption could be changed for different applications, e.g., multinomial/binomial distributions in sparse tensor modeling. With different assumptions, different dimension reduction algorithm could be developed for different applications.
2 Probabilistic Tensor Analysis In this Section, we first construct the latent tensor model which a multilinear mapping to relate the observed tensors with unobserved latent tensors. Based on the proposed latent tensor model, we propose the probabilistic tensor analysis with dimension reduction and data reconstruction. The detailed description about tensor terminologies can be found in [9]. 2.1 Latent Tensor Model Similar to the latent variable model [8], the latent tensor model multilinearly relates high dimensional observation tensors Ti ∈ R l1 ×l2 ×"×lM −1 ×lM for 1 ≤ i ≤ n to the
corresponding latent tensors Xi ∈ R l1′×l2′ ×"×lM′ −1 ×lM′ in the low dimensional space, i.e., M
Ti = Xi ∏ ×d U dT + M + Ei ,
(1)
d =1
where U d |dM=1∈ R ld′ ×ld is the dth projection matrix and 1 ≤ d ≤ M ; M = (1 n ) ∑ i =1 Ti is n
the mean tensor of all observed tensors Ti ; and Ei is the ith residue tensor and every entry of Ei follows N ( 0, σ 2 ) . Moreover, the number of effective dimension of latent
tensors is upper bounded by ld − 1 for dth mode, i.e., ld′ ≤ ld − 1 for the dth projection matrix U d . Here, 1 ≤ i ≤ n and n is the number of observed tensors. Projection matrices U d (1 ≤ d ≤ M ) construct the multilinear mapping between the observations and the latent tensors.
2.2 Probabilistic Tensor Analysis Armed with the latent tensor model and the probabilistic principal component analysis (PPCA), a probabilistic treatment of TPCA is constructed by introducing hyperparameters with prior distribution p ( M ,U d |dM=1 , σ 2 | D ) , where D = {Ti |in=1} is the set contains all observed tensors. According to the Bayesian theorem, with D the probabilistic modeling of TPCA or PTA is defined by the predictive density, i.e.,
p ( T | D ) = p ( T | M,U d |dM=1 , σ 2 ) p ( M,U d |dM=1 , σ 2 | D ) ,
(2)
794
D. Tao et al.
where U d |Md =1∈ R ld′ ×ld , p ( T | M ,U d |dM=1 , σ 2 ) is the predictive density with the given
full probabilistic model, and p ( M ,U d |dM=1 , σ 2 | D ) is the posterior probability. The
model parameters ( M ,U d |dM=1 , σ 2 ) can be obtained by applying maximum a posterior (MAP). In PTA, to obtain ( M ,U d |dM=1 , σ 2 ) , we have the following concerns:
1) the probabilistic distributions are defined over vectors but not tensors. Although the vectorization operation is helpful to utilize the conventional probability theory and inferences, the computational cost will be very high and almost intractable for practical applications. Based on this perspective, it is important to develop a method to obtain the probabilistic model in a computational tractable way. In the next part, the decoupled probabilistic model is developed to significantly reduce the computational complexity. 2) how to determine the number of retained dimensions to model observed tensors? In probability theory and statistics, the Akaike information criterion (AIC) [3] and the Bayesian information criterion (BIC) [3] are popular in model selection. However, both AIC and BIC are developed for data represented by vectors. Therefore, it is important to generalize AIC and BIC for tensor data. Based on the above discussions, to obtain the projection matrices U d |Md =1 is computationally intractable, because the model requires obtaining all projection matrices simultaneously. To construct a computationally tractable algorithm to obtain U d |Md =1 , we can construct decoupled probabilistic model for (1), i.e., obtain each
M
Ud
μd
G
σ d2
G td ; j
G xd ; j
Ti
Xi
nld
n
U1
U d −1
U d +1
UM
Fig. 2. Decoupled probabilistic graphical model for probabilistic tensor analysis
Probabilistic Tensor Analysis with Akaike and Bayesian Information Criteria
795
projection matrix U d separately and form an alternating optimization procedure. The decoupled predictive density p ( T | M ,U d |dM=1 , σ 2 ) is defined as
M G p ( T | M,U d |dM=1 , σ 2 ) ∝ ∏ p ( T ×d U d | μd , σ d2 ) .
(3)
d =1
G where μd ∈ Rld is the mean vector for the dth mode; σ d2 is the variance of the noise to model the residue on the dth mode. The decoupled posterior distribution with the given observed tensors D is M G p ( M,U d |dM=1 , σ 2 | D ) ∝ ∏ p ( μ d ,U d , σ d2 | D, U k |1k≤≠kd≤ M ) .
(4)
d =1
Therefore, based on (3) and (4), the decoupled predictive density is
(
)
M G G p ( T | D ) ∝ ∏ p ( T ×d U d | μ d , σ d2 ) × p ( D,U k |1k≤≠kd≤ M | μ d ,U d , σ d2 ) . d =1
(5)
With (4) and (5), we have the decoupled probabilistic graphical model as shown in Figure 2. G Consider the column space of the projected data set Dd = {td ; j } , where the sample G td ; j ∈ R ld is the jth column of Td = mat d ( T ×d U d ) and Td ∈ R ld × ld , where ⎛ M ⎞ G ld = ∏ k ≠ d lk′ . Therefore, 1 ≤ j ≤ nld . Let X d = mat d ⎜ T∏ ×k U k ⎟ and xd ; j ∈ R ld′ is ⎝ k =1 ⎠ G G th T G 2 the j column of X d . Suppose, td | xd ~ N (U d xd + μ d , σ d I ) and the marginal G distribution of the latent tensors xd ~ N ( 0, I ) , then the marginal distribution of the G G G observed projected mode–d vector td is also Gaussian, i.e., td ~ N ( μ d , Cd ) . The L′ G mean vector and the corresponding covariance matrix are μd = ∑ j d=1 td ; j nld and
(
)( )
Cd = U U d + σ I , respectively. Based on the property of Gaussian properties, we have G G G xd | td ~ N ( M d−1U d ( td − μd ) , σ d2 M d−1 ) , (6) T d
2 d
where M d = (U d U dT + σ d2 I ) . In this decoupled model, the objective is to find the dth projection matrix U d based on MAP with given U k |1k≤≠kd≤ M , i.e., calculate U d by maximizing log p (U d | D ×d U d ) ∝ −
nld ⎡ log det ( Cd ) + tr ( Cd−1 Sd ) ⎤ , ⎦ 2 ⎣
(7)
where log ( a ) is the natural logarithm of a and Sd is the sample covariance matrix G of td . The eq. (7) is benefited from the decoupled definition of the probabilistic model, defined by (5). Based on (7), the total log of posterior distribution is
796
D. Tao et al. M M ⎧ ⎫ nl L = ∑ log p (U d | D ×d U d ) ∝ −∑ ⎨ d ⎡⎣log det ( Cd ) + tr ( Cd−1 Sd ) ⎤⎦ ⎬. d =1 d =1 ⎩ 2 ⎭
(8)
To implement MAP for the dth projection matrix U d estimation, the EM algorithm is applied here. The expectation of the log likelihood of complete data with respect to G G p ( xd ; j | td ; j , μ d ,U d , σ d2 ) is given by M n M nld G G E ( Lc ) = ∑∑ E ⎡log p ( Ti , Xi | U j ≠ d ) ⎤ = ∑∑ E ⎡log p ( td ; j , xd ; j ) ⎤. ⎣ ⎦ ⎣ ⎦ d =1 i =1 d =1 j =1
(
)
(
)
(9)
1 G G G 2 td ; j − U dT xd ; j − ud ;i . 2
(10)
G G Here, log p ( td ; j , xd ; j ) with the given U k |1k≤≠kd≤ M is given by
(
)
G G G log p ( td ; j , xd ; j ) ∝ − xd ; j
(
)
2
− d log (σ 2 ) −
σ
It is impossible to maximize E ( Lc ) with respect to all projection matrices U d |dM=1 because different projection matrices are inter-related [7] during optimization procedure, i.e., it is required to know U j ≠ d to optimize U d . Therefore, we need to apply alternating optimization procedure [7] for optimization. To optimize the dth projection matrix U d with σ d2 , we need the decoupled expectation of the log likelihood function on the dth mode: G
∑ E ⎡⎣log ( p ( t nld
j =1
d; j
)
G , xd ; j ) ⎤ ⎦
1 G G GT G T G G ⎞ ⎛ 2 ⎜ tr E ⎡⎣ xd ; j xd ; j ⎤⎦ + d log (σ ) + σ 2 ( td ; j − ud ;i ) ( td ; j − ud ;i ) ⎟ ⎟. ∝ −∑ ⎜ G 2 1 G G G GT ⎟ j =1 ⎜ T ⎜ − 2 E ⎡⎣ xd ; j ⎤⎦ U d ( td ; j − ud ;i ) + 2 tr U d U d E ⎡⎣ xd ; j xd ; j ⎤⎦ ⎟ σ ⎝ σ ⎠ Based on (6), then we have G G G E ⎡⎣ xd ; j ⎤⎦ = M d−1U d ( td ; j − μ d )
(
nld
)
(
(11)
)
(12)
and G G G G E ⎡⎣ xd ; j xdT; j ⎤⎦ = σ d2 M d−1 + E ⎡⎣ xd ; j ⎤⎦ E ⎡⎣ xdT; j ⎤⎦ .
(13)
Eq. (12) and (13) form the expectation step or E-step. The maximization step or M-step is obtained by maximizing nld G G E ⎡log p ( td ; j , xd ; j ) ⎤ with respect to U d and σ d2 . In detail, by setting ∑ ⎣ ⎦ j =1
(
)
⎡ nld G G ∂U d ⎢ ∑ E ⎡log p ( td ; j , xd ; j ) ⎣ ⎣ j =1
(
⎤
)⎤⎦ ⎥ = 0 , we have ⎦
−1
⎡ nld G G G ⎤ ⎡ nl G G ⎤ U d = ⎢ ∑ E ⎡⎣ xd ; j xdT; j ⎤⎦ ⎥ ⎢ ∑ E ⎡⎣ xd ; j ⎤⎦ ( td ; j − μd ) ⎥ ; ⎦ ⎣ j =1 ⎦ ⎣ j =1
(14)
Probabilistic Tensor Analysis with Akaike and Bayesian Information Criteria
⎡ nld ⎤ G G and by setting ∂σ 2 ⎢ ∑ E ⎡log p ( td ; j , xd ; j ) ⎤ ⎥ = 0 , we have d ⎣ ⎦ ⎣ j =1 ⎦ G G 2 G GT G ⎫ ⎧ 1 nld ⎪ td ; j − μd − 2 E ⎡⎣ xd ; j ⎤⎦ U d ( td ; j − μd ) ⎪ 2 σd = ⎬. ∑⎨ nld ld i =1 ⎪+ tr E ⎡ xG xG T ⎤ U U T ⎪⎭ d ; j d ; j d d ⎣ ⎦ ⎩
(
797
)
(
)
(15)
2.3 Dimension Reduction and Data Reconstruction
After having projection matrices U d |Md =1 , the following operations are important for different applications: Dimension Reduction: Given the projection matrices U d |Md =1 and an observed tensor T ∈ R l1 ×l2 ×"×lM −1 ×lM in the high dimensional space, how to find the corresponding latent tensor X ∈ R l1′×l2′ ×"×lM′ −1 ×lM′ in the low dimensional space? From tensor algebra, M
the dimension reduction is given by X = T∏ ×d U d . However, the method is absent d =1
the probabilistic perspective. Under the proposed decoupled probabilistic model, X is M G G obtained by maximizing p ( X | T ) ∝ ∏ p ( xd | td ) . The dimension reduction is d =1
X = ( T − M ) ∏ ×d ( M d−1U dT ) . M
T
(16)
d =1
Data Reconstruction: Given the projection matrices U d |Md =1 and the latent tensor
X ∈ R l1′×l2′ ×"×lM′ −1 ×lM′ in the low dimensional space, how to approximate the corresponding observed tensor T ∈ R l1 ×l2 ×"×lM −1 ×lM in the high dimensional space? Based on (16), the data reconstruction procedure is given by
(
−1 Tˆ = X∏ ×d U dT (U d U dT ) M d M
d =1
ˆ The reconstruction error is given by T − T
Fro
)
T
T
+ M.
Fro
(17)
.
2.4 Akaike and Bayesian Information Criteria for PTA AIC and BIC are popular methods for model selection in statistics. However, they are developed for vector data. In the proposed PTA, data are in tensor form. Therefore, it is important to find a suitable method to utilize AIC and BIC for tensor based learning models. In PTA, the conventional AIC and BIC could be applied to determine the size of U d |Md =1 . The exhaustive search based on AIC (BIC) is applied for model selection. In detail, for AIC based model selection, we need to calculate the score of AIC
798
D. Tao et al.
J dAIC (U d , σ d2 , ld′ ) = ( 2ld ld′ + 2 − ld′ ( ld′ − 1) )
)
(
−1 + nld ⎡log det (U dT U d + σ d2 I ) + tr (U dT U d + σ d2 I ) S d ⎤ ⎣⎢ ⎦⎥
for each mode
∏ (l M
d =1
d
(18)
− 1) times, because the number of rows ld′ in each projection
matrix U d changes from 1 to ( ld − 1) . In determination stage, the optimal ld′ * is ld′ * = arg min J dAIC (U d , σ d2 , ld′ ) ,
(19)
ld′
where 1 ≤ ld′ ≤ ld − 1 . For BIC based model selection in PTA, we have similar definition as AIC, ⎛ l ′ ( l ′ − 1) ⎞ J dBIC (U d , σ d2 , ld′ ) = log nld ⎜ ld ld′ + 1 − d d ⎟ 2 ⎝ ⎠
( )
(
)
−1 + nld ⎡log det (U dT U d + σ d2 I ) + tr (U dT U d + σ d2 I ) Sd ⎤ ⎣⎢ ⎦⎥
for each mode
∏ (l M
d =1
d
(20)
− 1) times. In determination stage, the optimal ld′ * is ld′ * = arg min J dBIC (U d , σ d2 , ld′ ) , ld′
(21)
where 1 ≤ ld′ ≤ ld − 1 .
3 Empirical Study In this Section, we utilize a synthetic data model, to evaluate BIC PTA in terms of accuracy for model selection. For AIC PTA, we have the very similar experimental results as BIC PTA. The accuracy is measured by the model selection error M ∑ d =1 ld′ − ld′ * . Here, ld′ is the real model, i.e., the real dimension of dth mode of the unobserved latent tensor; and ld′ * is the selected model, i.e., the selected dimension of the dth mode of the unobserved latent tensor by using BIC PTA. A multilinear transformation is applied to map the tensor from the low dimensional space R l1′×l2′ ×"×lM′ to high M
dimensional space R l1 ×l2 ×"×lM by Ti = Xi ∏ ×d U dT + M + ς Ei , where Xi ∈ R l1′×l2′ ×"×lM′ d =1
and every entry of every unobserved latent tensor Xi is generated from a single
Gaussian with mean zero and variance 1, i.e., N ( 0,1) ; Ei is the noise tensor and every entry e j is drawn from N ( 0,1) , ς is a scalar and we set it as 0.01, the mean tensor
M ∈ R l1 ×l2 ×"×lM is a random tensor and every entry in M is drawn from the uniform distribution on the interval [ 0,1] ; projection matrices U d |Md =1∈ R ld′ ×ld are random matrices and every entry in U d |Md =1 is drawn from the uniform distribution on the interval [ 0,1] ; and i denotes the ith tensor measurement.
Probabilistic Tensor Analysis with Akaike and Bayesian Information Criteria
799
Fig. 3. BIC Score matrices for the first and the second projection matrices. Each block corresponds to a BIC score. The darker the block is the smaller the BIC score is. Based on this Figure, we determine l1′* = 3 and l2′ * = 2 based on BIC obtained by PTA and the model selection error is 0. Figure 4 shows the Hinton diagram of the first and the second projection matrices in the left and the right sub-figures, respectively. Projection matrices are obtained from PTA by setting l1′* = 7 and l2′ * = 5 .
In the first experiment, the data generator gives 10 measurements by setting M = 2 , l1 = 8 , l1′ = 3 , l2 = 6 , and l2′ = 2 . To determine l1′ * and l2′ * based on BIC
for PTA, we need to conduct PTA ( l1 − 1)( l2 − 1) times and obtain two BIC score
matrices for the first mode projection matrix U1 and the second projection matrix U 2 , respectively, as shown in Figure 3. In this Figure, every block corresponds to a BIC score and the darker the block is the smaller the corresponding BIC score is. We use a light rectangular to hint the darkest block in each BIC score matrix and the block corresponds to the smallest value. In the first BIC score matrix, as shown in the left sub-figure of Figure 3, the smallest value locates at ( 3,5 ) . Because this BIC score matrix is calculated for the first mode projection matrix based on (20), we can set l1′* = 3 according to (21). Similar to the determination of l1′ * , we determine l2′ * = 2 according to the second BIC score matrix, as shown in the right of Figure 3, because the smallest value locates at ( 7, 2 ) . For this example, the model selection error is
∑
2
l ′ − ld′ * = 0 .
d =1 d
We repeat the experiments with the similar setting as the first experiment in this Section 30 times, but l1 , l1′ , l2 , and l2′ are randomly set with the following requirements: 6 ≤ l1 , l2 ≤ 10 , 2 ≤ l1′, l2′ ≤ 5 , l1′ < l1 , and l2′ < l2 . The total model selection errors for BIC PTA are 0. We also conduct 30 experiments for third order tensor, with similar setting as described above and l1 , l1′ , l2 , l2′ , l3 , and l3′ are setting
800
D. Tao et al.
Fig. 4. The Hinton diagram of the first and the second projection matrices obtained by PTA, as shown in the left and the right sub-figures, respectively
with the following requirements: 6 ≤ l1 , l2 , l3 ≤ 10 , 2 ≤ l1′, l2′ , l3′ ≤ 5 , l1′ < l1 , l2′ < l2 , and l3′ < l3 . The total model selection errors are also 0. In every experiment, the value of the reconstruction error is very small.
4 Conclusion Vector data are normally used for probabilistic graphical models with probabilistic inference. However, tensor data, i.e., multidimensional arrays, are actually natural representations of a large amount of data, in data mining, computer vision, and many other applications. Aiming at breaking the huge gap between vectors and tensors in conventional statistical tasks, e.g., model selection, this paper proposes a decoupled probabilistic algorithm, named probabilistic tensor analysis (PTA) with Akaike information criterion (AIC) and Bayesian information criterion (BIC). PTA associated AIC and BIC can select suitable models for tensor data, as demonstrated by empirical studies.
Acknowledgment We authors would like to thank Professor Andrew Blake at the Microsoft Research Cambridge for encouragement of developing the tensors probabilistic graphic model. This research was supported by the Competitive Research Grants at the Hong Kong Polytechnic University (under project number A-PH42 and A-PC0A) and the National Natural Science Foundation of China (under grant number 60703037).
References [1] Bader, B.W., Kolda, T.G.: Efficient MATLAB Computations with Sparse and Factored Tensors. Technical Report SAND2006-7592, Sandia National Laboratories, Albuquerque, NM and Livermore, CA (2006) [2] Bader, B.W., Kolda, T.G.: MATLAB Tensor Classes for Fast Algorithm Prototyping. ACM Transactions on Mathematical Software 32(4) (2006)
Probabilistic Tensor Analysis with Akaike and Bayesian Information Criteria
801
[3] Bishop, C.M.: Neural Networks for Pattern Recognition. Oxford University Press, Oxford (1995) [4] Kroonenberg, P., Leeuw, J.D.: Principal Component Analysis of Three-Mode Data by Means of Alternating Least Square Algorithms. Psychometrika, 45 (1980) [5] Lathauwer, L.D.: Signal Processing Based on Multilinear Algebra, Ph.D. Thesis. Katholike Universiteit Leuven (1997) [6] Sun, J., Tao, D., Faloutsos, C.: Beyond Streams and Graphs: Dynamic Tensor Analysis. In: The Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USA, pp. 374–383 (2006) [7] Tao, D., Li, X., Wu, X., Maybank, S.J.: General Tensor Discriminant Analysis and Gabor Features for Gait Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(10) (2007) [8] Tipping, M.E., Bishop, C.M.: Probabilistic Principal Component Analysis. Journal of the Royal Statistical Society, Series B 21(3), 611–622 (1999) [9] Tucker, L.R.: Some Mathematical Notes on Three-mode Factor Analysis. Psychometrika 31(3) (1966) [10] Vasilescu, M.A.O., Terzopoulos, D.: Multilinear Subspace Analysis of Image Ensembles. In: IEEE Proc. International Conference on Computer Vision and Pattern Recognition, Madison, Wisconsin, USA, vol. 2, pp. 93–99 (2003) [11] Ye, J., Janardan, R., Li, Q.: Two-Dimensional Linear Discriminant Analysis. In: Schölkopf, B., Platt, J., Hofmann, T. (eds.) Advances in Neural Information Processing Systems, Vancouver, British Columbia, Canada, vol. 17 (2004) [12] Ye, J.: Generalized Low Rank Approximations of Matrices. Machine Learning 61(1-3), 167–191 (2005)
Decomposing EEG Data into Space-Time-Frequency Components Using Parallel Factor Analysis and Its Relation with Cerebral Blood Flow Fumikazu Miwakeichi1, Pedro A. Valdes-Sosa3, Eduardo Aubert-Vazquez3, Jorge Bosch Bayard3, Jobu Watanabe4, Hiroaki Mizuhara2, and Yoko Yamaguchi2 1
Medical System Course, Graduate School of Engineering, Chiba University 1-33, Yayoi-cho, Inage-ku, Chiba-shi, Chiba, 263-8522 Japan [email protected] 2 Laboratory for Dynamics of Emergent Intelligence, RIKEN Brain Science Institute, Japan 3 Cuban Neuroscience Center, Cuba 4 Waseda Institute for Advance Study, Waseda University, Japan
Abstract. Finding the means to efficiently summarize electroencephalographic data has been a long-standing problem in electrophysiology. Our previous works showed that Parallel Factor Analysis (PARAFAC) can effectively perform atomic decomposition of the time-varying EEG spectrum in space/ frequency/time domain. In this study, we propose to use PARAFAC for extracting significant activities in EEG data that is concurrently recorded with functional Magnetic Resonance Imaging (fMRI), and employ the temporal signature of the atom for investigating the relation between brain electrical activity and the changing of BOLD signal that reflects cerebral blood flow. We evaluated the statistical significance of dynamical effect of BOLD respect to EEG based on the modeling of BOLD signal by plain autoregressive model (AR), its AR with exogenous EEG input (ARX) and ARX with nonlinear term (ARNX). Keywords: Parallel Factor Analysis, EEG space/frequency/time decomposition, Nonlinear time series analysis, Concurrent fMRI/EEG data.
1 Introduction The electroencephalogram (EEG) is recorded as multi-channel time-varying data. In the history of EEG study, there are so many types of oscillatory phenomena in spontaneous and evoked EEG have been observed and reported. A statistical description of the oscillatory phenomena of the EEG was carried out first in the frequency domain by estimation of the power spectrum for quasi-stationary segments of data [1]. More recent characterizations of transient oscillations are carried out by estimation of the time-varying (or evolutionary) spectrum in the frequency/time domain [2]. These evolutionary spectra of EEG oscillations will have a topographic distribution on the sensors that is contingent on the spatial configuration of the neural sources that generate them as well as the properties of the head as a volume conductor [3]. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 802–810, 2008. © Springer-Verlag Berlin Heidelberg 2008
Decomposing EEG Data into Space-Time-Frequency Components Using PARAFAC
803
There is a long history of atomic decompositions for the EEG. However, to date, atoms have not been defined by the triplet spatial, spectral and temporal signatures but rather pairwise combinations of these components. Space/time atoms are the basis of both Principal Component Analysis (PCA) and Independent Component Analysis (ICA) as applied to multi-channel EEG. PCA has been used for artifact removal and to extract significant activities in the EEG [4,5]. A basic problem is that atoms defined by only two signatures (space and time) are not determined uniquely. In PCA orthogonality is therefore imposed between the corresponding signatures of different atoms. And there is the well-known non-uniqueness of PCA that allows the arbitrary choice of rotation of axes (e.g. Varimax and Quartimax rotations). More recently, ICA has become a popular tool for space/time atomic decomposition [6,7]. It avoids the arbitrary choice of rotation (Jung et al. 2001). Uniqueness, however, is achieved at the price of imposing a constraint even stronger than orthogonality, namely, statistical independence. In both PCA and ICA the frequency information may be obtained from the temporal signature of the extracted atoms in a separate step. For the purpose of decomposing of single channel EEG into frequency/time atoms the Fast Fourier Transformation (FFT) with sliding window [8] or the wavelet transformation [9,10] have been employed. In fact, any of the frequency/time atomic decompositions currently available [11] could, in principle, be used for the EEG. However, these methods do not address the topographic aspects of the EEG time/frequency analysis. It has long been known, especially in the chemometrics literature, that unique multilinear decompositions of multi-way arrays of data (more than 2 dimensions) are possible under very weak conditions [12]. In fact, this is the basic argument for Parallel Factor Analysis (PARAFAC). This technique recently has been improved by Bro [13]. The important difference between PARAFAC and techniques such as PCA or ICA, is that the decomposition of multi-way data is unique even without additional orthogonality or independence constraints. Thus, PARAFAC can be employed for a space/frequency/time atomic decomposition of the EEG. This makes use of the fact that multi-channel evolutionary spectra are multi-way arrays, indexed by electrode, frequency and time. The inherent uniqueness of the PARAFAC solution leads to a topographic time/frequency decomposition with a minimum of a priori assumptions. It has been shown that PARAFAC can effectively perform a time/frequency/spatial (T/F/S) atomic decomposition which is suitable for identification of fundamental modes of oscillatory activity in the EEG [14,15]. In this paper, the theory of PARAFAC and its applications to EEG analysis will be showed. Moreover the possibility of analysis for investigating the relation between extracted EEG atom and cerebral blood flow will be discussed.
2 Theory For the purpose of EEG analysis, we define the N d × N f × N t data matrix S as the three-way time-varying EEG spectrum array obtained by applying a wavelet transformation, where N d , N f and N t are the numbers of channels, frequency steps and
804
F. Miwakeichi et al.
time points, respectively. For the wavelet transformation a complex Morlet mother function was used. The energy S d f t of channel d at frequency f and time t is given by the squared norm of the convolution of a Morlet wavelet with the EEG signal v(d , t ) ,
S d f t = w(t , f ) ∗ v (d , t ) where
the
complex
w(t , f ) = πσ b e
⎛ t ⎞ −⎜ ⎟ ⎝ σb ⎠
Morlet
wavelet
2
(1)
,
w(t , f )
is
defined
by
2
e i 2π f t
with
σb
denoting the bandwidth parameter. The
width of the wavelet, m = 2πσ b f , is set to 7 in this study. The basic structural model for a PARAFAC decomposition of the data matrix S ( Nd × Nf × Nt ) with elements Sd f t is defined by Nk
S d f t = ∑ ad k b f k ct k + ed f t = Sˆd f t + ed f t
(2) . The problem is to find the loading matrices A , B and C , the elements of which are denoted by ad k , b f k and ct k in Eq. (2). Here we will refer to components k as k =1
“atoms”,
and
the
corresponding
loading
vectors
a k = {adk } , b k = {b fk } , ck = {ctk } will be said to represent the spatial, spectral
and temporal signatures of the atoms (Fig. 1). The uniqueness of the decomposition (2) is guaranteed if rank(A )+rank(B )+rank(C) ≥ 2( N k + 1) . As can be seen, this is a less stringent condition than either orthogonality or statistical independence [12].
Fig. 1. Graphical explanation of the PARAFAC model. The multi-channel EEG evolutionary spectrum S is obtained from a channel by channel wavelet transform. S is a three-way data array indicated by channel, frequency and time. PARAFAC decomposes this array into the sum of “atoms”. The k-th atom is the trilinear product of loading vectors representing spatial ( a k ), spectral ( b k ) and temporal ( c k ) “signatures”. Under these condition PARAFAC can be summarized as finding the matrices A = {a k } , B = {b k } and minimal residual error.
C = {c k } that explain S with
Decomposing EEG Data into Space-Time-Frequency Components Using PARAFAC
The decomposition (2) can be obtained by evaluating
min
a d k bf
k
ct k
805
2 Sd f t − Sˆd f t . Since
the S d f t can be regarded as representing spectra, this minimization should be carried out under a non-negativity constraint for the loading vectors. The result of the PARAFAC decomposition is given by the ( Nd ×1) vector a k , representing the topographical map of the k -th component, the ( N f × 1) vector b k representing the frequency spectrum of the
k -th component, and the
temporal signature of the
( Nt ×1) vector
c k representing the
k -th component.
3 Results 3.1 Extracting Relevant Components from EEG The experiment consisted of two conditions, eyes-closed resting condition and mental arithmetic condition; for each condition five epochs, each lasting 30 s per condition, were recorded. During the mental arithmetic epochs, the subjects were asked to count backwards from 1000 in steps of a constant single-digit natural number, while keeping their eyes closed. At the beginning of each mental arithmetic epoch this number was randomly chosen by a computer and presented to the subjects through headphones. The end of each mental arithmetic epoch was announced by a beeping sound (4 kHz, 20 ms). Mental arithmetic and resting epochs were occurring five times within each trial and arranged in alternating order; each subject was examined in two trials. During the experiment, fMRI and EEG were recorded simultaneously. The electrode set consisted of 61 EEG channels, two ECG channels and one EOG channel. The reference electrode was at FCz. Raw EEG signals were sampled at a sampling frequency of 5 kHz, using a 1Hz high-pass filter provided by the Brain Vision Recorder software (Brain Products, Munich, Germany). The fMRI was acquired as blood-oxygenation-sensitive (T2*-weighted) echoplanar images, using a 1.5 T MR scanner (Staratis II, Hitachi Medico, Japan) and a standard head coil. Thirty slices (4mm in thickness, gapless), covering almost the entire cerebrum, were measured, under the following conditions: TR 5 s, TA 3.3 s, TE 47.2 ms, FA 901, FoV 240mm, matrix size 64×64. In order to minimize head motion, heads of subjects were immobilized by a vacuum pad. SPM99 was used for preprocessing of fMRI images (motion correction, slice timing correction). Using bilinear interpolation, images were normalized for each subject to a standard brain defined by Montreal Neurological Institute; the normalized images were subsequently smoothed using a Gaussian kernel (fullwidth half-maximum: 10 mm). For each voxel, the BOLD signal time series were high-pass filtered (>0.01Hz), in order to remove fluctuations with frequencies lower than the frequency defined by the switching between resting and task. The EEG signal v(d , t ) was subsampled to 500Hz, and, in order to construct a three-way (channel/frequency/time) time-varying EEG N d × N f × N t data matrix S , a
806
F. Miwakeichi et al.
Fig. 2. (a) Spectral signatures of atoms of Parallel Factor Analysis (PARAFAC) for a typical subject. Note the recurrent appearance of frequency peaks in the theta and alpha bands. The horizontal axis represents frequency in Hz, the vertical axis represents the normalized amplitude. (b) Spatial signatures of atoms displayed as a topographic map, for the theta, alpha and high alpha atoms (from above) of Parallel Factor Analysis (PARAFAC) for the same subject. (c) Temporal signatures of atoms, same order as in (a). The horizontal axis represents time in units of scans (time between scans is five seconds), the vertical axis represents normalized intensity. The black and red colored lines are corresponding to resting and task stages, respectively. (See colored figure in CD-R).
wavelet transform was applied in the frequency range from 0.5 to 40.0 Hz with 0.5 Hz steps, using the complex Morlet wavelet as mother function. For adjusting the temporal resolution to the resolution of the fMRI data, the wavelet-transformed EEG was averaged over consecutive 5 seconds intervals. Then, by applying PARAFAC to this threeway data set, major signal components in the frequency range from 3.0 to 30.0 Hz were
Decomposing EEG Data into Space-Time-Frequency Components Using PARAFAC
807
extracted. The reason for choosing this frequency band was that at lower frequency EEG data typically contains eye movement artifacts, while at higher frequency there is no relevant activity in the data set analyzed in this study. In Fig.2(a) spectral signatures of the identified atoms are shown: theta (around 6Hz), alpha (around 9Hz), and high alpha (around 12Hz) atoms were found; in this study we denote the frequency of the last atom as “high alpha”. We find that if the order of the decomposition is increased beyond three, two atoms with overlapping frequency peaks appear. Since this indicates overfitting, we select three as the order of the decomposition. In Fig. 2(b) spatial signatures of atoms are shown. It can be seen that the main power of theta and alpha atoms is focused in frontal and occipital areas, respectively. In Fig.2(c) temporal signatures of atoms are shown; red color refers to task condition, black color to resting condition. By comparing the amplitudes of temporal signatures during both conditions by a standard T-test, we find that for the theta atom the amplitudes of the temporal signature are higher during the task condition, while for the alpha atom they are higher during the resting condition. For the high alpha atom no significant result is found with this test, therefore this atom will be omitted from further study. 3.2 Hemodynamic Response of Cerebral Blood Flow with Respect to EEG In order to investigate the dynamical relation between EEG and cerebral blood flow, we considered the linear autoregressive model (AR), the linear autoregressive model with exogenous input (ARX) and nonlinear nonparametric term (ARNX). The definitions of these models are as follows; AR (3) Bt = α Bt −1 + et ,
Bt = β 0 Bt −1 +
ARX ARNX
Bt = β 0 Bt −1 +
m
∑β
k∈U p
x
k t −k
(
m
∑β
k∈U p
x
k t −k
)
+ et ,
+ f xt − j1 , xt − j2 ,… + et ;
{ j1 , j2 ,…} ⊆ U np
(4) (5)
where the f (.) is a nonparametric Nadaraya-Watson type kernel estimator and the sequence of et is prediction error. The indices of those columns of the design matrix that belong to the parametric or nonparametric parts were fixed for ARX and ARNX at the same set of values, U p = U np = {1, 2, 3, 4, 5, 6} . The reason for choosing a maximum lag of 6 is given by the fact that the effect of the neural electric activity on the BOLD signal has been experimentally found to decay to zero level in approximately 30 seconds. In ARX and ARNX there is an autoregressive (AR) model term, with parameter β 0 , describing intrinsic dynamics of the BOLD signal, in addition to the terms representing the influence of the neural electric activity. The parameters of the parametric and the nonparametric model part, as well as the kernel width parameter, can be estimated by minimizing the modified Global Cross Validation (GCV) criterion (Hong, 1999). It can also be used for model comparison. Consider the difference between the values of GCV for two models, A and B:
808
F. Miwakeichi et al.
D = GCV ( A) -GCV ( B )
(6)
If for a particular voxel D assumes a positive value, model B is superior to A. If for a particular voxel the GCV value for ARX or ARNX is smaller than for AR, it can be said that this voxel is influenced by the EEG. From now on, we will focus only on the set of these voxels. We expect that among these voxels there is a subset of voxels for which the nonlinear term of ARNX improves the model, as compared to ARX. For extracting these voxels, D = GCV ( ARX ) -GCV ( ARNX ) was computed. If for a particular voxel, D assumes a positive value, this indicates the necessity of employing a nonlinear response to the EEG for modeling the BOLD signal of this voxel. On the other hand, if D assumes a negative value, it is sufficient to model the BOLD signal of this voxel by ARX. Regions where ARX or ARNX outperforms AR, are shown in Fig.3 (a) for the alpha atom and in Fig.3 (b) for the theta atom; note that these regions contain only 7% (alpha atom) and 4% (theta atom) of all gray-matter voxels. Regions where ARNX outperforms ARX (as measured by GCV) are shown by green color; these regions contain about 3% (alpha atom) and 2% (theta atom) of all gray-matter voxels. The opposite case of regions where ARNX fails to outperform ARX is shown by orange color; these regions contain about 4% (alpha atom) and 2% (theta atom) of all graymatter voxels.
Fig. 3. Regions where the ARX and ARNX models outperform the AR model, for the alpha atom (a) and the theta atom (b). Green color denotes regions where ARNX outperforms ARX, while orange color denotes the opposite case (thresholded at p<0.05 FDR). (See colored figure in CD-R).
4 Discussion For the analysis of concurrently recorded EEG/fMRI data set, essentially it needs multivariate versus multivariate analysis. However, there has been no established
Decomposing EEG Data into Space-Time-Frequency Components Using PARAFAC
809
effective method for this purpose. Therefore this problem has been resolved into multivariate versus unitivariate analysis using representative single electrode or averaged EEG of all or selected electrodes. Though temporal signature of atom is univariate, it is summarization using all information across all electrodes. In this study, it was shown that multivariate EEG data can be effectively summarized using PARAFAC. And the temporal signatures of atoms can be employed as reference function for investigating dynamical response of cerebral blood flow respect to brain electrical activity. The underlying theoretical requirement is that of a moderate amount of linear independence for atom topographies, spectra and time courses. This is a much milder requirement than previous models underlying space/time atomic decompositions (PCA or ICA). From the physiological point of view, we note that those regions, where the ARX or ARNX models outperformed the AR model for the alpha atom, approximately correspond to areas where the BOLD response was found to be negatively correlated to the alpha rhythm [16,17,18]; this situation indicates a decrease of cortical hemodynamic activity during the resting state. Previous studies reported bilateral premotor, posterior parietal, and prefrontal activation during serial subtraction [19,20]. This result agrees well with the finding of the present study, according to which the voxels in these areas were correlated to the theta atom. In summary we conclude that inclusion of linear and nonlinear EEG input terms improves modeling of brain activation in these regions during both rest and mental arithmetic.
References 1. Lopes da Silva, F.: EEG Analysis: Theory and Practice. In: Neidermeyer, E., Lopes da Silva, F., Electroencephalography. Urban and Schwartzenberg (1987) 2. Dahlhaus, R.: Fitting time series models to non-stationary processes. Annals of Statistics 25, 1–37 (1997) 3. Nunez, P.L.,: Electric Fields of the Brian: The Neurophysics of Eeg. Oxford University Press, Oxford (1993) 4. Soong, A.C., Koles, Z.J.: Principal-component localization of the sources of the background EEG. IEEE Transactions on Biomedical Engineering 42, 59–67 (1995) 5. Lagerlund, T.D., Sharbrough, F.W., Busacker, N.E.: Spatial Filtering of Multichannel Electroencephalographic Recordings Through Principal Component Analysis by Singular Value Decomposition. Journal of Clinical Neurophysiology 14, 73–83 (1997) 6. Hyvarinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. John Wiley & Sons, Inc., Chichester (2001) 7. Cichocki, A., Amari, S.: Adaptive Blind Signal and Image Processing. John Wiley & Sons, LTD., Chichester (2002) 8. Makeig, S.: Auditory event-related dynamics of the EEG spectrum and effects of exposure to tones. Electroencephalography and Clinical Neurophysiology 86, 283–293 (1993) 9. Bertrand, O., Bohorquez, J., Pernier, J.: Time-Frequency Digital Filtering Based on an Invertible Wavelet Transform: An Application to Evoked Potential. IEEE Transactions on Biomedical Engineering 41, 77–88 (1994) 10. Tallon-Baudry, C., Bertrand, O., Delpuech, C., Pernier, J.: Oscillatory g - Band (30-70Hz) Activity Induced by a Visual Search Task in Humans. The Journal of Neuroscience 17, 722– 734 (1997)
810
F. Miwakeichi et al.
11. Chen, S., Donoho, D.: Atomic decomposition by basis pursuit. SIAM Review 43, 129–159 (2001) 12. Sidiropoulos, N.D., Bro, R.: On the uniqueness of multilinear decomposition of N-way arrays. Journal of Chemometrics 14, 229–239 (2000) 13. Bro, R.: Multi-way Analysis in the Food Industry: Models, Algorithms and Applications. Ph.D. Thesis. University of Amsterdam (NL) & Royal Veterinary and Agricultural University (DK) (1998) 14. Miwakeichi, F., Martinez-Montes, E., Valdes-Sosa, P.A., Nishiyama, N., Mizuhara, H., Yamaguchi, Y.: Decomposing EEG data into space/time/frequency components using Parallel Factor Analysis. NeuroImage 22, 1035–1045 (2004) 15. Martinez-Montes, E., Valdes-Sosa, P.A., Miwakeichi, F., Goldman, R.I., Cohen, M.S.: Concurrent EEG/fMRI analysis by multiway Partial Least Squares. NeuroImage 22, 1023–1034 (2004) 16. Goldman, R.I., Stern, J.M., Engel, J.J., Cohen, M.S., Simultaneous, E.E.G.: fMRI of the alpha rhythm. NeuroReport 13, 2487–2492 (2002) 17. Laufs, H., Kleinschmidt, A., Beyerle, A., Eger, E., Salek-Haddadi, A., Preibisch, C., Krakow, K.: EEG-correlated fMRI of human alpha activity. NeuroImage 19, 1463–1476 (2003) 18. Laufs, H., Krakow, K., Sterzer, P., Eger, E., Beyerle, A., Salek-Haddadi, A., Kleinschmidt, A.: Electroencephalographic signatures of attentional and cognitive default modes in spontaneous brain activity fluctuations at rest. Proc.Natl.Acad.Sci. 100, 11053–11058 (2003) 19. Roland, P.E., Friberg, L.: Localization of Cortical Areas Activated By Thinking. J.Neurophysiol, 53 (1985) 20. Rueckert, L., Lange, N., Partiot, A., Appollonio, I., Litvan, I., Bihan, D.L., Grafman, J.: Visualizing Cortical Activation during Mental Calculation with Functional MRI. Neuroimage 3, 97–103 (2006)
Flexible Component Analysis for Sparse, Smooth, Nonnegative Coding or Representation Andrzej Cichocki , Anh Huy Phan, Rafal Zdunek , and Li-Qing Zhang RIKEN Brain Science Institute, Wako-shi, Saitama, Japan [email protected] Abstract. In the paper, we present a new approach to multi-way Blind Source Separation (BSS) and corresponding 3D tensor factorization that has many potential applications in neuroscience and multi-sensory or multidimensional data analysis, and neural sparse coding. We propose to use a set of local cost functions with flexible penalty and regularization terms whose simultaneous or sequential (one by one) minimization via a projected gradient technique leads to simple Hebbian-like local algorithms that work well not only for an over-determined case but also (under some weak conditions) for an under-determined case (i.e., a system which has less sensors than sources). The experimental results confirm the validity and high performance of the developed algorithms, especially with usage of the multi-layer hierarchical approach.
1
Introduction – Problem Formulation
Parallel Factor analysis (PARAFAC) or multi-way factorization models with sparsity and/or non-negativity constraints have been proposed as promising and quite efficient tools for processing of signals, images, or general data [1,2,3,4,5,6,7,8,9]. In this paper, we propose new hierarchical alternating algorithms referred to as the Flexible Component Analysis (FCA) for BSS, including as special cases: Nonnegative Matrix/Tensor Factorization (NMF/NTF), SCA (Sparse Components Analysis), SmoCA (Smooth Component Analysis). The proposed approach can be also considered as an extension of Morphological Component Analysis (MoCA) [10]. By incorporating nonlinear projection or filtering and/or by adding regularization and penalty terms to the local squared Euclidean distances, we are able to achieve nonnegative and/or sparse and/or smooth representations of the desired solution, and to alleviate a problem of getting stuck in local minima. In this paper, we consider quite a general factorization related to the 3D PARAFAC2 model [1,5] (see Fig.1) q + N q = AX q + N q , Y q = AD q X
(q = 1, 2, . . . , Q)
(1)
Dr. A. Cichocki is also with Systems Research Institute (SRI), Polish Academy of Science (PAN), and Warsaw University of Technology, Dept. of EE, Warsaw, Poland. Dr. R. Zdunek is also with Institute of Telecommunications, Teleinformatics and Acoustics, Wroclaw University of Technology, Poland. Dr. L.-Q. Zhang is with the Shanghai Jiaotong University, China.
M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 811–820, 2008. c Springer-Verlag Berlin Heidelberg 2008
812
A. Cichocki et al.
where Y q = [yitq ] ∈ RI×T is the q-th frontal slice (matrix) of the observed (known) 3D tensor data or signals Y ∈ RI×T ×Q , Dq ∈ RJ×J is a diagonal scal+ ing matrix that holds the q-th row of the matrix D ∈ RQ×J , A = [aij ] = q = [˜ [a1 , a2 , . . . , aJ ] ∈ RI×J is a mixing or basis matrix, X xjtq ] ∈ RJ×T represents unknown normalized sources or hidden components in q-th slice, q = [xjtq ] ∈ RJ×T represents re-normalized (differently scaled) X q = Dq X sources, and N q = [nitq ] ∈ RI×T represents the q-th frontal slice of the tensor N ∈ RI×T ×Q representing noise or errors. Our objective is to estimate the set q , subject to some natural constraints such as nonof all matrices: A, D q , X negativity, sparsity or smoothness. Usually, the common factors, i.e., matrices q are normalized to unit length column vectors and rows, respectively, A and X and are often enforced to be as independent and/or as sparse as possible. The above system of linear equations can be represented in an equivalent scalar form as follows y = itq j aij xjtq + nitq , or equivalently in the vector form Y q = j aj xjq + N q , where xjq = [xj1q , xj2q , . . . , xjT q ] are the rows of X q , and aj are the columns of A (j = 1, 2, . . . , J). Moreover, using the row-wise unfolding, the model (1) can be represented by one single matrix equation: Y = AX + N ,
(2)
where Y = [Y 1 , Y 2 , . . . , Y Q ] ∈ RI×QT , X = [X 1 , X 2 , . . . , X Q ] ∈ RJ×QT , and N = [N 1 , N 2 , . . . , N Q ] ∈ RI×QT are block row-wise unfolded matrices1 . In the special case, for Q = 1 the model simplifies to the standard BSS model used in ICA, NMF, and SCA . The majority of the well-known algorithms for the PARAFAC models work only if the following assumption T >> I ≥ J is held, where J is known or can be estimated using PCA/SVD. In the paper, we propose a family of algorithms that can work also for an under-determined case, i.e., for T >> J > I, if sources are enough sparse and/or smooth. Our primary objective is to estimate the mixing (basis) matrix A and the sources X q , subject to additional natural constraints such as nonnegativity, sparsity and/or smoothness constraints. To deal with the factorization problem (1) efficiently, we adopt several approaches from constrained optimization, regularization theory, multi-criteria optimization, and projected gradient techniques. We minimize simultaneously or sequentially several cost functions with the desired constraints using switching between two sets of parameters: {A} and {X q }. 1
It should be noted that the 2D unfolded model, in a general case, is not exactly equivalent to the PARAFAC2 model (in respect to sources X q ), since we usually need to impose different additional constraints for each slice q. In other words, the PARAFAC2 model should not be considered as a 2-D model with the single 2-D unfolded matrix X . Profiles of the augmented (row-wise unfolded) X can only be treated as a single profile, while we need to impose individual constraints independently to each slice X q or even to each row of X q . Moreover, the 3D tensor factorization is considered as a dynamical process or a multi-stage process, where the data analysis is performed several times under different conditions (initial estimates, selected natural constraints, post-processing) to get full information about the available data and/or discover some inner structures in the data, or to extract physically meaningful components.
FCA for Sparse, Smooth, Nonnegative Coding or Representation
813
Q
1
q
1
T
=
Y
i I 1
D
Q
t
(I x T x Q)
T
1
J
J
A
1 J
I (I x J)
~ X ~ Q 1 X q .. ~ + X 1 .. . . T
(J x T x Q)
Q T
N
I
(I x T x Q)
Fig. 1. Modified PARAFAC2 model illustrating factorization of 3D tensor into a set of fundamental matrices: A, D, {X q }. In the special case, the model is reduced to standard PARAFAC for X q = X 1 , ∀q, or tri-NMF model for Q = 1.
2
Projected Gradient Local Least Squares Regularized Algorithm
Many algorithms for the PARAFAC model are based on Alternating Least Square (ALS) minimization of the squared Euclidean distance [1,4,5]. In particular, we can attempt to minimize a set of the following cost functions: DF q (Y q ||AX q ) =
1 Y q − AX q 2F + αA JA (A) + αX JXq (X q ), 2
(3)
subject to some additional constraints, where JA (A), JXq (X q ) are penalty or regularization functions, and αA and αX are nonnegative coefficients controlling a tradeoff between data fidelity and a priori knowledge on the components to be estimated. A choice of the regularization terms can be critical for attaining a desired solution and noise suppression. Some of the candidates include the entropy, lp -quasi norm and more complex, possibly non-convex or non-smooth regularization terms [11]. In such a case a basic approach to the above formulated optimization problem is alternating minimization or alternating projection: the specified cost function is alternately minimized with respect to two sets of the parameters {xjtq } and {aij }, each time optimizing one set of arguments while keeping the other one fixed [6,7]. In this paper, we consider a different approach: instead of minimizing only one global cost function, we perform sequential minimization of the set of local cost functions composed of the squared Euclidean terms and regularization terms: (j)
DF q (Y (j) q ||aj xjq ) =
1 (j) 2 (j) (Y (j) q − aj xjq )F + αa Ja (aj ) + αXq Jx (xjq ), (4) 2
for j = 1, 2, . . . , J, q = 1, 2, . . . , Q, subject to additional constraints, where Y (j) ar xrq = Y q − AX q + aj xjq , (5) q = Yq − r =j
814
A. Cichocki et al.
aj ∈ RI×1 are the columns of the basis mixing matrix A, xjq ∈ R1×T are the rows of X q which represent unknown source signals, Ja (aj ) and Jx (xjq ) are local penalty regularization terms which impose specific constraints for the estimated (j) (j) parameters, and αa ≥ 0 and αXq ≥ 0 are nonnegative regularization parameters that control a tradeoff between data-fidelity and the imposed constraints. The construction of such a set of local cost functions follows from the simple observation that the observed data can be decomposed as follows Y q = J j=1 aj xjq + N q , ∀ q. We are motivated to use of such a representation and decomposition, because xjq have physically meaningful interpretation as sources with specific temporal and morphological properties. The penalty terms may take different forms depending on properties of the estimated sources. For example, ifthe sources are sparse, we can apply the lp quasi norm: Jx (xjq ) = ||xjq ||pp = ( t |xjtq |p )1/p with 0 < p ≤ 1, or alternatively p/2 T 2 we can use the smooth approximation Jx (xjq ) = |x | + ε , where jtq t ε ≥ 0 is a small constant. In order to impose smoothing of signals, we can Tlocal −1 apply the total variation (TV) Jx (xjq ) = t=1 |xj,t+1,q − xj,t,q |, or if we wish T −1 to achieve a smoother solution: Jx (xjq ) = t=1 |xj,t+1,q − xj,t,q |2 + ε, [12]. The gradients of the local cost function (4) with respect to the unknown vectors aj and xjq are expressed by (j)
∂DF q (Y (j) q ||aj xjq ) ∂xjq
= aTj aj xjq − aTj Y (j) q + αXq Ψx (xjq ),
(6)
T (j) = aj xjq xTjq − Y (j) q xjq + αa Ψa (aj ),
(7)
(j)
(j)
∂DF q (Y (j) q ||aj xjq ) ∂aj
where the matrix functions Ψa (aj ) and Ψx (xjq ) are defined as2 (j)
(j)
Ψa (aj ) =
∂Ja (aj ) , ∂aj
Ψx (xjq ) =
∂Jx (xjq ) . ∂xjq
(8)
By equating the gradient components to zero, we obtain a new set of local learning rules: 1 T (j) (j) a Y − α Ψ (x ) , j q jq Xq x aTj aj 1 T (j) aj ← Y (j) q xjq − αa Ψa (aj ) , T xjq xjq
xjq ←
(9) (10)
for j = 1, 2, . . . , J and q = 1, 2, . . . , Q. However, it should be noted that the above algorithm provides only a regularized least squares solution, and this is not sufficient to extract the desired 2
If the penalty functions are non-smooth, we can use sub-gradient instead of the gradient.
FCA for Sparse, Smooth, Nonnegative Coding or Representation
815
sources, especially for an under-determined case since the problem may have many solutions. To solve this problem, we need additionally to impose nonlinear projections PΩj (xjq ) or filtering after each iteration or each epoch in order to enforce that individual estimated sources xjq satisfy the desired constraints. All such projections or filtering can be imposed individually for the each source xjq depending on morphological properties of the source signals. The similar nonlinear projection P Ωj (aj ) can be applied, if necessary, individually for each vector aj of the mixing matrix A. Hence, using the projected gradient approach, our algorithm can take the following more general and flexible form: 1 (j) (aT Y (j) − αXq Ψx (xjq )), aTj aj j q 1 T (j) aj ← (Y (j) q xjq − αa Ψa (aj )), xjq xTjq
xjq ←
xjq ← PΩj {xjq },
(11)
aj ← P Ωj {aj };
(12)
where PΩj {xjq } means generally a nonlinear projection, filtering, transformation, local interpolation/extrapolation, inpainting, smoothing of the row vector xjq . Such projections or transformations can take many different forms depending on required properties of the estimated sources (see the next section for more details). Remark 1. In practice, it is necessary to normalize the column vectors aj or the row vectors xjq to unit length vectors (in the sense of the lp norm (p = 1, 2, ..., ∞)) in each iterative step. In the special case of the l2 norm, the above algorithm can be further simplified by neglecting the denominator in (11) or in q (i.e., the (12), respectively. After estimating the normalized matrices A and X normalized X q to unit-length rows), we can estimate the diagonal matrices, if necessary, as follows: +
}, D q = diag{A+ Y q X q
3
(q = 1, 2, . . . , Q).
(13)
Flexible Component Analysis (FCA) – Possible Extensions and Practical Implementations
The above simple algorithm can be further extended or improved (in respect to a convergence rate and performance). First of all, different cost functions can be used for estimating the rows of the matrices X q (q = 1, 2, . . . , Q) and the columns of the matrix A. Furthermore, the columns of A can be estimated simultaneously, instead one by one. For example, by minimizing the set of cost functions in (4) with respect to xjq , and simultaneously the cost function (3) with normalization of the columns aj to an unit l2 -norm, we obtain a new FCA learning algorithm in which the individual rows of X q are updated locally (row by row) and the matrix A is updated globally (all the columns aj simultaneously): xjq ← aTj Y (j) q − αXq Ψx (xjq ), (j)
xjq ← PΩj {xjq },
(j = 1, . . . , J), (14)
A ← (Y q X Tq − αA ΨA (A))(X q X Tq )−1 , A ← P Ω (A), (q = 1, . . . , Q),(15)
816
A. Cichocki et al.
with the normalization (scaling) of the columns of A to an unit length in the sense of the l2 norm, where ΨA (A) = ∂JA (A)/∂A. In order to estimate the basis matrix A, we can use alternatively the following global cost function (see Eq. (2)): DF (Y ||AX) = (1/2)Y −AX2F +αA JA (A). The minimization of the cost function for a fixed X leads to the updating rule A ← Y X T − αA ΨA (A) (XX T )−1 . (16) 3.1
Nonnegative Matrix/Tensor Factorization
In order to enforce sparsity and nonnegativity constraints for all the parameters: aij ≥ 0, xjtq ≥ 0, ∀ i, t, q, we can apply the ”half-way rectifying” element-wise projection: [x]+ = max{ε, x}, where ε is a small constant to avoid numerical instabilities and remove background noise (typically, ε = [10−2 − 10−9 ]). Simultaneously, we can impose weak sparsity constriants by using the l1 -norm (j) penalty functions: JA (A) = ||A||1 = ij aij and Jx (xjq ) = ||xjq ||1 = t xjtq . In such a case, the FCA algorithm for the 3D NTF2 (i.e., the PARAFAC2 with nonnegativity constraints) will take the following form: (j) xjq ← aTj Y (j) , (j = 1, . . . , J), (q = 1, . . . , Q), (17) q − αXq 1 + A ← (Y X T − αA 1)(XX T )−1 , (18) +
with normalization of the columns of A in each iterative step to a unit length with the l2 norm, where 1 means a matrix of all ones of appropriate size. It should be noted the above algorithm can be easily extended to semi-NMF or semi-NTF in which only some sources xjq are nonnegative and/or the mixing matrix A is bipolar, by simple removing the corresponding ”half-wave rectifying” projections. Moreover, the similar algorithm can be used for arbitrary bounded sources with known lower and/or upper bounds (or supports), i.e ljq ≤ xjtq ≤ uiq , ∀t, rather than xjtq ≥ 0, by using suitably chosen nonlinear projections which bound the solutions. 3.2
Smooth Component Analysis (SmoCA)
In order to enforce smooth estimation of the sources xjq for all or some preselected indexes j and q, we may apply after each iteration (epoch) the local smoothing or filtering of the estimated sources, such as the MA (Moving Average), EMA, SAR or ARMA models. A quite efficient way of smoothing and denoising can be achieved by minimizing the following cost function (which satisfies multi-resolution criterion): J(xjq ) =
T t=1
2
(xjtq − x jtq ) +
T −1
λjtq gt ( xj,t+1,q − x jtq ) ,
(19)
t=1
where x jtq is a smoothed version of the actually estimated (noisy) xjtq , gt (u) is a convex continuously differentiable function with a global minimum at u = 0, and λjtq are parameters that are data driven and chosen automatically.
FCA for Sparse, Smooth, Nonnegative Coding or Representation
3.3
817
Multi-way Sparse Component Analysis (MSCA)
In the sparse component analysis an objective is to estimate the sources xjq which are sparse and usually with a prescribed or specified sparsification profile, possibly with additional constraints like local smoothness. In order to enforce that the estimated sources are sufficiently sparse, we need to apply a suitable nonlinear projection or filtering which allows us adaptively to sparsify the data. The simplest nonlinear projection which enforces some sparsity to the normalized data is to apply the following weakly nonlinear element-wise projection: PΩj (xjtq ) = sign(xjtq )|xjtq |1+αjq
(20)
where αjq is a small parameter which controls sparsity. Such nonlinear projection can be considered as a simple (trivial) shrinking. Alternatively, we may use more sophisticated adaptive local soft or hard shrinking in order to sparsify individual sources. Usually, we have the three-steps procedure: First, we perform the linear transformation: xw = xW , then, the nonlinear shrinking (adaptive thresholding), e.g., the soft element-wise shrinking: PΩ (xw ) = sign(xw ) [|xw | − = PΩ (xw )W −1 . The threshold δ > 0 δ]1+δ + , and finally the inverse transform: x is usually not fixed but it is adaptively (data-driven) selected or it gradually decreases to zero with iterations. The optimal choice for a shrinkage function depends on a distribution of data. We have tested various shrinkage functions with gradually decreasing δ: the hard thresholding rule, soft thresholding rule, non-negative Garrotte rule, n-degree Garotte, and Posterior median shrinkage rule [13]. For all of them, we have obtained the promising results, and usually the best performance appears for the simple hard rule. Our method is somewhat related to the MoCA and SCA algorithms, proposed recently by Bobin et al., Daubechies et al., Elad et al., and many others [10,14,11]. However, in contrast to these approaches our algorithms are local and more flexible. Moreover, the proposed FCA is more general than the SCA, since it is not limited only to a sparse representation via shrinking and linear transformation but allows us to impose general and flexible (soft and hard) constraints, nonlinear projections, transformations, and filtering3 . Furthermore, in the contrast to many alternative algorithms which process the columns of X q , we process their rows which represent directly the source signals. We can outline the FCA algorithm as follows: 1. Set the initial values of the matrix A and the matrices X q , and normalize the vectors aj to an unit l2 -norm length. 2. Calculate the new estimate xjq of the matrices X q using the iterative formula in (14). 3. If necessary, enforce the nonlinear projection or filtering by imposing natural constraints on individual sources (the rows of X q , (q = 1, 2, . . . , Q)), such as nonnegativity, boundness, smoothness, and/or sparsity. 3
In this paper, in fact, we use two kinds of constraints: the soft (or weak) constraints via penalty and regularization terms in the local cost functions, and the hard (strong) constraints via iteratively adaptive postprocessing using nonlinear projections or filtering.
818
A. Cichocki et al.
4. Calculate the new estimate of A from (16), normalize each column of A to an unit length, and impose the additional constraints on A, if necessary. 5. Repeat the steps (2) and (4) until the convergence criterion is satisfied.
3.4
Multi-layer Blind Identification
In order to improve the performance of the FCA algorithms proposed in this paper, especially for ill-conditioned and badly-scaled data and also to reduce the risk of getting stuck in local minima in non-convex alternating minimization, we have developed the simple hierarchical multi-stage procedure [15] combined together with multi-start initializations, in which we perform a sequential decomposition of matrices as follows. In the first step, we perform the basic decomposition (factorization) Y q ≈ A(1) X (1) using any suitable FCA algorithm q presented in this paper. In the second stage, the results obtained from the first from the estimated frontal slices destage are used to build up a new tensor X 1 (1) (1) fined as Y = X , (q = 1, 2, . . . , Q). In the next step, we perform the similar q
q
(1) decomposition for the new available frontal slices: Y q = X (1) ≈ A(2) X (2) q q , using the same or different update rules. We continue our decomposition taking into account only the last achieved components. The process can be repeated arbitrarily many times until some stopping criteria are satisfied. In each step, we usually obtain gradual improvements of the performance. Thus, our FCA model has the following form: Y q ≈ A(1) A(2) · · · A(L) X (L) q , (q = 1, 2, . . . , Q) with the final components A = A(1) A(2) · · · A(L) and X q = X (L) q . Physically, this means that we build up a system that has many layers or cascade connections of L mixing subsystems. The key point in our approach is (l) that the learning (update) process to find the matrices X (l) is performed q and A sequentially, i.e. layer by layer, where each layer is initialized randomly. In fact, we found that the hierarchical multi-layer approach plays a key role, and it is necessary to apply in order to achieve satisfactory performance for the proposed algorithms.
4
Simulation Results
The algorithms presented in this paper have been tested for many difficult benchmarks for signals and images with various temporal and morphological properties of signals and additive noise. Due to space limitation we present here only one illustrative example. The sparse nonnegative signals with different sparsity and smoothness profiles are collected in with 10 slices X q (Q = 10) under the form of the tensor X ∈ R5×1000×10 . The observed (mixed) 3D data Y ∈ R4×1000×10 are obtained by multiplying the randomly generated mixing matrix A ∈ R4×5 by X. The simulation results are illustrated in Fig. 2 (only for one frontal slice q = 1).
FCA for Sparse, Smooth, Nonnegative Coding or Representation
819
Fig. 2. (left) Original 5 spectra (representing X 1 ); (middle) observed 4 mixed spectra Y 1 generated by random matrix A ∈ R4×5 (under-determined case); (right) estimated 5 spectra X 1 with our algorithm given by (17)–(18), using 10 layers, and (j) αA = αX1 = 0.05. Signal-to-Interference Ratios (SIR) for A and X 1 are as follows: SIRA = 31.6, 34, 31.5, 29.9, 23.1[dB] and SIRX1 = 28.5, 19.1, 29.3, 20.3, 23.2[dB], respectively
5
Conclusions and Discussion
The main objective and motivations of this paper is to derive simple algorithms which are suitable both for under-determined (over-complete) and overdetermined cases. We have applied the simple local cost functions with flexible penalty or regularization terms, which allows us to derive a family of robust FCA algorithms, where the sources may have different temporal and morphological properties or different sparsity profiles. Exploiting these properties and a priori knowledge about the character of the sources we have proposed a family of efficient algorithms for estimating sparse, smooth, and/or nonnegative sources, even if the number of sensors is smaller than the number of hidden components, under the assumption that the some information about morphological or desired properties of the sources is accessible. This is an original extension of the standard MoCA and NMF/NTF algorithms, and to the authors’ best knowledge, the first time such algorithms have been applied to the multi-way PARAFAC models. In comparison to the ordinary BSS algorithms, the proposed algorithms are shown to be superior in terms of the performance, speed, and convergence properties. We implemented the discussed algorithms in MATLAB [16]. The approach can be extended for other applications, such as dynamic MRI imaging, and it can be used as an alternative or improved reconstruction method to: the k-t BLAST, k-t SENSE or k-t SPARSE, because our approach relaxes the problem of getting stuck to in local minima, and provides usually better performance than the standard FOCUSS algorithms. This research is motivated by potential applications of the proposed models and algorithms in three areas of the data analysis (especially, EEG and fMRI) and signal/image processing: (i) multi-way blind source separation, (ii) model
820
A. Cichocki et al.
reduction and selection, and (iii) sparse image coding. Our preliminary experimental results are promising. The models can be further extended by imposing additional, natural constraints such as orthogonality, continuity, closure, unimodality, local rank - selectivity, and/or by taking into account a prior knowledge about the specific components.
References 1. Smilde, A., Bro, R., Geladi, P.: Multi-way Analysis: Applications in the Chemical Sciences. John Wiley and Sons, New York (2004) 2. Hazan, T., Polak, S., Shashua, A.: Sparse image coding using a 3D non-negative tensor factorization. In: International Conference of Computer Vision (ICCV), pp. 50–57 (2005) 3. Heiler, M., Schnoerr, C.: Controlling sparseness in non-negative tensor factorization. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 56–67. Springer, Heidelberg (2006) 4. Miwakeichi, F., Martnez-Montes, E., Valds-Sosa, P., Nishiyama, N., Mizuhara, H., Yamaguchi, Y.: Decomposing EEG data into space−time−frequency components using parallel factor analysi. NeuroImage 22, 1035–1045 (2004) 5. Mørup, M., Hansen, L.K., Herrmann, C.S., Parnas, J., Arnfred, S.M.: Parallel factor analysis as an exploratory tool for wavelet transformed event-related EEG. NeuroImage 29, 938–947 (2006) 6. Lee, D.D., Seung, H.S.: Learning the parts of objects by nonnegative matrix factorization. Nature 401, 788–791 (1999) 7. Cichocki, A., Amari, S.: Adaptive Blind Signal and Image Processing (New revised and improved edition). John Wiley, New York (2003) 8. Dhillon, I., Sra, S.: Generalized nonnegative matrix approximations with Bregman divergences. In: Neural Information Proc. Systems, Vancouver, Canada, pp. 283– 290 (2005) 9. Berry, M., Browne, M., Langville, A., Pauca, P., Plemmons, R.: Algorithms and applications for approximate nonnegative matrix factorization. Computational Statistics and Data Analysis (in press, 2006) 10. Bobin, J., Starck, J.L., Fadili, J., Moudden, Y., Donoho, D.L.: Morphological component analysis: An adaptive thresholding strategy. IEEE Transactions on Image Processing (in print, 2007) 11. Elad, M.: Why simple shrinkage is still relevant for redundant representations? IEEE Trans. On Information Theory 52, 5559–5569 (2006) 12. Kovac, A.: Smooth functions and local extreme values. Computational Statistics and Data Analysis 51, 5155–5171 (2007) 13. Tao, T., Vidakovic, B.: Almost everywhere behavior of general wavelet shrinkage operators. Applied and Computational Harmonic Analysis 9, 72–82 (2000) 14. Daubechies, I., Defrrise, M., Mol, C.D.: An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Pure and Applied Mathematics 57, 1413–1457 (2004) 15. Cichocki, A., Zdunek, R.: Multilayer nonnegative matrix factorization. Electronics Letters 42, 947–948 (2006) 16. Cichocki, A., Zdunek, R.: NTFLAB for Signal Processing. Technical report, Laboratory for Advanced Brain Signal Processing, BSI, RIKEN, Saitama, Japan (2006)
Appearance Models for Medical Volumes with Few Samples by Generalized 3D-PCA Rui Xu1 and Yen-Wei Chen2,3 1
2
Graduate School of Engineering and Science, Ritsumeikan University, Japan [email protected] College of Information Science and Engineering, Ritsumeikan University, Japan [email protected] 3 School of Electronic and Information Engineering, Dalian University of Technology, China
Abstract. Appearance models is important for the task of medical image analysis, such as segmentation. Principal component analysis (PCA) is an efficient method to build the appearance models; however the 3D medical volumes should be first unfolded to form the 1D long vectors before the PCA is used. For large medical volumes, such a unfolding preprocessing causes two problems. One is the huge burden of computing cost and the other is bad performance on generalization. A method named as generalized 3D-PCA is proposed to build the appearance models for medical volumes in this paper. In our method, the volumes are directly treated as the third order tensor in the building of the model without the unfolding preprocessing. The output of our method is three matrices whose columns are formed by the orthogonal bases in the three mode subspaces. With the help of these matrices, the bases in the third order tensor space can be constructed. According to these properties, our method is not suffered from the two problems of the PCA-based methods. Eighteen 256 × 256 × 26 MR brain volumes are used in the experiments of building appearance models. The leave-one-out testing shows that our method has good performance in building the appearance models for medical volumes even when few samples are used for training.
1
Introduction
In recent years, a lot of research has been focused on how to extract the statistical information from 3D medical volumes. Inspired from 2D active shape models (ASMs) [1], 3D ASMs[2][3] has been proposed to extract the statistical shape information from 3D medical volumes. The 3D ASMs is a kind of important method for segmentation but this method does not consider about the
This work was supported in part by the Grand-in Aid for Scientific Research from the Japanese Ministry for Education, Science, Culture and Sports under the Grand No. 19500161, and the Strategic Information and Communications R&D Promotion Programme (SCOPE) under the Grand No. 051307017.
M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 821–830, 2008. c Springer-Verlag Berlin Heidelberg 2008
822
R. Xu and Y.-W. Chen
information of the voxel’s intensity. So, the authors in [4] extend the works of 2D active appearance model (AAMs) [5] and propose the 3DAAMs to combine the shape models and appearance models together for 3D medical data. In the 3DAAMs, the appearance models for the volumes is built by the method based on the principal component analysis (PCA). PCA is a widely-used method to extract statistical information and successfully applied into many fields, such as image representation and face recognition. When the PCA-based methods are applied to build the appearance models for medical volumes, it is required to first unfold the 3D volumes to 1D long vectors in order to make the data to be processed by PCA. There are two types of ways to implement the PCA. In the first way, the bases of the unfolding vector space are obtained by calculating eigenvectors directly from the covariance matrix Cov, where Cov = A · AT and the columns of the matrix A are formed by the unfolded vectors. The dimension of the covariance matrix is depended on the dimension of the unfolded vectors. If the 3D volumes have large dimensions, the dimensions of covariance matrix are also very large. So we encounter the problem of huge burden of computing cost when we calculate the eigenvectors. The other way to implement PCA is similar to the one used in the work of eigenfaces [6]. Instead of calculating the the spaces directly from Cov, the eigenvectors ui is first calculated from the matrix Cov = AT · A. Then the bases of the unfolding vector space vi can be calculated by vi = A · ui . For the second way, the dimension of Cov is depended on the number of the training samples which is much fewer than the dimension of the unfolded vectors. Therefore, the second way to implement PCA does not suffer from the problem of huge computing cost. However, only M − 1 bases are available in the second way, where M is the number of training samples. In the medical field, the typical dimension of unfolding vectors is several millions; however the typical number of the medical data is only several tens. So the obtained bases are not enough to represent an untrained vector in the huge vector space. Therefore, the second way has the problem of bad performance on generalization when the large 3D volumes are used. The generalization problem may be overcome by increasing the training samples; however since the dimension of the unfolding vectors is huge, it is difficult to collect enough medical data for the training. Additionally, with the development of medical imaging techniques, the medical volumes become larger and larger. The conflict between the large dimensions of medical volumes and small number of training samples will become more prominent. Therefore, it is desirable to research on how to build the appearance model for medical volumes when the training samples are relative few. In this paper, a method called generalized 3D-PCA (G3D-PCA) is proposed to build the appearance models for the medical volumes. In our method, the 3D volumes are treated as the third order tensors in stead of unfolding them to be the long 1D vectors. The bases of the tensor space are constructed by three matrices whose columns are set by the leading eigenvectors calculated from the covariance matrices on the three mode subspaces. Since the dimensions of these covariance matrices are only depended by the dimensions of the corresponding
Appearance Models for Medical Volumes with Few Samples
823
mode subspaces, our method is able to overcome the problem of huge computing cost. Additionally, the base number is not limited by the training sample number in our method, so our method can also overcome the generalization problem. The proposed method is based on the multilinear algebra [7] [8] which is the mathematical tool for higher order tensors. Recently, the multilinear algebra based method is applied to field of computer vision and pattern recognition and achieves good results. The authors in [9] propose tensorfaces to build statistical model of faces in different conditions, such as illumination and viewpoint. A method called concurrent subspaces analysis (CSA) [10] is proposed for video sequence reconstruction and digital number recognition. However, we does not find the research to apply the multilinear algebra based method in the field of medical image analysis.
2
Background Knowledge of Multilinear Algebra
We introduce some background knowledge about multilinear algebra in this section. For more detailed knowledge of multilinear algebra, the readers can refer to [7] [8]. The notations of the multilinear algebra in this paper are listed as follows. Scalers are denoted by italic-shape letters, i.e. (a, b, ...) or (A, B, ...). Bold lower case letters, i.e. (a, b, ...), are used to represent vectors. Matrices are denoted by bold upper case letters, i.e. (A, B, ...); and higher-order tensors are denoted by calligraphic upper case letters, i.e. (A, B, ...). An N -th order tensor A is defined as a multiway array with N indices, where A ∈ RI1 ×I2 ×...×IN . Elements of the tensor A are denoted as ai1 ...in ...iN , where 1 in In . The space of the N -th order tensor is comprised by the N mode subspaces. In the viewpoint of tensor, scales, vectors and matrices can be seen as zeroth-order, first order and second order tensors respectively. Varying the n-th index in and keeping the other indices fixed, we can obtain the ”mode-n vectors” of the tensor A. The ”mode-n matrix” A(n) can be formed by arranging all the mode-n vectors sequentially as its column vectors, A(n) ∈ RIn ×(I1 ·...In−1 ·In+1 ·...IN ) . The procedure of forming the mode-n matrix is called unfolding of a tensor. Fig. 1 gives the examples to show how to unI2
I3
I2
I2
I2
Mode-1 Matrix
I1
I1
A(1) I3 I3
I3
I3
I3
I2
Mode-2 Matrix
I2
A(2)
I1
I1 I1
I3
I1
I1
I2 I1
Mode-3 Matrix
I3
A(3) I2
Fig. 1. Example of Unfolding the third order tensor to the three mode-n matrices
824
R. Xu and Y.-W. Chen
fold the third order to their mode-n matrices. The mode-n product of a tensor A ∈ RI1 ×I2 ×...×IN and a matrix U ∈ RJn ×In , denoted as A ×n U, is an (I1 × I2 × ... × In−1 × Jn × In+1 × ... × IN ) tensor. Entries of the new tendef sor is defined by (A ×n U)i1 i2 ...in−1 jn in+1 ...iN = in ai1 i2 ...in−1 in in+1 ...iN ujn in . These entries can also be calculated by matrix product, B(n) = U · A(n) , B(n) is mode-n matrix of the tensor B = A ×n U. The mode-n product has two properties. One can be expressed by (A ×n U) ×m V = (A ×m V) ×n U = A ×n U ×m V; and the other is (A ×n U) ×n V = A ×n (V · U). The def
I1 ×I2 ×...×IN inner is defined by A, B = product of two tensors A, B ∈ R ... a b . The Frobenius-norm of a tensor A is defined i i ...i i i ...i 1 2 N 1 2 N i1 i2 iN def by A = A, A. The Frobenius-norm of a tensor can also be calculated
from its mode-n matrix, A = A(n) = a matrix.
3
tr(A(n) · AT(n) ), tr() is the trace of
Appearance Models Built by Generalized 3D-PCA
An medical volume, such as MR volume, can be seen as a third order tensor. The three modes have their own physical meanings (coronal, sagittal and transversal axis respectively). Supposing there are a series of medical volumes collected from the same organs but different patients, there must be some common statistical information of their appearance (the voxel’s intensity) for these volumes. In this paper, we propose the multilinear algebra based algorithm named by G3D-PCA to build such appearance models for 3D medical volumes. We generalize the G3D-PCA for the appearance models building as follows. 4 I1 ×I2 ×I3 Given a series ,i = Mof the third order tensors with zero-mean Ai ∈ R 1, 2, ..., M , i=1 Ai = 0, we hope to find three matrices with orthogonal columns U ∈ RI1 ×J1 , V ∈ RI2 ×J2 , W ∈ RI3 ×J3 , (J1 < I1 , J2 < I2 , J3 < I3 ) which is able to reconstruct Ai from the corresponding dimension-reduced tensors Bi ∈ RJ1 ×J2 ×J3 with least error. Each reconstructed tensor Aˆi can be expressed by Eq. 1. The matrices U, V, W contain the statistical intensity information of the volume set Ai and they are respectively the orthogonal bases in the three mode subspaces. Aˆi = Bi ×1 U ×2 V ×3 W (1) We need to minimize the cost function S, shown by Eq. 2, to find the three matrices. S=
M i=1
4
Ai − Aˆi 2 =
M
Ai − Bi ×1 U ×2 V ×3 W 2
(2)
i=1
If the tensors do not have zero-mean, we can subtract the mean-value from each tensor to obtain a new series of tensors A i which have zero-mean, i.e. A i = Ai − M 1 i=1 Ai . M
Appearance Models for Medical Volumes with Few Samples
825
Algorithm 1. Iteration Algorithm to Compute the Matrices Uopt , Vopt , Wopt IN: a series of third order tensors, Ai ∈ RI1 × I2 × I3 , i = 1, 2, ..., M . OUT: Matrices Uopt ∈ RI1 ×J1 , Vopt ∈ RI2 ×J2 , Wopt ∈ RI3 ×J3 , (J1 < I1 , J2 < I2 , J3 < I3 ) with orthogonal column vectors.
1. Initial values: k = 0 and U0 , V0 , W0 whose columns are determined as M T the first J1 , J2 , J3 leading eigenvectors of the matrices i=1 (Ai(1) · Ai(1) ), M M T T i=1 (Ai(2) · Ai(2) ), i=1 (Ai(3) · Ai(3) ) respectively. 2. Iterate until convergence
T 2 T T – Maximize S = M i=1 Ci ×1 U , Ci = Ai ×2 Vk ×3 Wk Solution: U whose columns are determined as the first J1 leading eigenvecT tors of M i=1 (Ci(1) · Ci(1) ) Set Uk+1 = U. T 2 T T – Maximize S = M i=1 Ci ×2 V , Ci = Ai ×1 Uk+1 ×3 Wk Solution: V whose columns are determined as the first J2 leading eigenvecT tors of M i=1 (Ci(2) · Ci(2) ) Set Vk+1 = V. T 2 T T – Maximize S = M i=1 Ci ×3 W , Ci = Ai ×1 Uk+1 ×2 Vk+1 Solution: W whose columns are determined as the first J3 leading eigenvecT tors of M i=1 (Ci(3) · Ci(3) ) Set Wk+1 = W.
k = k+1 3. Set Uopt = Uk , Vopt = Vk , Wopt = Wk .
In Eq. 2, only the tensors Ai are known. However, supposing the three matrices are known, the answer of Bi to minimize Eq. 2 is merely the result of the traditional linear least-square problem. Theorem 1 can be obtained. Theorem 1. Given fixed matrices U, V, W, the tensors Bi that minimize the cost function, Eq. 2, are given by Bi = Ai ×1 UT ×2 VT ×3 WT . With the help of Theorem 1, Theorem 2 can be obtained. Theorem 2. If the tensors Bi are chosen as Bi = Ai ×1 UT ×2 VT ×3 WT , minimization of the cost function Eq. 2 is equal to maximization of the following M cost function, S = i=1 Ai ×1 UT ×2 VT ×3 WT 2 . There is no close-form solution to simultaneously resolve the matrices U, V, W for the cost function S ; however if two of them are fixed, we can find the explicit solution of the other matrix which is able to maximize S . Lemma 1. Given the fixed matrices, V, W, if the columns of the matrix U are M T selected as the first J1 leading eigenvectors of the matrix i=1 (Ci(1) · Ci(1) ), Ci(1) is the mode-1 matrix of the tensor Ci = Ai ×2 VT ×3 WT , the cost function S can be maximized.
826
R. Xu and Y.-W. Chen J2 I2
Vopt
u2 J1 I1
U opt
J3
u1
J1
I3
J3
I2
J2
u3
I3
Wopt
I1
Fig. 2. Illustration of reconstructing a volume by the principal components Bi and the three orthogonal bases of mode subspaces Uopt , Vopt , Wopt
Lemma 2. Given the fixed matrices, U, W, if the columns of the matrix V are M T selected as the first J2 leading eigenvectors of the matrix i=1 (Ci(2) · Ci(2) ), Ci(2) is the mode-2 matrix of the tensor Ci = Ai ×1 UT ×3 WT , the cost function S can be maximized. Lemma 3. Given the fixed matrices, U, V, if the columns of matrix W are the M T selected as the first J3 leading eigenvectors of the matrix (C i(3) · Ci(3) ), i=1 T T Ci(3) is the mode-3 matrix of the tensor Ci = Ai ×1 U ×2 V , the cost function S can be maximized. According to Lemma 1, Lemma 2 and Lemma 3, we can use an iteration algorithm to get the optimal matrices, Uopt , Vopt , Wopt , which are able to maximize the cost function S . This algorithm is summarized as Algorithm 1. In Algorithm 1, we terminate the iteration when the values of the cost function is not significantly changed in two consecutive times. Usually, the convergence is fast. According to our experiences, two or three times iterations are enough. Using the calculated matrices Uopt , Vopt , Wopt , each of the volume Ai can be approximated by Aˆi with least errors, where Aˆi = Bi ×1 Uopt ×2 Vopt ×3 Wopt T T and Bi = Ai ×1 UTopt ×2 Vopt ×3 Wopt . The approximation can be illustrated by Fig. 2. In 3D PCA, the three matrices Uopt , Vopt , Wopt construct the bases on the third order tensor space; and the components of the tensor Bi are the principal components.
Table 1. Comparison of G3D-PCA and the two PCA-based methods, given the training volumes Ai , where Ai ∈ RI1 ×I2 ×I3 , i = 1, 2, ..., M Methods
Dimension of Maximal Number of Covariance Matrices Available Bases PCA–First Way (I1 · I2 · I3 ) × (I1 · I2 · I3 ) I1 · I 2 · I3 PCA–Second Way M ×M M −1 Mode-1 : I1 × I1 G3D-PCA Mode-2 : I2 × I2 I1 · I 2 · I3 Mode-3 : I3 × I3
Appearance Models for Medical Volumes with Few Samples
827
Compared to the PCA-based methods, the G3D-PCA has advantages in two aspects. Given M training volumes with the dimension of I1 × I2 × I3 , the covariance matrix for the first way to implement PCA has the a huge dimension of (I1 · I2 · I3 ) × (I1 · I2 · I3 ). So, this method is suffered from huge computing cost. Different from the PCA-based method, G3D-PCA is to find three orthogonal bases in the three mode subspaces rather than find one orthogonal bases in the very long 1D vector space. The three bases are respectively obtained by calculating the leading eigenvectors from three covariance matrices of the mode subspaces. Since the three matrices have the dimension of I1 × I1 , I2 × I2 , I3 × I3 respectively, the calculation of eigenvectors is efficient. So G3D-PCA overcomes the problem of huge computing cost. In the other aspect, the G3D-PCA can obtain enough bases to represent the untrained volumes compared to the PCAbased method implemented by the second way. In theory, the maximal number of the tensor bases constructed from the bases of the three mode subspaces is I1 ·I2 · I3 . So the number of the available bases is not limited on the number of training samples and this make the G3D-PCA has good performance on generalization even when few samples are used for training. Table 1 compares the differences between the G3D-PCA and the two PCA-based methods.
4
Experiments
We use eighteen MR T1-weighted volumes of Vanderbilt [11] database to build the appearance models of the human brain. These eighteen volumes are collected from different patients, and their dimensions are 256 × 256 × 26. We choose one volume as the template and align the other seventeen volumes onto the template by similarity-transformation based registration. A 3D similarity-transformation has seven parameters, three for translations, three for rotation angles and one for scaling factor. Fig. 3 gives examples for three volumes of the registered data.
Fig. 3. Example of registered volumes
For these volumes, the first way to implement PCA is suffered from the huge burden of computing cost. For the 256 × 256 × 26 volumes, the unfolded vectors have the huge dimensions of 1703936, so the covariance matrix in the unfoled vector space has an extremely huge dimension of 1703936 × 1703936. Supposing float type is used to store the covariance matrix in a computer, we need to allocate the memory of 10816GB. This is impossible for the current desktop PCs.
828
R. Xu and Y.-W. Chen
The second way to implement PCA is suffered from the problem of bad performance for generalization. This can be demonstrated by the leave-one-out experiment, where the seventeen volumes are used to train the appearance model and the left one is reconstructed by the trained models for testing. Fig. 4 gives the result of leave-one-out experiment when the PCA is implemented in the second way. It should be noticed that all the 16 available bases are used in the reconstruction in this experiment. It can be seen that the quality of the reconstructed slices is not satisfied. The reconstructed slices are blurred and the tumor part on the 17-th can not be visualized from the reconstructed result. This problem is because the available bases are too few to represent the untrained vector in the unfolded vector space with large dimension.
Original Slices
Reconstructed Slices RMSE: 0.197
7th-Slice
13th-Slice
17th-Slice
Fig. 4. Result of leave-one-out testing for the PCA-based method implemented by the second way
The G3D-PCA is not suffered from the two problems. The covariance matrices for the three mode subspaces have the dimension of 256×256, 256×256 and 26× 26 respectively, so the eigenvectors of them can be calculated efficiently. In the other aspect, it is able to obtain enough bases to represent an untrained volumes in the space of third order tensor by G3D-PCA. This can also be demonstrated by the leave-one-out experiment. Fig. 5 gives the result of leave-one-out experiment for the G3D-PCA. The training samples and testing volume are the same as the experiment for PCA. The untrained volume is tried to be represented by three tensor bases with different dimensions of 50 · 50 · 15 = 37500, 75 · 75 · 20 = 112500, 100 · 100 · 20 = 200000. Since the number of the maximal available bases is 256 · 256 · 26 = 1703936, the compressing rates for the three cases 50·50·15 75·75·20 are 256·256·26 ≈ 2.2%, 256·256·26 ≈ 6.6% and 100·100·20 256·256·26 ≈ 11.7% respectively. It can be seen that the quality of the reconstructed results become better and
Appearance Models for Medical Volumes with Few Samples
829
Original Slices
Dimension of the Bases:
50 5015 RMSE: 7.69-E2
Dimension of the Bases:
75 75 20 RMSE: 5.36-E2
Dimension of the Bases:
100100 20 RMSE: 4.23-E2 7th-Slice
13th-Slice
17th-Slice
Fig. 5. Result of leave-one-out testing for the proposed G3D-PCA method
better with the increase of the dimensions of the bases. The root mean square error (RMSE) is also calculated between the reconstructed volumes and the original volumes. Compared to the PCA-based result, it can be seen that the reconstructed results based on G3D-PCA is much better. The reconstructed results are clearer. Additionally, the tumor region is also reconstructed well. This experiment illustrates that the appearance models for medical volumes based on G3D-PCA has good performance on generalization even when the models are trained from few samples.
5
Conclusion
We propose a method named as G3D-PCA based on multilinear algebra to build the appearance models for 3D medical volumes with large dimensions in this
830
R. Xu and Y.-W. Chen
paper. In this method 3D volumes are treated as the third order tensors directly without the unfolding to be 1D vectors. Additionally, the output of G3D-PCA is three matrices whose columns are the orthogonal bases respectively in the three mode subspaces. The bases in the third order tensor spaces can be constructed from the three matrices and the maximal available number for the tensor bases is not limited by the number of the training samples. According to these properties, our method is not suffered from the huge computing cost; and what’s more, it has good performance on generalization even when few samples are used in training.
References 1. Cootes, T.F., Cooper, D., Taylor, C.J., Graham, J.: Active shape models – their training and application. Comput. Vision Image Understanding 61(1), 38–59 (1995) 2. van Assen, H.C., Danilouchkine, M.G., Behloul, F.: J., H., van der Geest, R.J., Reiber, J.H.C., Lelieveldt, B.P.F.: Cardiac lv segmentation using a 3d active shape model driven by fuzzy inference. In: Procedings of Lecture Notes in Computer Science: Medical Image Computing and Computer-Assisted Intervention, pp. 533– 540 (2003) 3. Kaus, M.R., von Berg, J., Weese, J., Niessen, W.J., Pekar, V.: Automated segmentation of the left ventricle in cardiac mri. Medical Image Analysis 8(3), 245–254 (2004) 4. Mitchell, S.C., Bosch, J.G., Lelieveldt, B.P.F., van der Geest, R.J., Reiber, J.H.C., Sonka, M.: 3d active appearance models: Segmentation of cardiac mr and ultrasound images. IEEE Transactions on Medical Imaging 21(9), 1167–1178 (2002) 5. Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence 23(6), 681–685 (2001) 6. Turk, M., Pentland, A.: Eigenfaces for recognition. Journal of Cognitive Neuroscience 3(1), 71–86 (1991) 7. Lathauwer, L.D., Moor, B.D., Vandewalle, J.: A multilinear sigular value decomposition. SIAM Journal of Matrix Analysis and Application 21(4), 1253–1278 (2000) 8. Lathauwer, L.D., Moor, B.D., Vandewalle, J.: On the best rank-1 and rank-(r1, r2,., rn) approximation of higher-order tensors. SIAM Journal of Matrix Analysis and Application 21(4), 1324–1342 (2000) 9. Vasilescu, M.A.O., Terzopoulos, D.: Multilinear analysis of image ensembles: Tensorfaces. In: Procedings of European Conference on Computer Vision, pp. 447–460 (2002) 10. Xu, D., Yan, S.C., Zhang, L., Zhang, H.J., Liu, Z.L., Shum, H.Y.: Concurrent subspaces analysis. In: Procedings of IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 203–208 (2005) 11. West, J., Fitzpatrick, J., et al.: Comparison and evaluation of retrospective intermodality brain image registration techniques. J. Comput. Assist. Tomogr. 21, 554–566 (1997)
Head Pose Estimation Based on Tensor Factorization Wenlu Yang1,2 , Liqing Zhang1, , and Wenjun Zhu1 1
Department of Computer Science and Engineering, 800 Dongchuan Rd., Shanghai Jiaotong University, Shanghai 200240, China 2 Department of Electronic Engineering, 1550 Pudong Rd. Shanghai Maritime University, Shanghai 200135, China [email protected], [email protected], [email protected]
Abstract. This paper investigates head pose estimation problem which is considered as front-end preprocessing for improving multi-view human face recognition. We propose a computational model for perceiving head poses based on the Non-negative Multi-way Factorization (NMWF). The model consists of three components: the tensor representation for multiview faces, feature selection and head pose perception. To find the facial representation basis, the NMWF algorithm is applied to the training data of facial images. The face tensor includes three factors of facial images, poses, and people. The former two factors are used to construct the computational model for pose estimation. The discriminative measure for perceiving the head pose is defined by the similarity between tensor basis and the representation of testing facial image which is projection of faces on the subspace spanned by the basis “TensorFaces”. Computer simulation results show that the proposed model achieved satisfactory accuracy for estimating head poses of facial images in the CAS-PEAL face database. Keywords: Head pose estimation, Face recognition, Tensor decomposition, Non-negative multi-way factorization, TensorFaces.
1
Introduction
Human possess a remarkable ability to recognize faces regardless of facial geometries, expressions, head poses, lighting conditions, distances, and ages. Modelling the functions of face recognition and identification remains a difficult open problem in the fields of computer vision and neuroscience. Particularly, face recognition robust to head pose variation is still difficult in the complex natural environment. Therefore, head pose estimation is a very useful front-end processing for multi-view human face recognition. Head pose variation covers the following three free rotation parameters, such as yaw (rotation around the neck), tilt (rotation up and down), and roll (rotation left profile to right profile). In this paper, we will focus on yawing rotation which has many important applications. Previous methods on head pose estimation from 2D images can be roughly divided into two categories: template-based approach and appearance-based ap
Corresponding author.
M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 831–840, 2008. c Springer-Verlag Berlin Heidelberg 2008
832
W. Yang, L. Zhang, and W. Zhu
proach. The template-based approaches are based on nearest neighbor classification with texture templates [1][2], or deduced from geometric information like configurations of facial landmarks [3][5][6]. Appearance-based approaches generally regard head pose estimation as a parameter estimation problem. It could be handled using regression method, multi-class classification, or nonlinear dimension reduction method. Typical appearance approaches include Support Vector Machines (SVM)[7][8], tree-structured AdaBoost pose classifier[9], soft margin AdaBoost[10], Neural Networks (NN)[11][12], Gabor Wavelet Networks[13], MDS[14], Principal component analysis (PCA)[15][16], Kernel PCA [17], Independent Component Analysis (ICA)[18][24], linear or non-linear embedding and mapping [19][20](ISOMAP[21], Local Linear Embedding (LLE)[22][23]). In this paper, we propose a computational model for perceiving head poses from facial images using tensor decomposition. The rest of this paper is organized as follows: In section 2, we propose an computational model for pose estimation and introduce the tensor decomposition algorithm the Nonnegative Multi-Way Factorization (NMWF). In section 4, the tensor bases for different head poses are obtained from training data and computer simulations are provided to show the performance of our proposed pose estimation method. Finally, discussions and conclusions are given in section 5.
2
Tensor Representation and Learning Algorithm
In this section, we first review related linear models for Tensor Decomposition. Then we propose a computational model for perceiving multi-views or head poses of facial images. Then we introduce the Non-negative Multi-way Factorization (NMWF) algorithm for learning the representation basis functions of facial images, and we further define representation basis for different views of facial images. Furthermore, we propose a method for estimating the head poses from facial images. 2.1
Related Models for Tensor Decomposition
PCA has been a popular technique in facial image recognition, and its generalization, known as ICA, has also been employed in face recognition[24]. PCA based face recognition method Eigenfaces works well when face images are well aligned. If facial images contain other factors, such as lighting, poses, and expression, The recognition performance of Eigenfaces degenerate dramatically. Tensor representation is able to formulate more than two factors in one framework. Therefore, tensor models have richer representational capability that commonly used matrix does. Grimes and Rao [25] proposed a bilinear sparse model to learn styles and contents from natural images. When facial images contain more than two factors, these linear and bilinear algebra do not work well. Some researchers proposed multilinear algebra to represent multiple factors of image ensembles. Vasilescu and Terzopoulos [26] applied the higher-order tensors approach to factorize facial images, resulting the “TensorFaces” representation of facial images. And later, they proposed a multilinear
Head Pose Estimation Based on Tensor Factorization
833
ICA model to learn the statistically independent components of multiple factors. Morup et al. proposed the Non-negative Multi-way Factorization (NMWF) algorithm, a generalization of Lee and Seung’s algorithm [28] of non-negative matrix factorization (NMF algorithm) to decompose the time-frequency representation of EEG[27]. Using the advantage of multiple factor representation in tensor model, we will use the tensor factorization approach to find the tensor basis for facial image and further propose a computational model for perceiving head poses from facial images. The model consists of two parts: the training module and perception module, as shown in Fig.1. The training module is to find the tensor representation using the non-negative multi-way algorithm. By using the coefficients of poses factor, we define the representation basis for different head poses.
Fig. 1. Model for learning the “TensorFaces” and estimating poses
The second module, perception module, is to estimate the head pose from input facial image. other part composed of testing set, basis functions, coefficients of poses, similarity analysis, and output of resulting poses. Given a facial image randomly selected from the testing set, the corresponding coefficients of poses are obtained by projecting the image onto the subspace spanned by the basis functions. Then analyze the similarity between the coefficients and factors of poses, and the maximal correlation coefficients results in the corresponding pose. The next section will give the brief introduction of the NMWF and the algorithm for pose estimation. 2.2
The NMWF Algorithm
In this section, we briefly introduce tensor representation model and the Nonnegative Multi-way Factorization (NMWF) algorithm. Generally speaking, tensor is a generalized data representation of 2-dimensional matrix. Given an n-way array T , the objective of tensor factorization is to decompose T into the following N components
834
W. Yang, L. Zhang, and W. Zhu
T ≈Th =
N
(1)
(2)
(n)
Uj ⊗ Uj ⊗ · · · ⊗ Uj
(1)
j=1
This factorization is important for many applications, since each rank-1 component represents certain feature of the tensor data. There exist a number of questions needed to be discussed with this tensor factorization. The first question is how many components are good for the decomposition, this problem is related to the matrix rank estimation in the singular value decomposition of matrices. It is usually difficult to estimate the rank of a tensor because the tensor rank estimation is still an NP-hard and open problem. Refer to [4] for a tutorial on the tensor rank estimation. Second issue of the tensor factorization is what criterions shall be used for finding the tensor components. Commonly used criterions covers the least-squares error, relative entropy and more generalized divergence, the alpha-divergence. In this paper, we use the relative entropy as a criterion to find the tensor components. Given a nonnegative tensor T , the relative entropy of the tensor T and its approximation T h is defined as T j ,j ,··· ,j N Div(T ||T h ) = Tj1 ,j2 ,··· ,jN log h1 2 − Tj1 ,j2 ,··· ,jN + Tjh1 ,j2 ,··· ,jN T j ,j ,··· ,j 1 2 N j ,j ,··· ,j 1
2
N
(2) Denote the inner product of two tensor as < T , R >= Tj1 ,j2 ,··· ,jN Rj1 ,j2 ,··· ,jN . j1 ,j2 ,··· ,jN
Then the relative entropy can be rewritten as Div(T ||T h ) =< T , log(T ) > − < T , log(T h ) > − < 1, T > + < 1, T h > . (3) To explore some specific properties of the data, such as sparseness and smoothness, some constraints may be imposed as additional conditions of the optimization problem. By using the gradient descent algorithm, we derive the learning algorithm as follows: p n p i=1 Uki /Uj k1 ,k2 ,··· ,kN kj =j Tk1 ,k2 ,··· ,kN Tkh ,k ,··· ,k 1 2 N n Ujp ⇐ Ujp (4) p p U /U j k1 ,k2 ,··· ,kN ij =j i=1 ki The algorithm can be implemented alternatively by updating one set of parameters and keeping the others unchanged.
3
Face Tensor Representation and Head Pose Estimation
The facial images can be formulated into the framework of tensor representation. Different mode represents different factor in facial images. In this paper, we take
Head Pose Estimation Based on Tensor Factorization
835
three factors: face image, head pose and people into consideration. Thus the facial tensor discussed in this paper consists of three modes. For a data set of facial images with I3 people, I2 poses of each person and I1 -dimensionality of each facial image, the tensor T I1 I2 I3 denotes all facial images. According to the decomposi(1) tion of T I1 I2 I3 , Uj , j = 1, 2, · · · , N represent the set of the basis functions for (2)
facial images, Uj
(3)
and Uj
denote poses and people coefficients separately with (1)
(1)
respect to the basis function Uj . Using all basis functions Uj , j = 1, 2, · · · , N , the k-th vector U3k (k = 1, 2, · · · , I2 ) of coefficients of people, and the l-th vector U2l of pose coefficients, the facial image Tk,l is reconstructed as follows: h Tk,l =
N
(1)
(2)
(3)
Uj Uj (k)Uj (l).
(5)
j=1 (2)
(3)
Therefore, the parameters Uj (k), Uj (l), j = 1, 2, · · · , N are considered as the (1)
feature representation on the tensor basis T I1 I2 I3 , Uj , j = 1, 2, · · · , N . In order to find the tensor representation of an input image, we use the least square error method. Given a new facial image X, we attempts to find the coefficients S = (s1 , s2 , · · · , sN )T of the following tensor representation, X=
N
(1)
sj Uj .
(6)
j=1
The least square error solution of the above equation is given
by the following (1) (1) −1 solution S = A B, where A = (Ai,j ) = < Ui , Uj > and B = (Bj ) =
(j) < Ui , X > . 3.1
Head Pose Estimation
After obtaining the decomposition of the tensor DI1 I2 I3 , we are able to use the second mode of the model for estimating the poses of faces randomly selected from the testing set. Given a facial image I = U1 S, here, U1 denotes the basis functions and S the representation of the image on the subspace spanned by the bases. Computing S = (UT1 U1 )−1 UT1 I, the S is obtained. And then the similarity of S and U2,k (k = 1, 2, · · · , I2 ) is calculated by the correlation measure Corr(S, U 2,k ) =
< S, U2,k > . < S, S >< U2,k , U2,k >
(7)
For different head pose facial images, the feature representation of U2,k in S are different. Therefor, head poses can be easily estimated by maximizing the similarity between S and U2,k .
836
4
W. Yang, L. Zhang, and W. Zhu
Simulations and Results
In this section, we will present simulation results to verify the performance of the computational model. 4.1
Face Database
The facial images are selected from the pose subset, images of 1040 subjects across 7 different poses {±45◦, ±30◦ , ±15◦, 0◦ } shown in Fig. 2, included in the public CASE-PEAL-R1 face database[29]. In our experiments, we first crop the original facial images to patches of 300 × 250 pixels which include whole faces by the positions of eyes. Then resize them to 36 ×30 for the limit of computational memory. The tensor D in Equation (1) is a 3-mode tensor with pixels, poses, and people. For the limited computing resources, images of one hundred subjects across seven different poses are randomly selected. And the facial tensor D1080×7×100 is generated for finding the tensor basis functions. We use all other images in the pose subset as testing set.
Fig. 2. Examples of facial images in CASE-PEAL-R1 face database[29]
4.2
TensorFaces Representation
The tensor D is decomposed by the NMWF, and the basis functions or “TensorFaces” are obtained, shown as in Fig.3. Carefully examining the “TensorFaces” tells us that the “TensorFaces” have the favorable characteristics for seven pose representation. When different facial images are projected onto the tensor basis, the resulting coefficients S of seven poses are apparently discriminative. The subsets of “TensorFaces” projected onto subspace in different angles of yawing rotation are shown in Fig. 4. 4.3
Head Pose Estimation
According to the algorithm mentioned in section 3.1, we estimate head poses of facial images randomly selected from the testing set. For test stage, 500 facial images were used. The accuracy rate of pose estimation is shown in Figs. 6. Compared with the other methods, our method shows some advantages such as simplicity, effectivity, and high accuracy. First testing on the same face database included in the CAS-PEAL, Ma[30] used the combination of SVM classifier and
Head Pose Estimation Based on Tensor Factorization
837
Fig. 3. Basis functions decomposed by the NMWF from facial images across seven poses
Fig. 4. Subsets of “TensorFaces” in different poses. (Left) Pose angle of yaw rotation: -30◦ . (Right) Pose angle of yaw rotation: 0◦ .
838
W. Yang, L. Zhang, and W. Zhu 3
3
−45 −30 −15 +00 +15 +30 +45
2 1 PC3
−45 −15 +15 +45
4
0 −1
2
−2
PC3
1 −3
0
−2 −1
−1
0
−2 −3 −3
1 PC2
5 −2
−1
0
PC2
2
0
1
2
−2
−5 3 PC1
0
−1
2
1
PC1
Fig. 5. The left figure shows the sample features of four poses projected in 2dimensional space. The right figure shows sample features of seven poses, projected in 3-dimensional space. 1 1.005 1
0.995
0.995
0.99 0.99 0.985
0.985 0
2
4
6
8
10
0
2
4
6
8
Fig. 6. Pose estimation results. (Left) The X-axis denotes indexes of tests, and Y-axis denotes pose estimation results. (Right) It is different from the (Left) that the results are rearranged by poses. The X-axis denotes poses in the range of {-45◦ , -30◦ , -15◦ , 0◦ , +15◦ , +30◦ , +45◦ }. The Y-axis denotes pose estimation results by poses.
LGBPH features for pose estimation, showed that their best accuracy of pose estimation is 97.14%. Using the same data, the worst accuracy we obtained is 98.6% and the average is 99.2%. Second in the preprocessing process of the face images, Ma et al. applied geometric normalization, histogram equalization, and facial mask to the images, and cropped them to 64×64 pixels. And we simply cropped the image to include the head and resized it to 36×30 for the limit of computing resources. Therefore, our preprocessing is simpler and the results is better. By the way, if the facial images are not cropped, the resulting accuracy of pose estimation is worse, averagely about 90.4%. On the other hand, the training set, randomly selected from the face database, composed of 100 subjects with seven poses. Our supplemental simulations show that if the subjects in the training set increase, the resulting accuracy of pose estimation will further improve, even close to 100%.
Head Pose Estimation Based on Tensor Factorization
5
839
Discussions and Conclusions
In this paper, we proposed a computational model for head pose estimation based on tensor decomposition. Using the NMWF algorithm, facial images are decomposed into three factors such as the “TensorFaces”, poses, and people. And then, these facial images are projected onto the subspace spanned by the “TensorFaces”. The resulting representations can be used for pose estimation and further face recognition. Consider the correlation between representation of facial images and components of poses as the similarity measure of pose estimation, the head pose of the facial images included in the public CAS-PEAL database are estimated. The simulation results show that the model is very efficient. Acknowledgments. The work was supported by the National Basic Research Program of China (Grant No. 2005CB724301) and the National High-Tech Research Program of China (Grant No. 2006AA01Z125). Portions of the research in this paper use the CAS-PEAL face database collected under the sponsor of the Chinese National Hi-Tech Program and ISVISION Tech. Co. Ltd.
References 1. Bichsel, M., Pentland, A.: Automatic interpretation of human head movements. In: 13th International Joint Conference on Artificial Intelligence (IJCAI), Workshop on Looking At People, Chambery France (1993) 2. McKenna, S., Gong, S.: Real time face pose estimation. Real-Time Imaging 4(5), 333–347 (1998) 3. Heinzmann, J., Zelinsky, A.: 3D facial pose and gaze point estimation using a robust real-time tracking paradigm. In: Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition, pp. 142–147 (1998) 4. Bro, R.: PARAFAC. Tutorial and applications, Chemom. Intell. Lab. Syst. In: Special Issue 2nd Internet Conf. in Chemometrics (INCINC 1996), vol. 38, pp. 149–171 (1997) 5. Xu, M., Akatsuka, T.: Detecting Head Pose from Stereo Image Sequence for Active Face Recognition. In: Proceedings of the 3rd. International Conference on Face & Gesture Recognition, pp. 82–87 (1998) 6. Hu, Y.X., Chen, L.B., Zhou, Y., Zhang, H.J.: Estimating face pose by facial asymmetry and geometry. In: Proceedings of Automatic Face and Gesture Recognition, pp. 651–656 (2004) 7. Huang, J., Shao, X., Wechsler, H.: Face pose discrimination using support vector machines (SVM). In: Proceedings of Fourteenth International Conference on Pattern Recognition, vol. 1, pp. 154–156 (1998) 8. Moon, H., Miller, M.: Estimating facial pose from a sparse representation. In: International Conference on Image Processing, vol. 1, pp. 75–78 (2004) 9. Yang, Z., Ai, H., Okamoto, T.: Multi-view face pose classification by tree-structured classifier. In: International Conference on Image Processing, vol. 2, pp. 358–361 (2005) 10. Guo, Y., Poulton, G., Li, J., Hedley, M.: Soft margin AdaBoost for face pose classification. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 3, pp. 221–224 (2003)
840
W. Yang, L. Zhang, and W. Zhu
11. Baluja, S., Sahami, M., Rowley, H.A.: Efficient face orientation discrimination. In: International Conference on Image Processing, vol. 1, pp. 589–592 (2004) 12. Voit, M., Nickel, K., Stiefelhagen, R.: Multi-view head pose estimation using neural networks. In: The 2nd Canadian Conference on Computer and Robot Vision, pp. 347–352 (2005) 13. Kruger, V., Sommer, G.: Gabor Wavelet Networks for Object Representation. Journal of the Optical Society of America 19(6), 1112–1119 (2002) 14. Chen, L.B., Zhang, L., Hu, Y.X., Li, M.J., Zhang, H.J.: Head Pose Estimation Using Fisher Manifold Learning. In: Proceedings of the IEEE International Workshop on Analysis and Modeling of Faces and Gestures, pp. 203–207 (2003) 15. Turk, M.A., Pentland, A.P.: Face recognition using eigenfaces. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 586–591 (1991) 16. Martinez, A.M., adn Kak, A.C.: PCA versus LDA. IEEE Transactions on Pattern Analysis and Machine Intelligence 23(2), 228–233 (2001) 17. Li, S.Z., Fu, Q.D., Gu, L., Scholkopf, B., Cheng, Y.M., Zhang, H.J.: Kernel Machine Based Learning for Multi-view Face Detection and Pose Estimation. International Conference of Computer Vision 2, 674–679 (2001) 18. Li, S.Z., Lv, X.G., Hou, X.W., Peng, X.H., Cheng, Q.S.: Learning Multi-View Face Subspaces and Facial Pose Estimation Using Independent Component Analysis. IEEE Transactions on Image Processing 14(6), 705–712 (2005) 19. Hu, N., Huang, W.M., Ranganath, S.: Head pose estimation by non-linear embedding and mapping. In: International Conference on Image Processing, vol. 2, pp. 342–345 (2005) 20. Raytchev, B., Yoda, I., Sakaue, K.: Head Pose Estimation By Nonlinear Manifold Learning. In: International Conference on Pattern Recognition, vol. 4, pp. 462–466 (2004) 21. Tenenbaum, J.B., de Silva, V., Langford, J.C.: A global geometric framework for nonlinear dimensionality reduction. Science 290, 2319–2323 (2000) 22. Saul, L.K., Roweis, S.T.: Think Globally, Fit Locally: Unsupervised Learning of Low Dimensional Manifolds. Journal of Machine Learning Research 4, 119–155 (2003) 23. Roweis, S., Saul, L.: Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science 290, 2323–2326 (2000) 24. Bartlett, M.S., Movellan, J.R., Sejnowski, T.J.: Face recognition by independent component analysis. IEEE Transactions on Neural Networks 13(6), 1450–1464 (2002) 25. Grimes David, B., Rao Rajesh, P.N.: Bilinear Sparse Coding for Invariant Vision. Neural Computation 17(1), 47–73 (2005) 26. Alex, M., Vasilescu, O., Terzopoulos, D.: Multilinear analysis of image ensembles: TensorFaces. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2350, pp. 447–460. Springer, Heidelberg (2002) 27. Mørup, M., Hansen, L.K., Parnas, J., Arnfred, S.M.: Decomposing the timefrequency representation of EEG using non-negative matrix and multi-way factorization. Technical reports (2006) 28. Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788–791 (1999) 29. Gao, W., Cao, B., Shan, S., Zhou, D., Zhang, X., Zhao, D.: The CAS-PEAL Large-Scale Chinese Face Database and Evaluation Protocols, Technical Report No. JDL TR 04 FR 001, Joint Research & Development Laboratory, CAS (2004) 30. Ma, B.P., Zhang, W.C., Shan, S.G., Chen, X.L., Gao, W.: Robust Head Pose Estimation Using LGBP. In: Proceeding of International Conference on Pattern Recognition (2), pp. 512–515 (2006)
Kernel Maximum a Posteriori Classification with Error Bound Analysis Zenglin Xu, Kaizhu Huang, Jianke Zhu, Irwin King, and Michael R. Lyu Dept. of Computer Science and Engineering, The Chinese Univ. of Hong Kong, Shatin, N.T., Hong Kong {zlxu,kzhuang,jkzhu,king,lyu}@cse.cuhk.edu.hk
Abstract. Kernel methods have been widely used in data classification. Many kernel-based classifiers like Kernel Support Vector Machines (KSVM) assume that data can be separated by a hyperplane in the feature space. These methods do not consider the data distribution. This paper proposes a novel Kernel Maximum A Posteriori (KMAP) classification method, which implements a Gaussian density distribution assumption in the feature space and can be regarded as a more generalized classification method than other kernel-based classifier such as Kernel Fisher Discriminant Analysis (KFDA). We also adopt robust methods for parameter estimation. In addition, the error bound analysis for KMAP indicates the effectiveness of the Gaussian density assumption in the feature space. Furthermore, KMAP achieves very promising results on eight UCI benchmark data sets against the competitive methods.
1
Introduction
Recently, kernel methods have been regarded as the state-of-the-art classification approaches [1]. The basic idea of kernel methods in supervised learning is to map data from an input space to a high-dimensional feature space in order to make data more separable. Classical kernel-based classifiers include Kernel Support Vector Machine (KSVM) [2], Kernel Fisher Discriminant Analysis (KFDA) [3], and Kernel Minimax probability Machine [4,5]. The reasonability behind them is that the linear discriminant functions in the feature space can represent complex separating surfaces when mapped back to the original input space. However, one drawback of KSVM is that it does not consider the data distribution and cannot directly output the probabilities or confidences for classification. Therefore, it is hard to be applied in systems that reason under uncertainty. On the other hand, in statistical pattern recognition, the probability densities can be estimated from data. Future examples are then assigned to the class with the Maximum A Posteriori (MAP) [6]. One typical probability density function is the Gaussian density function. The Gaussian density functions are easy to handle. However, the Gaussian distribution cannot be easily satisfied in the input space and it is hard to deal with non-linearly separable problems. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 841–850, 2008. c Springer-Verlag Berlin Heidelberg 2008
842
Z. Xu et al.
To solve these problems, we propose a Kernel Maximum a Posteriori (KMAP) Classification method under Gaussianity assumption in the feature space. Different from KSVM, we make the Gaussian density assumption, which implies that data can be separated by more complex surfaces in the feature space. Generally, distributions other than the Gaussian distribution can also be assumed in the feature space. However, under a distribution with a complex form, it is hard to get a close form solution and easy to trap in over-fitting. Moreover, with the Gaussian assumption, a kernelized version can be derived without knowing the explicit form of the mapping functions for our model. In addition, to indicate the effectiveness of our assumption, we calculate a separability measure and the error bound for bi-category data sets. The error bound analysis shows that Gaussian density distribution can be more easily satisfied in the feature space. This paper is organized as follows. Section 2 derives the MAP decision rules in the feature space, and analyzes its separability measures and upper error bounds. Section 3 presents the experiments against other classifiers. Section 4 reviews the related work. Section 5 draws conclusions and lists possible future research directions.
2
Main Results
In this section, our MAP classification model is derived. Then, we adopt a special regularization to estimate the parameters. The kernel trick is used to calculate our model. Last, the separability measure and the error bound are calculated in the kernel-induced feature space. 2.1
Model Formulation
Under the Gaussian distribution assumption, the conditional density function for each class Ci (1 ≤ i ≤ m) is written as below: 1 1 T −1 p(Φ(x)|Ci ) = exp − (Φ(x) − μi ) Σi (Φ(x) − μi ) , (1) 2 (2π)N/2 |Σi |1/2 where Φ(x) is the image of x in the feature space, N is the dimension of the feature space (N could be infinity), μi and Σi are the mean and the covariance matrix of Ci , respectively, and |Σi | is the determinant of the covariance matrix. According to the Bayesian Theorem, the posterior probability of class Ci is calculated by p(x|Ci )P (Ci ) P (Ci |x) = m . j=1 p(x|Cj )P (Cj )
(2)
Based on Eq. (2), the decision rule can be formulated as below: x ∈ Cw if P (Cw |x) = max P (Cj |x). 1≤j≤m
(3)
Kernel Maximum a Posteriori Classification with Error Bound Analysis
843
This means that a test data point will be assigned to the class with the maximum of P (Cw |x), i.e., the MAP. Since the MAP is calculated in the kernel-induced feature space, the output model is named as the KMAP classification. KMAP can provide not only a class label but also the probability of a data point belonging to that class. This probability can be viewed as a confidence of classifying new data points and can be used in statistical systems that reason under uncertainty. If the confidence is lower than some specified threshold, the system can refuse to make an inference. However, many kernel learning methods including KSVM cannot output these probabilities. It can be further formulated as follows: gi (Φ(x)) = (Φ(x) − μi )T Σi−1 (Φ(x) − μi ) + log |Σi |.
(4)
The intuitive meaning of the function is that a class is more likely assigned to an unlabeled data point, when the Mahalanobis distance from the data point to the class center is smaller. 2.2
Parameter Estimation
In order to compute the Mahalanobis distance function, the mean vector and the covariance matrix for each class are required to be estimated. Typically, the mean vector (μi ) and the within covariance matrix (Σi ) are calculated by the maximum likelihood estimation. In the feature space, they are formulated as follows: μi =
ni 1 Φ(xj ), ni j=1
Σ i = Si =
(5)
ni 1 (Φ(xj ) − μi )(Φ(xj ) − μi )T , ni j=1
(6)
where ni is the cardinality of the set composed of data points belonging to Ci . Directly employing Si as the covariance matrix, will generate quadratic discriminant functions in the feature space. In this case, KMAP is noted as KMAP-M. However, the covariance estimation problem is clearly ill-posed, because the number of data points in each class is usually much smaller than the number of dimensions in the kernel-induced feature space. The treatment of this ill-posed problem is to introduce the regularization. There are several kinds of regularization methods. One of them is to replace the individual withinm Si covariance matrix by their average, i.e., Σi = S = i=1 + rI, where I is the m identity matrix and r is a regularization coefficient. This method can substantially reduce the number of free parameters to be estimated. Moreover, it also reduces the discriminant function between two classes to a linear one. Therefore, a linear discriminant analysis method can be obtained. Alternatively, we can estimate the covariance matrix by combining the above linear discriminant function with the quadratic one. Instead of estimating the
844
Z. Xu et al.
covariance matrix in the input space [7], we try to apply this method in the feature space. The formulation in the feature space is as follows: ˜ ˜i + η trace(Σi ) I, Σi = (1 − η)Σ n
(7)
˜i = (1 − θ)Si + θS. where Σ In the equations, θ (0 ≤ θ ≤ 1) is a coefficient linked with the linear discriminant term and the quadratic discriminant one. Moreover, η (0 ≤ η ≤ 1) determines the shrinkage to a multiple of the identity matrix. This approach is more flexible to adjust the effect of the regularization. The corresponding KMAP is noted as KMAP-R. 2.3
Kernel Calculation
We derive methods to calculate the Mahalanobis distance (Eq. (4)) using the kernel trick, i.e., we only need to formulate the function in an inner-product form regardless of the explicit mapping function. To do this, the spectral rep T resentation of the covariance matrix, Σi = N j=1 Λij Ωij Ωij where Λij ∈ R is the j-th eigenvalue of Σi and Ωij ∈ RN is the eigenvector relevant to Λij , is utilized. However, the small eigenvalues will degrade the performance of the function overwhelmingly because they are underestimated due to the small number of examples. In this paper, we only estimate the k largest eigenvalues and replace each left eigenvalue with a nonnegative number hi . Thus Eq. (4) can be reformulated as follows: 1 [g1i (Φ(x)) − g2i (Φ(x))] + g3i (Φ(x)) hi ⎛ ⎞ N k 1 ⎝ T h i T = [Ω (Φ(x) − μi )]2 − 1− [Ωij (Φ(x) − μi )]2 ⎠ hi j=1 ij Λ ij j=1 ⎛ ⎞ k −k + log ⎝hN Λij ⎠ . (8) i
gi (Φ(x)) =
j=1
In the following, we show that g1i (Φ(x)), g2i (Φ(x)), and g3i (Φ(x)) can all be written in a kernel form. To formulate these equations, we need to calculate the eigenvalues Λi and eigenvectors Ωi . The eigenvectors lie in the space spanned by all the training samples, i.e., each eigenvector Ωij can be written as a linear combination of all the training samples: Ωij =
n
(l)
γij Φ(xl ) = U γij
(9)
l=1 (1)
(2)
(n)
where γij = (γij , γij , . . . , γij )T is an n dimensional column vector and U = (Φ(x1 ), . . . , Φ(xn )).
Kernel Maximum a Posteriori Classification with Error Bound Analysis
845
It is easy to prove that γij and Λij are actually the eigenvector and eigenvalue of the covariance matrix ΣG(i) , where G(i) ∈ Rni ×N is the i-th block of the kernel matrix G relevant to Ci . We omit the proof due to the limit of space. Accordingly, we can express g1i (Φ(x)) as the kernel form: g1i (Φ(x)) =
n j=1 n
T T γij U (Φ(x) − μi )T (Φ(x) − μi )U γij
i 1 = Kx − K xl ni j=1 l=1 2 ni 1 = Kx − Kxl , ni
n
2
T γij
l=1
(10)
2
where Kx = {K(x1 , x), . . . , K(xn , x)}T . In the same way, g2i (Φ(x)) can be formulated as the following: g2i (Φ(x)) =
k hi T 1− Ωij (Φ(x) − μi )(Φ(x) − μi )T Ωij . Λ ij j=1
(11)
Substituting (9) into the above g2i (Φ(x)), we have: k hi T g2i (Φ(x)) = 1− γij Λ ij j=1
ni 1 Kx − K xl ni l=1
ni 1 Kx − K xl ni
T γij .
l=1
(12) Now, the Mahalanobis distance function in the feature space gi (Φ(x)) can be finally written in a kernel form, where N in g3i (Φ(x)) is substituted by the cardinality of data n. The time complexity of KMAP is mainly due to the eigenvalue decomposition which scales as O(n3 ). Thus KMAP has the same complexity as KFDA. 2.4
Connection to Other Kernel Methods
In the following, we show the connection between KMAP and other kernel-based methods. In the regularization method based on Eq. (7), by varying the settings of θ and η, other kernel-based classification methods can be derived. When (θ = 0, η = 0), the KMAP model represents a quadratic discriminant method in the kernelinduced feature space; when (θ = 1, η = 0), it represents a kernel discriminant method; and when (θ = 0, η = 1) or (θ = 1, η = 1), it represents the nearest mean classifier. Therefore, by varying θ and η, different models can be generated from different combinations of quadratic discriminant, linear discriminant and the nearest mean methods. We consider a special case of the regularization method when θ = 1 and η = 0. If both classes are assumed to have the same covariance structure for a
846
Z. Xu et al.
2 binary class problem, i.e., Σi = Σ1 +Σ , it leads to a linear discriminant function. 2 Assuming all classes have the same class prior probabilities, gi (Φ(x)) can be 2 −1 derived as: gi (Φ(x)) = (Φ(x) − μi )T ( Σ1 +Σ ) (Φ(x) − μi ), where i = 1, 2. We 2 reformulate the above equation in the following form: gi (Φ(x)) = wi Φ(x) + bi , where wi = −4(Σ1 + Σ2 )−1 μi , and bi = 2μTi (Σ1 + Σ2 )−1 μi . The decision hyperplane is f (Φ(x)) = g1 (Φ(x)) − g2 (Φ(x)), i.e.,
1 f (Φ(x)) = (Σ1 + Σ2 )−1 (μ1 − μ2 )Φ(x) − (μ1 − μ2 )T (Σ1 + Σ2 )−1 (μ1 + μ2 ). (13) 2 Eq. (13) is just the solution of KFDA [3]. Therefore, KFDA can be viewed as a special case of KMAP when all classes have the same covariance structure. Remark. KMAP provides a rich class of kernel-based classification algorithms using different regularization methods. This makes KMAP as a flexible framework for classification adaptive to data distribution. 2.5
Separability Measures and Error Bounds
To measure the separability of different classes of data in the feature space, the Kullback-Leibler divergence (a.k.a. K-L distance) between two Gaussians is adopted. The K-L divergence is defined as pi (Φ(x)) dKL [pi (Φ(x)), pj (Φ(x))] = Pi (Φ(x)) ln . (14) pj (Φ(x)) Since the K-L divergence is not symmetric, a two-way divergence is used to measure the distance between two distributions dij = dKL [pi (Φ(x)), pj (Φ(x))] + dKL [pj (Φ(x)), pi (Φ(x))]
(15)
Following [6], it can be proved that: dij =
1 1 (μi − μj )T (Σi−1 + Σj−1 )(μi − μj ) + trace(Σi−1 Σj + Σj−1 Σi − 2I), (16) 2 2
which can be solved by using the trick in Section 2.3. The Bayesian decision rule guarantees the lowest average error rate as presented in the following: P (correct) =
m i=1
p(Φ(x)|Ci )P (Ci )dΦ(x),
(17)
Ri
where Ri is the decision region of class Ci . We implement the Bhattacharyya bound in the feature space for the Gaussian density distribution function. Following [6], we have P (error) ≤ P (C1 )P (C2 ) exp−q(0.5) , (18)
Kernel Maximum a Posteriori Classification with Error Bound Analysis
847
where q(0.5) =
1 (μ2 − μ1 )T 8
Σ1 + Σ 2 2
−1 (μ2 − μ1 ) +
| Σ1 +Σ2 | 1 ln 2 . 2 |Σ1 ||Σ2 |
(19)
Using the results in Section 2.3, the Bhattacharyya error bound can be easily calculated in the kernel-induced feature space.
3
Experiments
In this section, we report the experiments to evaluate the separability measure, the error bound and the prediction performance of the proposed KMAP. 3.1
Synthetic Data
We compare the separability measure and error bounds on three synthetic data sets. The description of these data sets can be found in [8]. The data sets are named according to their characteristics and they are plotted in Fig. 1. We map the data using RBF kernel to a special feature space where Gaussian distributions are approximately satisfied. We then calculate separability measures on all data sets according to Eq. (16). The separability values for the Overlap, Bumpy, and Relevance in the original input space, are 14.94, 5.16, and 22.18, respectively. Those corresponding values in the feature space are 30.88, 5.87, and 3631, respectively. The results indicate that data become more separable after mapped into the feature space, especially for the Relevance data set. For data in the kernel-induced feature space, the error bounds are calculated according to Eq. (18). Figure 1 also plots the prediction rates and the upper error bounds for data in the input space and in the feature space, respectively. It can be observed that the error bounds are more valid in the feature space than those in the input space. 3.2
Benchmark Data
Experimental Setup. In this experiment, KSVM, KFDA, Modified Quadratic Discriminant Analysis (MQDA) [9] and Kernel Fisher’s Quadratic Discriminant Analysis (KFQDA) [10] are employed as the competitive algorithms. We implement two variants of KMAP, i.e., KMAP-M and KMAP-R. The properties of eight UCI benchmark data sets are described in Table 1. In all kernel methods, a Gaussian-RBF kernel is used. The parameter C of KSVM and the parameter γ in RBF kernel are all tuned by 10-cross validation. In KMAP, we select k pairs of eigenvalues and eigenvectors according to their contribution to the covariance matrix, i.e., the index j ∈ { : nλ λq ≥ α}; while q=1 in MQDF, the range of k is relatively small and we select k by cross validation. PCA is used as the regularization method in KFQDA and the commutative decay ratio is set to 99%; the regularization parameter r is set to 0.001 in KFDA.
848
Z. Xu et al. Overlap
Bumpy
2
2
1.5
1.5
1
1
0.5 0.5 0 0 −0.5 −0.5 −1 −1
−1.5
−1.5
−2 −2.5 −1.5
−1
−0.5
0
0.5
1
1.5
−2 −2.5
2
(a) Overlap
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
(b)Bumpy
Relevance 0.8
60 input_bound input_rate feature_bound feature_rate
Prediction error and error bound (%)
0.6 0.4 0.2 0 −0.2 −0.4 −0.6
50
40
30
20
10 −0.8 −1 −1
−0.5
0
0.5
(c) Relevance
1
0 Bumpy
Relevance Different data sets
Overlap
(d) Separability Comparison
Fig. 1. The data plot of Overlap, Bumpy and Relevance and the comparison of data separability in the input space and the feature space Table 1. Data set information Data Set # Samples # Features # Classes Data Set # Samples # Features # Classes Iono 351 34 2 Breast 683 9 2 Twonorm 1000 21 2 Sonar 208 60 2 Pima 768 8 2 Iris 150 4 3 Wine 178 13 3 Segment 210 19 7
In both KMAP and MQDF, h takes the value of Λk+1 . In KMAP-R, extra parameters (θ, η) are tuned by cross-validation. All experimental results are obtained in 10 runs and each run is executed with 10-cross validation for each data set. Experimental Results. Table 2 reports the average prediction accuracy with the standard errors on each data set for all algorithms. It can be observed that both variants of KMAP outperform MQDF, which is an MAP method in the input space. This also empirically validates that the separability among different classes of data becomes larger and that the upper error bounds get tighter and more accurate, after data are mapped to the high dimensional feature space. Moreover, the performance of KMAP is competitive to that of other kernel methods. Especially, the performance of KMAP-R gets better prediction accuracy than all other methods for most of the data sets. The reason is that the regularization methods in KMAP favorably capture the prior distributions of
Kernel Maximum a Posteriori Classification with Error Bound Analysis
849
Table 2. The prediction results of KMAP and other methods Data set Iono(%) Breast(%) Twonorm(%) Sonar(%) Pima(%) Iris(%) Wine(%) Segment(%) Average(%)
KSVM 94.1±0.7 96.5±0.4 96.1±0.4 86.6±1.0 77.9±0.7 96.2±0.4 98.8±0.1 92.8±0.7 92.38
MQDF 89.6±0.5 96.5±0.1 97.4±0.4 83.7±0.7 73.1±0.4 96.0±0.1 99.2±1.3 86.9±1.2 90.30
KFDA 94.2±0.1 96.4±0.1 96.7±0.5 88.3±0.3 71.0±0.5 95.7±0.1 99.1±0.1 91.6±0.3 91.62
KFQDA KMAP-M KMAP-R 93.6±0.4 95.2±0.2 95.7±0.3 96.5±0.1 96.5±0.1 97.5±0.1 97.3±0.5 97.6±0.7 97.5±0.4 85.1±1.9 87.2±1.6 88.8±1.2 74.1±0.5 75.4±0.7 75.5±0.4 96.8±0.2 96.9±0.1 98.0±0.0 96.9±0.7 99.3±0.1 99.3±0.6 85.8±0.8 90.2±0.2 92.1±0.8 90.76 92.29 93.05
data, since the Gaussian assumption in the feature space can fit a very complex distribution in the input space.
4
Related Work
In statistical pattern recognition, the probability density function can first be estimated from data, then future examples could be assigned to the class with the MAP. One typical example is the Quadratic Discriminant Function (QDF) [11], which is derived from the multivariate normal distribution and achieves the minimum mean error rate under Gaussian distribution. In [9], a Modified Quadratic Discriminant Function (MQDF) less sensitive to estimation error is proposed. [7] improves the performance of QDF by covariance matrix interpolation. Unlike QDF, another type of classifiers does not assume the probability density functions in advance, but are designed directly on data samples. An example is the Fisher discriminant analysis (FDA), which maximizes the between-class covariance while minimizing the within-class variance. It can be derived as a Bayesian classifier under Gaussian assumption on the data. [3] develops a Kernel Fisher Discriminant Analysis (KFDA) by extending FDA to a non-linear space by the kernel trick. To supplement the statistical justification of KFDA, [10] extends the maximum likelihood method and Bayes classification to their kernel generalization under Gaussian Hilbert space assumption. The authors do not directly kernelize the quadratic forms in terms of kernel values. Instead, they use an explicit mapping function to map the data to a high dimensional space. Thus the kernel matrix is usually used as the input data of FDA. The derived model is named as Kernel Fisher’s Quadratic Discriminant Analysis (KFQDA).
5
Conclusion and Future Work
In this paper, we present a novel kernel classifier named Kernel-based Maximum a Posteriori, which implements Gaussian distribution in the kernel-induced feature space. Comparing to state-of-the-art classifiers, the advantages of KMAP include that the prior information of distribution is incorporated and that it can output probability or confidence in making a decision. Moreover, KMAP can
850
Z. Xu et al.
be regarded as a more generalized classification method than other kernel-based methods such as KFDA. In addition, the error bound analysis illustrates that Gaussian distribution is more easily satisfied in the feature space than that in the input space. More importantly, KMAP with proper regularization achieves very promising performance. We plan to incorporate the probability information into both the kernel function and the classifier in the future work.
Acknowledgments The work described in this paper is fully supported by two grants from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. CUHK4205/04E and Project No. CUHK4235/04E).
References 1. Sch¨ olkopf, B., Smola, A.: Learning with Kernels. MIT Press, Cambridge (2002) 2. Vapnik, V.N.: Statistical Learning Theory. John Wiley & Sons, Chichester (1998) 3. Mika, S., Ratsch, G., Weston, J., Scholkopf, B., Muller, K.: Fisher discriminant analysis with kernels. In: Proceedings of IEEE Neural Network for Signal Processing Workshop, pp. 41–48 (1999) 4. Lanckriet, G.R.G., Ghaoui, L.E., Bhattacharyya, C., Jordan, M.I.: A robust minimax approach to classification. Journal of Machine Learning Research 3, 555–582 (2002) 5. Huang, K., Yang, H., King, I., Lyu, M.R., Chan, L.: Minimum error minimax probability machine. Journal of Machine Learning Research 5, 1253–1286 (2004) 6. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. Wiley-Interscience Publication (2000) 7. Friedman, J.H.: Regularized discriminant analysis. Journal of American Statistics Association 84(405), 165–175 (1989) 8. Centeno, T.P., Lawrence, N.D.: Optimising kernel parameters and regularisation coefficients for non-linear discriminant analysis. Journal of Machine Learning Research 7(2), 455–491 (2006) 9. Kimura, F., Takashina, K., S., T., Y., M.: Modified quadratic discriminant functions and the application to Chinese character recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 9, 149–153 (1987) 10. Huang, S.Y., Hwang, C.R., Lin, M.H.: Kernel Fisher’s discriminant analysis in Gaussian Reproducing Kernel Hilbert Space. Technical report, Academia Sinica, Taiwan, R.O.C. (2005) 11. Fukunaga, K.: Introduction to Statistical Pattern Recognition, 2nd edn. Academic Press, San Diego (1990)
Comparison of Local Higher-Order Moment Kernel and Conventional Kernels in SVM for Texture Classification Keisuke Kameyama Department of Computer Science, Graduate School of Systems and Information Engineering, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki 305-8573, Japan [email protected]
Abstract. Use of local higher-order moment kernel (LHOM kernel) in SVMs for texture classification was investigated by comparing it with SVMs using other conventional kernels. In the experiments, it became clear that SVMs with LHOM kernels achieve better trainability and give stable response to the texture classes when compared with those with conventional kernels. Also, the number of support vectors were kept low which indicates better class separability in the nonlinearly-mapped feature space. Keywords: Support Vector Machine (SVM), Higher-Order Moment Spectra, Kernel Function, Texture classification.
1
Introduction
Image texture classification of according to their local nature has a various use. Among them are, scene understanding in robot vision, document processing and diagnosis support in medical images, etc. Generally, segmentation of textures rely on local features, extracted by a local window. Multiple features are extracted from the windowed image, and gives a feature vector of the local texture. The feature vector is further passed to a classifier, which maps them to class labels. One of the common choices among the local features is the local power spectrum of the signal. However, there are cases when second order feature is not sufficient. In such cases, higher-order feature can be used to improve the classification ability [1]. However, when the orders of the features grow higher, exhaustive examination of the possible features or an efficient search for an effective feature would become difficult due to the high dimensionality of the feature space. Thus, use of moments and moment spectra features of higher-order have been limited so far. Recently, use of high dimensional feature space by the so-called kernel-based methods are being actively examined. The Support Vector Machine (SVM) [2] [3] M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 851–860, 2008. c Springer-Verlag Berlin Heidelberg 2008
852
K. Kameyama
which also benefits from this nature has a potential to make the high dimensional signal features a realistic choice for supervised signal classification. In the literature, kernel function corresponding to higher-order moments [4,5] and their spectra [6] has been investigated. In this work, the effectiveness of using the local higher-order moment kernel (LHOM kernel) in SVMs for texture classification is examined by comparing it with SVMs using other conventional kernel functions. In the following, local higher-order moment (LHOM) features and SVMs will be reviewed in Secs. 2 and 3, respectively. The definition of the LHOM kernel and its characteristics will be mentioned in Sec. 4. In Sec. 5, the setup and the results of the texture classification experiments will be shown, with discussions focused on the differences introduced by the kernel choice. Finally, Sec. 6 concludes the paper.
2 2.1
Local Moment and Local Moment Spectra Feature Extraction by Local Power Spectrum
Let s(t) be a real valued image signal defined on R2 . One of the most common characterization method of local signal is the use of local power spectrum. It can be obtained as the Fourier transform of the local second-order moment of signal s as, Ms,w,2 (ω, x) = ms,w,2 (τ , x) exp(−jω T τ )dτ . (1) The integral is over R2 where not specified. The local second-order moment of signal s is defined as ms,w,2 (τ , x) = w(t)s(t + x)w(t + τ )s(t + x + τ )dt, (2) where w is the window function for localizing the signal characterization. In this work, a window of a Gaussian function defined as −1/2
w(t, Σ) = (2π)−1 |Σ|
exp(−
1 T −1 t Σ t), 2
(3)
is used. Here, Σ is a positive definite symmetric real 2 × 2 matrix to modify the window shape, and superscript T denotes the matrix transpose. 2.2
Local Higher-Order Moment Spectra (LHOMS)
Moment spectra of orders n higher than two, namely the bispectrum (n = 3), trispectrum (n = 4) and so on, also contribute to characterize the texture. The spectrum of order n reflects the correlation of n − 1 frequency components in complex values. Therefore it is possible, for example, to characterize the phase difference among the signal components using the bispectrum, while the information was lost in the case of power spectrum.
Comparison of LHOM Kernel and Conventional Kernels in SVM
853
The n-th order local moment spectrum is the Fourier transform of the nth order local moment. By defining Ω n−1 = [ω T1 ω T2 . . . ω Tn−1 ]T and T n−1 = [τ T1 τ T2 . . . τ Tn−1 ]T , we have Ms,w,n (Ω n−1 , x) = R2(n−1)
ms,w,n (T n−1 , x) exp(−jΩ Tn−1 T n−1 )dT n−1
(4)
where ms,w,n (T n−1 , x) =
w(t)s(t + x)
n−1
{w(t + τ k )s(t + x + τ k )}dt.
(5)
k=1
The n-th order LHOMS being product of n − 1 frequency components and their sum frequency component, makes LHOMS a especially useful tool for characterizing the phase relations among the fundamental (ω 0 ) and harmonic (2ω0 , 3ω0 , . . .) frequency components.
3
Support Vector Machine and Kernel Functions
Support Vector Machine (SVM) [2] [3] is a supervised pattern recognition scheme with the following two significant features. First, the SVM realizes an optimal linear classifier (optimal hyperplane) in the feature space, which is theoretically based on the Structural Risk Minimization (SRM) theory. The SVM learning achieves a linear classifier with minimum VC-dimension thereby keeping the expected generalization error low. Second, utilization of feature spaces of very high dimension is made possible by way of kernel functions. The SVM takes an input vector x ∈ RL , which is transformed by a predetermined feature extraction function Φ(x) ∈ RN . This feature vector is classified to one of the two classes by a linear classifier y = sgn(u(x)) = sgn(w, Φ(x) + b)
y ∈ {−1, 1},
(6)
where w ∈ RN and b ∈ R are the weight vector and the bias determining the placement of the discrimination hyperplane in the feature space, respectively. Parameters w and b, are determined by supervised learning using a training set of input-output pairs {(xi , yi ) ∈ RL × {±1}}i=1. Assuming that the training set mapped to the feature space {Φ(xi )}i=1 is linearly separable, the weight vector w will be determined to maximize the margin between the two classes, by solving an optimization problem with inequality constraints to minimize
1 ||w||2 2
subject to
{yi (w, Φ(xi ) + b) ≥ 1}i=1 .
It can be solved by using the Lagrangian function L(w, b, α), minimizing it for the variables w and b, and maximizing it for the multipliers α = [α1 , . . . , α ].
854
K. Kameyama
When the training set is not linearly separable, the constraints in the optimization problem can be relaxed to minimize
1 ||w||2 + C ξi 2 i=1
subject to {yi (w, Φ(xi ) + b) ≥ 1 − ξi }i=1 (7)
utilizing the so called soft margins by introducing nonnegative slack variables {ξi }i=1 and regularization constant C > 0. In both cases, function u for the classifier is obtained in the form of u(x) =
yi αi Φ(xi ), Φ(x) + b.
(8)
i=1
Typically, the optimal solution of the original constrained problem will reside at where equalities hold for a very small fraction of the inequality conditions. This leaves the multipliers for the other conditions to be zero and makes the use of Eq. (8) practical even for a large . Training inputs with nonzero multipliers are called support vectors, thus the name of SVM. Parameter b in Eq. (8) can be obtained from the (equality) constraints for one of the support vectors. In both operations of training and using the SVM, the feature extraction function Φ(x) is never explicitly evaluated. Instead, the inner product K(x, y) = Φ(x), Φ(y) called the kernel function is always evaluated. Among the well known conventional kernel functions are, the polynomial kernel K(x, y) = (x, y + 1)d
(d = 1, 2, ...)
(9)
and the radial-basis function (RBF) kernel K(x, y) = exp(−
||x − y||2 ). γ2
(γ ∈ R)
(10)
In some cases of feature extractions, evaluation of Φ, Φ requires less computation than directly evaluating Φ. This nature enables the SVM to efficiently utilize feature spaces of very high dimensions.
4 4.1
The LHOMS and LHOM Kernels LHOMS Kernel Derivation
Here, a kernel function for the case when the LHOMS corresponds to the nonlinear feature map Φ, will be derived. This will amount to treating the LHOMS function as the feature vector of infinite dimension for all the possible combinations of frequency vectors ω 1 , . . . , ωn−1 .
Comparison of LHOM Kernel and Conventional Kernels in SVM
855
The inner product of the n-th order moment spectra of signals s(t) and v(t) localized at x and y, respectively, is Kw,n (s, v ; x, y) = Ms,w,n (ω 1 , . . . , ω n−1 , x), Mv,w,n (ω 1 , . . . , ω n−1 , y) n−1 n−1 ∗ = . . . (Sw ( ωk , x) Sw (ω k , x)) × k=1 n−1
×{Vw∗ (
k=1 n−1
ω k , y)
k=1
k=1
2(n−1)
= (2π)
Vw (ω k , y)}∗ dω 1 . . . dω n−1 n
w(z)s(z + x)w(z + τ )v(z + y + τ )dz
dτ .
(11)
See [6] for the derivation. Function Kw,n (s, v ; x, y) will serve as the kernel function in the SVM for signal classification. Notable here, is that the increase of computational cost due to the moment order can be avoided as far as the inner product is concerned. In the literature, the inner product of the n-th order moments of two signals s and v, which omits the window w in Eq. (3) has been derived as, ms,n , mv,n =
n s(z)v(z + τ )dz
dτ ,
(12)
by an early work of McLaughlin and Raghu [7] for use in optical character recognition. Recently, it has been used in the context of modern kernel-based methods, for signal and image recognition problems [4, 5]. Non-localized and localized versions of this kernel will be referred to as the HOM and LHOM kernels, respectively. 4.2
Equivalence of LHOMS and LHOM Kernels
The following clarifies an important relation between the LHOMS kernel and the LHOM kernel. Theorem (Plancherel). Define the Fourier transform of function f (x) ∈ L2 (RN ) as (F f )(ω) = RN f (x)e−jω,x dx. Then, Ff, F g = (2π)N f, g holds for functions f, g ∈ L2 (RN ) [8]. Corollary. By denoting LHOMS and LHOM of an image s(t) as Ms,w,n and ms,w,n , respectively, Ms,w,n , Mv,w,n = (2π)2(n−1) ms,w,n , mv,w,n .
856
K. Kameyama c(t)
Class map
Feature / Kernel
HOMS feature φ Power spectrum , Bispectrum, Trispectrum ...
Common kernel K(s, v) = <φ, φ> = <ψ, ψ>
image
HOM feature ψ Autocorrelations (n = 1, 2, 3 ...)
s(t)
Fig. 1. The LHOM(S) kernel unifies the treatment of higher-order moment features and higher-order moment spectra features of a signal
This corollary shows that LHOMS and LHOM kernels are proportional, which is clear from the fact that LHOMS is a Fourier transform of LHOM as mentioned in Eq. (4). Therefore, this kernel function introduces a unified view of the kernelbased pattern recognition using moments and moment spectra, as shown in Fig. 1.
5 5.1
Texture Classification Experiments Experimental Conditions
The SVM with the LHOM(S) kernel was tested in two-class texture classification problems. Image luminance was preprocessed to be scaled to the range of [0, 1]. The training sets were obtained from random positions of the original texture images. All the experiments used 20 training examples per class, each sized 32 × 32 pixels. The kernel types used in the experiments were the following. 1. Polynomial kernel of Eq. (9), for (d = 1, . . . , 5). 2. RBF kernel kernel of Eq. (10), for (γ = 1, 10, 100). 3. LHOM kernel in their discrete versions [6] with isotropic Gaussian windows. Order n and window width σ were varied as, (n = 2, 3, 4, 5) and (σ = 4, 8, 12). When testing the SVM after training, images with tiled texture patterns were used as shown in Figs. 2(c) and 3(c). The classification rates were calculated for the center regions of the test images allowing edge margins of 16 pixel width. The output class maps will be presented as the linear output u in Eq. (8). No post-smoothing of the class maps has been used. 5.2
Sinusoidal Wave with Harmonic Components
Two texture classes generated by a computer according to two formulas, sA (x) = sin(ω T0 x) + sin(2ω T0 x) + sin(3ωT0 x) and sB (x) = sin(ω T0 x) + sin(2ω T0 x + φ) + sin(3ω T0 x).
Comparison of LHOM Kernel and Conventional Kernels in SVM
(a)
857
(b)
(c)
Fig. 2. Computer generated textures including fundamental, second and third harmonic frequency components. (a) Class 1, (b) Class 2, and (c) tiled collage for testing.
(a)
(b)
(c)
Fig. 3. Natural texture images selected from the Vision Texture collection. (a) Fabric17 texture, (b) Fabric18 texture and (c) tiled collage for testing.
Here, the fundamental frequency vector was set to ω0 = [π/6, π/6]T . Phase shift φ = π/2 in the second harmonic component of sB is the only difference in the two textures. Although the two classes cannot be discriminated using the local power spectrum, they should be easily classified using the phase information, extracted by the features of higher orders. The classification results in Table 1 show that the training was successful for RBF and LHOM kernels (n = 3, 4, 5) only as expected. The higher test rates for the LHOMS kernels is partly due to the clear discrimination of the classes within the feature space, indicated by the small number of support vectors without using soft margins. 5.3
Fabric Texture
The textures in Fig. 3 were selected from the fabric textures of the Vision Texture collection [9]. The results are shown in Table 2. Tendencies similar to the previous experiment are observed showing the superiority of the LHOMS kernel. Fig. 5(a) indicates that the unthresholded output of the SVM with RBF kernels have a very small inter-class variance indicating the instability of the classifier. Use of soft margins resulted in a small improvement for the SVMs with LHOM kernels.
858
K. Kameyama
Table 1. Trainability, number of support vectors and the test rate of the SVMs using various kernels for the computer-generated texture SV number (ratio to TS)
Kernel Type
Trained ?
Test rate
Polynomial (d = 1, .., 5)
No
RBF (γ=10)
Yes
39 (97.5%)
LHOM (n = 2) (σ = 8)
No
-
LHOM (n = 3) (σ = 8)
Yes
2 (5%)
83 %
LHOM (n = 4) (σ = 8)
Yes
3 (7.5%)
82 %
LHOM (n = 5) (σ = 8)
Yes
8 (20%)
84 %
-
80 % -
(a)
(b)
(c)
(d)
Fig. 4. SVM linear outputs and classification rates for the test image of the computergenerated texture using various types of kernels. (a) RBF(80%), (b) LHOM (n = 3) (83%), (c) LHOM (n = 4)(82%) and (d) LHOM (n = 5)(84%).
Comparison of LHOM Kernel and Conventional Kernels in SVM
859
Table 2. Trainability, number of support vectors and the test rate of the SVMs using various kernels for the natural texture
SV number Test rate (No SM / Use SM) (No SM / Use SM)
Kernel Type
Trained ?
Polynomial (d = 1, .., 5)
Partly (at d=1only)
40 / 40
50 % / 50 %
RBF (γ=10)
Yes
40 / 40
65 % / 65 %
LHOM (n = 2) (σ = 8)
Yes
5 / 40
69 % / 79 %
LHOM (n = 3) (σ = 8)
Yes
27 / 40
73 % / 75 %
LHOM (n = 4) (σ = 8)
Yes
30 / 35
75 % / 75 %
LHOM (n = 5) (σ = 8)
Yes
37 / 40
67 % / 69 %
(a)
(b)
(c)
(d)
Fig. 5. SVM linear outputs and classification rates for the test image of the natural texture using various types of kernels. (a) RBF(65%), (b) LHOM (n = 2)(69%), (c) LHOM (n = 3)(73%) and (d) LHOM (n = 4)(75 %).
860
5.4
K. Kameyama
Discussion
The test using the tiled images evaluates the combination of the raw classification rate and the spatial resolution. Further investigation separating the two factors are necessary. When non-tiled (pure) test images of the fabric texture were used, SVMs with kernels of RBF (γ = 10), LHOMS (σ = 8) for n = 2, 3, 4 and 5 achieved 90%, 95%, 90%, 95% and 80%, respectively. In [1], classifiers using LHOM features of orders up to 3 is reported to have achieved test rates over 96% for 30 classes of natural textures. Although this work does not give a direct comparison, the results supports the suitability of using LHOM(S) in natural texture classification,
6
Conclusion
Texture classification using SVMs with LHOM kernel and other conventional kernel functions were compared. It became clear that the SVMs with LHOM kernels achieve better trainability and give stable response to the texture classes when compared with those with conventional kernels. Also, the number of support vectors were lower which indicates a better class separability in the feature space.
References 1. Kurita, T., Otsu, N.: Texture classification by higher order local autocorrelation features. In: Proc. of Asian Conf. on Computer Vision (ACCV 1993), pp. 175–178 (1993) 2. Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20(3), 273–297 (1995) 3. Vapnik, V.N.: The Nature of Statistical Learning Theory, 2nd edn. Springer, Heidelberg (2000) 4. Popovici, V., Thiran, J.P.: Higher order autocorrelations for pattern classification. In: Proceedings of International Conference on Image Processing 2001, pp. 724–727 (2001) 5. Popovici, V., Thiran, J.P.: Pattern recognition using higher-order local autocorrelation coefficients. In: Neural Networks for Signal Processing XII (NNSP), pp. 229–238 (2002) 6. Kameyama, K., Taga, K.: Texture classification by support vector machines with kernels for higher-order gabor filtering. In: Proceedings of International Joint Conference on Neural Networks 2004, vol. 4, pp. 3009–3014 (2004) 7. MacLaughlin, J.A., Raviv, J.: N-th autocorrelations in pattern recognition. Information and Control 12(2), 121–142 (1968) 8. Umegaki, H.: Basic Information Mathematics - Development via Functional Analysis - (in Japanese). Saiensu-sha (1993) 9. MIT Vision and Modeling Group: Vision texture (1995)
Pattern Discovery for High-Dimensional Binary Datasets Václav Snášel1 , Pavel Moravec1 , Dušan Húsek2 , Alexander Frolov3 , 4 ˇ Hana Rezanková , and Pavel Polyakov5 1
3
Department of Computer Science, FEECS, VŠB – Technical University of Ostrava, 17. listopadu 15, 708 33 Ostrava-Poruba, Czech Republic {pavel.moravec,vaclav.snasel}@vsb.cz 2 Institute of Computer Science, Dept. of Nonlinear Systems, Academy of Sciences of the Czech Republic, Pod Vodárenskou vˇeží 2, 182 07 Prague, Czech Republic [email protected] Institute of Higher Nervous Activity and Neurophysiology, Russian Academy of Sciences, Butlerova 5a, 117 485 Moscow, Russia [email protected] 4 Department of Statistics and Probability, University of Economics, Prague, W. Churchill sq. 4, 130 67 Prague, Czech Republic [email protected] 5 Institute of Optical Neural Technologies, Russian Academy of Sciences, Vavilova 44, 119 333 Moscow, Russia [email protected]
Abstract. In this paper we compare the performance of several dimension reduction techniques which are used as a tool for feature extraction. The tested methods include singular value decomposition, semi-discrete decomposition, non-negative matrix factorization, novel neural network based algorithm for Boolean factor analysis and two cluster analysis methods as well. So called bars problem is used as the benchmark. Set of artificial signals generated as a Boolean sum of given number of bars is analyzed by these methods. Resulting images show that Boolean factor analysis is upmost suitable method for this kind of data.
1 Introduction In order to perform object recognition (no matter which one) it is necessary to learn representations of the underlying characteristic components. Such components correspond to object-parts, or features. These data sets may comprise discrete attributes, such as those from market basket analysis, information retrieval, and bioinformatics, as well as continuous attributes such as those in scientific simulations, astrophysical measurements, and sensor networks. Feature extraction from high-dimensional data typically consists in correlation analysis, clustering including finding efficient representations for clustered data, data classification, and event association. The objective is discovering meaningful M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 861–872, 2008. c Springer-Verlag Berlin Heidelberg 2008
862
V. Snášel et al.
information in data to be able to represent them appropriately and if possible in a space of lower dimension. The feature extraction if applied on binary datasets, addresses many research and application fields, such as association rule mining [1], market basket analysis [2], discovery of regulation patterns in DNA microarray experiments [3], etc. Many of these problem areas have been described in tests of PROXIMUS framework (e.g. [4]). The feature extraction methods can use different aspects of images as the features. Such methods are either using a heuristics based on the known properties of the image collection, or are fully automatic and may use the original image vectors as an input. Here we will concentrate on the case of black and white pictures of bars combinations represented as binary vectors, so the complex feature extraction methods are unnecessary. Values of the entries of this vector represent individual pixels i.e. 0 for white and 1 for black. There are many attempts that could be used for this reason. This article reports on an empirical investigation of the performance of some of them. For the sake of simplicity we use the well-known bars problem (see e.g. [5]) where we try to isolate separate horizontal and vertical bars from images containing their combinations. Here we will concentrate on the category which use dimension reduction techniques for automatic feature extraction. We will compare results of the most up to date procedures. One of such methods is the singular value decomposition which was already many times successfully used for automatic feature extraction. In case of bars collection (such as our test data), the base vectors can be interpreted as images, describing some common characteristics of several input signals. However singular value decomposition is not suitable for huge collections and is computationally expensive, so other methods of dimension reduction were proposed. Here we use the semi-discrete decomposition. Because the data matrix does have all elements non-negative, we tried to apply a new method, called non-negative matrix factorization as well. The question, how does the brain form a useful representation of its environment, was behind the development of neural network based methods for dimension reduction, neural network based Boolean factor analysis [6,7] and optimal sparse coding network developed by Földiák [5]. Here we applied the neural network based Boolean factor analysis developed by us only, because we have not the optimal sparse coding network algorithm for our disposal at this moment. However for the first view on image structures, we can apply traditional statistical methods, mainly different algorithms for statistical cluster analysis. Analysis of discrete data sets, however, generally leads to NP-complete/hard problems, especially when physically interpretable results in discrete spaces are desired. The rest of this paper is organized as follows. The second section explains the dimension reduction method used in this study. Then in the section three we describe experimental results, and finally in the section four we made some conclusions.
Pattern Discovery for High-Dimensional Binary Datasets
863
2 Dimension Reduction We used four promising methods of dimension reduction for our comparison – Singular Value Decomposition (SVD), Semi-Discrete Decomposition (SDD), Non-negative Matrix Factorization (NMF) and Neural network Boolean Factor Analysis (NBFA). For the first analysis we used two statistical clustering methods, Hierarchical Agglomerative Algorithm (HAA) and Two-Step Cluster Analysis (TSCA). All methods are briefly described bellow. 2.1 Singular Value Decomposition The SVD [8] is an algebraic extension of classical vector model. It is similar to the PCA method, which was originally used for the generation of eigenfaces in image retrieval. Informally, SVD discovers significant properties and represents the images as linear combinations of the base vectors. Moreover, the base vectors are ordered according to their significance for the reconstructed image, which allows us to consider only the first k base vectors as important (the remaining ones are interpreted as "noise" and discarded). Furthermore, SVD is often referred to as more successful in recall when compared to querying whole image vectors [8]. Formally, we decompose the matrix of images A by SVD, calculating singular values and singular vectors of A. We have matrix A, which is an n × m rank-r matrix (where m ≥ n without loss T of generality) √ and values σ1 , . . . , σr are calculated from eigenvalues of matrix AA as σi = λi . Based on them, we can calculate column-orthonormal matrices U = (u1 , . . . , un ) and V = (v1 , . . . , vn ), where U T U = In a V T V = Im , and a diagonal matrix Σ = diag(σ1 , . . . , σn ), where σi > 0 for i ≤ r, σi ≥ σi+1 and σr+1 = . . . = σn = 0. The decomposition A = U ΣV T (1) is called singular decomposition of matrix A and the numbers σ1 , . . . , σr are singular values of the matrix A. Columns of U (or V ) are called left (or right) singular vectors of matrix A. Now we have a decomposition of the original matrix of images A. We get r nonzero singular numbers, where r is the rank of the original matrix A. Because the singular values usually fall quickly, we can take only k greatest singular values with the corresponding singular vector coordinates and create a k-reduced singular decomposition of A. Let us have k (0 < k < r) and singular value decomposition of A A = U ΣV T ≈ Ak = (Uk U0 )
Σk 0 0 Σ0
VkT V0T
(2)
We call Ak = Uk Σk VkT a k-reduced singular value decomposition (rank-k SVD). Instead of the Ak matrix, a matrix of image vectors in reduced space Dk = Σk VkT is used in SVD as the representation of image collection. The image vectors (columns in Dk ) are now represented as points in k-dimensional space (the feature-space). represent the matrices Uk , Σk , VkT .
864
V. Snášel et al.
Fig. 1. Rank-k SVD
Rank-k SVD is the best rank-k approximation of the original matrix A. This means that any other decomposition will increase the approximation error, calculated as a sum of squares (Frobenius norm) of error matrix B = A−Ak . However, it does not implicate that we could not obtain better result with a different approximation. 2.2 Semi-discrete Decomposition The SDD is one of other LSI methods, proposed recently for text retrieval in [9]. As mentioned earlier, the rank-k SVD method (called truncated SVD by authors of semidiscrete decomposition) produces dense matrices U and V , so the resulting required storage may be even larger than the one needed by the original term-by-document matrix A. To improve the required storage size and query time, the semi-discrete decomposition was defined as A ≈ Ak = Xk Dk YkT ,
(3)
where each coordinate of Xk and Yk is constrained to have entries from the set ϕ = {−1, 0, 1}, and the matrix Dk is a diagonal matrix with positive coordinates. The SDD does not reproduce A exactly, even if k = n, but it uses very little storage with respect to the observed accuracy of the approximation. A rank-k SDD (although from mathematical standpoint it is a sum on rank-1 matrices) requires the storage of k(m + n) values from the set {−1, 0, 1} and k scalars. The scalars need to be only single precision because the algorithm is self-correcting. The SDD approximation is formed iteratively. The optimal choice of the triplets (xi , di , yi ) for given k can be determined using greedy algorithm, based on the residual Rk = A − Ak−1 (where A0 is a zero matrix). 2.3 Non-negative Matrix Factorization The NMF [10] method calculates an approximation of the matrix A as a product of two matrices, W and H. The matrices are usually pre-filled with random values (or H is initialized to zero and W is randomly generated). During the calculation the values in W and H stay positive. The approximation of matrix A, matrix Ak , can be calculated as Ak = W H.
Pattern Discovery for High-Dimensional Binary Datasets
865
The original NMF method tries to minimize the Frobenius norm of the difference between A and Ak using min ||V − W H||2F as the criterion in the minimization problem. W,H
Recently, a new method was proposed in [11], where the constrained least squares problem min{||Vj − W Hj ||2 − λ||Hj ||22 } is the criterion in the minimization problem. Hj
This approach is yields better results for sparse matrices. Unlike in SVD, the base vectors are not ordered from the most general one and we have to calculate the decomposition for each value of k separately. 2.4 Neural Network Based Boolean Factor Analysis The NBFA is a powerful method for revealing the information redundancy of high dimensional binary signals [7]. It allows to express every signal (vector of variables) from binary data matrix X of observations as superposition of binary factors: X=
L
Sl f l ,
(4)
l=1
where Sl is a component of factor scores and f l is a vector of factor loadings and ∨ denotes Boolean summation (0 ∨ 0 = 0, 1 ∨ 0 = 1, 0 ∨ 1 = 1, 1 ∨ 1 = 1). If we mark Boolean matrix multiplication by the symbol , then we can express approximation of data matrix X in matrix notation Xk F S
(5)
where S is the matrix of factor scores and F is the matrix of factor loadings. The Boolean factor analysis implies that components of original signals, factor loadings and factor scores are binary values. Optimal solution of Xk decomposition according 5 by brute force search is NP-hard problem and as such is not suitable for high dimensional data. On other side the classical linear methods could not take into account non-linearity of Boolean summation and therefore are inadequate for this task. The NBFA is based on Hopfield-like neural network [12,13]. Used is the fully connected network of N neurons with binary activity (1 - active, 0 - nonactive). Each pattern of the learning set Xm is stored in the matrix of synaptic connections J according to Hebbian rule: Jij
=
M
(Xim − q m )(Xjm − q m ), i, j = 1, ..., N, i =j, Jii = 0
(6)
m=1
where M is the number of patterns in the learning set and bias q m =
N i=1
Xim /N is
the total relative activity of the m-th pattern. This form of bias corresponds to the biologically plausible global inhibition being proportional to overall neuron activity. One
866
V. Snášel et al.
special inhibitory neuron was added to N principal neurons of the Hopfield network. The neuron was activated during the presentation of every pattern of the learning set and was connected with all the principal neurons by bidirectional connections. Patterns of the learning set are stored in the vector j of the connections according to the Hebbian rule: ji =
M
(Xim − q m ) = M (qi − q), i = 1..N,
(7)
m=1
where qi =
M m=1
Xim /M is a mean activity of the i-th neuron in the learning set and q is
a mean activity of all neurons in the learning set. We also supposed that the excitability of the introduced inhibitory neuron decreases inversely proportional to the size of the learning set, being 1/M after all patterns are stored. In the recall stage its activity is then: A(t) = (1/M )
N
ji Xi (t) = (1/M )jT X(t)
i=1
where jT is transposed j . The inhibition produced in all principal neurons of the network is given by vector j A(t) = (1/M )j jT X(t). Thus, the inhibition is equivalent to the subtraction of J = j jT /M = M qqT
(8)
from J where q is a vector with components qi − q. Adding the inhibitory neuron is equivalent to replacing the ordinary connection matrix J by the matrix J = J − J . To reveal factors we suggest the following two-run recall procedure. Its initialization starts by the presentation of a random initial pattern Xin with kin = rin N active neurons. Activity kin is supposed to be smaller than the activity of any factor. On presentation of Xin , network activity X evolves to some attractor. This evolution is determined by the synchronous discrete time dynamics. At each time step: Xi (t + 1) = Θ(hi (t) − T (t)), i = 1, · · ·, N,
Xi (0) = Xiin
(9)
where hi are components of the vector of synaptic excitations h(t) = JX(t),
(10)
Θ is the step function, and T (t) is an activation threshold. At each time step of the recall process the threshold T (t) was chosen in such a way that the level of the network activity was kept constant and equal to kin . Thus, on each time step kin “winners" (neurons with the greatest synaptic excitation) were chosen and only they were active on the next time step. As shown in [12], this choice of activation threshold enables the network activity to stabilize in point or cyclic attractors
Pattern Discovery for High-Dimensional Binary Datasets
867
of length two. The fixed level of activity at this stage of the recall process could be ensured by biologically plausible non-linear negative feed-back control accomplished by the inhibitory interneurons. It is worth to note that although the fact of convergence of network synchronous dynamics to point or cyclic attractors of length two was established earlier [14], it was done for fixed activation threshold T but for fixed network activity it was done first in [12]. When activity stabilizes at the initial level of activity kin , kin + 1 neurons with maximal synaptic excitation are chosen for the next iteration step, and network activity evolves to an attractor at the new level of activity kin + 1. The level of activity then increases to kin + 2, and so on, until the number of active neurons reaches the final level kf = rf N where r = k/N is a relative network activity. Thus, one trial of the recall procedure contains (rf − rin )N external steps and several internal steps (usually 2-3) inside each external step to reach an attractor for a given level of activity. At the end of each external step when network activity stabilizes at the level of k active neurons a Lyapunov function was calculated by formula: λ = XT (t + 1)JX(t)/k,
(11)
where XT (t+1) and X(t) are two network states in a cyclic attractor (for point attractor XT (t + 1) = X(t) ). The identification of factors along the trajectories of the network dynamics was based on the analysis of the change of the Lyapunov function and the activation threshold along each trajectory. In our definition of Lyapunov function its value gives a mean synaptic excitation of neurons belonging to an attractor at the end of each external step. 2.5 Statistical Clustering Methods The clustering methods help to reveal groups of black pixels, it means typical parts of images. However, obtaining disjunctive clusters is a problem of the usage of traditional hard clustering. The HAA algorithm starts with each pixel in a group of its own. Then it merges clusters until only one large cluster remains which includes all pixels. The user must choose dissimilarity or similarity measure and agglomerative procedure. At the beginning, when each pixel represents its own cluster, the dissimilarity between two pixels is defined by the chosen dissimilarity measure. However, once several pixels have been linked together, we need a linkage or amalgamation rule to determine when two clusters are sufficiently similar to be linked together. Several linkage rules have been proposed, i.e. the distance between two different clusters can be determined by the greatest distance between two pixels in the clusters (Complete Linkage method - CL), or average distance between all pairs of objects in the two clusters (Average Linkage Between Groups - ALBG). Hierarchical clustering is based on the proximity matrix (dissimilarities for all pairs of pixels) and it is independent on the order of pixels. For binary data, we can choose for example Jaccard and Ochiai (cosine) similarity measures of two pixa els. The former can be expressed as SJ = a+b+c where a is the number of the common
868
V. Snášel et al.
occurrences of ones and b + c is the number of pairs in which one value is one and the a a second is zero. The latter can be expressed as SO = a+b · a+c . In the TSCA algorithm, the pixels are arranged into subclusters, known as "clusterfeatures". These cluster-features are then clustered into k groups, using a traditional hierarchical clustering procedure. A cluster feature (CF) represents a set of summary statistics on a subset of the data. The algorithm consists of two phases. In the first one, an initial CF tree is built (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data). In the second one, an arbitrary clustering algorithm is used to cluster the leaf nodes of the CF tree. Disadvantage of this method is its sensitivity to the order of the objects (pixels in our case). The log-likelihood distance measure can be used for calculation of the distance between two clusters.
3 Experimental Results For testing of above mentioned methods, we used generic collection of 1600 32 × 32 black-and-white images containing different combinations of horizontal and vertical lines (bars). The probabilities of bars to occur in images were the same and equal to 10/64, i.e. images contain 10 bars in average. An example of several images from generated collection is shown in Figure 3a. For the first view on image structures, we applied traditional cluster analysis. We clustered 1024 (32 × 32) positions into 64 and 32 clusters. The problem of the use of traditional cluster analysis consists in that we can obtain disjunctive clusters which means we can find only horizontal bars and parts of vertical bars and vice versa. We applied HAA and TSCA clustering as implemented in the SPSS system. The problem of TSCA method consists in that it is dependent on the order of analyzed images, so we used two different orders. For HAA, we tried to use different similarity measures. We found that linkage methods have more influence to the results of clustering than similarity measures. We used Jaccard and Ochiai (cosine) similarity measures suitable for asymmetric binary attributes. We found both as suitable methods for the identification of the bars or their parts. For 64 clusters, the differences were only in a few assignments of positions by ALBG and CL methods with Jaccard and Ochiai measures. The following figures illustrate the application of some of these techniques for 64 and 32 clusters. Figure 2a,b show results of ALBG method with Jaccard measure Figure 2a for 32 clusters and Figure 2b for 64 clusters. In the case of 32 clusters, we found 32 horizontal bars (see Figure 2c ) by TSCA method for the second order of features. Many of tested methods were able to generate a set of base images or factors, which should ideally record all possible bar positions. However, not all methods were truly successful in this. With the SVD, we obtain classic singular vectors, the most general being among the first. The first few are shown in Figure 3b. We can see, that the bars are not separated and different shades of gray appear. The NMF methods yield different results. The original NMF method, based on the adjustment of random matrices W and H, provides hardly-recognizable images even for k = 100 and 1000 iterations (we used 100 iterations for other experiments).
Pattern Discovery for High-Dimensional Binary Datasets
(b)
869
(c)
(a)
Fig. 2. (a) 64 and (b) 32 clusters of pixels by ALBG method (Jaccard coefficient) (c) 32 clusters of pixels by TSCA method
(a)
(b)
(c)
Fig. 3. (a) Several bars from generated collection (b) First 64 base images of bars for SVD method (c) First 64 factors for the original NMF method
Moreover, these base images still contain significant salt and pepper noise and have a bad contrast. The factors are shown in Figure 3c. We must also note, that the NMF decomposition will yield slightly different results each time it is run, because the matrix(es) are pre-filled with random values. The GD-CLS modification of NMF method tries to improve the decomposition by calculating the constrained least squares problem. This leads to a better overall quality, however, the decomposition really depends on the pre-filled random matrix H. The result is shown in Figure 4a. The SDD method differs slightly from previous methods, since each factor contains only values {−1, 0, 1}. Gray in the factors shown in Figure 4b represents 0; −1 and 1 are represented with black and white respectively. The base vectors in Figure 4b can be divided into three categories: Base vectors containing only one bar. Base vectors containing one horizontal and one vertical bar. Other base vectors, containing several bars and in some cases even noise. Finally, we made decomposition of images into binary vectors by the NBFA method. Here factors contain only values {0, 1} and we use Boolean arithmetics. The factor search was performed under assumption that the number of ones in factor is not less
870
V. Snášel et al.
(a)
(b)
(c)
Fig. 4. First (a) 64 factors for GD-CLS NMF method, first 64 base vectors for (b) SDD method and (c) NBFA method
than 5 and not greater than 200. Since the images are obtained by Boolean summation of binary bars, it is not surprising that the NBFA is able to reconstruct all bars as base vectors, providing an ideal solution, as we can see in Figure 4. It is clear that classical linear methods could not take into account non-linearity of Boolean summation and therefore are inadequate for bars problem task. But the linear methods are fast and well elaborated so it was very interesting to compare linear approach with the NBFA and compare qualitatively the results.
4 Conclusion In this paper, we have compared NBFA and several dimension reduction attempts icluding clustering methods - on so called bars problem. It is shown that NBFA perfectly found basis (factors) from which the all learning pictures can be reconstructed. First, it is because the model, on which the BFA is based, is the same as used for data generation. Secondly, because of robustness of BFA implementation based on recurrent neural network. Some experiments show, that the resistance against noise is very high. We hypothesize that it is due the self reconstruction ability of our neural network. Whilst the SVD is known to provide quality eigenfaces, it is computationally expensive and in case we only need to beat the "curse of dimensionality" by reducing the dimension, SDD may suffice. As expected, from methods which allowed direct querying in reduced space, SVD was the slower, but most exact method. The NMF and SDD methods may be also used, but not with the L2 metric, since the distances are not preserved enough in this case. There are some other newly-proposed methods, which may be interesting for future testing, e.g. the SparseMap [15]. Additionally, faster pivot selection technique for FastMap [16] may be considered. Finally, testing of used dimension reduction methods with deviation metrics on metric structures should answer the question of projected data indexability (which is poor for SVD-reduced data).
Pattern Discovery for High-Dimensional Binary Datasets
871
Cluster analysis in this application is focused on finding original factors from which images were generated. Applied clustering methods were quite successful in finding these factors. The problem of cluster analysis is, that it provides disjunctive clusters only. So, only some bars or their parts were revealed. However, from the general view on 64 clusters, it is obvious, that images are compounded from vertical and horizontal bars (lines). By two-step cluster analysis, 32 horizontal lines were revealed by clustering to 32 clusters. Acknowledgment. The work was partly funded by the Centre of Applied Cybernetics 1M6840070004 and partly by the Institutional Research Plan AV0Z10300504 "Computer Science for the Information Society: Models, Algorithms, Appplications" and by the project 1ET100300419 of the Program Information Society of the Thematic Program II of the National Research Program of the Czech Republic.
References 1. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: VLDB 1994: Proceedings of the 20th International Conference on Very Large Data Bases, pp. 487–499. Morgan Kaufmann Publishers Inc., San Francisco (1994) 2. Brin, S., Motwani, R., Ullman, J.D., Tsur, S.: Dynamic itemset counting and implication rules for market basket data. In: SIGMOD 1997: Proceedings of the 1997 ACM SIGMOD international conference on Management of data, pp. 255–264. ACM Press, New York (1997) 3. Spellman, P.T., Sherlock, G., Zhang, M.Q., Anders, V.I.K., Eisen, M.B., Brown, P., Botstein, D., Futcher, B.: Comprehensive identification of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell 9, 3273– 3297 (1998) 4. Koyutürk, M., Grama, A., Ramakrishnan, N.: Nonorthogonal decomposition of binary matrices for bounded-error data compression and analysis. ACM Trans. Math. Softw. 32(1), 33–69 (2006) 5. Földiák., P.: Forming sparse representations by local anti-Hebbian learning. Biological cybernetics 64(22), 165–170 (1990) ˇ 6. Frolov, A., Húsek, D., Polyakov, P., Rezanková, H.: New Neural Network Based Approach Helps to Discover Hidden Russian Parliament Votting Paterns. In: International Joint Conference on Neural Networks, Omnipress, pp. 6518–6523 (2006) 7. Frolov, A.A., Húsek, D., Muravjev, P., Polyakov, P.: Boolean Factor Analysis by Attractor Neural Network. Neural Networks, IEEE Transactions 18(3), 698–707 (2007) 8. Berry, M., Dumais, S., Letsche, T.: Computational Methods for Intelligent Information Access. In: Proceedings of the 1995 ACM/IEEE Supercomputing Conference, San Diego, California, USA (1995) 9. Kolda, T.G., O’Leary, D.P.: Computation and uses of the semidiscrete matrix decomposition. In: ACM Transactions on Information Processing (2000) 10. Shahnaz, F., Berry, M., Pauca, P., Plemmons, R.: Document clustering using nonnegative matrix factorization. Journal on Information Processing and Management 42, 373–386 (2006) 11. Spratling, M.W.: Learning Image Components for Object Recognition. Journal of Machine Learning Research 7, 793–815 (2006) 12. Frolov, A.A., Húsek, D., Muravjev, P.: Informational efficiency of sparsely encoded Hopfield-like autoassociative memory. Optical Memory and Neural Networks (Information Optics), 177–198 (2003)
872
V. Snášel et al.
13. Frolov, A.A., Sirota, A.M., Húsek, D., Muravjev, P.: Binary factorization in Hopfield-like neural networks: single-step approximation and computer simulations. Neural Networks World, 139–152 (2004) 14. Goles-Chacc, E., Fogelman-Soulie, F.: Decreasing energy functions as a tool for studying threshold networks. Discrete Mathematics, 261–277 (1985) 15. Faloutsos, C.: Gray Codes for Partial Match and Range Queries. IEEE Transactions on Software Engineering 14(10) (1988) 16. Faloutsos, C., Lin, K.: FastMap: A Fast Algorithm for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets. ACM SIGMOD Record 24(2), 163–174 (1995)
Expand-and-Reduce Algorithm of Particle Swarm Optimization Eiji Miyagawa and Toshimichi Saito Hosei University, Koganei, Tokyo, 184-0002 Japan
Abstract. This paper presents an optimization algorithm: particle swarm optimization with expand-and-reduce ability. When particles are trapped into a local optimal solution, a new particle is added and the trapped particle(s) can escape from the trap. The deletion of the particle is also used in order to suppress excessive network grows. The algorithm efficiency is verified through basic numerical experiments.
1
Introduction
Particle swarm optimization (PSO) has been used extensively as an effective technique for solving a variety of optimization problems [1] - [6]. It goes without saying that it is hard to find a real optimal solution in practical complex problems. The PSO algorithm tries to find the solution with enough accuracy in practical cases. The original PSO was developed by Kennedy and Eberhart [1]. The PSO shares best information in the swarm and is used for the function optimization problem. In order to realize effective PSO search, there exist several interesting approaches including particle dynamics by random sampling from a flexible distribution [3], an improved PSO by speciation [4], growing multiple subswarms [5] and adaptive PSO [6]. The prospective applications of the PSO are many, including iterated prisoner’s dilemma, optimizing RNA secondary structure, mobile sensor networks and nonlinear State estimation [7] - [10]. However, the basic PSO may not find a desired solution in complex problems. For example, if one particle find a local optimal solution, the entire swarm may be trapped into it. Once the swam is trapped, it is difficult to escape from the trap. This paper presents a novel version of the PSO: PSO with expand-and-reduce ability (ERSPO). When particles are trapped into a local optimal solution, a new particle is added and the particle swarm can grow. In a swarm, particles share information of added particles. The trapped particle(s) can escape from the trap provided parameters are selected suitably. The deletion of particle(s) is also introduced in order to suppress excessive swarm grows. If the deletion does not present, the swarm goes excessively and computation overload is caused. Performing basic numerical experiments, effectiveness of the ERSPO algorithm is verified. It should be noted that the growing and/or reducing of swarms have been key technique in several learning algorithms including self-organizing maps and practical applications [11]. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 873–881, 2008. c Springer-Verlag Berlin Heidelberg 2008
874
2
E. Miyagawa and T. Saito
Fundamental Version of PSO
As preparation, we introduce the global best version of the PSO with an elemental numerical experiment. The PSO is an optimization algorithm where particles within the swarm learn from each other, and move to become more similar to their “better” neighbors. The social structure for the PSO is determined through the formation of neighborhood to communicate. The global best (gbest) version is fundamental where each particle can communicate with every other particle, forming a fully connected social network. The i-th particle at time t is characterized by its position xi (t) and the position is updated based on its velocity vi (t)”. We summarize the gbest version for finding global minimum of a function F where dimension of F and xi are the same: Step 1: Initialization. Let t = 0. The particle position xi (t) is located randomly in the swarm S(t) where i = 1 ∼ N and N is the number of particles. Step 2: Compare the value of F of each particle to its best value (pbesti ) so far: if F (xi (t)) < pbesti then (a) pbesti = F (xi (t)) and (b) xpbesti = xi (t). Step 3: Compare the value of F of each particle to the global best value: if F (xi (t)) < gbest then (a) gbest = F (xi (t)) and (b) xgbest = xi (t).
1RV
㧔C㧕+PKVKCN5VCVG
㧔E㧕
㧔D㧕
㧔F㧕%QORNGVKQP
Fig. 1. Search process by the gbest version. The dot denotes a particle. N = 10 and tmax = 50. ρ1 = ρ2 is uniform random number on [0, 2].
Expand-and-Reduce Algorithm of Particle Swarm Optimization
875
Step 4: Change the velocity vector for each particle: vi (t) = W vi (t) + ρ1 (xpbesti − xi (t)) + ρ2 (xgbest − xi (t))
(1)
where W is the inertial term defined by W = Wmax −
Wmax −Wmin tmax
× t.
(2)
ρ1 and ρ2 are random variables. Step 5: Move each particle to a new position: xi (t) = xi (t) + vi (t). Step 6: Let t = t + 1. Go to Step 2, and repeat until t = tmax . In order to demonstrate performance of the PSO, we apply the gbest algorithm to the following elemental problem: minimize subject to
f1 (x1 , x2 ) = x21 + x22 |x1 | ≤ 50, |x2 | ≤ 50
(3)
f1 has the unique extremum at the origin (x1 , x2 ) = (0, 0) that is the optimal solution. Fig. 1 illustrates learning process of particles in the gbeat version. We can see that the entire swarm converge to the optimal solution. New algorithm is an improvement of this algorithm: the network has ring structure, the number of swarm is time-variant, and the global best value is replaced with the local best value within the closest neighbors. The detailed definition is given in Section 3.
3
Expand-and-Reduce Algorithm
The fundamental PSO can find the optimal solution in simple problems as shown in the previous section. However, if the object functions has complex shape with many local extrema, entire swarm may be trapped into a local optimal solution. The velocity of each particle is changed by the random variables ρ1 and ρ2 , however, it is very hard to adjust the parameter depending on various problems. Here we present a novel type of PSO where the swarm can be expanded and reduced. In this expand-and-reduce particle swarm optimization (ERPSO), the particle can escape from the trap without adjusting the parameter. In the ERPSO, the i-th particle at time t is characterized by its position xi (t) and counter ci (t) that counts time-invariance of pbesti. The closest neighbors in the ring structure is used to update particles and the number of particles N is a variable depending on the value of ci (t). We define the ERPSO for the swarm with ring structure and for finding problem of global minimum of a function F . Step 1: Initialization. Let t = 0 and the number of particles N and the value of the counter c( t) is initialized. The position xi (t) is located randomly in the swarm.
876
E. Miyagawa and T. Saito
1RV
RCTVKENG
VTCR
.QECN1RV 㧔C㧕
㧔D㧕
GUECRG
PGYRCTVKENG
㧔F㧕
㧔E㧕
Fig. 2. Expansion of PSO and escape from a trap
Step 2: Compare the cost of each particle with its best cost so far: if F (xi (t)) < pbesti then (a) pbesti = F (xi (t)) and (b) xpbesti = xi (t) The counter value increases if improvement of pbesti is not sufficient: ci (t) = ci (t) + 1
if |F (xi (t)) − pbesti | <
(4)
where is a small value. Step 3 (expansion): If a particle is trapped into local optimal or global optimal as shown in Fig. 2 (a) and (b), the number of ci is large. If the counter value exceeds a threshold, a new particle is inserted into the closest neighbor of the trapped particle and is located away from the trap as shown in Fig. 2 (c) and (d). In the case where the i-th particle is trapped, xnew (t) = xi (t) + r
if ci (t) > Tint .
(5)
where r is a random variable and the suffix is reassigned: j = j + 1 for i < j and xnew = xi+1 . Let N = N + 1. Let the counter be initialized, ci (t) = 0 for all i. This expansion can be effective to escape from the trap. Step 4 (reduction): If the value of gbest(t) does not change during some time interval started from time when a new particle is inserted, one of trapped particles is removed. The reduction aims at suppression of excessive growing that causes unnecessary computation time.
Expand-and-Reduce Algorithm of Particle Swarm Optimization
877
Step 5: Compare the cost of each particle with its local best lbset cost so far: if F (xi (t)) < lbesti then (a) lbesti = F (xi (t)) and (b) xlbesti = xi (t) where the lbest is given for the both closest neighbors of the i-th particle. Step 6: Change the velocity vector for each particle: vi (t) = W vi (t) + ρ1 (xpbesti − xi (t)) + ρ2 (xlbesti − xi (t))
(6)
where W is the inertial term defined by Eq. 2 and ρ1 and ρ2 are random variables.
XCNWGQHf
KVGTCVKQP
C.QECNDGUV5GCTEJ
XCNWGQHf
KVGTCVKQP
D'ZRCPFCPFTGFWEGUGCTEJ Fig. 3. Searching the minimum value of Equation (3). The red line is the solution. (a) lbest version without expand-and-reduce for N = 7. (b) ERPSO, N (0) = 3. The parameter values: Tint = 30, Wmax = 0.9, Wmin = 0.4, tmax = 750, = 10−3 . The random variables ρ1 = ρ2 are given by uniform distribution on [0, 2]. The time average of N (t) is 7.
878
E. Miyagawa and T. Saito
Step 7: Move each particle to a new position: xi (t) = xi (t) + vi (t). Step 8: Let t = t + 1. Go to Step 2, and repeat until t = tmax .
4
Numerical Experiments
In order to investigate efficiency of the ERPSO, we have performed basic numerical experiments. First, we have applied the ERPSO to the problem defined by Equation (3). The result is shown in Fig. 3 where the lbest version is defined by Step 1 to Step 7 without Steps 3 and 4. This result suggest that the ERPSO can find the optimal solution a little bit faster than the lbest version. The efficiency of ERPSO seems to be remarkable in more complicated problem.
Optimal Solution
Local Optimal Solution
Ett = 250
E
Dtt = 0
D
CHWPEVKQPf
CHWPEVKQP
Ftt = 500
F
Fig. 4. Expand-and-reduce Search process in f2 (x1 , x2 ). Parameter values are defined in Fig. 5 caption.
Expand-and-Reduce Algorithm of Particle Swarm Optimization
879
XCNWGQHf
KVGTCVKQP
C.QECNDGUV5GCTEJ
XCNWGQHf
KVGTCVKQP
D'ZRCPFCPFTGFWEGUGCTEJ Fig. 5. Searching the minimum value of Equation (7). The red line is the solution. (a) is lbest version without expand-and-reduce, N = 9 that is the time average of N (t) in the ERPSO. (b) is ERPSO, N (0) = 3. The parameter values: Tint = 30, Wmax = 0.9, Wmin = 0.4, tmax = 750, = 10−3 . The random variables ρ1 = ρ2 is used uniform distribution on [0, 2].
Second, we have applied the ERPSO to a multi-extrema function: the modified Shekel’s Foxholes function studied in [12]. minimize
f2 (x1 , x2 ) = −
30
2
(x1 − αi ) + (x2 − βi ) + γi 0 ≤ x1 ≤ 10, 0 ≤ x2 ≤ 10 i=1
subject to
1 2
(7)
880
E. Miyagawa and T. Saito Table 1. Success rate for 1,000 trials. Parameters are as in Fig. 5. Function Algorithm Success rate f1 PSO 100 % ERPSO 100 % f2 PSO 37.5 % ERPSO 56.2 %
(α1 , · · · , α10 ) = (9.681, 9.400, 8.025, 2.196, 8.074, 7.650, 1.256, 8.314, 0.226, 7.305) (α11 , · · · , α20 ) = (0.652, 2.699, 8.327, 2.132, 4.707, 8.304, 8.632, 4.887, 2.440, 6.306) (α21 , · · · , α30 ) = (0.652, 5.558, 3.352, 8.798, 1.460, 0.432, 0.679, 4.263, 9.496, 4.138) (β1 , · · · , β10 ) = (0.667, 2.041, 9.152, 0.415, 8.777, 5.658, 3.605, 2.261, 8.858, 2.228) (β11 , · · · , β20 ) = (7.027, 3.516, 3.897, 7.006, 5.579, 7.559, 4.409, 9.112, 6.686, 8.583) (β21 , · · · , β30 ) = (2.343, 1.272, 7.549, 0.880, 8.057, 8.645, 2.800, 1.074, 4.830, 2.562) (γ1 , · · · , γ10 ) = (0.806, 0.517, 0.100, 0.908, 0.965, 0.669, 0.524, 0.902, 0.531, 0.876) (γ11 , · · · , γ20 ) = (0.462, 0.491, 0.463, 0.714, 0.352, 0.869, 0.813, 0.811, 0.828, 0.964) (γ21 , · · · , γ30 ) = (0.789, 0.360, 0.369, 0.992, 0.332, 0.817, 0.632, 0.883, 0.608, 0.326)
This function has a lot of local optimal solutions as shown in Fig. 4(a). The optimal solution is f2 (x1 , x2 ) = −12.12 at (x1 , x2 ) = (8.02, 9.14). A search process is illustrated in Fig. 5: we can see that particles can escape from the trap and goes to the optimal solution. We have confirmed that this optimal solution is hard to search by the fundamental gbast version. Fig. 5 and Table. 1 shows the result of ERPSO with that of the lbest version. In the results, the lbest version is hard to find the solution but the ERPSO can find it. In the lbest version, the particles are trapped into some local minimum almost always.
5
Conclusions
A novel version of PSO is presented in this paper. In the algorithm, the expanding function is effective for escape from a trap and the reducing function is effective to suppress computation cost. The algorithm efficiency is suggested through basic numerical experiment. This paper is a first step and has many problems including the following: 1) 2) 3) 4) 5)
analysis of role of parameters, analysis of effect of topology of particle network, automatic adjustment of parameters, application to more complicated benchmarks, and application to practical problems.
References 1. Kennedy, J., Eberhart, R.: Particle Swarm Optimization. In: Proc of IEEE/ICNN, pp. 1942–1948 (1995) 2. Engelbrecht, A.P.: Computational Intelligence, an introduction, pp. 185–198. Wiley, Chichester (2004)
Expand-and-Reduce Algorithm of Particle Swarm Optimization
881
3. Richer, T.J., Blackwell, T.M.: The Levy Particle Swarm. In: Proc. Congr. Evol. Comput., pp. 3150–3157 (2006) 4. Parrott, D., Li, X.: Locating and tracking multiple dynamic optima by a particle swarm model using speciation. IEEE Trans. Evol. Comput. 10(4), 440–458 (2006) 5. Brits, R., Engelbrecht, A.P., van den Bergh, F.: A Niching Particle Swarm Optimizer. In: Proc. of SEAL, vol. 1079 (2002) 6. Hu, X., Eberhart, R.C.: Adaptive Particle Swarm Optimization: Detection and Response to Dynamic Systems. In: Proc. of IEEE/CEC, pp. 1666–1670 (2002) 7. Franken, N., Engelbrecht, A.P.: Particle swarm optimization approaches to coevolve strategies for the Iterated Prisoner’s Dilemma. IEEE Trans. Evol. Comput. 9(6), 562–579 (2005) 8. Neethling, M., Engelbrecht, A.P.: Determining RNA secondary structure using setbased particle swarm optimization. In: Proc. Congr. Evol. Comput., pp. 6134–6141 (2006) 9. Jatmiko, W., Sekiyama, K., Fukuda, T.: A PSO-based mobile sensor network for odor source localization in dynamic environment: theory, simulation and measurement. In: Proc. Congr. Evol. Comput., pp. 3781–3788 (2006) 10. Tong, G., Fang, Z., Xu, X.: A particle swarm optimized particle filter for nonlinear system state estimation. In: Proc. Congr. Evol. Comput., pp. 1545–1549 (2006) 11. Oshime, T., Saito, T., Torikai, H.: ART-based parallel learning of growing SOMs and its application to TSP. In: King, I., Wang, J., Chan, L.-W., Wang, D. (eds.) ICONIP 2006. LNCS, vol. 4232, pp. 1004–1011. Springer, Heidelberg (2006) 12. Bersini, H., Dorigo, M., Langerman, S., Geront, G., Gambardella, L.: Results of the first international contest on evolutionary optimisation (1st iceo). In: Proc. of IEEE/ICEC, pp. 611–615 (1996)
Nonlinear Pattern Identification by Multi-layered GMDH-Type Neural Network Self-selecting Optimum Neural Network Architecture Tadashi Kondo School of Health Sciences, The University of Tokushima, 3-l8-15 Kuramoto-cho, Tokushima 770-8509, Japan [email protected]
Abstract. A revised Group Method of Data Handling (GMDH)-type neural network is applied to the nonlinear pattern identification. The GMDH-type neural network has both characteristics of the GMDH and the conventional multilayered neural network trained by the back propagation algorithm and can automatically organize the optimum neural network architecture using the heuristic self-organization method. In the GMDH-type neural network, many types of neurons described by such functions as the sigmoid function, the radial basis function, the high order polynomial and the linear function, can be used to organize neural network architecture and neuron characteristics, which fit the complexity of the nonlinear system, are automatically selected so as to minimize the error criterion defined as Akaike’s Information Criterion (AIC) or Prediction Sum of Squares (PSS). In this paper, the revised GMDH-type neural network is applied to the identification of the nonlinear pattern, showing that it is a useful method for this process. Keywords: Neural Network, GMDH, Nonlinear pattern identification.
1 Introduction The multi-layered GMDH-type neural networks, which have both characteristics of the GMDH [1],[2] and conventional multi-layered neural network, have been proposed on our early works [3],[4]. The multi-layered GMDH-type neural network can automatically organize multi-layered neural network architectures using the heuristic self-organization method and also organize optimum architectures of the high order polynomial fit for the characteristics of the nonlinear complex system. The GMDH-type neural network has several advantages compared with conventional multilayered neural network. It has the ability of self-selecting useful input variables. Also, useless input variables are eliminated and useful input variables are selected automatically. The GMDH-type neural network also has the ability of self-selecting the number of layers and the number of neurons in each layer. These structural parameters are automatically determined so as to minimize the error criterion defined as Akaike’s Information Criterion (AIC) [5] or Prediction Sum of Squares (PSS) [6], and the optimum neural network architectures can be organized autoM. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 882–891, 2008. © Springer-Verlag Berlin Heidelberg 2008
Nonlinear Pattern Identification by Multi-layered GMDH-Type Neural Network
883
matically. Because of this feature it is very easy to apply this algorithm to the identification problems of practical complex systems. In this paper, the revised GMDH-type neural network is applied to the identification problem of the nonlinear pattern. It is shown that the revised GMDH-type neural network can be applied easily and that it is a useful method for the identification of the nonlinear system.
2 Heuristic Se1f-organization Method[1],[2] The architecture of the GMDH-type neural network is organized automatically by using the heuristic self-organization method which is the basic theory of the GMDH algorithm. The heuristic self-organization method in the GMDH-type neural networks is implemented through the following five procedures: Separating the Original Data into Training and Test Sets. The original data are separated into training and test sets. The training data are used for the estimation of the weights of the neural network. The test data are used for organizing the network architecture. Generating the Combinations of the Input Variables in Each Layer. Many combinations of r input variables are generated in each layer. The number of combinations is p!/(p-r)!r!. Here, p is the number of input variables and the number of r is usually set to two. Selecting the Optimum Neuron Architectures. For each combination, the optimum neuron architectures which describe the partial characteristics of the nonlinear system can be calculated by applying the regression analysis to the training data. The output variables (yk) of the optimum neurons are called intermediate variables. Selecting the Intermediate Variables. The L intermediate variables giving the L smallest test errors which are calculated by using the test data are selected from the generated intermediate variables (yk ). Stopping the Multilayered Iterative Computation. The L selected intermediate variables are set to the input variables of the next layer and the same computation is continued. When the errors of the test data in each layer stop decreasing, the iterative computation is terminated. The complete neural network which describes the characteristics of the nonlinear system can be constructed by using the optimum neurons generated in each layer. The heuristic self-organization method plays very important roles for organization of the GMDH-type neural network.
3 Revised GMDH-Type Neural Network Algorithm Revised GMDH-type neural network has a common feedforward multilayered architecture. Figure 1 shows architecture of the revised GMDH-type neural network. This neural network is organized using heuristic self-organization method.
884
T. Kondo
X1 X2 X3 . .
. .
XP
Σf
Σf
Σf
Σf
Σf
Σf
Σf
Σf
Σf
. .
. .
. .
Σf
Σf
Σf
Σf
φ1
Σf
φ2
. .
. .
Σf
φκ
Fig. 1. Architecture of the revised GMDH-type neural network
Procedures for determining architecture of the revised GMDH-type neural network conform to the following: 3.1 First Layer uj=xj
(j=1,2,…,p)
(1)
where xj (j=1,2,…,p) are input variables of the nonlinear system, and p is the number of input variables. In the first layer, input variables are set to output variables. 3.2 Second Layer All combinations of r input variables are generated. For each combination, optimum neuron architectures are automatically selected from the following two neurons. Architectures of the first and second type neurons are shown in Fig.2. Optimum neuron architecture for each combination is selected from the first and second type neuron architectures. u1 u2
ui Σ
f
yk
. .
Σ
f
yk
ur
uj
(a)
(b)
Fig. 2. Neuron architectures of two type neurons: (a) with two inputs; (b) with r inputs
Revised GMDH-type neural network algorithm proposed in this paper can select optimum neural network architecture from three neural network architectures such as sigmoid function neural network, RBF neural network and polynomial neural network. Neuron architectures of the first and second type neurons in each neural network architecture are shown as follows. Sigmoid Function Neural Network The first type neuron Σ: (Nonlinear function) zk=w1ui+w2uj+w3uiuj+w4ui2+w5uj2+w6ui3+w7ui2uj+w8uiuj2+w9uj3 - w0θ1
(2)
Nonlinear Pattern Identification by Multi-layered GMDH-Type Neural Network
885
f: (Nonlinear function)
1 (3) 1 + e ( − zk ) Here, θ1 =1 and wi (i=0,1,2,…,9) are weights between the first and second layer. Value of r, which is the number of input variables u in each neuron, is set to two for the first type neuron. The second type neuron Σ: (Linear function) yk =
zk= w1u1+w2u2+w3u3+ ··· +wrur - w0θ1 ( r
(4)
f: (Nonlinear function)
yk =
1 1 + e ( − zk )
(5)
Here, θ1 =1 and wi (i=0,1,2,…,r) are weights between the first and second layer. Value of r, which is the number of input variables u in each neuron, is set to be greater than two and smaller than p for the second type neuron. Here p is the number of input variables xi (i=1,2,…,p). Radial Basis Function Neural Network The first type neuron Σ: (Nonlinear function) zk=w1ui+w2uj+w3uiuj+w4ui2+w5uj2+w6ui3+w7ui2uj+w8uiuj2+w9uj3 - w0θ1
(6)
f: (Nonlinear function)
y k = e ( − zk ) 2
(7)
Here, θ1 =1 and wi (i=0,1,2,…,9) are weights between the first and second layer. Value of r, which is the number of input variables u in each neuron, is set to two for the first type neuron. The second type neuron Σ: (Linear function) zk= w1u1+w2u2+w3u3+ ··· +wrur - w0θ1 ( r
(8)
f: (Nonlinear function)
y k = e ( − zk ) 2
(9)
Here, θ1 =1 and wi (i=0,1,2,…,r) are weights between the first and second layer. Value of r, which is the number of input variables u in each neuron, is set to be greater than two and smaller than p for the second type neuron. Here p is the number of input variables xi (i=1,2,…,p). Polynomial Neural Network The first type neuron Σ: (Nonlinear function) zk=w1ui+w2uj+w3uiuj+w4ui2+w5uj2+w6ui3+w7ui2uj+w8uiuj2 +w9uj3 - w0θ1
(10)
886
T. Kondo
f: (Linear function) yk= zk
(11)
Here, θ1 =1 and wi (i=0,1,2,…,9) are weights between the first and second layer. Value of r, which is the number of input variables u in each neuron, is set to two for the first type neuron. The second type neuron Σ: (Linear function) zk= w1u1+w2u2+w3u3+ ··· +wrur - w0θ1 ( r
(12)
f: (Linear function) yk=zk
(13)
Here, θ1 =1 and wi (i=0,1,2,…,r) are weights between the first and second layer. Value of r, which is the number of input variables u in each neuron, is set to be greater than two and smaller than p for the second type neuron. Here p is the number of input variables xi (i=1,2,…,p). Weights wi (i=0,1,2,…) in each neural network architecture are estimated by stepwise regression analysis [7] using AIC. Estimation Procedure of Weight wi . First, values of zk are calculated for each neural network architecture as follows. Sigmoid function neural network z k = log e (
φ' ) 1−φ'
(14)
RBF neural network z k = − log e φ '
(15)
Polynomial neural network zk=φ
(16)
where φ’ is normalized output variable whose values are between zero and one and φ is output variable. Then weights wi are estimated by stepwise regression analysis [7] which selects useful input variables using AIC. Only useful variables in Eq.(2), Eq.(4), Eq.(6), Eq.(8), Eq.(10) and Eq.(12) are selected by stepwise regression analysis using AIC and optimum neuron architectures are organized by selected useful variables. AIC [5] is described at the first type neuron by the following equations: AIC = n loge Sm2 + 2(m+1) + C S m2 =
n
1 ∑ (φα − zα ) 2 n α =1
(17) (18)
zα= w1ui+w2uj+w3uiuj+w4ui2+w5uj2+w6ui3+w7ui2uj+w8uiuj2 +w9uj3 – w0θ, α=1,2,…,n
(19)
Nonlinear Pattern Identification by Multi-layered GMDH-Type Neural Network
887
Here, m is the number of terms in Eq.(19), n is the number of training data and C is a constant. For each combination, three neuron architectures which are sigmoid function neuron, RBF neuron and polynomial neuron, are generated and L neurons which minimize test error calculated using test data are selected for each neuron architecture. From these L selected neurons for each neuron architecture, mean test errors of L neurons are calculated. Then, neural network architecture which has minimum mean test error is selected as revised GMDH-type neural network architecture from three neural network architectures such as sigmoid function neural network, RBF neural network and polynomial neural network. After the type of the revised GMDH-type neural network architecture is selected, output variables yk of L selected neurons are set to input variables of neurons in the third layer. 3.3 Third and Successive Layers In the second layer, optimum neural network architecture is selected from three neural network architectures. In the third and successive layers, only one neuron architecture, which is sigmoid function neuron or RBF neuron or polynomial neuron, is used for calculation and the same calculation of the second layer is iterated until AIC values of L neurons with selected neuron architecture stop decreasing. When iterative calculation is terminated, neural network architecture is produced by L selected neurons in each layer. By using these procedures, the revised GMDH-type neural network self-selecting optimum neural network architecture is organized.
4 An Application to the Nonlinear Pattern Identification The GMDH-type neural network is applied to the nonlinear pattern identification and the identified results are compared with those obtained by the GMDH algorithm and the conventional neural network trained by the back propagation algorithm. Figure 3 shows the pattern identified using the GMDH-type neural network. In this figure, the values of the white marks are 0.1 and the values of the black marks are 0.9. The GMDH-type neural network is identified by using the data of the eight points in the corners and the prediction errors are calculated by using the data of the eight other points located near the corners.
Fig. 3. Pattern identified using the GMDH-type neural network
888
T. Kondo
4.1 Identification Results Obtained Using the GMDH-Type Neural Network Input Variables. The x, y and z co-ordinates are used as the input variables of the GMDH-type neural network. The number of input variables is three. Number of Selected Neurons. Three neurons are selected in each layer. The optimum neuron architectures are selected automatically using the prediction error criterion defined as AIC. Architecture of the Neural Network. The sigmoid function neural network architecture was selected from three kinds of neural network architectures at the second layer calculation. The calculation of the GMDH-type neural network was terminated in the tenth layer. So the number of layers was eleven. Figure 4(c) shows the variation of AIC values. Estimation Accuracy. The estimation accuracy was evaluated using the following equation. 8 | φi − φi* | (20) J = 1
∑
8
i =1
where φi (i=1,2,…,8) were the actual values at the estimation points and φi* (i=1,2,…,8) were the estimated values of φi by the GMDH-type neural network. Figure 4(a) shows the variation of J1 in each layer. From this figure, we can see that the values of J1 were decreased gradually and converged at the tenth layer. Table 1 shows the estimation error obtained by the GMDH-type neural network. From this table, the maximum estimation error is -0.0020 and so we can see that the estimation errors are very small and the GMDH-type neural network is very accurate.
oo o
o oo
0
o oo
o oo
AIC
ooo ooo ooo
-50
ooo
1
2
3
4
5
6
7
8
ooo
ooo
9
10
Layer
(c)
Fig. 4. Variation of the mean errors(J1 and J2) and AIC values: (a)mean error J1; (b) mean error J2; (c) AIC values
Nonlinear Pattern Identification by Multi-layered GMDH-Type Neural Network
889
Table 1. Errors of GMDH-type neural network and the GMDH
Estimation Error Data Actual GMDH-NN No. value (Error) 1 0.1 0.0000 2 0.9 -0.0020 3 0.1 0.0000 4 0.9 0.0012 5 0.1 0.0002 6 0.1 0.0000 7 0.9 0.0007 8 0.1 0.0000 Maximum Error -0.0020 J1(Mean Error) 0.0005
Prediction Error Data Actual GMDH-NN GMDH No. value (Error) (Error) 9 0.9 -0.0069 -0.1533 10 0.9 0.0011 0.0832 11 0.1 0.0001 0.0212 12 0.9 -0.0057 -0.2594 13 0.1 0.0000 -0.0068 14 0.1 0.0000 -0.0068 15 0.1 0.0000 -0.0059 16 0.1 0.0000 -0.0069 Maximum Error -0.0069 -0.2595 J2(Mean Error) 0.0017 0.0679
GMDH (Error) -0.0069 -0.0127 -0.0067 0.0842 0.0513 -0.0069 -0.0955 -0.0067 -0.0955 0.0339
Prediction Accuracy. The prediction accuracy was evaluated by using the following equation. 16 | φi − φi* | (21) J = 2
∑ i=9
8
where φi (i=9,10,…,16) were the actual values at the prediction points and φi* (i=9,10,…,16) were the predicted values of φi by the GMDH-type neural network. Figure 4(b) shows the variation of J2 in each layer. From this figure, we can see that the values of J2 were decreased gradually and converged at the tenth layer. The prediction error obtained by the GMDH-type neural network is shown in Table 1. From this table, the maximum prediction error is -0.0069 and so we can see that the prediction errors are very small and the GMDH-type neural network has good generalization ability. 4.2 Identification Results Obtained Using the GMDH The same input variables as the GMDH-type neural network, which are the x, y and z co-ordinates, are used. The number of input variables is three. Three intermediate variables are selected in each layer. The optimum partial polynomials are automatically selected using the prediction error criterion defined as AIC. The calculation of the GMDH was terminated in the tenth layer. The estimation accuracy was evaluated by Eq.(20). The variation of J1 in each layer is shown in Fig.4(a). The estimation errors obtained using GMDH are shown in Table 1. The prediction accuracy was evaluated by using Eq.(21). The variation of J2 in each layer is shown in Fig 4(b). The prediction errors obtained by the GMDH is shown in Table 1. We can see that the estimation and prediction errors are greater than those of the GMDH-type neural network. 4.3 Identification Results Obtained Using the Conventional Neural Network The nonlinear pattern was identified by using the conventional neural network. The x, y and z co-ordinates are used as the input variables of the conventional neural
890
T. Kondo
network. The number of input variables is three. The neural network has three layered architecture. The conventional neural network has not the ability of self-selecting optimum neural network architecture. The numbers of the neurons in the hidden layer were set to 5, 10, 15 and 20. The identification accuracy was checked for each neural network architectures. The weights of the neural network are learned using the back propagation algorithm. The learning of the weights are repeated 10,000 times per each point. The estimation accuracy was evaluated by using Eq.(20) and shown in Table 2.The prediction accuracy was evaluated by using Eq.(21) and shown in Table 2. Table 2. Estimation and prediction accuracy of the conventional neural network
Mean Error J1 J2
Number of neurons in hidden layer 5 10 15 20 0.0118 0.0118 0.0118 0.0133 0.0358 0.0385 0.0435 0.0444
4.4 Companion of the Identification Results In both the GMDH-type neural network and the GMDH algorithm, the estimation and prediction errors were decreased gradually and converged at the tenth layer. Table 3 shows the comparison between the GMDH-type neural network, the GMDH and the conventional neural network in the estimation and prediction accuracy. The GMDHtype neural network is most accurate in three networks both the estimation and prediction accuracy and the maximum values of the estimation and prediction errors of the GMDH-type neural network are very small compared with the GMDH and the conventional neural network. So we can see that the GMDH-type neural network is very accurate identification method and they have a good generalization ability. Table 3. Comparison of the GMDH-type neural network, the GMDH and the conventional neural network
Error Estimation J1(Mean Error) Maximum Error Prediction J2(Mean Error) Maximum Error
GMDH-NN
GMDH
NN
0.0005 -0.0020 0.0017 -0.0069
0.0339 -0.0955 0.0679 -0.2595
0.0118 -0.0258 0.0358 -0.0751
5 Conclusion The GMDH-type neural network can automatically organize the optimal neural network architecture using the heuristic self-organization method. The optimum neural network architectures is automatically organized using the neurons whose architectures are selected from three kinds of neuron architectures so as to minimize AIC. Therefore, it is very easy to apply this algorithm to the identification problems of the practical complex systems.
Nonlinear Pattern Identification by Multi-layered GMDH-Type Neural Network
891
In this paper, the GMDH neural network was applied to the identification problem of the nonlinear pattern. The GMDH-type neural network was compared with the GMDH and the conventional neural network. It was shown that the GMDH-type neural network was accurate and very useful for the nonlinear pattern identification.
References 1. Farlow, S.J. (ed.): Self-organizing Methods in Modeling, GMDH-type Algorithms. Marcel Dekker, Inc., New York (1984) 2. Ivakhnenko, A.G.: Heuristic self-organization in problems of engineering cybernetics. Automatica 6(2), 207–219 (1970) 3. Kondo, T., Pandya, A.S., Zurada, J.M.: GMDH-type Neural Networks and their Application to the Medical Image Recognition of the lungs. In: Proceedings of the 38th SICE Annual Conference International Session Papers, pp. 1181–1186 (1999) 4. Kondo, T., Pandya, A.S.: GMDH-type Neural Networks with Radial Basis Functions and their Application to Medical Image Recognition of the Brain. In: Proceedings of the 39th SICE Annual Conference International Session Papers, vol. 331A-2, pp. 1–6 (2000) 5. Akaike, H.: A new look at the statistical model identification. IEEE Trans. Automatic Control AC-19(6), 716–723 (1974) 6. Tamura, H., Kondo, T.: Heuristics free group method of data handling algorithm of generating optimal partial polynomials with application to air pollution prediction. INT. J. SYSTEMS SCI. 11(9), 1095–1111 (1980) 7. Draper, N.R., Smith, H.: Applied Regression Analysis. John Wiley and Sons, New York (1981)
Coordinated Control of Reaching and Grasping During Prehension Movement Masazumi Katayama and Hirokazu Katayama Department of Human and Artificial Intelligent Systems, University of Fukui, Japan [email protected]
Abstract. In this paper, we investigate coordinated control of human reaching and grasping during the prehension movements. The interaction between reaching and grasping has been investigated by using visual perturbations that unexpectedly change size or position of an object when executing the movement. Those studies have reported that both of reaching and grasping interact each other. However, the interaction may include some properties in visual information processing for physical properties of an target object. Thus, the interaction should be examined without the visual perturbations because the detail of the interaction between reaching and grasping is still unclear. From this point of view, the influence from reaching to grasping have been directly investigated by using mechanical perturbations. While, the influence from grasping to reaching have never been examined. In this study, we examined the influence that grasping affects reaching by giving mechanical perturbations to the finger tips during the movement. As a result, we found that humans adjust the speed of reaching based on the changes in grasping perturbed by the mechanical perturbations. Moreover, both the movement times of grasping and reaching are highly correlated even when grasping was perturbed by the perturbations. Consequently, we confirmed the temporally-coordinated adjustment from grasping to reaching during the prehension movement. This result indicates that the previously proposed control schemes can not explain our findings. This finding is quite important for building a computational model of the coordinated control of human reaching and grasping. Keywords: Human Coordinated Control, Reaching, Grasping, Prehension Movement.
1
Introduction
We skillfully manipulate various tools with our own arm and hand in daily life. In the arm and hand movements, the prehension movement when reaching out a hand to an object plays an important role in order to achieve skillful object manipulation. From this point of view, a lot of researchers have investigated for coordinated control mechanisms between reaching and grasping during the movement. For example, Jeannerod [1] observed “preshaping” from the viewpoint in M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 892–901, 2008. c Springer-Verlag Berlin Heidelberg 2008
Coordinated Control of Reaching and Grasping
893
behavioral psychology, that we gradually generate a hand shape to grasp before our hand reaches a target object. From this observation, Arbib [2] have proposed a control scheme which explains the preshaping. The scheme makes the basic assumptions that each control of reaching and grasping is executed independently and in parallel, and also the synchronization signal to start grasping an object is sent from the control system of reaching to the control system of grasping. The biological plausibility of the scheme have been investigated by abruptly changing position or size of an object, that is a visual perturbation (e.g., [3,4,5,6,7,8,9]). For instance, Paulignan et al. [3] and Gentilucci et al. [4] examined the influence from reaching to grasping by unexpectedly changing object position. They have reported that the changing affects drastically not only reaching but also grasping during the movement, and grasping temporally and spatially changes depending on changes in reaching. On the other hand, Paulignan et al. [5] examined the influence from grasping to reaching by unexpectedly changing object size when executing the movement. They have reported that the unpredictable perturbation affects reaching. As a consequence, for the prehension movement, reaching affects largely grasping and grasping affects also reaching slightly. Therefore, although the Arbib’s scheme seems to be plausible, strictly speaking, these results reject the Arbib’s hypothesis. However, some researchers have reported that the influence from grasping to reaching during the movement is negligible small. Thus, the influence from grasping to reaching is quite different for each measurement condition. These complicated results indicate that it is really difficult to generate pure visual perturbations that directly affect one control mechanism and that do not affect the other, by using the visual perturbations. This is because these results include the influence of the visual information processing for object size that is unexpectedly changed when executing the movement. Thus, these studies with visual perturbations did not investigate directly the coordinated control mechanism of grasping and reaching during the movement. The detail of the influence remains to be still unclear. Therefore, we emphasize that the coordinated control mechanism should be investigated directly without visual manipulation. From the above points of view, Haggard and Wing [10] examined the influence from reaching to grasping by giving mechanical perturbations to subject’s arm when executing the prehension movement. They have reported that pullperturbations produce temporal reversals of grip aperture. These results clearly show that there is a temporal and spacial influence from reaching to grasping. However, the influence from grasping to reaching have never been investigated by using physical perturbations. Therefore, in this study, we investigate directly the influence by giving mechanical perturbations to the finger tips when executing the movement.
2
Measurement Experiments
Five right-handed subjects (males, 22 - 24 years old) participated in the present experiments. They were completely naive with regard to its specific purpose.
894
M. Katayama and H. Katayama
PHANToM Object
Object
25[cm]
OPTOTRAK 20[cm]
30[cm]
15[cm]
Marker
10[cm] PHANToM2
PHANToM1
Start position
Fig. 1. Experimental environment
Fig. 2. Each position of mechanical perturbations given to the finger tips
Before measurement experiments, we instructed the subjects to give unpredictable perturbations to both the finger tips when executing the movement. They were given informed consent to the current experiment. In order to give mechanical perturbations to both the finger tips of thumb and index finger when executing the prehension movement, we built a measurement system that examines the influence from grasping to reaching (see Fig. 1). This measurement system consists of two haptic interface devices (PHANToM 151AG, SensAble Technologies Inc.) and a three dimensional motion measurement device (OPTOTRAK 3020, Northern Digital Inc.). A movement task in this study was to reach toward an object, to grasp it and to lift up it. The movement distance was 30 [cm] as shown in Fig. 2. The target object to grasp was a cylinder that is placed at the position of 30 [cm] from the start position. The height of the object was 10 [cm] and the diameter is 3.3 [cm]. Each subject was seated in a dark room. Each gimbal of two haptic interface devices was attached to each finger tip and an infrared maker of the motion measurement device was attached to subject’s wrist joint. The haptic interface devices was used to give mechanical perturbations to both the finger tips when executing the prehension movement and moreover were used to measure movement trajectories of both the finger tips at sampling frequency 1[kHz]. The motion measurement device was used to measure wrist-joint trajectories at 500 [Hz]. The mechanical perturbations given to the finger tips were selected so as to reduce the influence that the perturbation directly affects reaching movements, by testing various amplitudes and directions of external force as a perturbation. The amplitude was 2 [N], the duration was for 0.1 [sec] and these perturbations were given at each position of 10, 15, 20 or 25 [cm] from the start position at random and at a rate of 30 [%] of the total trials . Moreover, the direction of the perturbations were two types that grip aperture of the finger tips opened or
Coordinated Control of Reaching and Grasping
895
closed. In this experiment, 400 trials were measured and the perturbations were given for 120 trials within the whole trials. The prehension movements were measured under the two conditions: Normal condition. Subjects perform the above task. Without grasping. Subjects only reach to the position of the object and do not grasp the object. The objective of this condition is to examine the influence that the perturbations may directly affect reaching. Subject repeated the task for ten minutes so as to execute the movements at almost constant movement time before the measurement experiment. The measurement procedure was as follows: 1. The right hand of subject was placed at the start position in front of the body. 2. After a beep, subject started the movement task. After the object was lifted up, the subject maintained his arm posture. 3. After a beep, the right hand was returned to the start position again.
3 3.1
Results Movement Time of Grasping and Reaching
Measured data were filtered with a second-order Butterworth dual pass filter (cutoff frequency: 15 [Hz]). A few data out of ±3σ and the few trials that failed to grasp the object were excepted in the below analysis. We show the typical results or the average of all subjects below because the results of all subjects have similar characteristics, Figs. 3(a) and 3(b) show the paths of the finger tips of thumb and index finger and the paths of wrist-joint, respectively. Figs. 3(c) and 3(d) show the grip aperture that is the distance between both the finger tips and the velocity of wrist-joint trajectory, respectively. In these figures, N expresses the condition of unperturbed trials. Ci and Oi express the types of the perturbation that grip aperture closes or opens, respectively. i expresses each position of the perturbations: 1, 2, 3 and 4 correspond to 10, 15, 20 and 25 [cm], respectively (see Fig. 2). In the below analysis, we defined each movement time of grasping and reaching. The start time of grasping is when tangential velocity of the center between the finger tips becomes larger than a threshold, 0.01 [m/sec], and the finish time is when the rate of grip aperture becomes smaller than 0.05 [m/sec]. For the condition without grasping, the finish time is when the center position of the finger tips reaches to the target object. For reaching, the start time is the same as timing of grasping and the finish time is when the velocity of wrist-joint becomes smaller than 0.01 [m/sec]. Grasping movements. Each spatial path of the finger tips is drastically perturbed by the external forces, as shown in Fig. 3(a). For the unperturbed case (solid line) of Fig. 3(c), the grip aperture increases gradually from about 0.2
M. Katayama and H. Katayama 0.2
0.1
[m]
0.1
N C1 O1 C2 O2 C3 O3 C4 O4
0.15
0
0.05
-0.05
y
-0.1
y
0 -0.05
-0.15
-0.1
-0.2
-0.15
-0.25
-0.2 -0.2 -0.15 -0.1 -0.05
0
x
0.05
0.1
0.15
-0.3 -0.2 -0.15 -0.1 -0.05
0.2
[m]
(a) Paths of the finger tips
0.1
0.15
0.2
(b) Paths of the wrist joint
Wrist velocity
[m/sec]
[m] 0.08
0.05
[m]
1
N C1 O1 C2 O2 C3 O3 C4 O4
0.1
Grip aperture
0
x
0.12
0.06
0.04
0.02
0
N C1 O1 C2 O2 C3 O3 C4 O4
0.05
[m]
896
N C1 O1 C2 O2 C3 O3 C4 O4
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
0.8
Time
[sec]
1
(c) Grip apertures
1.2
1.4
-0.2
0
0.2
0.4
0.6
0.8
Time
[sec]
1
1.2
1.4
(d) Velocity of reaching (wrist-joint)
Fig. 3. Paths and trajectories of reaching and grasping when perturbing to the finger tips for a typical subject (TA) for the normal condition (solid line: average of unperturbed trails, the other lines: average of perturbed trails at each condition)
[sec] after movement onset and becomes the maximum value at about 0.5 [sec]. The grasping movements finish at about 0.8 [sec]. For the perturbed cases (dotted and dashed lines) of Fig. 3(c), although the grip apertures are drastically increased or decreased by the external forces, the perturbed movements tend to return to the same profile as the unperturbed profile except for the conditions of Ci . The finish time of grasping for Ci lags behind the other conditions and especially grip aperture of C4 changes drastically. The movement times of grasping are shown in Fig. 4(a). The movement times of grasping perturbed by the external forces of Oi are almost the same as those of unperturbed grasping. While, the movement times for Ci change in the latter half of the movement and especially the movement times of C3 and C4 for all the subjects are significantly different with respect to the control group (N). Reaching movements. The spatial paths of the wrist joint under all the conditions are almost straight, and these paths are invariant, as shown in Fig. 3(b). It seems that reaching is not affected by the perturbations because their profiles are almost invariant even when the finger tips are perturbed. These results indicate that the Arbib’s hypothesis may be plausible. However, as shown in Fig. 3(d), the velocities are slightly different for each condition at the latter half
Coordinated Control of Reaching and Grasping
[sec]
1.4
1.2
** 1
ns
ns
ns ns
**
Movement time
Movement time
[sec]
1.4
ns
ns
0.8
0.6
897
N
C1
O1
C2
O2
C3
O3
C4
1.2
** ns
ns ns
1
*
ns
ns
0.8
0.6
O4
ns
N
Measurement conditions
C1
O1
C2
O2
C3
O3
C4
O4
Measurement conditions
(a) Grasping
(b) Reaching
Fig. 4. Averaged movement times in the normal condition for all subjects (Dunnet multiple comparison with respect to the control group (N), **: p < 0.01, *: 0.01 ≤ p < 0.05, ns: not significant, and vertical bars express standard deviation of movement times) Table 1. Averaged movement times in the normal condition for all subjects. Upper values are movement times [sec], middle values are standard deviations [sec], bottom descriptions are the results of Dunnet multiple comparison with respect to the control group, N, (**: p < 0.01, *: 0.01 ≤ p < 0.05, ns: not significant). Conditions N Grasping 0.811 0.111 —– Reaching 0.862 0.131 —–
C1 0.829 0.130 ns 0.877 0.146 ns
O1 0.833 0.123 ns 0.889 0.138 ns
C2 0.847 0.127 ns 0.897 0.147 ns
O2 0.809 0.113 ns 0.860 0.127 ns
C3 0.859 0.118 ** 0.904 0.145 *
O3 0.815 0.122 ns 0.861 0.141 ns
C4 0.934 0.124 ** 0.999 0.153 **
O4 0.826 0.127 ns 0.876 0.144 ns
of movement. The finish time of reaching for Ci lags behind the other conditions and especially the velocity of C4 changes drastically. Therefore, perturbed grasping may affect reaching during the prehension movement. Fig. 4(b) shows the movement times of reaching for each condition. The movement times of Oi are not significantly different with respect to those of N. Reaching for Ci changes in the latter half of the movement as shown in Figs. 3(d), and especially the movement times of C3 and C4 for all subjects are significantly different with respect to the control group (N). Relationship between grasping and reaching. As described in the above sections, both the movement times of grasping and reaching have the similar characteristics. Fig. 5 shows a relationship between both the movement times for all the conditions. Moreover, Table 2 shows the correlation coefficients between both the movement times. Both the movement times are highly correlated. These results show that the movement time of reaching is accurately adjusted depending on the changes in the movement time of grasping. Thus, in human motor control for reaching and grasping, there is a temporal coordination from
898
M. Katayama and H. Katayama
1.2
1.0
Movement time of reaching
Movement time of reaching
[sec]
1.4
[sec]
1.1
0.9
0.8
0.7 0.7 0.8 0.9 1.0 1.1 Movement time of grasping [sec]
1.0
0.8 0.6 0.4 0.4 0.6 0.8 1.0 1.2 1.4 Movement time of grasping [sec]
(a) A typical subject (TA)
(b) All subjects
Fig. 5. Linear relation between both the movement times of grasping and reaching Table 2. Correlation coefficient between both the movement times of grasping and reaching Subjects Correlation coefficients
TA 0.86
DN 0.92
YI 0.92
KO 0.82
TN 0.83
Average 0. 87
grasping to reaching. This finding is really important because computational models proposed previously do not include the adjustment. 3.2
Reaching without Grasping an Object
As described in Section 2, the mechanical perturbations given to the finger tips might directly cause the changes of the movement times of reaching. In this experiment, the perturbations were selected so as to reduce the influence that the perturbation directly affects the motor behavior of reaching. Moreover, in order to confirm whether the influence is negligible small or not, we measured the movement times for reaching without grasping although the perturbations are given to the finger tips at the same conditions as the above experiments. If the perturbations directly affect motor behavior of reaching, the movement time of reaching should be changed even in this condition without grasping. Figs. 6(a) and 6(b) show the grip aperture that is the distance between both the finger tips and the velocity of wrist-joint trajectory, respectively. Although these perturbed grip apertures change drastically, the velocities of wrist-joint are almost invariant for all the conditions. The movement times for all subjects are shown in Fig. 7. As a result, the influence is negligible small because all the movement times of each condition are not significantly different. Thus, we ascertained that the mechanical perturbations given to the finger tips do not cause directly the changes of the movement times of reaching. Here, we would like to emphasize again that the movement-time adjustment of reaching described in the above sections is performed by a coordinated control mechanism for the prehension movement.
Coordinated Control of Reaching and Grasping 0.12
Wrist velocity
[m/sec]
[m] 0.08
Grip aperture
1
N C1 O1 C2 O2 C3 O3 C4 O4
0.1
0.06
0.04
0.6
0.4
0.2
0
0
-0.2
0.2
0.4
0.6
0.8
Time
[sec]
1
1.2
1.4
(a) Grip apertures
N C1 O1 C2 O2 C3 O3 C4 O4
0.8
0.02
0
899
0
0.2
0.4
0.6
0.8
Time
[sec]
1
1.2
1.4
(b) Velocity of reaching (wrist-joint)
Fig. 6. Paths and trajectories of reaching and grasping for a typical subject (TA) without grasping an object (solid line: average of unperturbed trails, the other lines: average of perturbed trails at each condition)
[sec]
1.2
1
Movement time
ns
ns ns
ns
ns
ns
ns
0.6
N
C1
O1
C2
O2
C3
O3
C4
ns
ns
ns
ns
O2
C3
ns ns
ns
ns
0.8
0.4
1
ns
Movement time
[sec]
1.2
0.8
0.6
0.4
O4
N
Measurement conditions
C1
O1
C2
O3
C4
O4
Measurement conditions
(a) Grasping
(b) Reaching
Fig. 7. Movement times of grasping and reaching for all subjects without grasping an object (Dunnet multiple comparison with respect to the control group (N), ns: not significant, and vertical bars express standard deviation of movement times) Table 3. Averaged movement times of reaching for all subjects without grasping an object (upper values are movement times [sec], middle values are standard deviations [sec], bottom descriptions are the results of Dunnet multiple comparison with respect to the control group (N), ns: not significant) Conditions
N 0.693 Grasping 0.193 —– 0.777 Reaching 0.146 —–
C1 0.703 0.187 ns 0.787 0.134 ns
O1 0.657 0.184 ns 0.744 0.133 ns
C2 0.681 0.194 ns 0.765 0.137 ns
O2 0.676 0.200 ns 0.768 0.155 ns
C3 0.703 0.192 ns 0.792 0.151 ns
O3 0.684 0.177 ns 0.783 0.146 ns
C4 0.708 0.203 ns 0.795 0.169 ns
O4 0.688 0.191 ns 0.785 0.155 ns
900
4 4.1
M. Katayama and H. Katayama
Discussion Coordinated Mechanism between Reaching and Grasping
Haggard et al. [10] examined the detail of the influence from reaching to grasping by giving mechanical perturbations to subject’s arm when executing the prehension movements. These results clearly show that grasping is skillfully adjusted depending on reaching. In this study, by using mechanical perturbations to the finger tips when executing the movement, we directly investigated the influence from grasping to reaching. As a result, we found the temporal adjustment that the movement time of reaching is accurately adjusted so as to fit to the timing when grasping finishes because both of the movement times are highly correlated. There are two possibilities to explain the temporal adjustment: automatic adjustment such as a spinal reflex and voluntary adjustment with a trans-cortical loop. Moreover, the adjustment may be affected by learning effect through the iterative trials of the movement task. In this experiment, it seems that there is no remarkable learning effect, although the variance of the movements in the beginning of the trials are relatively large. However, there were some failure trials in grasping for only Ci , although there was no failure for Oi . It seems that the lengthened duration of movement time of reaching depends on the number of the trials that subject failed to grasp the object. From this observation, the temporal adjustment may be caused by voluntary adjustment and/or its learning. Thus, there is a possibility that in the human brain the movement time of reaching is planned by an internal simulation of the movement that grasps an object. Moreover, the movement time of reaching may be adjusted automatically. The adjustment can be investigated by detecting the duration between the timing of the perturbation and the time when reaching is adjusted. From this point of view, we would like to investigate the temporal adjustment mechanism. 4.2
Biological Plausibility of Computational Models
From computational point of view, computational models that explain the coordinated control mechanism between reaching and grasping during the prehension movement have been proposed [10,11,12]. For example, Haggard et al. [10] have proposed a simple computational model with automatic adjustment that reaching and grasping are coupled strongly . Moreover, base on the Arbib’s scheme, Hoff and Arbib [11] have built a computational model that independently calculated each trajectory of reaching and grasping and the Hoff-Arbib model explains motor behavior for unpredictable changing of object position and size and for the speed-accuracy trade-off. Bullock et al. have proposed a computational model based on the VITE model (see [12]). However, those computational models are not plausible biologically because those models do not include the temporal adjustment that grasping affects reaching during the movement. Therefore, those computational models should be extended so as to explain the temporal adjustment we found.
Coordinated Control of Reaching and Grasping
5
901
Conclusion
In this paper, by giving the mechanical perturbations to the finger tips when executing the prehension movement, we examined the influence from grasping to reaching in order to investigate the coordinated control system of human reaching and grasping movements. As a result, we found the temporal adjustment that the movement time of reaching is accurately adjusted so as to fit to the timing when grasping finishes because both of the movement times are highly correlated. This finding is really important for building a computational model of coordinated control of human reaching and grasping.
References 1. Jeannerod, M.: Intersegmental coordination during reaching at natural visual objects. In: Long, J., Baddeley, A. (eds.) Attention and Performance IX, pp. 153–168. ErIbaum, Hillsdale, Nj (1981) 2. Arbib, M., Iberall, T., Lyons, D.: Coodinated control programs for control of the hands. Experimental Brain Research, Supplement 10, 111–129 (1985) 3. Paulignan, Y., MacKenzie, C.L., Marteniuk, R.G., Jeannerod, M.: Selective perturbation of visual input during prehension movements. 1. the effects of changing object position. Experimental Brain Research 83, 502–512 (1991) 4. Gentilucci, M., Chieffi, S., Scarpa, M., Castiello, U.: Temporal coupling between transport grasp components during prehension movements: Effects of visual perturbation. Behavioural Brain Research 47, 71–82 (1992) 5. Paulignan, Y., Jeannerod, M., MacKenzie, C.L., Marteniuk, R.G.: Selective perturbation of visual input during prehension movements. 2. the effects of changing object size. Experimental Brain Research 87, 407–420 (1991) 6. Gentilucci, M., Castiello, U., Corradini, M.L., Scarpa, M., Umilta, C., Rizzolatti, G.: Influence of different types of grasping on the transport component of prehension movements. Neuropsychologia 29, 361–378 (1991) 7. Chieffi, S., Fogassi, L., Gallese, V., Gentilucci, M.: Prehension movements directed to approaching objects: Influence of stimulus velocity on the transport the grasp components. Neuropsychologia 30, 877–897 (1992) 8. Marteniuk, R.G., Leavitt, J.L., MacKenzie, C.L., Athenes, S.: Functional relationships between grasp transport components in a prehension task. Human Movement Science 9, 149–176 (1990) 9. Zaal, F., Bootsma, R., van Wieringen, P.: Coordination in prehension. informationbased coupling of reaching and grasping. Experimental Brain Research 119, 427– 435 (1998) 10. Haggard, P., Wing, A.: Coordinated responses following mechanical perturbation of the arm during prehension. Experimental Brain Research 102, 483–494 (1995) 11. Hoff, B., Arbib, M.: Models of trajectory formation and temporal interaction of reach grasp. Journal of Motor Behavior 25(3), 175–192 (1993) 12. Ulloa, A., Bullock, D.: A neural network simulating human reach-grasp coordination by continuous updating of vector positioning commands. Neural Networks 16, 1141–1160 (2003)
Computer Simulation of Vestibuloocular Reflex Motor Learning Using a Realistic Cerebellar Cortical Neuronal Network Model Kayichiro Inagaki1, Yutaka Hirata2 , Pablo M. Blazquez1 , and Stephen M. Highstein1 1
Washington University School of Medicine, Dept. Otolaryngology, 4566 Scott Avenue, St. Louis, MO 63110, USA 2 Chubu University, Dept. Computer Science, 1200 Matsumoto Kasugai, Aichi, Japan
Abstract. The vestibuloocular reflex (VOR) is under adaptive control to stabilize our vision during head movements. It has been suggested that the acute VOR motor learning requires long-term depression (LTD) and potentiation (LTP) at the parallel fiber – Purkinje cell synapses in the cerebellar flocculus. We simulated the VOR motor learning basing upon the LTD and LTP using a realistic cerebellar cortical neuronal network model. In this model, LTD and LTP were induced at the parallel fiber – Purkinje cell synapses by the spike timing dependent plasticity rule, which considers the timing of the spike occurrence in the climbing fiber and the parallel fibers innervating the same Purkinje cell. The model was successful to reproduce the changes in eye movement and Purkinje cell simple spike firing modulation during VOR in the dark after low and high gain VOR motor learning. Keywords: VOR motor learning, Spike neuron model, Spike timing dependent plasticity, Long-term depression, Long-term potentiation.
1
Introduction
The vestibuloocular reflex (VOR) stabilizes retinal image during head motion by counter-rotating the eyes in the orbit. The VOR is under adaptive control which maintains the compensatory eye movements under the situation where growth, aging, injury, etc, may cause changes in oculomotor plant dynamics. The adaptive control of the VOR requires the cerebellar flocculus. Inactivation of flocculus precludes further VOR motor learning [23] and eliminates the short term memory of VOR [19]. It is widely accepted that cerebellar long term depression (LTD) or the combination of LTD and long-term potentiation (LTP) at the parallel fiber – Purkinje cell synapses are the underlying mechanisms for the VOR motor learning [3],[11]. However, how the synaptic plasticity modifies the signal processing in the cerebellar cortical neuronal network to achieve VOR motor learning is still unclear. It has been shown that the parallel fiber – Purkinje cell LTD and LTP were induced depending on the input timing of the parallel M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 902–912, 2008. c Springer-Verlag Berlin Heidelberg 2008
Computer Simulation of Vestibuloocular Reflex Motor Learning
903
fiber and the climbing fiber [25] innervating the same Purkinje cell. The parallel fiber inputs evoke simple spike (SS) on Purkinje cells at about a hundred Hertz, whereas the climbing fiber input evokes complex spike (CS) in Purkinje cells at ultra-low frequency (∼ 5 Hz). In cerebellar computational theories, CSs code an error signal which is the difference between intended and actual movements, and SSs learn to modulate their firing patterns to minimize the error [26]. To date, several computational models have been proposed to simulate VOR motor learning (e.g. [7],[16],[27]). These models were configured by using transfer functions which process neural firing rate and do not take spike occurrence timing into consideration to describe the signal processing in VOR neuronal circuitry, thus cannot evaluate the roles of the spike-timing dependent LTD and LTP in VOR motor learning. In our previous work [10], we constructed a VOR model in which the cerebellar cortical neuronal network is explicitly described by integrate-andfire neurons based upon the known anatomy and physiology [11]. Presently we embedded a cerebellar motor learning algorithm in this model based on the parallel fiber – Purkinje cell LTD and LTP. Simulations confirmed that the model can successfully reproduce changes in eye movements and Purkinje cell SS firing modulation after VOR motor learning in squirrel monkeys.
2 2.1
Model Structure
Figure 1A illustrates the structure of the model. The model consists of 8 subsystems one of which is a cerebellar flocculus cortical neuronal network composed of spiking neurons. Other subsystems are described as transfer functions. visual vestib Gecopy preF L , GpreF L and GpreF L describe characteristics of pre-flocculus pathway each of which processes the efference copy, visual, and vestibular signal, respectively. Gcf preF L describes characteristics of another pre-floccular pathway including lateral terminal nucleus – inferior olive – Purkinje cell which processes the visual signal (retinal slip). GpostF L describes characteristics of post-floccular pathway which converts Purkinje cell SS outputs to a part of the motor output. visual Gvestib nonF L and GnonF L describe characteristics of non-floccular vestibular pathway and non-floccular visual pathway, respectively. The flocculus neuronal network model consists of 20 Purkinje, 900 Golgi, 60 Basket, and 10000 granule cells (Figure 1B). Synaptic connections among these cell types are similar to those in the cerebellar cortical model for eye blink conditioning configured by Medina et al.[18]. Namely, each granule cell receives 6 excitatory mossy fiber inputs and 3 inhibitory Golgi cell inputs. Each Golgi cell receives 20 excitatory mossy fibers and 100 granule cell inputs. Each basket cell receives excitatory synaptic inputs from 250 granule cells. Each Purkinje cell receives excitatory inputs from 10000 granule cells and a single climbing fiber as well as inhibitory inputs from 10 basket cells. In scaling down the cerebellar cortical network to a computationally feasible dimension, it was suggested that the convergence / divergence ratios were more important than the cell ratios [18]. Consequently, the model is scaled down basing upon maintaining the convergence / divergence ratio.
904
K. Inagaki et al.
A
B
Fig. 1. Structure of the proposed VOR motor learning model. A: entire model structure. B: structure of the cerebellar cortical neuronal network. In A, the VOR model consists of 8 subsystems each of which represents different anatomical pathways provisual vestib cessing different modality of signals[9]. Gecopy preF L , GpreF L and GpreF L are pre-floccular subsystems each of which corresponds to a transfer function of pre-floccular efference copy pathway, visual (retinal slip) pathway and vestibular pathway, respectively. Gcf preF L represents a transfer function of climbing fiber pathway, which transfers visual signal to Purkinje cell CS activity. GpostF L is a transfer function of post-floccular pathway, which transfers the Purkinje cell SS activity to a part of the motor command. vestib Gvisual nonF L and GnonF L represent transfer functions of non-floccular visual and vestibular pathway, respectively. In B, the flocculus is explicitly described by spike neuron model to evaluate the cerebellar motor learning in terms of the spike timing dependent plasticity. The organization of the flocculus neuronal network and each of synaptic connections were determined by the anatomical and physiological evidence.
Computer Simulation of Vestibuloocular Reflex Motor Learning
2.2
905
Description of Subsystems
Neurons in the flocculus are described as conductance based spike neuron models. Synaptic conductance gsyn in the model is calculated as follows: dgsyn 1 = Si · wi (t) · (1 − gsyn ) − gsyn · dt τ i=1 N
(1)
where N , Si , wi and τ denote the number of the pre-synaptic cells, spike input from the i-th pre-synaptic cell (0 or 1), synaptic weight, and synaptic decay time constant, respectively. The synaptic current Isyn is computed by: Isyn (V (t), t) = gsyn · gsyn · (V (t) − P SPmax )
(2)
where gsyn and P SPmax are respectively the scaling coefficient of synaptic strength and the constant of the post synaptic potential whose sign determines if the synaptic current is excitatory or inhibitory. Synaptic decay time constant and post synaptic potential for each cell type are same as those used in previous report [10]. The membrane potential V is given by: M dV = −gl · (V (t) − Eleak ) − Isyn (V (t), t) dt syn=1
(3)
where gl and Eleak represent the leak coefficient and leak potential of the post synaptic cell, respectively. M denotes the number of synapse type connected to a post synaptic cell. Each cell fires a spike immediately when their membrane potential exceeds the threshold set for individual cell type [10]. Then membrane potential is reset to 0 after the cell firing. Other subsystems are described as transfer functions modified from our previous VOR system model [9] as in eq.(4) – (9). 2 −pe s Gecopy preF L (s) = (αe s + βe s + γe )e
(4)
2 −ph s Gvestib preF L (s) = (αh s + βh s + γh )e
(5)
2 −pr s Gvisual preF L (s) = (αr s + βr s + γr )e
(6)
1 e−qe s ue s2 + ve s + we
(7)
Gvestib nonF L (s) =
(ah s2 + bh s + ch )e−qh s uh s2 + vh s + wh
(8)
Gvisual nonF L (s) =
(ar s2 + br s + cr )e−qr s ur s2 + vr s + wr
(9)
GpostF L (s) =
In each equation, α , β and γ denote the coefficient of acceleration, velocity and position of eye movement (e), retinal slip (r) and head movement (h). Also u, v and w are the coefficient of acceleration, velocity and position of eye movement. q and p are latencies of Purkinje cell SS firing in response to input signal to
906
K. Inagaki et al.
each subsystem. The subsystem for another pre-floccular pathway representing lateral terminal nucleus – inferior olive – Purkinje cell pathway is given by 2 −pc s Gcf preF L (s) = (αc s + βc s + γc )e
(10)
where αc , βc and γc denote the coefficient of acceleration, velocity and position of retinal slip. pc is time delay constant. The output of this transfer function cf (t) is saturated by a sigmoid function before innervating a Purkinje cell. CF (t) =
1 1 + exp(κ(cf (t) − ξ))
(11)
where κ and ξ are free parameters. CF (t) is integrated and CS fires immediately when the membrane potential exceeds its threshold level. Parameters in the cerebellar spike neuron network model were first pre-adjusted so that the model approximates experimental Purkinje cell firing patterns during optokinetic response (OKR) and VORd (see 2.4). At this point, same parameter values of the previous VOR model [9] were used for those in the transfer function models. Then, parameters in the transfer function models were fine-tuned together with those in the spike neuron model so that the model reproduces both eye velocities and Purkinje cell firing patterns during OKR and VORd. We then tested this model by predicting eye velocities and Purkinje cell firing patterns during VOR enhancement (VORe) and VOR suppression (VORs) paradigms (see Experimental paradigm below). 2.3
Description of Learning Rule
The parallel fiber – Purkinje cell LTD is induced when a glutamate input from a parallel fiber to a Purkinje cell is followed by an elevation of the calcium ion level elicited by climbing fiber input [25]. On the other hand, LTP is induced by a glutamate input without an elevation of the calcium ion level [6]. In the model, these LTD and LTP are represented as changes in synaptic weight (δw) between the parallel fibers and the Purkinje cell which are described by following equation. δw = δLT D · GR(t) · winCF (t) + δLT P · GR(t) · (1 − winCF (t))
(12)
where GR(t) is an input from a granule cell to a Purkinje cell conveyed via a parallel fiber. If the granule cell fires a spike, GR(t) is 1. winCF (t) is the window function that describes the time window of an elevation of the calcium ion level in a Purkinje cell. It is 1 for 35msec after a firing of climbing fiber and 0 otherwise. δLT D and δLT P are the rate of the change in synaptic weight by LTD and LTP, respectively. With the learning rule, the synaptic weight (wi (t)) in eq. (1) between a granule cell and a Purkinje cell is updated so that it decreases if the granule cell fires within the 35msec window after the firing of climbing fiber (LTD), or increases if the granule cell fires outside the time window (LTP) [6],[25].
Computer Simulation of Vestibuloocular Reflex Motor Learning
2.4
907
Experimental Paradigms
Detailed experimental procedures were mentioned elsewhere [7]. Briefly, squirrel monkeys were employed for recording of flocculus Purkinje cells and eye movements during various visual-vestibular interaction paradigms (see below). Animals were seated in a primate chair with their head fixated. The chair was rotated by a servo motor around monkey’s inter-aural axis to produce a vertical VOR. Black random dots were projected on the cylindrical screen coaxially surrounding the rotation axis of the servo motor. This is the optokinetic stimulus (OKS). OKR is induced by applying OKS with the monkey’s head stationary, while in VORd the chair rotates in the dark (no OKS). In VORs, the chair rotates in phase with the OKS, while in VORe the chair and OKS rotate out of phase. All paradigms consist of sinusoidal chair rotation and/or OKS (frequency 0.5Hz, amplitude 40deg/s). VORs and VORe were utilized for high and low gain training. Animals were trained toward high or low gain for 3-7h a day. Vertical and horizontal eye position, chair velocity and OKS velocity were continuously recorded at a sampling frequency of 200Hz with the use of Power1401 interface (Cambridge Electronic Design) for display and storage using Spike2 program. Floccular and ventral parafloccular Purkinje cells were identified by the firing of CS and their SS discharge patterns. After a unit was isolated, OKR, VORd, VORs and VORe paradigms were applied. All these experiments and surgical procedures were approved by the Animal Welfare and Use Committee of Washington University.
3 3.1
Results Simple and Complex Spike Firing Before Learning
Figure 2 illustrates the simulated Purkinje cell SS and CS firing during VORe, VORs and VOR in the light (VORl) averaged over 30 stimulus cycles. According to the experimental evidence [15], CS firing modulates out of phase with SS firing pattern. Furthermore, SS firing pattern modulates in-phase with head rotation during VORe and out of phase with head rotation during VORs [2],[7]. In our model simulation, these SS modulation are well reproduced in both VORe (A, D, G) and VORs (B, E, H) paradigms, and CS firing rate in these paradigms are out of phase with the SS firing modulation. During VORl simulation(C, F, I), the SS slightly modulated in-phase with head rotation [21], whereas the CS did not significantly modulate. 3.2
Simulation of VOR Motor Learning
It has been reported that the gain of VOR can be acutely increased or decreased by the continuous application of VORe or VORs, respectively in squirrel monkeys and other animal species [14],[22],[24]. We simulated VOR motor learning in our model by using VORe and VORs paradigm. In each training paradigm, vestibular and optokinetic stimuli were applied for 45 stimulus cycles. To quantify the
908
K. Inagaki et al.
Fig. 2. SS and CS of the floccular Purkinje cell during VORe (A, D, G), VORs (B, E, H), and VORl (C, F, I) in the computational simulation. Each panel shows: Top, mean firing rate of a SS, middle, mean firing rate of a CS, bottom, head velocity (solid black line) and OKS (dashed gray line). Mean firing rate of SS and CS were determined as an average over 30 stimulus cycles.
Fig. 3. Adaptive gain change during VORe training (filled circle), VORs training (filled square) and VORl (filled triangle) in the simulation. The gain of VOR increased 0.51 (from 0.79 to 1.30) after VORe training, and decreased 0.30 (from 0.79 to 0.49) after VORs training. During VORl, the gain of VOR did not change.
Computer Simulation of Vestibuloocular Reflex Motor Learning
909
Fig. 4. Purkinje cell SS firing modulation in the experiment using a squirrel monkey [7] (A, B, C), and in our simulation (D, E, F) during VORd before and after VOR motor learning (A and D: VORs training, C and F: VORe training). In each panel, top, Purkinje cell SS response, bottom, eye (solid line) and head velocity (dashed line).
newly acquired VOR memory, we measured gain of VORd every 15 stimulus cycles. Figure 3 illustrates adaptive gain change in eye movement during VORe training (filled circle), VORs training (filled square) and VORl (filled triangle). The model simulation demonstrated that the gain of VOR changed from 0.79 to 1.30 (+0.51) after VORe training and from 0.79 to 0.49 (-0.30) after VORs training. After VORl training, the gain of VOR did not change. These changes in gain of VOR after VORe, VORs and VORl training are comparable to those demonstrated experimentally in monkeys [14]. Figure 4 illustrates Purkinje cell SS firing modulation in an experiment using squirrel monkey [7](A, B, C) and in our simulation (D, E, F) during VORd before and after VOR motor learning(A and D: VORs training, C and F: VORe training). It has been shown that Purkinje cell SS changes firing modulation during VORd after motor learning [2],[7]. In the experiment, before any training, Purkinje cell SS during VORd hardly modulate (Fig.4 B). After VORs training (Fig.4 A), SS slightly modulated out of phase with the head rotation. After VORe training (Fig.4 C), SS modulated in phase with the head rotation. In agreement with these experimental findings, our model reproduced these changes in the Purkinje cell SS modulation during VORd after VORs and VORe training.
4
Discussion and Conclusion
Conceptual theory of cerebellar motor learning was proposed by Marr[17], Albus[1] and Ito[11]. The cerebellar Purkinje cells receive parallel fiber and climbing fiber input. The parallel fibers send vestibular, visual and efference copy of motor output to the Purkinje cell and evoke SS, whereas the climbing fiber sends error signal to
910
K. Inagaki et al.
the same Purkinje cell and evokes CS. In the conceptual theory, the SS activities encode motor performance in their firing rate, while the CS activities reflect an error onto the Purkinje cell to guide motor learning and change SS activity. It has been suggested that cerebellar motor learning is induced by LTD and LTP [3],[27] and also shown that LTD or LTP is evoked depending on the input timing of a parallel fiber and a climbing fiber [25]. Presently, we simulated VOR motor learning using the realistic cerebellar cortical neuronal network model with spike timing dependent learning rule for the LTD and the LTP. The model successfully demonstrated the firing relation between Purkinje cell SS and CS, the adaptive changes in the motor performance and SS activity. Our principal finding is that VOR motor learning is induced when the Purkinje CS is modulated (during VORe and VORs), and is not induced when the CS hardly modulates (during VORl). In our model, CS firing modulation reflects retinal slip information which is one of the error signal candidates in VOR motor learning, and the modulation of the CS changes the Purkinje SS firing pattern to guide VOR motor learning. The CS modulates out of phase with the SS during VORe (Fig.2, D) and in phase with the SS during VORs (Fig.2, E). During VORl, the CS does not modulate significantly (Fig.2, F). Our simulation showed that these CS modulations can change the balance of LTD and LTP at the synapses between parallel fibers and a Purkinje cell, and guide the motor performance toward respective goals: high, low or normal gain. This consideration, revealing from our computational simulation, consistents with the theory of CS guided motor learning [4],[20]. Several types of plasticity in the cerebellum have been reported in in vitro studies (e.g. [5],[13]). In contrast, we could account for VOR motor learning solely by the parallel fiber – Purkinje cell LTD and LTP. However, the other type of synaptic plasticity [5],[13] might be required in the complex VOR motor learning: up - down or left - right asymmetric VOR motor learning [8],[28].
References 1. Albus, J.S.: A Theory of Cerebellar Function. Mathematical Biosciences 10, 25–61 (1971) 2. Blazquez, P.M., Hirata, Y., Heiney, S.A., Green, A.M., Highstein, S.M.: Cerebellar Signatures of Vestibule-ocular Reflex Motor Learning. J. Neurosci. 23, 9742–9751 (2003) 3. Boyden, E.S., Katoh, A., Raymond, J.L.: Cerebellum-dependent Learning: The Role of Multiple Plasticity Mechanisms. Annu. Rev. Neurosci. 27, 581–609 (2004) 4. Gilbert, P.F.C., Thach, W.T.: Purkinje Cell Activity During Motor Learning. Brain Res. 128, 309–328 (1977) 5. Hansel, C., Linden, D.J., D’Angelo, E.: Beyond Parallel Fiber LTD: the Diversity of Synaptic and Nonsynaptic Plasticity in the Cerebellum. Nature Neurosci. 4, 467–475 (2001) 6. Hirano, T.: Depression and Potentiation of the Synaptic Transmission Between a Granule Cell and a Purkinje Cell in Rat Cerebellar Culture. Neurosci. Lett. 119, 141–144 (1990) 7. Hirata, Y., Highstein, S.M.: Acute Adaptation of the Vestibuloocular Reflex: Signal Processing by Floccular and Ventral Parafloccular Purkinje Cells. J. Neurophysiol. 85, 2267–2288 (2001)
Computer Simulation of Vestibuloocular Reflex Motor Learning
911
8. Hirata, Y., Lockard, J.M., Highstein, S.M.: Capacity of Vertical VOR Adaptation in Squirrel Monkey. J. Neurophysiol. 88, 3194–3207 (2002) 9. Hirata, Y., Takeuchi, I., Highstein, S.M.: A Dynamical Model for the Vertical Vestibuloocular Reflex and Optokinetic Response in Primate. Neurocomputing 5254, 531–540 (2003) 10. Inagaki, K., Hirata, Y., Blazquez, P., Highstein, S.: Model of VOR motor Learning with Spiking Cerebellar Cortical Neuronal Network. In: Proceedings of 16th Annu. CNS Meeting (2006) 11. Ito, M.: The Cerebellum and Neural Control. Raven Press (1984) 12. Ito, M.: Long-term Depression. Annu. Rev. Neurosci. 12, 85–102 (1989) 13. Kano, M., Rexhausen, U., Dreessen, J., Konnerth, A.: Synapticexcitation Produces a Long-lasting Rebound Potentiation of Inhibitory Synaptic Signals in Cerebellar Purkinje Cells. Nature 356, 601–604 (1992) 14. Kuki, Y., Hirata, Y., Blazquez, P.M., Heiney, S.A., Highstein, S.M.: Memory Retention of Vestibuloocular Reflex Motor Learning in Squirrel Monkeys. Neuroreport 15, 1007–1011 (2004) 15. Kobayashi, Y., Kawano, K., Takemura, A., Inoue, Y., Kitama, T., Gomi, H., Kawato, M.: Temporal Firing Patterns of Purkinje Cells in the Cerebellar Ventral Paraflocculus During Ocular Following Responses in Monkeys II. Complex spikes. J. Neurophysiol. 80, 832–848 (1998) 16. Lisberger, S.G., Sejnowski, T.: Motor Learning in a Recurrent Neural Network Model Based on the Vestibulo-ocular Reflex. Nature 360, 159–161 (1992) 17. Marr, D.: A Theory of Cerebellar Cortex. J. Physiol. 202, 437–470 (1969) 18. Medina, J.F., Garcia, K.S., Nores, W.L., Taylor, N.M., Mauk, M.D.: Timing Mechanisms in the Cerebellum: Testing Predictions of a Large-scale Computer Simulation. J. Neurosci. 20, 5516–5525 (2000) 19. Nagao, S., Kitazawa, H.: Effects of Reversible Shutdown of the Monkey Flocculus on the Retention of Adaptation of the Horizontal Vestibulo-ocular Reflex. Neuroscience 118, 563–570 (2003) 20. Ojakangas, C.L., Ebner, T.J.: Purkinje Cell Complex and Simple Spike Changes During a Voluntary Arm Movement Learning Task in the Monkey. J. Neurophysiol. 68, 2222–2236 (1992) 21. Omata, T., Kitama, T., Mizukoshi, A., Ueno, T., Kawato, M., Sato, Y.: Purkinje Cell Activity in the Middle Zone of the Cerebellar Flocculus During Optokinetic and Vestibular Eye Movement in Cats. Jpn. J. Physiol. 50, 357–370 (2000) 22. Pastor, A., De La Cruz, R.R., Baker, R.: Characterization and Adaptive Modification of the Goldfish Vestibuloocular Reflex by Sinusoidal and Velocity Step Vestibular Stimulation. J. Neurophysiol. 68, 2003–2015 (1992) 23. Rambold, H., Churchland, A., Selig, Y., Jasmin, L., Lisberger, S.G.: Partial Ablations of the Flocculus and Ventral Paraflocculus in Monkeys Causes Linked Deficits in Smooth Pursuit Eye Movements and Adaptive Modification of the VOR. J. Neurophysiol. 87, 912–924 (2002) 24. Robinson, D.A.: Adaptive Gain Control of Vestibuloocular Reflex by the Cerebellum. J. Neurophysiol. 39, 954–969 (1978) 25. Sakurai, M.: Synaptic Modification of Parallel Fibre – Purkinje Cell Transmission in In Virto Guinea-pig Cerebellar Slices. J. Physiol (Lond) 394, 463–480 (1987)
912
K. Inagaki et al.
26. Stone, L.S., Lisberger, S.G.: Visual Responses of Purkinje Cells in the Cerebellar Flocculus During Smooth-pursuit Eye Movements in Monkeys. II. Complex Spikes. J. Neurophysiol. 63, 1962–1975 (1990) 27. Tabata, H., Yamamoto, K., Kawato, M.: Computational Study on Monkey VOR Adaptation and Smooth Pursuit Based on the Parallel Control-pathway Theory. J. Neurophysiol. 87, 2176–2189 (2002) 28. Yoshikawa, A., Yoshida, M., Hirata, Y.: Capacity of the Horizontal Vestibuloocular Reflex Motor Learning in Goldfish. In: Proc. of the 26th Ann. Int’l Con. of the IEEE EMBS, pp. 478–481 (2005)
Reflex Contributions to the Directional Tuning of Arm Stiffness Gary Liaw1, David W. Franklin1,2,3, Etienne Burdet4, Abdelhamid Kadi-allah4, and Mitsuo Kawato3 1
National Institute of Information and Communications Technology, 2-2-2 Hikaridai, Keihanna Science City, Kyoto, 619-0288, Japan 2 Department of Engineering, University of Cambridge, Cambridge, United Kingdom 3 ATR Computational Neuroscience Laboratories, Keihanna Science City, Kyoto, Japan 4 Department of Bioengineering, Imperial College London, London, United Kingdom
Abstract. It has been shown that during arm movement, humans selectively change the endpoint stiffness of their arm to compensate for the instability in an unstable environment. When the direction of the instability is rotated with respect to the direction of movement, it was found that humans modify the antisymmetric component of their endpoint stiffness. The antisymmetric component of stiffness arises due to reflex responses suggesting that the subjects may have tuned their reflex responses as part of the feedforward adaptive control. The goal of this study was to examine whether the CNS modulates the gain of the reflex response for selective tuning of endpoint impedance. Subjects performed reaching movements in three unstable force fields produced by a robotic manipulandum, each field differing only in the rotational component. After subjects had learned to compensate for the field, allowing them to make unperturbed movements to the target, the endpoint stiffness of the arm was estimated in the middle of the movements. At the same time electromyographic activity (EMG) of six arm muscles was recorded. Analysis of the EMG revealed differences across force fields in the reflex gain of these muscles consistent with stiffness changes. This study suggests that the CNS modulates the reflex gain as part of the adaptive feedforward command in which the endpoint impedance is selectively tuned to overcome environmental instability. Keywords: Mechanical impedance, limb stiffness, internal model, stretch reflex, impedance control, co-contraction, electromyography.
1 Introduction In everyday life, we perform activities in our environment that require us to interact with different objects such as tools. These interactions with tools are often inherently unstable [1]. For example, when using a screwdriver, the direction of the force applied needs to be parallel to the screwdriver in order to maintain contact with the screw. However this is further complicated as the force applied by the user is susceptible to fluctuations due to inherent signal dependent noise [2]. These variations M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 913–922, 2008. © Springer-Verlag Berlin Heidelberg 2008
914
G. Liaw et al.
in the direction of the applied force could produce movements that may cause the screwdriver to slip, and lose contact with the screw. In order to successfully perform this and other unstable tasks, the neuromuscular system must overcome such instability. Viscoeleastic properties of muscle play an important role in motor control, as it reacts without delay to disturbances caused by instabilities. The greater the viscoelastic impedance of the arm, the better it can resist perturbations that affect the intended position or trajectory. The ability to control viscoelastic impedance is therefore important, especially in unstable environments [3] or unpredictable circumstances [4]. Hogan first suggested that impedance can be selectively controlled by the central nervous system (CNS) [5] and evidence for this selective control was later demonstrated by others [3,6,7]. In the latter studies, point-to-point arm reaching movements were used as a platform to investigate the level of sophistication of impedance control by the CNS. It was demonstrated that arm stiffness could be modulated by the CNS for magnitude [6] as well as direction [7] of the environmental instability. It was suggested that such modulation was facilitated by feedforward changes in muscle activation, specifically by modifying the activations in three muscle groups: the shoulder muscles, the elbow muscles and the biarticular muscles. Endpoint stiffness can be decomposed into two components: a symmetric component and an antisymmetric component [8]. The symmetric component can be produced by a combination of passive, intrinsic, or symmetric reflexive stiffness, while the antisymmetric component is thought to be mainly due to heteronymous reflexive responses [5]. This means that a large antisymmetric component in the endpoint stiffness can be interpreted as evidence of contribution from reflexes. Our previous study [7] found that adaptation to environmental instability in different directions resulted in changes to the antisymmetric component of endpoint stiffness. The present study aims at solidifying those results by varying only the cross terms of the environmental instability. We will investigate whether the CNS modulates the reflex contributions to stiffness by modifying the reflex gain as part of the impedance controller.
2 Methods Four healthy, right-handed subjects (three male, one female) participated in the study. The experiment was approved by the institutional ethics committee and subjects gave informed consent. 2.1 Apparatus Subjects were seated in a chair with their shoulder held against the back of the chair by seatbelts to restrict trunk movement. The height of the chair was adjusted such that the workspace was at the subject’s shoulder level. A custom-molded thermoplastic cuff was used to restrict the subject’s wrist motion and a horizontal beam was secured to the subject’s forearm for support. The subject’s right arm, along with the cuff and the beam were coupled to the parallel-link direct drive air-magnet floating manipulandum (PFM) (Fig 1). Details of the PFM and setup can be found elsewhere [9].
Reflex Contributions to the Directional Tuning of Arm Stiffness
915
Subjects performed point-to-point reaching movements in the horizontal plane from a start target to an end target. The start and end targets were marked with a circle of diameters 2.5cm and 3cm, respectively, and located 31cm and 56cm, respectively, from and directly in front of the shoulder joint. The subject’s arm was restricted to keep the movement to two degrees of freedom. Subjects were asked to perform the movements in 600ms ± 100ms. A computer monitor provided feedback to the subjects on whether the previous movement was successful or not. During force field trials, safety boundaries were set up 5cm from the center on either side, beyond which the force field would revert to a null field. A successful movement was one where the subject completed the movement by reaching the target within the desired duration range, and did not exceed the safety boundaries. An opaque table above the workspace blocked the subject's view of the manipulandum and the subject's arm. The position of the hand and the start and end target circles were projected onto the surface of the table to provide subjects with visual feedback.
PFM
target
start y [0,0] x
Fig. 1. Experimental Setup. The subject’s right arm was attached to the handle of the PFM with a cuff, restricting wrist motion. Reaching movement were performed from a start position [0, 31]cm to a target position [0, 56]cm relative to the shoulder joint. The PFM either applied a computer-controlled force field or controlled displacements for stiffness estimation.
2.2 Force Fields Subjects performed point-to-point reaching movements in three different unstable divergent force fields as well as a null field. The three force fields exerted a force away from the center line (y-axis) of the workspace that is proportional to the size of the deviation from the center line; the subjects experienced no force on the hand if there is no deviation. The three force fields differed in direction and magnitude. As a baseline for adaptation, a simple divergent force field (DF), oriented perpendicular to the direction of movement, was simulated. The equation was:
⎡ Fx ⎤ ⎡ 200 0 ⎤ ⎡ x ⎤ ⎢F ⎥ = ⎢ ⎥⋅⎢ ⎥ ⎣ y ⎦ ⎣ 0 0⎦ ⎣ y ⎦
(1)
916
G. Liaw et al.
where Fx and Fy are the forces applied by the PFM in the x and y directions, respectively. The other two divergent force fields simply varied in terms of the cross term linking an error in x with the force produced in y. Each of these rotated the forces either in the clockwise direction (CW DF) or counter clockwise direction (CCW DF). The equations of the force fields were:
CW DF:
⎡ Fx ⎤ ⎡ 200 0 ⎤ ⎡ x ⎤ ⎢F ⎥ = ⎢ ⎥ ⋅ ⎢ ⎥ ; CCW DF: ⎣ y ⎦ ⎣ −600 0 ⎦ ⎣ y ⎦
⎡ Fx ⎤ ⎡ 200 0 ⎤ ⎡ x ⎤ ⎢F ⎥ = ⎢ ⎥⋅⎢ ⎥ ⎣ y ⎦ ⎣ 600 0 ⎦ ⎣ y ⎦
(2)
In order to compensate for the imposed force fields, it has been shown that subjects increase the limb stiffness to compensate for the instability of the environment [3,6]. In order to increase one cross term (eg. Kyx) without modifying the other one (eg. Kxy), it has been suggested that subjects change the reflex gain [7]. Therefore, these fields were designed to examine if changes in the reflex gain occur as part of the impedance controller which compensates for the environment. 2.3 Protocol
All subjects practiced making movements in the NF on at least one day prior to the experiment. These training trials were used to accustom the subjects to the equipment and to the movement speed and accuracy requirements. Each force field was learned on a different day. Subjects first learned the DF, after which their endpoint stiffness was estimated. The order of training in the other two rotated DF force fields was randomized across subjects. For each force field, three procedures were conducted on the same day: learning, stiffness estimation, and after effects trials. During learning of the DF, subjects first completed 30 successful movements in the NF, after which the DF was activated, although no information was provided to the subjects about when activation of the DF would occur. Subjects then made movements in the DF until 100 successful trials were completed. For learning the rotated DF, they first performed 110 movements in the DF before making movements in the rotated DF until 100 successful trials were achieved. There was a short break before the stiffness estimation. Details of the method and analysis for the stiffness estimation can be found in [6,7,10]. Briefly, subjects first completed 20 successful trials in whichever force field was being tested. Following this, an additional 160 successful trials were completed in the force field of which 80 trials were randomly selected for stiffness estimation. During these trials, a 300ms displacement is applied near the midpoint of the trajectory. The displacement is divided up into three periods: 100ms ramp up, 100ms hold, and a 100ms ramp down period. The average force and displacement measured during the final 80 ms of the hold period were used to estimate the 2 by 2 endpoint stiffness matrix (K) by linear regression. Finally, after effects were recorded by randomly interspersing 20 null field trials with 80 force field trials. On these null field trials, subjects expect to be moving in the force field but no errors will be induced by the force fields. This allows for a clear picture of the feedforward control during the movements. On each experimental day, electromyographic activity of six arm muscles was measured. Surface EMG was recorded from two monoarticular shoulder muscles: pectoralis major and posterior deltoid; two biarticular muscles: biceps brachii and
Reflex Contributions to the Directional Tuning of Arm Stiffness
917
long head of the triceps; and two monoarticular elbow muscles: brachioradialis, and lateral head of the triceps. EMG was recorded using pairs of silver-silver chloride surface electrodes. The EMG signals were Butterworth band-pass filtered between 25 Hz and 1.0 kHz and sampled at 2.0 kHz. A
S1
S2
S3
S4
100 N/m
B
C
D
0 -50
0.8
2
0.6
1.5
0.4 0.2
DF CW CCW
5
2.5
Size
50
Shape
Orientation
1
0
x 10
1 0.5
DF CW CCW
0
DF CW CCW
Fig. 2. Endpoint Stiffness. A. Endpoint stiffness was adapted differently in each force field. The stiffness ellipses for each of the four subjects participating in the study are shown for each of the three unstable force fields (DF: filled ellipse; CW DF: solid line; CCW DF: dotted line). The direction of the increased stiffness for each unstable force field is similar across all of the subjects. These directions are close to the directions of instability in the environment. B. The orientation of the endpoint stiffness ellipses. C. The shape of the endpoint stiffness ellipses. D. The size of the endpoint stiffness ellipses. Relative to the DF stiffness, the stiffness measured in the two rotated divergent force fields is larger and modified in orientation.
3 Results Subjects were able to adapt to all force fields and make smooth straight movements to the final target. While the onset of the force fields caused disturbed trajectories in early learning, subjects were able to reduce these errors gradually as learning progressed. After learning was completed, stiffness measurements were performed in each of the three unstable force fields. 3.1 Endpoint Stiffness
The endpoint stiffness can be represented in terms of the eigenvalues and eigenvectors of the stiffness matrix using an ellipse. Singular value decomposition of the stiffness matrix K was used to find the eigenvalues [11]. Subjects modified the endpoint stiffness of their arms in different directions according to the different environments in which they were moving (Fig 2). Relative to the baseline stiffness in
918
G. Liaw et al.
the 200 DF, the orientation of the endpoint stiffness in the two rotated DF fields had been changed. In each case the orientation occurred in the direction in which the environmental stiffness had been rotated. While the shape was fairly consistent across all fields, and more anisotropic than seen in null fields [7], the size of the stiffness was generally larger in the rotated DF fields. The endpoint stiffness can also be examined in terms of the four components of the stiffness matrix (Fig 3). For both the CW DF and the CCW DF, the Kxx and Kyy terms were increased relative to the baseline DF level. However, the biggest difference in terms of the adaptation to the two force fields occurred in the Kyx term. For the CW DF, this term was decreased relative to the baseline, whereas in the CCW DF this term had been increased. In contrast, the Kxy term was maintained close to zero in both cases. Analysis of the effect of these terms was performed (Fig 3C). The change in the cross stiffness terms (Kxy and Kyx) was such that it compensates for the environmental instability with an oppositely directed force. A
B CW DF
1000
1000
500
500
0
0 Kxx Kxy Kyx Kyy
CCW DF
Kxx Kxy Kyx Kyy
C force produced by Kxy force produced by Kyx
CW DF
CCW DF
perturbations in x & y axes
Fig. 3. Changes in the stiffness relative to the DF stiffness. A,B. Mean endpoint stiffness across subjects. Each panel shows the mean values for one of the rotated DF fields relative to the stiffness in the baseline DF field (light grey bars). C. Effect of the change in cross terms of the endpoint stiffness matrix (Kxy and Kyx). For each of the rotated divergent force fields, the resulting change in the cross terms relative to the baseline DF stiffness was calculated. The force (solid line) resulting from a 1 cm displacement (dotted line) was calculated for each of the ΔKxy and ΔKyx terms. The Kxy terms after adaptation produced only small changes in force. The ΔKyx produced opposite effects in each of the CW DF and CCW DF fields. These forces were directed to resist the oppositely directed forces from the divergent force field.
3.2 Electromyographic Activity
The large difference between the two stiffness terms (Kyx and Kxy) (Fig 3) cannot normally be produced by pure co-contraction of opposing muscles [5, 7]. This difference, produced by modifying the antisymmetric part of the stiffness matrix, therefore suggests that subjects may have modulated part of their stiffness by
Reflex Contributions to the Directional Tuning of Arm Stiffness
Pectoralis Major
919
0.4 0 -0.4
0.4
0.4
0
0
-0.4
-0.4
+y
0.4
0.4
0
-x
0
+x
-0.4
-0.4
-y 0.4
0.4
0
0
-0.4
-0.4
DF CW CCW
reflex response scaled for background activity
0.4 0 -0.4
long latency contrib. to reflex stiffness [60-125 ms] [95-175 ms] Fig. 4. The long latency reflex response of the Pectoralis Major changes depending on the force field. During the stiffness measurement the hand was moved by 8 mm in each of eight directions as illustrated by the arrow insert in the middle of the figure. The reflex response scaled by the background activity in unperturbed trials is shown for two intervals: a long latency reflex interval (60-125 ms) and the interval which could contribute to the stiffness measurement (95-175 ms). The response in each force field is plotted as separate bars: DF (left), CW DF (center) and CCW DF (right), with error bars denoting standard error of the mean.
modifying the reflex gain. We investigated this by analyzing the reflex responses produced by the perturbations applied to measure stiffness. The integrated EMG was calculated during two periods after the perturbation: (60-125 ms) and (95-175 ms). The second interval was chosen as the period of muscle activation which would contribute to the stiffness estimate. This interval was determined by the fact that the stiffness was estimated between 120-200 ms after the onset of the perturbation and that the force produced by muscle activity (EMG) in the arm is time delayed by approximately 25 ms [12]. The integrated EMG was subtracted by the EMG normally present during this time in the movement (from unperturbed trials) and divided by the same number, producing a gain value which can be compared across the three fields.
920
G. Liaw et al.
Posterior Deltoid
1 0.5 0
1
1
0.5
0.5
0
0 +y
1
1
0.5
0.5 -x
0
+x
1 -y
0.5 0
reflex response scaled for background activity
0
1 0.5 0
1
DF CW CCW
0.5 0
long latency contrib. to reflex stiffness [60-125 ms] [95-175 ms] Fig. 5. The long latency reflex response of the Posterior Deltoid changes depending on the force field. During the stiffness measurement the hand was moved by 8 mm in each of eight directions as illustrated by the arrow insert in the middle of the figure. The reflex response scaled by the background activity in unperturbed trials is shown for two intervals: a long latency reflex interval (60-125 ms) and the interval which could contribute to the stiffness measurement (95-175 ms). The response in each force field is plotted as separate bars: DF (left), CW DF (center) and CCW DF (right), with error bars denoting standard error of the mean.
The reflex responses in the Pectoralis Major (Fig 4) demonstrate large differences in the long latency reflex responses depending on the force field to which the subject has adapted. For perturbations in the +x, +x –y, and –y directions, the reflex response has been inhibited for the CW DF field, and perhaps slightly enhanced in the CCW DF. For oppositely directed perturbations the opposite effect is seen. The CW DF shows an excitatory reflex response whereas the CCW DF is again inhibited. Similarly, the reflex responses in the Posterior Deltoid (Fig 5) show the same effect. The response of long latency reflex depends on the force field to which the subject has adapted. For perturbations in the +x, +x –y, and –y directions, the reflex response is excitatory for the CW DF field. For oppositely directed perturbations the
Reflex Contributions to the Directional Tuning of Arm Stiffness
921
opposite effect is seen. The CW DF shows an inhibited reflex response to the perturbations whereas the CCW DF is unaffected compared to the baseline field. Clearly changes in the reflex gain as part of the adaptation to the environment have occurred. Similar changes in the reflex responses were seen in all six arm muscles that were recorded. These reflex responses were modified appropriately to compensate for the destabilizing effects of the force fields.
4 Discussion We have investigated the ability of the neuromuscular system to control the direction of the impedance of the arm during movement. Subjects adapted to three differently oriented position dependent unstable force fields. We found that the endpoint stiffness rotated towards the direction of the instability in each force field. The change in the stiffness matrix occurred in the appropriate components to compensate for the environment. The three fields differed only in the Kyx component: the component which determines the force applied in the y-axis in response to an error in the x-axis. In response, while the Kxx and Kyy terms were increased in the two rotated DF fields, the main change occurred in the Kyx term of the endpoint stiffness of the arm. This change was opposite to the imposed force field Kyx term in order to compensate for it. Such a change was brought about by changing the antisymmetric component of the endpoint stiffness. Any changes in the antisymmetric stiffness are thought to be produced by changes in the reflex response, particularly in terms of heteronymous reflex responses [8]. Investigation of the reflex responses produced by the perturbations demonstrated that the reflex gains had been modulated by the CNS. The long latency responses had been either inhibited or excited from the baseline level depending on the force field in which they were moving. Evidence of feedforward changes in the feedback gain have been seen previously in cyclic activities such as cycling [13], ball catching [14] and arm reaching movements [15]. However, here we show that these reflex responses are tuned according to the instability in the environment as part of the impedance controller. The CNS carefully modified the long latency reflex responses of the limb, tuning it to appropriately counteract the disturbing force field. This work demonstrates that the neuromuscular system attempts to selectively increase the impedance in the direction of instability rather than globally. This provides compelling evidence for the existence and utility of an impedance controller in the CNS. It also shows that this impedance controller does not only control the feedforward co-activation of the muscles but also changes the reflex gains in order to appropriately tune the overall response of the neuromuscular system to the environment.
Acknowledgments We thank T. Yoshioka for his assistance in running the experiments as well as for programming the PFM. DWF is supported by a fellowship from NSERC, Canada.
922
G. Liaw et al.
References 1. Rancourt, D., Hogan, N.: Stability in force-production tasks. J. Mot. Behav. 33, 193–204 (2001) 2. Harris, C.M., Wolpert, D.M.: Signal-dependent noise determines motor planning. Nature 394, 780–784 (1998) 3. Burdet, E., Osu, R., Franklin, D.W., Milner, T.E., Kawato, M.: The central nervous system stabilizes unstable dynamics by learning optimal impedance. Nature 414, 446–449 (2001) 4. Takahashi, C.D., Scheidt, R.A., Reinkensmeyer, D.J.: Impedance control and internal model formation when reaching in a randomly varying dynamical environment. J. Neurophysiol. 86, 1047–1051 (2001) 5. Hogan, N.: The mechanics of multi-joint posture and movement control. Biol. Cybern. 52, 315–331 (1985) 6. Franklin, D.W., So, U., Kawato, M., Milner, T.E.: Impedance control balances stability with metabolically costly muscle activation. J. Neurophysiol. 92, 3097–3105 (2004) 7. Franklin, D.W., Liaw, G., Milner, T.E., Osu, R., Burdet, E., Kawato, M.: End-point stiffness of the arm is directionally tuned to instability in the environment. J. Neurosci. (2007) 8. Mussa-Ivaldi, F.A., Hogan, N., Bizzi, E.: Neural, mechanical, and geometric factors subserving arm posture in humans. J. Neurosci. 5, 2732–2743 (1985) 9. Gomi, H., Kawato, M.: Human arm stiffness and equilibrium-point trajectory during multi-joint movement. Biol. Cybern. 76, 163–171 (1997) 10. Burdet, E., Osu, R., Franklin, D.W., Yoshioka, T., Milner, T.E., Kawato, M.: A method for measuring endpoint stiffness during multi-joint arm movements. J. Biomech. 33, 1705– 1709 (2000) 11. Gomi, H., Osu, R.: Task-dependent viscoeleasticity of human multijoint arm and its spatial characteristics for interaction with environments. J. Neurosci. 18, 8965–8978 (1998) 12. Ito, T., Murano, E.Z., Gomi, H.: Fast force-generation dynamics of human articulatory muscles. J. Appl. Physiol. 96, 2318–2324 (2004) 13. Grey, M.J., Pierce, C.W.P., Milner, T.E., Sinkjær, T.: Soleus stretch reflex during cycling. Motor Control 1, 36–49 (2001) 14. Lacquaniti, F., Maioli, C.: Anticipatory and reflex coactivation of antagonist muscles in catching. Brain Res. 406, 373–378 (1987) 15. Gomi, H., Saijyo, N., Haggard, P.: Coordination of multi-joint arm reflexes is modulated during interaction with environments. In: 12th Annual Meeting of Neural Control of Movement, E-09, Naples, Florida (2002)
Analysis of Variability of Human Reaching Movements Based on the Similarity Preservation of Arm Trajectories Takashi Oyama1 , Yoji Uno2 , and Shigeyuki Hosoe1 1 Bio-Mimetic Control Research Center, RIKEN, Shimoshidami, Moriyama-ku, Nagoya 403-0003, Japan 2 Graduate School of Engineering, Nagoya University, Furo-cho, Chikusa-ku, Nagoya 464-8601, Japan
Abstract. Human movements exhibit some variability. Based on the assumption that the main factor of movement variability is noise adding to motion commands in motion execution, it can be considered that a planned trajectory is the same during the same task. However, the human might not be able to plan the same trajectory during the same task because of some factors (e.g., the uncertainty of target location perception). We analyzed the similarity preservation of trajectories in reaching movements. The similarity preservation of trajectories cannot be reproduced by random noise in motion execution. We argue that the movement variability breaks out in motion planning.
1
Introduction
Human movements exhibit some variability, namely the variance or standard deviation of the points of trajectories. Some hypotheses have been suggested with regard to the factor concerning this variability. Harris & Wolpert [1] suggested “signal dependent noise” (SDN) as the factor of movement variability. SDN is the noise added to a motor command during its execution, and the magnitude of SDN increases with the magnitude of the motor command. van Beers et al. [2] measured the reaching movements along many directions and distances, indicating that SDN could reproduce the variability of movements. Based on the assumption that the main factor of movement variability is SDN, it can be considered that a planned trajectory is the same during every trial. For example, a minimum torque change model [3] and a minimum commanded torque change model [4] determine exactly one desired trajectory when the start and target positions are provided. The perception of a target location must be accurate to calculate a trajectory that reaches only the target. However, large deviation and variability exist between the visual and proprioceptive perception of a target location [5]. Sakamoto et al. [6] suggested the uncertainty of target localization in motion planning as the factor of movement variability. We conjecture that the influence of noise on variability during motion execution is not significant and the variability is mainly attributed to the uncertainty of target localization in motion planning. If trajectories vary with noise, the similarity between the M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 923–932, 2008. c Springer-Verlag Berlin Heidelberg 2008
924
T. Oyama, Y. Uno, and S. Hosoe B
B A
A
(a) Trajectory variability caused by (b) Trajectory variablility caused by the noise in motion execution; uncertainty in motion planning; the second parts (B) of the trajectories are the second parts (B) of the trajectories are not similar even if the first parts (A) are similar. similar if the first parts (A) are similar.
Fig. 1. Difference in the trajectory similarities depending on the factor of variability
trajectories in the first parts may not be kept in the latter half because of the noise (Fig. 1a). On the other hand, if movement variability breaks out with the uncertainty of target localization in motion planning, the similarity between the trajectories will be preserved (Fig. 1b). In order to investigate whether the factor of movement variability is the noise in motion execution or the uncertainty of target localization in motion planning, the similarities between the torque profiles of actual measured trajectories, trajectories computed with noise in motion execution, and minimum commanded torque change trajectories with variable end points that were made on the basis of the uncertainty of target localization in motion planning were analyzed.
2 2.1
Experiment Measurement
Five subjects participated in this experiment. Human right arm 2-joint (a shoulder and an elbow) reaching movements between two points in the horizontal plane were examined. When a subject’s right shoulder position was regarded as the origin of Cartesian coordinates, a start position and an end position of the movements were located at (-0.25 m, 0.3 m) and (0.05 m, 0.45 m), respectively. The shape of the target is a circle of which diameter is approximately 1 cm. The subjects could perceive their limbs and the target on the basis of visual information prior to the motion execution. The subjects were asked to close their eyes as soon as they noticed a signal indicating the onset of movement and to aim at the target in the absence of any visual information. The subjects were asked to aim using a single motion; they were allowed to open their eyes and see the trial result after each movement. A total of 100 trials were measured for each subject. 2.2
Simulated Trajectories with Noise
The procedure to compute the trajectories under the influence of noise is shown in Fig. 2 [2]. The mean measured trajectory was computed by averaging the points of the measured trajectories in each subject; the movement duration of each trajectory was normalized. The magnitude of the added noise depends on 2 2 the motor command and is expressed as σ(t) = N (0, 1) kSDN u2 (t) + kCON ST
Analysis of Variability of Human Reaching Movements Inverse kinematics
925
Inverse dynamics
Mean measured trajectory
Joint angles
Joint torques
Motor commands
Deformed trajectories
Joint angles
Joint angles
Motor commands
Noise
Forward kinematics Forward dynamics
Fig. 2. Procedure to compute trajectories deformed by noise (a) 95% concentration ellipse of the end points of the trajectories: end points variability mean trajectory
(b) intersection points with the normal plane of the mean trajectory: intermediate points variability
Fig. 3. Index of the variability of trajectories; (a) the 95% concentration ellipse of the end points is used as the index of the end points variability, and (b) the standard deviations of the intersection points with the normal plane of the mean trajectory of measured trajectories is used as the index of the intermediate points variability
as shown in van Beers et al. [2]. Here, N (0, 1) denotes a Gaussian noise with zero mean value and unit variance, kSDN determines the extent of noise depending on the motor command, and kCON ST determines the extent of noise independent of the motor command. u(t) is the motor command and is expressed as u(t) = te ta τ¨(t) + (te + ta )τ˙ (t) + τ (t) as shown in van der Helm & Rozendaal [7]. Here, te and ta represent time constants with values of 30 and 40 ms, respectively. Further, the movement duration of the trajectories was also changed stochastically according to the normal distribution N (μmd , kT IME ). Here, μmd is the mean of the movement duration of measured trajectories. The movement duration of a measured trajectory was defined as the time interval during which the tangential velocity of movement was greater than 0.01 m/s. As shown by Hollerbach & Flash [8], when the mean of the movement duration is μmd and that of the computed trajectory is cμmd , the motor command of the computed trajectory is set to 1/c2 times that of the original value. The value of noise parameters kSDN , kCON ST and kT IME were modulated so that the variability of (a) end points or (b) intermediate points of measured trajectories were reproduced. The 95% concentration ellipse of the end points was used as the index of the end points variability, and the standard deviations of the intersection points with the normal plane of the mean trajectory of measured trajectories was used as the index of the intermediate points variability (Fig. 3). (a) Simulated trajectories with noise that reproduces end points variability. The value of noise parameters kSDN , kCON ST and kT IME were modulated so that the variability of end points of measured trajectories were
926
T. Oyama, Y. Uno, and S. Hosoe
Table 1. The values of noise parameters kSDN , kCONST and kT IM E that reproduce the variability of the end points of measured trajectories Subject A B C D E kSDN 0.018 0.162 0.061 0.069 0.042 kCONST 0.589 0.568 0.943 0.605 0.353 kT IM E 0.909 1.195 1.365 1.260 1.045
Y 4 cm
Measured Noise X
Subject A
Subject B
Subject C
Subject D
4 cm
Subject E
Fig. 4. The 95% concentration ellipses of the end points of measured trajectories (solid lines) and N oiseend (dashed lines) 8 Measured Noise
3 2 1 0 100
150
200
250 300 Time [ms]
350
7 6 5 4 3 2 1 0 100
400
Subject A Intermediate variability [cm]
Intermediate variability [cm]
200
250 300 Time [ms]
350
400
12
Measured Noise
10 8 6 4 2 0 100
150
200 Time [ms]
250
300
Subject C
5 Measured Noise
4 3 2 1 0 100
150
Subject B
6 5
14 Measured Noise
Intermediate variability [cm]
4
Intermediate variability [cm]
Intermediate variability [cm]
5
150
200
250 300 Time [ms]
Subject D
350
400
4
Measured Noise
3 2 1 0 100
150
200 Time [ms]
250
300
Subject E
Fig. 5. The standard deviations of the intersection points of measured trajectories (solid line) with the normal plane of the mean trajectory and N oiseend (dashed lines)
reproduced (Table 1). Hereinafter, these simulated trajectories with noise are referred as N oiseend . The 95% concentration ellipses of the end points of measured trajectories (solid lines) and simulated trajectories with noise (dashed lines) are shown in Fig. 4, and the standard deviations of the intersection points of measured trajectories (solid lines) with the normal plane of the mean trajectory and simulated trajectories with noise (dashed lines) are shown in Fig. 5. The horizontal axis illustrates the passage of time from the start point. The variability of the intermediate points of the simulated trajectories was less than that of the measured trajectories; the variability of the actual trajectories could not be reproduced. (b) Simulated trajectories with noise that reproduces intermediate points variability. The value of noise parameters kSDN , kCON ST and kT IME were modulated so that the variability of intermediate points of measured
Analysis of Variability of Human Reaching Movements
927
Table 2. The values of noise parameters kSDN , kCONST and kT IM E that reproduce the variability of the intermediate points of measured trajectories Subject A B C D E kSDN 0.292 0.586 0.608 0.547 0.307 kCONST 0.100 0.567 0.566 0.271 0.100 kT IM E 0.909 1.195 1.365 1.260 1.045
Y Measured
4 cm
Noise X
Subject A
Subject B
Subject C
Subject D
4 cm
Subject E
Fig. 6. The 95% concentration ellipses of the end points of measured trajectories (solid lines) and N oiseinter (dashed lines) 8 Measured Noise
3 2 1 0 100
150
200
250 300 Time [ms]
350
7 6 5 4 3 2 1 0 100
400
Subject A Intermediate variability [cm]
Intermediate variability [cm]
200
250 300 Time [ms]
350
400
12
Measured Noise
10 8 6 4 2 0 100
150
200 Time [ms]
250
300
Subject C
5 Measured Noise
4 3 2 1 0 100
150
Subject B
6 5
14 Measured Noise
Intermediate variability [cm]
4
Intermediate variability [cm]
Intermediate variability [cm]
5
150
200
250 300 Time [ms]
Subject D
350
400
4
Measured Noise
3 2 1 0 100
150
200 Time [ms]
250
300
Subject E
Fig. 7. The standard deviations of the intersection points of measured trajectories (solid line) with the normal plane of the mean trajectory and N oiseinter (dashed lines)
trajectories were reproduced (Table 2). In this case, these simulated trajectories with noise are referred as N oiseinter . Observe that parameters kT IME are the same with the ones in Table 1. This indicates that the influence of the variability of movement duration on the variability of intermediate points is small. The 95% concentration ellipses of the end points of measured trajectories (solid lines) and simulated trajectories with noise (dashed lines) are shown in Fig. 6, and the standard deviations of the intersection points of measured trajectories (solid line) with the normal plane of the mean trajectory and simulated trajectories with noise (dashed lines) are shown in Fig. 7. The variability of the end points of the simulated trajectories was greater than that of the measured trajectories; namely the variability of the actual trajectories could not be well reproduced in this case.
928
T. Oyama, Y. Uno, and S. Hosoe
0.5 Y [m]
Y [m]
0.5
0.4
0.3 -0.3
-0.2
-0.1
0
0.4
0.3 -0.3
0.1
-0.2
-0.1
X [m]
0
0.1
X [m]
Measured trajectories
M CT Cun
Fig. 8. Measured trajectories (left) and M CT Cun (right). The 95% concentration ellipse of end points is the same for both the cases.
8 Measured MCTC
3 2 1 0 100
150
200
250 300 Time [ms]
350
7 6 5 4 3 2 1 0 100
400
Subject A Intermediate variability [cm]
Intermediate variability [cm]
200
250 300 Time [ms]
350
400
12
Measured MCTC
10 8 6 4 2 0 100
150
200 Time [ms]
250
300
Subject C
6 Measured MCTC
5 4 3 2 1 0 100
150
Subject B
7 6
14 Measured MCTC
Intermediate variability [cm]
4
Intermediate variability [cm]
Intermediate variability [cm]
5
150
200
250 300 Time [ms]
Subject D
350
400
5
Measured MCTC
4 3 2 1 0 100
150
200 Time [ms]
250
300
Subject E
Fig. 9. The standard deviations of the intersection points of measured trajectories (solid line) with the normal plane of the mean trajectory and M CT Cun (dashed lines)
Although both the variability of end points and intermediate points could be reproduced by well modulating the noise parameters, the values of parameters are not near in each condition (Table 1 and 2). It appeared difficult to reproduce both variability of the end points and intermediate points simultaneously. 2.3
Simulated Trajectories with the Uncertainty of Target Perception
As described in the introduction, a planned trajectory may also exhibit variability if we assume the uncertainty of the target localization in planning stage [5,6]. In this section, the minimum commanded torque change trajectories between two points (start and end points) were computed to reproduce the uncertainty of target localization in motion planning. Hereinafter, these simulated trajectories with the uncertainty of target perception are referred as M CT Cun . The end points were imparted variability so that the 95% concentration ellipse of the end points of the measured trajectories was reproduced. The measured trajectories and the corresponding minimum commanded torque change trajectories are shown in Fig. 8. The variability of the intermediate points of the measured
Analysis of Variability of Human Reaching Movements
929
trajectories (solid lines) and the minimum commanded torque change trajectories (dashed lines) are shown in Fig. 9. The variability of the intermediate points of the minimum commanded torque change trajectories were approximately in agreement with that of the measured trajectories, although the minimum commanded torque change trajectories were calculated so that only the 95% concentration ellipse of the end points of the measured trajectories was reproduced. 2.4
Analysis of the Similarity Preservation of Trajectories
The time series of the joint torques were used to evaluate the similarity between the trajectories. It is possible to distinguish different trajectories according to the motor commands even if the paths appear similar. Denote by Sm , Sne , Sni and Smctc the set of all the measured trajectories, simulated trajectories with noise (N oiseend and N oiseinter ) and minimum commanded torque change trajectories (M CT Cun ), respectively. To each set Sm , Sne , Sni and Smctc , the similarity between the trajectories is evaluated as follows: (1) Select any one of the trajectories in S (S presents either Sm , Sne , Sni or Smctc ). (2) Compute the squared sum of the differences of torques during a certain time interval A (see Fig. 1) between the selected and other trajectories. (3) Choose N trajectories from S in the ascending order of the squared sum. This set of these N trajectories is referred to as SA . (4) Do procedures (2) and (3) by shifting time interval A to B, where the time interval B comes after A (see Fig. 1). Denote by SB the set of the corresponding N trajectories. (5) Denote by M the number of trajectories that belong to both SA and SB . (6) Repeat (1) to (5) for all selection of the trajectories in S. (7) Average M over all the selection of trajectories in (1). This presents the index of the similarity preservation of S. These procedures were applied to the experimental data. When the parameters are selected as N = 20, A =100–200 ms, and B =250–350 ms, the order of the squared sum of the differences of torques for a particular trajectory during intervals A and B is shown in Fig. 10. The black circles in this figure represent trajectories within the 20th order of the squared sum of the differences of torques during time interval A. Those correspond to black circles for time interval B. The number of black circles within the 20th order during time interval B represents the index of the similarity preservation of trajectories (=M ). For example, M is 8 for a trajectory verified in Fig. 10. The procedure was applied for 100 measured trajectories, 100 trajectories with noise (N oiseend and N oiseinter ), and 100 minimum commanded torque change trajectories (M CT Cun ) for all the subjects. Time intervals A and B were selected depending on the mean of movement duration (μmd ) in each subject (A = 0.2μmd –0.4μmd, B = 0.5μmd–0.7μmd ,), and N = 20 was provided. We also tested other parameters set that is described later.
930
T. Oyama, Y. Uno, and S. Hosoe
Squared difference [(Nm)²]
Squared difference [(Nm)²]
100 80 60 40 20 0
10 20 30 40 50 60 70 80 90 100 Order
A: interval 100–200 ms
160 140 120 100 80 60 40 20 0
10 20 30 40 50 60 70 80 90 100 Order
B: interval 250–350 ms
Fig. 10. Example of trajectory similarity evaluation. Black circles represent trajectories in which the squared sum of the differences of torques is small during the interval of 100–200 ms (left). The number of black circles comprising the top 20 ranks during the interval of 250–350 ms (right) indicates the measure of trajectory similarity.
20
Measured Noise (end) Noise (intermediate) MCTC
Number M
15 10 5 0 A
B
C Subject
D
E
Fig. 11. The mean of a number of trajectories preserving similarity M (not filled: measured trajectory; light mesh: N oiseend ; thick mesh: N oiseinter ; filled: M CT Cun ; errorbar means standard deviation)
2.5
Results
The mean of a number M of trajectories preserving a similarity between the measured trajectories (not filled), N oiseend (light mesh), N oiseinter (thick mesh), and M CT Cun (filled) are shown in Fig. 11. In all the subjects, the number M of the trajectories preserving the similarity between the trajectories with noise was smaller than that of the measured trajectories with a significant difference (p < 0.001). No significant difference was observed between the number of trajectories preserving similarity M of the measured trajectories and the minimum commanded torque change trajectories for subjects C and E. Although a significant difference existed between M of the measured trajectories and the minimum commanded torque change trajectories for subjects A, B and D the tendency of the magnitude relation of M was inconsistent. The parameters N (the number of trajectories considered in the ascending order of the squared sum of the differences of torques) and the set of A and B (over which the squared sum of the differences of torques is evaluated) employed during the procedure to calculate the number of trajectories preserving similarity M were modified to a different value and the procedure was executed. When N
Analysis of Variability of Human Reaching Movements
931
was 15 and 30, and the set of A and B was moved before, after, or extended temporally (e.g., A = 0.3μmd –0.5μmd and B = 0.6μmd–0.8μmd ; A = 0.2μmd –0.5μmd and B = 0.5μmd –0.8μmd), the value of M for the trajectories with noise was always less than that of the measured trajectories with a significant difference (p < 0.001).
3
Discussion
In this study, the similarity between the trajectories was investigated to test whether the primary factor of movement variability was “noise in motion execution” or “the uncertainty of location perception in motion planning.” If the variability of the actual trajectories results from random noise in motion execution, the preservation of the similarity between the trajectories would depend on the probability distribution of noise. In contrast, if variability is included in motion planning, the similarity between the realized trajectories is well preserved. The number of trajectories preserving similarity M for the trajectories with noise was significantly smaller than that of the measured trajectories. The minimum commanded torque change trajectories in which the end points exhibited almost the same variability as the measured trajectories (based on the assumption of the uncertainty of location perception in motion planning) could reproduce M of the measured trajectories better than that of trajectories with noise. Further, the minimum commanded torque change trajectories could roughly reproduce the variability of the intermediate points of the measured trajectories even though they were not specified. It is appropriate that the primary factor of movement variability is the uncertainty of location perception in motion planning as compared to randomly added noise to a motor command during motion execution. In this experiment, the subjects were asked not to use visual feedback during motion execution. The influence of available somatosensory feedback on movement variability should be considered. If a subject correctly localizes a target position in motion planning and uses lots of somatosensory feedback in motion execution, trajectories would tend to converge on the perceived position as time passes and the movement variability will decrease. As shown in Fig. 5, however, the variability of intermediate points of measured trajectories increases as time passes. We conjecture that the subjects use almost feedforward control to perform reaching movements and the influence of somatosensory feedback is small on the results. Schmidt et al. [9] showed that the variability of end points increases when the movement distance and movement duration are longer. How are these results explained under the assumption that movement variability breaks out in motion planning? The first explanation is that when a subject rapidly moves his hand toward a target to perform a task, the application of feedback control in the vicinity of the target is difficult. Since the subject is instructed to stop his hand at the end of the movement certainty during a typical reaching task, he may be more concerned about stopping his hand rather than the accuracy of reaching; hence, the variability of the end points increases. The second explanation is that the difficulty of computing a desired trajectory increases the amount of
932
T. Oyama, Y. Uno, and S. Hosoe
computation since the movement duration increases. When a subject performs a slow and time-consuming movement, the amount of planned trajectory data may increase, and it increases the movement variability. We have measured very slow reaching movements (movement duration of 6 sec) and fast reaching movements (movement duration of 0.5 sec) and revealed that the variability of the intermediate points of the slow movements was greater than that of the fast movements [10]. Movement variability involves the computation difficulty of a planned trajectory. We propose the uncertainty of location perception as one of the factors of variability in motion planning. A planned trajectory is not necessarily invariable, such as a mean trajectory for the same task, and it exhibits a large variability. The manner in which a planned trajectory is expressed and computed, and the required information of perception should be investigated. The method to estimate a planned trajectory from a measured trajectory [11] is useful to investigate the variability in motion planning. Acknowledgment. This work was supported by Grant-in-Aid for scientific Research (B) 18360202 from JSPS.
References 1. Harris, C.M., Wolpert, D.M.: Signal-dependent noise determines motor planning. NATURE 394, 780–784 (1998) 2. van Beers, R.J., Haggard, P., Wolpert, D.M.: The role of execution noise in movement variability. J. Neurophysiol. 91, 1050–1063 (2004) 3. Uno, Y., Kawato, M., Suzuki, R.: Formation and control of optimal trajectory in human multijoint arm movement. Biol. Cybern 61, 89–101 (1989) 4. Nakano, E., Imamizu, H., Osu, R., Uno, Y., Gomi, H., Yoshioka, T., Kawato, M.: Quantitative examinations of internal representations for arm trajectory planning: minimum commanded torque change model. J. Neurophysiol. 81, 2140–2155 (1999) 5. Kitagawa, T., Fukuda, H., Fukumura, N., Uno, Y.: Investigation of error in human perception of hand position during arm movement (in Japanese). IEICE J89-D, 1429–1439 (2006) 6. Sakamoto, T., Fukumura, N., Uno, Y.: Variability in human reaching movements depends on perception of targets (in Japanese). Technical Report of IEICE NC2002-175, 19–24 (2003) 7. van der Helm, F.C.T., Rozendaal, L.A.: Musculoskeletal systems with intrinsic and proprioceptive feedback. In: Winters, J.M., Crag, P.E. (eds.) Biomechanics and Neural Control of Posture and Movement, pp. 164–174 (2000) 8. Hollerbach, J.M., Flash, T.: Dynamic interaction between limb segments during planar arm movement. Biol. Cybern 44, 67–77 (1982) 9. Schmidt, R.A., Zelaznik, H., Hawkins, B., Frank, J.S., Quinn Jr., J.T.: Motoroutput variability: a theory for the accuracy of rapid motor acts. Psychological Review 86, 415–451 (1979) 10. Oyama, T., Uno, Y.: Variability of human arm reaching movements occurs in motion planning (in Japanese). In: BPES20th, pp. 149–152 (2005) 11. Oyama, T., Uno, Y.: Estimation of a human planned trajectory from a measured trajectory (in Japanese). IEICE J88-DII, 800–809 (2005)
Directional Properties of Human Hand Force Perception in the Maintenance of Arm Posture Yoshiyuki Tanaka and Toshio Tsuji Department of Artificial Complex Systems Engineering, Hiroshima University, 1-4-1 Kagamiyama, Higashi-hiroshima, 739-8527, Japan {ytanaka,tsuji}@bsys.hiroshima-u.ac.jp http://www.bsys.hiroshima-u.ac.jp
Abstract. This paper discusses the directional properties of human hand force perception during maintained arm posture in operation of a robotic device by means of robotic and psychological techniques. A series of perception experiments is carried out using an impedance-controlled robot depending on force magnitudes and directions in the different arm postures. Experimental results demonstrate that human hand force perception is much affected by the stimulus direction and can be expressed with an ellipse. Finally, the relationship between hand force perception properties and hand force manipulability is analyzed by means of human force manipulability ellipse. Keywords: Force perception, multi-joint arm, human force manipulability.
1
Introduction
Humans can control dynamic properties of his/her own body according to tasks by utilizing the perceived information of environmental characteristics. For example, in the door open-close task usually appearing as a constrained task in our daily activities, we can carry out a smooth operation without feeling an excessive load by controlling hand movements and arm configurations at the same time depending on kinematical and dynamical characteristics of the maneuvering door. If the relationship between human sensation and dynamic properties of movements for environmental characteristics can be described quantitatively, it would be useful to evaluate and develop a novel human-machine system in which the operator can manipulate the machine comfortably. In the field of robotics, there have been several methods to evaluate the manipulability of a robot with a serial link mechanism for each posture from the kinematical and dynamical viewpoints in the operational task space [2,3]. These methods can quantitatively indicates the directional dependence of robot performance by representing the shape and size of an ellipse for the specified posture. Some researches applied robot manipulability into the analysis of human movements [4] and the evaluation of welfare equipments [5] to develop more comfortable and safety mechanical interfaces for a human operator. The robot M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 933–942, 2008. c Springer-Verlag Berlin Heidelberg 2008
934
Y. Tanaka and T. Tsuji
manipulability, however, cannot consider kinetic characteristics of the human musculoskeletal system that should be taken into account for evaluating human movements. For such a problem, Tanaka and Tsuji developed human force manipulability based on human joint-torque characteristics that can estimate human force capability in good agreement with experimental data, and applied to the layout problem of driving interfaces of a human-robotic system [6][7]. However, these previous researches did not concern how a human operator would feel dynamic properties of designed devices although the improvement of operational feeling was stated as a major study purpose. On the other hand, a large number of studies on the properties of human perception have been carried out in the psychometric field so far, and there are some well-known law based on experimental findings, e.g., Weber-Fechner law; a just-noticeable difference in a stimulus is proportional to the magnitude of the original stimulus [8]. Many experimental studies have been also reported on human force perception [10]–[15]. Especially, Jones [13] examined the perceived magnitude of forces exerted to muscle groups in the finger, upper-arm, and forearm, and reported that human force perception ability is much affected by muscle size. This experimental evidence suggests that directional properties might be involved in human hand force perception since the muscles used change depending on the direction of the force exerted in multi-joint arm movement. No research, however, has been reported on the directional dependency of human hand force perception as far as the authors know. The objective of the present paper is to clarify whether the force direction affects human hand force perception properties during maintained arm posture in operation of a robotic device. Experimental findings will be useful as basic data for designing kinematic and dynamic properties of a robotic system as well as the control strategy to assist operator’s motion more safety and comfortably. This paper is organized as follows: Section 2 explains the experimental system and method using an impedance-controlled robot for investigating human hand force perception, and presents typical experimental results. Section 3 discusses the directional dependency of human hand force perception and the relationship between hand force perception properties and hand force manipulability by using human force manipulability ellipse [7].
2 2.1
Hand Force Perception Experiment Experimental System
Fig. 1 shows an overview of the constructed experimental apparatus. The robot is composed of two linear motor tables with one degree of freedom (NSK Ltd., maximum driving force: x axis 100 [N], y axis 400 [N], encoder resolution: x axis 4 [μm], y axis 4 [μm]), where the two tables are placed orthogonally in order to carry out the two-dimensional hand motion exercise [16]. Hand force generated by a human operator is also measured using a six-axis force/torque sensor on the handle (Nitta Corp., resolution: force x and y axes 200 [N], z axis 400 [N],
Directional Properties of Human Hand Force Perception
Display
Bio-feedback
935
DSP system Xd
φ
Force sensor Force Position
X0
X
Impedance control
y x
Robot Human
(a)
φ : Target direction X 0 : Initial position X : Hand position Xd : Target position of equilibrium
(b)
Fig. 1. Experimental apparatus for the human hand force perception
torque 18 [Nm]). Hand position X ∈ 2 is measured using an encoder built into the linear motor table. The robotic device is impedance-controlled [17], and can provide force loads F ∈ 2 to the operator’s hand by adjusting the impedance parameters. Thus, dynamics of the robot is as follows: ¨ + B X(t) ˙ − F (t) = M X(t) + K(X(t) − Xv (t))
(1)
where M = diag.(mr , mr ), B = diag.(br , br ), K = diag.(kr , kr ) ∈ 2×2 is the robot inertia, viscosity, and stiffness; and Xv ∈ 2 is the equilibrium of K. The robot control is performed using a DSP system (A & D Company, AD5410) that can provide stable control based on the real-time simulation output of the Matlab/Simulink (Mathworks Inc.) as well as high-quality data measurement in high sampling. The biofeedback display, programmed in Open GL, shows the target direction of the hand force φ with an arrow and the current hand position X with a circle. 2.2
Experimental Method
Experiments were carried out based on a magnitude estimation method [8]. A human subject was seated in the front of the robot as shown in Fig. 2. The shoulders of the subject were restrained by a shoulder harness belt to the chair back, and the elbow of the right arm was hung from the ceiling by a rope to maintain his right arm posture in the horizontal plane without excessive co-contractions of the muscles. The right wrist and hand were tightly fixed by a molded plastic cast to the robot handle to eliminate the influence of tactile sensation in the experiment as much as possible [9]. The subject was instructed to keep his hand position at the initial position X0 by monitoring the information on the biofeedback display. The equilibrium position of robot stiffness Xv was smoothly moved to the target position Xd
936
Y. Tanaka and T. Tsuji
φ θ2 y
θ1 x
Fig. 2. Experimental condition on the arm posture
located on the circle with radius 4 [cm] in the specified direction φ within two seconds: The force stimulus was gradually increased to the target value for two seconds. After that, the subject perceived the reaction force for three seconds, and reported the perceived value in percentage terms with respect to the standard force stimulus. The presented force series was composed of the six different magnitude F = 5, 10, 15, 20, 25, 30 [N], and the eight different directions φ = 0, 45, 90, 135, 180, 225, 270, 315 [deg.] in consideration of the performance of the employed robot and a human subject. The presented values of force F and direction φ were randomly determined by the experimental system. The standard force magnitude was set at Fs = 15 [N] and the standard force direction was at φs = 0 [deg.] in this paper, i.e., the standard force stimulus was at (Fs , φs ) = (15 [N], 0 [deg.]). Each magnitude of the different six force stimuli was provided five times in each direction, and the total number of trials was 240 (= 30 × 8) for each arm posture. The standard force stimulus was presented after every three trials so that a human subject can remember the standard force stimulus during the perception test. Under the above conditions, the arm posture θ = (θ1 , θ2 ) ∈ 2 was changed as θ1 = 30, 60, 90 [deg.] under θ2 = 60 [deg.]. The presented force magnitude was generated by changing the robot stiffness kr . The robot inertia was set at mr = 5 [kg] and the robot viscosity br was automatically adjusted for critical damping to avoid the oscillation of hand movements during the perception test. The sampling frequency for robot control was 1 [kHz] in this study. Fig. 3 shows typical measured signals during the perception test with the standard force stimulus. The hand displacement from the initial position for each axis, and the hand force for each axis are presented in the order from the top. It can be seen that the smooth change of force amplitude is realized using the developed experimental system, and that hand motion is almost constant during the perception term for three seconds.
Directional Properties of Human Hand Force Perception
937
Equilibrium position, Xv [m]
Perception term 0.06 0.04 0.02 0
Hand displacement, dy [m]
Hand displacement, dx [m]
-0.02 0.06 0.04 0.02 0 -0.02 0.06 0.04 0.02 0
Hand force, f x [N]
-0.02 5 0 -10
Hand force, f y [N]
-20 5 0 -10 -20 0
2
4
6
Time [s]
8
Fig. 3. Time profiles of the measured signals during a force perception test with the standard force stimulus
2.3
Experimental Results
Six right-handed volunteers (male university students, aged 23 - 25) were participated in the perception test. The subjects carried out some practice until they understood the point of the perception test and had enough confidence in their answers. Fig. 4 shows the results of the force perception test for Subject A in the case of (θ1 , θ2 ) = (30 [deg.], 60 [deg.]) where the vertical axis in each graph is the perceived force amplitude Fp , and the horizontal axis is the true force Ft normalized with the standard force magnitude Fs (= 15 [N]). The solid line for each direction denotes a regression curve with a logarithmic function, y = A ln(x) + B, obtained by fitting the all data (30 samples) using the least squares method, and the dotted line is 95 % prediction interval assuming that the normal distribution is satisfied in the data. The black circle • is the perception result for Fs , and r2 represents the coefficient of determination between true and perceived values.
150 100 50 2
r = 0.94
200
50 2
r = 0.96
0 0
100
y
50
r2 = 0.96
0
150 100 50
r 2 = 0.94
0 0
100 50
x
50 100 150 200 True value, Ft [%]
r2 = 0.94
0
150 100 50
r2 = 0.98
0 0
y = 55.1 ln (x) - 138.0
200 150 100 50
r 2 = 0.98
0 0
50 100 150 200 True value, Ft [%]
50 100 150 200 True value, Ft [%] y = 59.5 ln (x) - 150.3
200
50 100 150 200 True value, Ft [%] y = 61.6 ln (x) - 173.6
200
150
0
Perception value, Fp [%]
y = 60.6 ln (x) - 161.9
y = 58.0 ln (x) - 151.7
200
50 100 150 200 True value, Ft [%]
150
0
Perception value, Fp [%]
100
50 100 150 200 True value, Ft [%]
Perception value, Fp [%]
Perception value, Fp [%]
0
150
Perception value, Fp [%]
0
y = 62.5 ln x - 178.0
200
Perception value, Fp [%]
y = 58.6 ln (x) - 148.4
200
Perception value, Fp [%]
Y. Tanaka and T. Tsuji
Perception value, Fp [%]
938
50 100 150 200 True value, Ft [%] y = 48.6 ln (x) - 95.8
200 150 100 50
r 2 = 0.88 0 0
50 100 150 200 True value, Ft [%]
Fig. 4. Relationship between true and perceived hand forces under θ1 = 30 [deg.] (Subject A)
The subject almost correctly perceived the standard force stimulus (Fs , φs ) in each specified arm posture, and there exist ranges where the subject perceives hand force as larger or smaller than the true one. The perceived force tends to be larger than the true one when the presented force magnitude was smaller than the standard one, while smaller when the presented force magnitude was larger. The coefficient of determination is over 0.88, and the relationship with the logarithm of the present force is almost proportional. That is, the WeberFechner law is almost satisfied in hand force perception for all combination of the specified directions and arm postures. The significant difference of perceived force to the standard force magnitude (a mark • in Fig. 4) was observed in the perception direction φ for each of the specified arm posture by the analysis of variance (ANOVA) with significant level p < 0.001: F(7, 32) = 20.94 for θ1 = 30 [deg.]; F(7, 32) = 8.14 for θ1 = 60 [deg.]; F(7, 32) = 6.24 for θ1 = 90 [deg.]. These characteristics on hand force perception mentioned here were observed for the other subjects.
3 3.1
Directional Properties of Human Hand Force Perception Human Force Perception Ellipse
Further analysis on the directional properties of human hand force perception was executed using “Point of Subjective Equality (PSE)” [8] with respect to the standard force stimulus. The PSE in this paper was defined as the value of
Directional Properties of Human Hand Force Perception
939
a regression curve at Fp = 100, that is the force magnitude to make a human subject perceive the standard force stimulus. The PSE was calculated backward using the regression curve for each of directions and arm postures. Fig. 5 shows the results of the PSE analysis for all subjects depending on the arm posture θ1 , where the vertical axis is the force magnitude corresponding to PSE and the horizontal axis is the perception direction φ. It can be seen that the PSE changes in the perception direction as well as the arm posture although some individual differences exist. Next, the changes of PSE in the direction φ was expressed with an ellipse defined by the following quadratic form as: T Fp cos φ − d a b Fp cos φ − d =1 (2) Fp sin φ − e b c Fp sin φ − e where a, b, c, d and e are determined by fitting the PSE shown in Fig. 5 with a least squares method. Fig. 6 shows the results of a human hand force perception ellipse (HFPE) depending on the arm posture for Subs. A, B, and C, where the radius between the center of a HFPE and the white circle denotes the value of PSE. It can be seen that the directional changes of PSE is well expressed with an ellipse. The major axis of an ellipse represents the direction where a larger force than the standard force Fs (= 15 [N]) will be required to make a human subject have the illusion that he perceives Fs , while the minor axis the direction where a smaller force will be required. Accordingly, the HFPE indicates that a human subject perceives Fs as smaller toward the direction of the major axis while he perceives as larger toward the direction of the minor axis. It can be also found that the HFPE changes depending on the specified arm posture, and that the major axis tends to be oriented toward the shoulder point. Similar tendencies were observed for all subjects, although some individual exceptions were found. 3.2
Relationship between Human Force Manipulability
Finally, the directional properties of human hand force perception were associated with the human hand force manipulability ellipse (HFME) [7] defined by f T (J(θ)T (θ)−1 )(J(θ)T (θ)−1 )T f ≤ 1,
(3)
where f denotes a hand force generated by a human, J denotes a Jacobian matrix on the hand position with respect to the arm posture, and T denotes a matrix representing the characteristics of human arm joint-torque with maximum effort. The size and shape of HFME can be utilized as a performance index in generating the maximum hand force according to the operational direction under the specified arm posture θ. Large operational force can be easily exerted in the major axis direction, while it is difficult toward the minor axis direction. The results of HFME for each of the arm postures is drawn with a gray ellipse under the HFPE as shown in Fig. 6, where the human arm is modeled as a rigid
Y. Tanaka and T. Tsuji
30
θ 1 = 30 [deg.] θ 1 = 60 [deg.] θ 1 = 90 [deg.]
20
PSE [N]
15 10 5
30
30
25
25
20
20
15 θ 1 = 30 [deg.] θ 1 = 60 [deg.] θ 1 = 90 [deg.]
10 5
0
0 0
45
90
135
180
225
270
315
45
90
135
180
225
270
315
0
(b) Subject B 30
25
25
20
20
20
PSE [N]
30
15 θ 1 = 30 [deg.] θ 1 = 60 [deg.] θ 1 = 90 [deg.]
15 θ 1 = 30 [deg.] θ 1 = 60 [deg.] θ 1 = 90 [deg.]
10 5
45
90
135
180
225
270
315
(d) Subject D
225
270
315
15 θ 1 = 30 [deg.] θ 1 = 60 [deg.] θ 1 = 90 [deg.]
10
0 0
45
90
135
180
225
270
315
Direction, φ [deg.]
Direction, φ [deg.]
180
5
0
0
135
(c) Subject C
25
0
90
Direction, φ [deg.]
30
5
45
Direction, φ [deg.]
(a) Subject A
10
θ 1 = 30 [deg.] θ 1 = 60 [deg.] θ 1 = 90 [deg.]
10
0 0
Direction, φ [deg.]
PSE [N]
15
5
PSE [N]
PSE [N]
25
PSE [N]
940
0
45
90
135
180
225
270
315
Direction, φ [deg.]
(e) Subject E
(f) Subject F
Fig. 5. PSE depending on the arm postures and the direction of motion for the subjects
HFPE
HFME
Elbow 15 [N]
150 [N]
Shoulder (a) Subject A
Elbow
Elbow Shoulder
Shoulder
(b) Subject B
(c) Subject C
Fig. 6. HFPE and HFME for the three subjects (Subjects A, B and C)
link structure with two rotational joints (See Fig. 2). The length of forearm and upper-arm was measured for each subject and the joint-torque matrix was determined based on the previous work [7]. It can be found that the orientation of HFME is similar to one of HFPE in each arm posture for the subjects. The orientation difference between HFPE and HFME Δψ is then summarized for all six subjects in Fig. 7. Although there exists some differences, the orientation of HFPE almost agrees with one of HFME. The results indicate that a human perceives a reaction force exerted on the hand as smaller than the actual force in the direction where he can easily generate a larger hand force, and vise versa. The possible reasons why the anisotropy of HFPE is not clear compared than HFME are 1) the HFME expresses the directional properties of generating the maximum hand force under maximum effort, and 2) a human perceives an reaction force as smaller than the actual one when the force magnitude is enough large since Weber-Fechner law is almost satisfied in hand force perception.
Orientation difference between HFPE and HFME, Δψ [deg.]
Directional Properties of Human Hand Force Perception
40
941
θ 1 = 30 [deg.] θ 1 = 60 [deg.] θ 1 = 90 [deg.]
20
0 Sub.A
Sub.B
Sub.C
Sub.D
Sub.E
Sub.F
Fig. 7. Orientation differences between HFPE and HFME for all subjects
A series of experimental and simulated results demonstrates the directional properties of human hand force perception and its close relationship with the motor performance of hand force generation.
4
Conclusion
This paper experimentally analyzed the influence of direction on the perception of human hand force during maintained arm posture by means of robotic and psychological techniques. The main results are summarized as follows: 1) Humans have directional hand force perception properties that change the magnitude of subjective sensation according to the direction of the force exerted, 2) The directional properties change according to arm posture, and can expressed with an ellipse, and 3) There is the close relationship between HFPE and HFME. These findings on human sensorimotor characteristics will be useful in designing an advanced user-friendly machine, a virtual reality system and a rehabilitation system using robotic devises. The future research will be directed to examine for other arm configurations with considerations of the muscle co-contractions and the sensation of active movements during the force perception test in order to clarify the mechanism of human hand force perception physiologically. It is also planned to evaluate and design human-machine systems, such as a neuro-rehabilitation system, based on human sensorimotor characteristics. Acknowledgments. This research work was supported in part by a Grant-inAid for Scientific Research from the Japanese Ministry of Education, Science and Culture (18760193, 15360226). The authors would like to appreciate the kind supports of the A&D Co. Ltd. on the DSP instrument.
References 1. Ito, K.: Physical wisdom systems, Kyoritsu Shuppan (in Japanese) (2005) 2. Asada, H.: Geometrical representation of manipulator dynamics and its application to arm design. Transaction of the ASME, Journal of Dynamic Systems, Measurement, and Control 105, 105–135 (1983)
942
Y. Tanaka and T. Tsuji
3. Yoshikawa, T.: Analysis and control of robot manipulators with redundancy. In: Brady, M., Paul, R. (eds.) Robotic Research, the 1st International Symposium, pp. 735–747. MIT Press, Cambridge (1984) 4. Hada, M., Yamada, D., Tsuji, T.: Equivalent inertia of human-machine systems under constraint environments. In: Proceedings of the third Asian Conference on Multibody Dynamics 2006, August 2006, vol. 00643 (2006) 5. Otsuka, A., Tsuji, T., Fukuda, O., Shimizu, M.E., Sakawa, M.: Development of an internally powered functional prosthetic hand with a voluntary closing system and thumb flexion and radial abduction. In: Proceedings of the 2000 IEEE International Workshop on Robot and Human Interactive Communication, pp. 405–410 (2000) 6. Tanaka, Y., Yamada, N., Masamori, I., Tsuji, T.: Manipulability Analysis of lower Extremities Based on Human Joint-Torque Characteristics (in Japanese). Transaction of the Society of Instrument and Control Engineers 40(6), 612–618 (2004) 7. Tanaka, Y., Yamada, N., Nishikawa, K., Masamori, I., Tsuji, T.: Manipulability Analysis of Human Arm Movements during the Operation of a Variable-Impedance Controlled Robot. In: Proceedings of the 2005 IEEE/RSJ International Conference on Intelligent Robotics and Systems, pp. 3543–3548 (2005) 8. Oyama, T., Imai, S., Wake, T.: Sensory and perceptual handbook, Seishin Shobo (in Japanese) (1994) 9. Tanaka, Y., Abe, T., Tsuji, T., Miyaguchi, H.: Motion dependence of impedance perception ability in human movements. In: Proceedings of the First International Conference on Complex Medical Engineering, pp. 472–477 (2005) 10. Gandevia, S.C., McClosky, D.I.: Sensations of heaviness. Brain 100, 345–354 (1977) 11. Jones, L.A.: Role of central and peripheral signals in force sensation during fatigue. Experimental Neurology 81, 497–503 (1983) 12. Miall, R.C., Ingram, H.A., Cole, J.D., Gauthier, G.M.: Weight estimation in a “deafferented” man and in control subjects: are judgments influenced by peripheral or central signal? Experimental Brain Research 133, 491–500 (2000) 13. Jones, L.A.: Perceptual constancy and the perceived magnitude of muscle forces. Experimental Brain Research 151, 197–203 (2003) 14. Yamakawa, S., Fujimoto, H., Manabe, S., Kobayashi, Y.: The necessary conditions of the scaling ratio in master-slave systems based on human difference limen of force sense. IEEE Transactions on System, Man, and Cybernetics: Part A 35(2), 275–282 (2005) 15. Sasaki, H., Fujita, K.: Experimental analysis of role of visual information in hardness cognition displayed by a force display system and effect of altered visual information (in Japanese). Journal of the Virtual Reality Society of Japan 5(1), 795–802 (2000) 16. Tanaka, Y., Matsushita, K., Tsuji, T.: Sensorimotor characteristics in human arm movements during a virtual curling task (in Japanese). Transactions of the Society of Instrument and Control Engineers 42(12), 1288–1294 (2006) 17. Hogan, N.: Impedance Control: An approach to manipulation, Parts I, II, III. ASME Journal of Dynamic Systems, Measurement, and Control 107(1), 1–24 (1985)
Computational Understanding and Modeling of Filling-In Process at the Blind Spot Shunji Satoh and Shiro Usui Laboratory for Neuroinformatics, RIKEN Brain Science Institute, Japan [email protected]
Abstract. A visual model for filling-in at the blind spot is proposed. The general scheme of standard regularization theory is used to derive a visual model deductively. First, we indicate problems of the diffusion equation, which is frequently used for various kinds of perceptual completion. Then, we investigate the computational meaning of a neural property discovered by Matsumoto and Komatsu (J. Neurophysiology, vol. 93, pp. 2374–2387, 2005) and introduce second derivative quantities related to image geometry into a priori knowledge of missing images on the blind spot. Moreover, two different information pathways for filling-in (slow conductive paths of horizontal connections in V1, and fast feedforward/feedback paths via V2) are regarded as the neural embodiment of adiabatic approximation between V1 and V2 interaction. Numerical simulations show that the outputs of the proposed model for filling-in are consistent with a neurophysiological experimental result, and that the model is a powerful tool for digital image inpainting.
1
Introduction
The blind spot is the area in the visual field that corresponds to the lack of photoreceptors on the retina. Although no receptors exists to detect light stimulus at the blind spot (BS), we do not see a black disk or a strange pattern but we perceive the same color or pattern as the surroundings. This phenomenon is referred to as perceptual filling-in at the blind spot. Perceptual filling-in is not a special phenomenon limited to the blind spot. It is observed in various situations, e.g., an artificial scotoma produced by transcranial magnetic stimulation [1], patients with retinal scotoma [2], normal visual field (no defect of visual field) for stabilized retinal image [3], and others [4]. Those findings imply that perceptual filling-in is a general and common process in our visual system. Consequently, research about BS computation contributes to understanding of the general process of the visual system. Komatsu and colleagues reported a quantitative analysis of V1 neural responses of awake macaque monkeys to bar stimuli presented on the blind spot
This work was partially supported by the Grant-in-Aid for Young Scientists (#17700244), the Ministry of Education, Culture, Sports, Science and Technology, Japan.
M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 943–952, 2008. c Springer-Verlag Berlin Heidelberg 2008
944
S. Satoh and S. Usui
Fig. 1. (a). Schematic examples of bar stimuli. BS represents the blind spot. RF is the receptive field of a recorded neuron. (b) Retinal inputs corresponding to the above stimuli. (c) Responses of the recorded neuron. (d) Abstract model proposed by Matsumoto & Komatsu. (Adapted from Fig. 3 in J. Neurophysiology, vol. 93, pp. 2374–2387, 2005) This is also the neural representation of the proposed algorithm for filling-in in this article. (e). Simulation result of the proposed model in this article.
[5,6]. The receptive fields (RFs) of the recorded neurons overlapped with the BS area (Fig. 1(a)). Bar stimuli of various lengths were presented at the BS. One end of the bar stimulus was fixed and the other end was varied (Figs. 1(a1)– 1(a4)). Some V1 neurons showed a significant increase in their activities when the bar end exceeded the BS (Fig. 1(c)), although those activities remained constant as long as the end was in the BS area. These results indicate that V1 neurons perform the filling-in process in addition to orientation detection. In addition, Matsumoto and Komatsu posited the existence of two different pathways with different velocities of visual signal: (i) fast feedforward [ff] and feedback [fb] connections via V2, and (ii) slow horizontal connections [hc] connecting V1 neurons (see Fig. 1D). However, no discussion exists about the necessities of those properties. Why are the different velocities necessary? What is the significance of the two different pathway velocities? As shown in Fig.1(c), the abrupt increase of neural response was observed even though the retinal stimulation increased only slightly; one end of a bar appears or not (see Fig. 1(b3) and 1(b4)). This phenomenon implies the complexity and nonlinearity of the filling-in process at the BS. It also shows the difficulties in obtaining appropriate patterns. The different velocities of pathways and the existence of V2 neurons might be the keys to solving the difficult problem of the filling-in process. We therefore studied the computational necessities of the different conductance velocities and the role of V2 neurons from a computational point of view. For this purpose, we interpreted the physiological findings given by Matsumoto and Komatsu [6] from a theoretical point of view. We constructed a visual model so that it reproduced the data shown in Fig. 1(c). We will deductively obtain
Computational Understanding and Modeling of Filling-In Process (a)
(b)
(c)
945
(d)
Fig. 2. (a) Filling-in process is applied in BS area B (a gray rectangle). (b) Example of a desired filling-in pattern. (c) Filling-in by the diffusion equation. (d) Filling-in by the proposed visual model.
our visual model based on standard regularization theory because the theory enables us to understand our filling-in model as an optimizer for a pre-defined evaluation function. Moreover, the dynamics derived from standard regularization is generally a reaction-diffusion equation with few parameters. It can be implemented easily as a neural network. Application of standard regularization theory to explain the filling-in process seems a trivial approach because the theory is applicable to various types of completion problems to recover missing information, e.g., estimation of optical flow, restoration of surface, aperture problems [7], and so on. However, no successful work exists on BS filling-in based on the standard regularization theory reflecting the physiological characteristics described before. One reason why no mathematical model for BS filling-in exists would be the high nonlinearity and the complexity of the filling-in problem. The neural properties related to the BS should be keys to solving the complex problem to reiterate. We expect that the filling-in algorithm in our visual system should be an effective image-processing algorithm. For example, digital image inpainting (DII) is a technique that can repair a region of a damaged or removed image using automatic mechanisms [8]. Other applications of DII include restoration of video images, image transmission through narrow band systems, and so on [9]. Our visual model is merely an algorithm for DII when we regard the BS area as the area to be restored using DII. In addition to the initial purpose of this article, we also evaluate the effectiveness of our visual model for a DII algorithm using color input images.
2 2.1
Evaluation Function for Filling-In Problems of Diffusion Equation for Filling-In Process
To formulate the filling-in problem, we define an evaluation function E[I] (or an energy function, a functional for I) for filling-in patterns I(x) so that E takes a small value when an image I(x) is a desired one, where I signifies the brightness and x = (x, y) is a spatial position within the BS area. The BS area and the boundary are referred to as B and ∂B, respectively. A desired filling-in for
946
S. Satoh and S. Usui
Fig. 2(a) is a completed bar, as shown in Fig. 2(b), in which the spatial change of I(x) in B should be small (no intensity fluctuation along x-axis). A simple functional that evaluates intensity change is formulated as 1 1 2 E1 = d2 x ||∇I(x, t)|| = d2 x Iξ2 (x), (1) 2 B 2 B where Iξ is the directional derivative of I in the direction ξ, which is parallel to the gradient of I at a point. We obtain Iξ =
∂ ∂ ∂ I = (cos ξ) I + (sin ξ) I ∂ξ ∂x ∂y
(2)
Hereafter, we indicate the directional derivative using subscripts. For example, Iξη means the second order directional derivative I in the directions ξ and η. As shown in Fig. 3, the direction ∇I (the gradient vector of I) is referred to as ξ, and the orthogonal vector as μ. The algorithm we use is an iterative update method of I such that E[I] decreases as time progresses. We obtain the dynamics (update rule) of I by applying the steepest descent method to the functional E: 2 ∇ I(x, t), if x ∈ B ∂ I(x, t) = (3) ∂ ∂ ∂t ∇2 I(x, t) − ∂x + ∂y I(x, t), if x ∈ ∂B Equation (3) is the diffusion equation of brightness I. The steady state of I is ¯ it is a result of the filling-in process. referred to as I; Figure 2(c) is the result of a numerical simulation of (3). We find that the result is far from our expectation. We confirmed this was not attributable to local minima trapped by the steepest descent method (the value of E1 for Fig. 2(c) was smaller than that for Fig. 2(b)). ¯ we applied other possible methTo obtain Fig. 2(b) as the steady state I, ods, e.g., multi-resolution, multi-grid, and anisotropic diffusion. However, every method failed. Apparently, we will never obtain the desired result as long as we use (1) as the evaluation function. 2.2
New Functional with Curvature Terms
Term Iξ in (1) is the first-order derivative of the image, and it represents the output of a V1 neuron selective to ξ-orientation. This computational aspect reveals a flaw of (1) that the functional E1 uses only V1-coded visual information. On the other hand, Matsumoto and Komatsu found that V2 neurons contribute to filling-in. That is, from a computational viewpoint, we should introduce visual information coded by V2 neurons into the evaluation function. Hence, we introduce angular information of contours because some V2 neurons are selective to angles embedded within V-shaped patterns [10]. Important quantities associated with contour angles are (i) κ: curvature of level-set and (ii) μ: curvature of flow line; the former signifies the curvature of
Computational Understanding and Modeling of Filling-In Process
947
Fig. 3. Quantities used in this work (see the text for detail)
contour lines of I, the latter is the curvature of lines crossing at right angles with contour lines (see Fig. 3(b)). Both κ and μ represent smoothness of edges or image contours. We then propose the following new functional E based on E1 . 2 1 E= d2 x κ ¯ (x, t) + μ ¯2 (x, t) Iξ2 (x, t). (4) 2 In that equation,
3
κ ¯ = κIξ = Iηη =
Iy2 Ixx − 2Ix Iy Ixy + Ix2 Iyy Ix2 + Iy2
(5)
μ ¯ = μIξ = Iξη =
(Ix2 − Iy2 )Ixy − Ix Iy (Iy y − Ix x) . Ix2 + Iy2
(6)
Dynamics for Filling-In at the Blind Spot
We expect that the dynamics for filling-in and the corresponding neural network will emerge deductively by applying steepest descent method to (4). However, the resultant dynamics is very complex, as shown below. ∂ I = ( Iy6 Iyyyy Ix2 + 3Iy4 Iyyyy Ix4 + 3Iy2 Iyyyy Ix6 ∂t + · · · (74 terms) + · · ·
3/2 + 3Iy6 Ix2 Ixxxx + 3Iy4 Ix4 Ixxxx + Iy2 Ix6 Ixxxx ) / Ix2 + Iy2
(7)
The problematic complex equation (7) comprises 80 terms. It seems impossible to analyze all 80 terms to investigate their physiological meanings. Similarly, we can not expect a physiologically plausible visual model from (7). Again, we attempt to use a physiological phenomenon as a key to solving the problem. The key is the different pathway with different conductance velocities: fast pathways via V2, and slow pathways within V1 using horizontal connections. The faster conductance velocity of the visual pathway via V2 implies the faster optimization of κ ¯ (x, t) and μ ¯(x, t) in time than that of Iξ (x, t). In the
948
S. Satoh and S. Usui
Fig. 4. (a) Spatial distributions of receptive fields formulated as Gaussian derivatives (σ = 1). The RF model is selective to horizontal bars when θ = 90◦ . (b) μ ˜ξθθ is the sum of four neural outputs selective to about 27◦ angular difference of lines when θ = 90◦ . (c) κ ˜ ηθθ is the sum of four neural outputs selective to those four patterns when θ = 90◦ .
extreme situation of velocity difference, we can assume a constant value of Iξ (x, t) with respect to time t. This assumption corresponds to variable separation and adiabatic approximation. Applying the steepest descent method to (4) using this assumption, we obtain the following dynamics:
2 ∂ ∂ ∂ ∂ ∂2 I(x, t) = κ ˜+ μ ˜ − + κ ˜ + λIηη , (8) ∂t ∂η ∂ξ ∂ξ 2 ∂η 2 =κ ˜η + μ ˜ξ − Δ˜ κ + λIηη , (9) where κ ˜ = Iξ2 κ, μ ˜ = Iξ2 μ and λ is a positive constant. The fourth term of (9) is a consistency term which alleviates the drawbacks of variable separation and adiabatic approximation. Compared with (7) in the appendix, we find a simple, analyzable, neural implementable equation as a neural network in (9). However, we have no guarantee that desired filling-in patterns are obtained by (9). We will demonstrate the validity of (9) in section 5 using numerical simulations.
4
Neural Dynamics for Filling-In at the Blind Spot
The intention of this study is the construction of a V1 model for filling-in. We next consider the dynamics of orientation selective neurons and the neural implementation. To simulate the neurophysiological experiments illustrated in Fig. 1(d), we must consider the dynamics of V1 neurons selective to bar stimuli. In this section, we derive orientation selective neurons dynamics from eq. (9). We apply a Gaussian derivative model (GD) as receptive fields (RFs) of V1 neurons [11]. The RF model selective to θ-orientation is written as ∂ 2 gσ (x)/∂θ2 , where gσ is the Gaussian function with σ 2 variance. Figure 4(a) shows examples of θ-preferring RFs. We use a linear neuron model. The output of a θ-preferring
Computational Understanding and Modeling of Filling-In Process
(a)
949
(b)
[ff] V1
[hc]
BS RF
RF
Normalized Response
V2
[fb]
effective RF
BS
1 .5 0
0
2
4
6 8 10 12 Bar length (pixel)
14
16
Fig. 5. (a) Schematic representation of the proposed filling-in model. (b) Simulation results of a computational V1 neuron using various length of horizontal bars.
V1 neuron whose RF center is x is referred to as sθ (x, t). Presume that Iξ (x, t) is constant around x and that σ is a small value. We obtain
2 ∂ ∂ ∂ ∂2 ∂2 ∂ sθ (x, t) = g ∗I I= 2 I (10) 2 2 ∂t ∂θ ∂t ∂θ ∂θ ∂t ∂2 =κ ˜ηθθ + μ ˜ξθθ − κ ˜ ηηθθ − κ ˜ ξξθθ + λ 2 sθ . (11) ∂η Equation (11) indicates that a V1 neuron sθ is affected by κ ˜ ηθθ , μ ˜ξθθ and other terms. We expect that those terms be the outputs of V2 neurons. Because of page limitations, details are not presented in this article. However, we found that the value of μ ˜ξθθ is the sum of four neurons selective to about 27◦ angular difference of the V-shaped pattern or junctions as illustrated in Fig. 4(b). In addition, κ ˜ ηθθ is the sum of neurons selective to patterns in Fig. 4(c). We found the angular selectivity in κ ˜ ηηθθ and μ ˜ξξθθ . The fifth term of (11) represents intra-cortical interaction between V1 neurons connected by horizontal connections [12]. Results show that our V1 model neurons, sθ (x, t), are affected by the output of V2 model neurons, which encode angular information of lines, and which are affected by V1 neurons through horizontal connections. Figure 5(a) depicts a schematic representation of our model formulated by (11). Comparing Fig. 5(a) to Fig. 1(d), we conclude that our computational model is consistent with the physiological abstract model. However, explicit intra-cortical connections in V2 are not emerged in (11). This problem will be addressed in future work.
5
Numerical Simulations
First, numerical simulations of (9) are performed to investigate whether the expected filling-in pattern is obtained using Fig. 2(a) as the initial value of I. Parameter is λ = 0.1. Figure 2(d) is the steady state of I (filling-in pattern). We find an expected pattern of Fig. 2(d) in which a broken bar of Fig. 2(a) is completed.
950
S. Satoh and S. Usui
Fig. 6. (a) A baboon with graffiti. (b) Inpainted (restored) baboon using the proposed visual model. (c) Almost half of visual information is missing. (d) Restored image by the proposed visual model.
Next, we evaluate the effectiveness of our visual model as a digital image inpainting algorithm. Results are shown in Fig. 6 and Fig. 7. Areas B of Fig. 6(a) Fig. 6(c) are black curvy lines drawn by an author and checkered orange area, respectively. (Color results would be available with the electronic version of this article.) Simulation for color images are executed as follows: decompose a color image into three (R,G,B) intensity channels, apply (9) to each of color channels, and unify three steady states into one image. Restored images by our visual model are shown in Figs. 6(b) and 6(d). We find that our visual model is effective as a DII algorithm. The situation portrayed in Fig. 6(c) is possible, for example, in the case of block loss because of a packet drop during wireless transmission, gap padding for image magnification, and so on. R R We compare our model and Spot Healing Brush Tool of Adobe Photoshop CS2 (options are default settings). Neither method repaired texture areas as of baboon fur, but our model restores strong edges, whereas the Photoshop tool
Computational Understanding and Modeling of Filling-In Process
951
Fig. 7. (a) The black rectangle is area B to be filled in. (b) Result of the proposed R R visual model. (c) Result of Adobe Photoshop CS using default setting.
gives a blurred image. The reason our model is not applicable to the textured area is that the evaluation function (9) contains no texture information. Finally, we simulate (11) to investigate whether our model neuron, sθ reproduces the physiological result of Fig. 1(c). The widths of horizontal bars are two pixels; the length varies from 0 to 14 pixels by 1 pixel step. Parameters are θ = π/2 (not θ = 0) and σ = 1 such that the neuron sθ is selective to the horizontal bars. The receptive field of the simulated neuron overlaps BS area B. Figure 5(b) illustrates the steady values of sθ . We find consistency between physiological results and our model. One end of the bar appears from BS area B, as in Fig. 1(b4), when the bar length is greater than 9 (pixels). In this situation, neuron sθ implicitly performs orientation detection for a completed bar like Fig. 1(a4) as its intrinsic filling-in process. For that reason, sθ shows a considerable increase in its activities when the bar length becomes greater than nine pixels.
6
Summary
To solve the filling-in problem, we employed two physiological findings for a visual model and present novel aspects for those findings: variable separation and adiabatic approximation. Results showed physiological consistency and plausibility in our model, and evaluated the effectiveness as an algorithm for digital image inpainting. As a basis of computational modeling, standard regularization theory and the steepest descent method are used to expose the sort of problem our model solves or optimizes. Our visual model optimizes an evaluation function representing a priori knowledge of missing images. We obtained desired patterns and neural responses for bar stimulus. However, we have not yet answered the following question: why is adiabatic approximation between V1 and V2 suitable for the filling-in process? That remains as an open problem.
952
S. Satoh and S. Usui
We should develop an appropriate means for texture filling-in. We expect that a new algorithm or visual models will be derived from theoretical aspects reflecting other neural properties from our fundamental functional. For example, a new functional including higher order image properties will be effective for texture filling-in. The functional E is defined by authors from theoretical viewpoints. An exciting challenge will include self-organization of E because E represents a priori knowledge of various kinds of images. It should reflect and represent statistical features of those images.
References 1. Kamitani, Y., Shimojo, S.: Manifestation of scotomas created by transcranial magnetic stimulation of the human visual cortex. Nature Neuroscience 2, 767–771 (1999) 2. Gerrits, H.J., Timmerman, G.J.: The filling-in process in patients with retinal scotoma. Vision Research 9, 439–442 (1969) 3. Gerrits, H.J., De Haan, B., Vendrik, A.J.: Experiments with retinal stabilized images. Vision Research 6, 427–440 (1966) 4. Komatsu, H.: The neural mechanisms of perceptual filling-in. Nature Neuroscience 7, 200–231 (2006) 5. Komatsu, H., Kinoshita, M., Murakami, I.: Neural responses in the retinotopic representation of the blind spot in the macaque V1 to stimuli for perceptual filling in. J. Neuroscience 20, 9310–9319 (2000) 6. Matsumoto, M., Komatsu, H.: Neural responses in the macaque V1 to bar stimuli with various lengths presented on the blind spot. J. Neurophysiology 93, 2374–2387 (2005) 7. Hildreth, E.C.: The computation of the velocity field. Proc. R. Soc. Lond. B 221, 189–220 (1984) 8. Bertalmio, M., Sapiro, G., Caselles, V., Ballester, C.: Image inpainting. In: Proc. of SIGGRAPH 2000, pp. 417–424. ACM Press, New York (2000) 9. Rane, S.D., Shantanu, D., Sapiro, G., Bertalmio, M.: Structure and texture fillingin of missing image blocks in wireless transmission and compression applications. IEEE Trans. on Image Processing 12, 296–303 (2003) 10. Ito, M., Komatsu, H.: Representation of angles embedded within contour stimuli in area V2 of Macaque monkeys. J. Neuroscience 24, 3313–3324 (2004) 11. Young, R.A., Lesperance, R.M., Meyer, W.W.: The Gaussian derivative model for spatial-temporal vision: I. Cortical model. Spatial vision 14, 261–319 (2001) 12. Satoh, S., Usui, S.: Image reconstruction: another computational role of long-range horizontal connections in the primary visual cortex. Neural Computation (under review)
Biologically Motivated Face Selective Attention Model Woong-Jae Won1, Young-Min Jang2, Sang-Woo Ban3, and Minho Lee2 1
Dept. of Mechatronics Intelligent Vehicle Research Team, Daegu Gyeongbuk Institute of Science and Technology, 711 Hosan-dong Dalseo-Gu, Taegu 704-230, Korea [email protected] 2 School of Electrical Engineering and Computer Science, Kyungpook National University 1370 Sankyuk-Dong, Puk-Gu, Taegu 702-701, Korea [email protected], [email protected] 3 Dept. of Information and Communication Engineering, Dongguk University 707 Seokjang-Dong, Gyeongju, Gyeongbuk, 780-714, Korea [email protected]
Abstract. In this paper, we propose a face selective attention model, which is based on biologically inspired visual selective attention for human faces. We consider the radial frequency information and skin color filter to localize a candidate region of human face, which is to reflect the roles of the V4 and the infero-temporal (IT) cells. The ellipse matching based on symmetry axis is applied to check whether the candidate region contain a face contour feature. Finally, face detection is conducted by face form perception model implemented by an auto-associative multi-layer perceptron (AAMLP) that mimics the roles of faces selective cells in IT area. Based on both the face-color preferable attention and face-form perception mechanism, the proposed model shows plausible performance for localizing face candidates in real time. Keywords: Face selective attention, biologically motivated selective attention, saliency map.
1 Introduction Recently, the social development mechanism has been considered for Autonomous Mental Development (AMD) in construction of more intelligent robots [1, 2, 3]. It might be possible if the robots can increase their own knowledge through interaction with environment and human like human does. In order to embody the intelligent robot with the social development concept, we need to implement more human-like sensors such as retina, electronic nose, touch, smell and acoustic sensors to the machines. Also, we need to develop an intelligent model in order to pay attention to interesting objects by primitive sensory information. Furthermore, it is important that human and environment can share their knowledge by interactive ways [1, 2, 3]. In order to implement a truly human-like robot system, face detection is one of the most important functions for realizing social development mechanism [1, 2, 3]. Human babies learn from their mother after focusing their eyes to mother’s face, and they can feel emotions and get social functions through experience with learning. No conventional face attention system has shown comparable performance with the M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 953–962, 2008. © Springer-Verlag Berlin Heidelberg 2008
954
W.-J. Won et al.
system of a human being yet. Recently, biologically motivated approaches have been developed by L. Itti., T. Poggio, and C. Koch [4, 5, 6]. Some research groups have developed human-like intelligent robots using these kinds of approaches [2, 3, 7]. And, an attention model was introduced for face detection [8]. However, they have not shown plausible results for the face attention problem in complex scenes until now. In this paper, we propose a real time face candidates localizer as we simply imitate the function of human visual pathway based on a biologically motivated selective attention mechanism, which can focus on a face preferentially and reject non-face areas in order to implement social developmental robot vision system. When a task is given to find a specific object, not only the features for the saliency map (SM) in the bottom-up processing should be biased as differently weighted color features but also the task specific shape feature should be feedback from top-down process to reject non-interesting area. If the specific task is to find a face, a skin color characteristic of human faces can be considered as the dominant features in the selective attention model to intensify the face areas, and also a face shape can be considered to reject non-face area. Thus, we simply consider color biased information, which is a color filtered intensity, an R·G color opponent and edge information of the R·G color opponent feature for generating the preference of a human face areas by intensifying the low level features related with human faces in a SM. Moreover, in order to reject non-face areas in the selected face candidate areas, we consider elliptical face contour shape information based on symmetry axis for human face. Moreover, face inner form features are also considered in order to reflect more complicate face form information, which is implemented using an auto-associative multi-layer perceptron (AAMLP) model. This paper is organized as follows; Section 2 describes the proposed face localization model using bottom-up processing with face color task biased signals for the face candidate areas and face form perception. The experimental results will be followed in Section 3. Section 4 presents our conclusions and discussions.
2 Biologically Motivated Selective Attention Model for Localizing Human Face When humans pay attention to a target object, the prefrontal cortex gives a competitive bias signal, related with the target object, to the infero-temporal (IT) and the V4 area [9]. Then, IT and the V4 area generates target object dependant information, and this is transmitted to the low-level processing part in order to make a filter for the areas that satisfy the target object dependant features. In the proposed model, therefore, we simply consider a skin bias color signal and elliptical face contour shape information as the face specific top-down feedback biased signal for real time operation. Moreover, we considered more complicate face inner form features. We propose a face candidate localizer based on the biologically motivated bottomup SM model as shown in Fig 1. The bottom-up SM can preferably focus on face candidate areas by a simple face-specific color bias filter using face color filtered
Biologically Motivated Face Selective Attention Model
955
Fig. 1. The proposed face candidate localizer based on the saliency map (SM) model; I: intensity, E: edge, R G: red-green opponent coding feature, B Y: blue-yellow opponent coding feature, I : intensity feature map, E : edge feature map, C : color feature map, SM: saliency map, CSP: candidate salient point, SP: salient point
intensity, an R·G color opponent and edge information of the R·G color opponent feature. Then, the candidate regions are checked how much the localized areas match up the elliptical shape based on a symmetry axis and how similar trained face form features. 2.1 Face Color Biased Selective Attention In the bottom-up processing, the intensity, edge and color features are extracted in the retina. These features are transmitted to the visual cortex through the lateral geniculate nucleus (LGN). While transmitting those features to the visual cortex, intensity, edge, and color feature maps are constructed using the on-set and off-surround mechanism of the LGN and the visual cortex. And those feature maps make a bottomup SM model in the laterial intral-parietal cortex (LIP) [10]. In order to implement a human-like visual attention function, we consider the simplified bottom-up SM model [11]. In our approach, we use the SM model that reflects the functions of the retina cells, the LGN and the visual cortex. Since the retina cells can extract edge and intensity information as well as color opponency, we use these factors as the basic features of the SM model [10-12]. In order to provide the proposed model with face color preference property, the skin color filtered intensity feature is considered together with the original intensity feature. According to a given task to be conducted, those two intensity features are differently biased. For face preferable attention, a skin color filtered intensity feature works for a dominant feature in generating an intensity feature map. The ranges of red(r), green(g), blue(b) for skin
956
W.-J. Won et al.
color filtering are obtained from a lot of natural sample face data. And the real color components R, G, B, Y are extracted using normalized color coding [10]. Actually, considering the function of the LGN and the ganglian cells, we implement the on-center and off-surround operation by the Gaussian pyramid images with different scales from 0 to n-th level, whereby each level is made by the sub-sampling of 2n, thus it is able to construct four feature basis such as the intensity (I), and the edge (E), and color (RG and BY) [11, 12]. This reflects the non-uniform distribution of the retina-topic structure. Then, the center-surround mechanism is implemented in the model as the difference operation between the fine and coarse scales of the Gaussian pyramid images [11, 12]. Consequently, the three feature maps such as I , E , and C can be obtained by the center-surround difference algorithm [11]. However, in this paper, we simply consider only the R·G color opponent features for the color feature map and the edge feature map to intensify face areas as a bias signal. A SM is generated by the summation of these three feature maps. The salient areas are obtained by searching a maximum local energy with a fixed window size shifting pixel by pixel in the SM. After obtaining the candidate salient points for human face, a proper scale for the obtained areas is computed using entropy maximization approach [13]. 2.2 Ellipse Fitting Based on Symmetry Axes
Fukushima’s neural network was to model a symmetry axis extraction mechanism considering the human visual pathway. The model consists of a number layers connected in a hierarchical manner: a contrast layer UG, an edge-extracting layer of a simple type (US), an edge-extracting layer of a complex type (UC), and a symmetryaxis-extracting layer (UH) [14]. In Fukushima’s model, the output of cells in UG which resembles the function of the ganglion cells sent to orientation quantization layer US which resembles the function of simple cells in the primary visual cortex. And the output of layer US is fed to layer UC, where blurred version of the response of layer UC is generated, which resembles the function of complex cells in the primary visual cortex. Finally the output of cells of UC sent to Uh layer which resembles the function of hyper complex cells to analyze symmetry axis [14]. In our model, we extract symmetry axis for face candidate areas which are selected by the simplified bottom-up face color preferable attention model. Thus, we can get the edge feature in each candidate face area for layer of UG. In a different way of Fukusima’s model, we applied quantization of the edge feature using edge and the orientation of the edge in a face candidate area to generate the orientation feature for the US layer to reduce computation load. The orientation features sent to UC layer are burred using Eq. (1)
uCm (n, k ) =
∑
|v|< ACm
gCm (v) ⋅ us (n + v, k )
(k = 0,1,..., K − 1), (m = 0,1,..., M − 1)
(1)
Biologically Motivated Face Selective Attention Model
957
where K is a quantization level of orientation and gcm is a Gaussian filter with a radius of Acm. However, we use M level Gaussian pyramid images with fixed Acm instead of varying Acm for reducing computation load. After extracting M level blurred orientation features in UC layer, the symmetry axis is extracted in UH layer using Eq. (2).
⎡ M −1 K −1 uH (n, k ) = ϕ ⎢ ∑ β m ∑ {γ κ ⋅ (uCm (nr , k + κ ) + uCm (n1 , k − κ )) ⎣ m =0 κ =0 − δ κ ⋅ uCm (nr , k + κ ) − uCm (n1 , k − κ } ],
(2)
(k = 0,1,..., K / 2 − 1) where ϕ ( x) = max( x, 0) , δ κ and β m are the positive parameters for determining how much degree of asymmetry be allowed, k is the opposite orientation feature to k number orientation feature, But if k = k , k = k + κ − k / 2 . n is the pixel position to get symmetry axis magnitude. That is, n = ( x, y ) , nr = ( xr , yr ) and nl = ( xl , yl ) , in which x, y, xr, and yr are represented by Eq. (3).
xr = x + a ⋅ cos(α k ),
yr = y + a ⋅ sin(α k )
xl = x − a ⋅ cos(α k ),
yl = y − a ⋅ sin(α k )
(3)
where α k = 2kπ / K , and a is the distance from current pixel position to another pixel position to be compared for obtaining symmetry information. Because the extracted symmetry axes in the UH layer are not unique line, we need to find the main symmetry axis line. After finding the symmetry axis line by searching in the UH layer, the main axis with a maximum length is selected among the several symmetry axis lines. Fig. 2 shows an example result of each layer for symmetry axis extraction. Here, we set K=16, γ κ =1, δ κ =1.5, β m =1, M=3, a = 2 × m to extract symmetry axis for a face candidate area. Finally, we reject non-face area through checking the length of symmetry axis line and a matching degree between the ellipse shape obtained from the symmetry axis and its orthogonal axis and the segmented face candidate area. 2.3 AAMLP for Face Localization
The upper part of Fig. 1 shows the architecture of the proposed model for face detection. We have modeled the face detection mechanism in the IT and V4 areas using AAMLP by which the characteristic information of a face form is trained and memorized in the connections of the artificial neurons in AAMLP. Also, a human being can perceive some important characteristic information for a specific object rather than very detailed information. To mimic this role as well as computational efficiency, we extracted some eigenvectors with large eigen-values using a principal component
958
W.-J. Won et al.
Fig. 2. The experimental result of each layer output for symmetry axis extraction
analysis (PCA) for extracting some important features of a face object. To perceive a face related information, we mimic the retrieval of face related information from AAMLP using correlation computation between input and output of the AAMLP. The AAMLP has been used successfully in many partially-exposed environments [15]. The face detection is also one of the partially exposed problems with tremendous within-class variability [15]. Let F(·) denotes an auto-associative mapping function, and xi and yi indicate an input and output vector, respectively. Then the function F(xi) is usually trained to minimize the following mean square error given by Eq. (4). n
n
i =1
i =1
E = ∑ || xi − yi || 2 = ∑ || xi − F ( xi ) || 2
(4)
where n denotes the number of output nodes After the training process is successfully finished, eight directional Gabor filters are applied to each localized face candidate region. After then, log-polar transform is considered for obtaining orientation invariant form features. The projected coefficients on the principal components of the log-polar transformed features are applied to the input nodes of the AAMLP. Then, we calculate the correlation value of the input values and the corresponding outputs of the AAMLP. If the degree of correlation is above a threshold, we regard the face candidate region contains a face.
3 Experimental Results We prepared 174 sample scenes including 176 human faces with various poses captured in the laboratory. The scenes were obtained in an indoor laboratory with illumination range between 104 and 124 lux. The face color components were obtained from hand segmented areas in the captured scenes. From the color components, we could obtain the intensity ranges of R varying from 67 to 229, G from 34 to 148, and
Biologically Motivated Face Selective Attention Model
959
B from 33 to 139. The obtained ranges are used as a face color filter. And, we set K=12, γ κ =1, δ κ =1.5, β m =1, M=3, a = 2 × m to extract symmetry axis for each face candidate area. Fig. 3 shows the experimental result of the simplified bottom-up face color preferable attention model with scale information. Fig. 4 shows the experimental result that rejects non-face area through checking the length of symmetry axis line
Fig. 3. The experimental result of simplified bottom-up face color preferable attention
Fig. 4. Face candidate localization results by rejecting non-face areas using symmetry axis length and ellipse shape matching degree
960
W.-J. Won et al.
and a matching degree between the obtained ellipse shape using the symmetry axis and its orthogonal axis and the segmented face candidate area. Fig. 5 shows the experimental result of the proposed face candidate localizer. The proposed bottom-up face color preferable attention model can intensify the face area preferably by considering the face color biased signal. The ellipse matching based on symmetry axis can efficiently reject non-face area for each face candidate area. Table 1 shows the performance of the proposed face candidate localization model in KNU database, which was obtained from different illumination environment varying from 104 to 192 lux in the indoor laboratory [16]. The detection rate for human face is 96.44% in the bottom-up face preferable attention level, and the non-face area rejection rate is 81.85% in the ellipse matching level based on the symmetry axis. Moreover, our proposed model shows the performance of the correct face detection by 93.9% with 72.71% non-face reject ratio for Georgia Tech Face Database [17]. The proposed system can successfully find human faces in real time within 0.187~0.234 sec. Also, we compared the face detection rates of our proposed model with the adaboost face detector which is included in OpenCV library [18]. Even though face detection rates of the proposed model is slightly lower than the adaboost face detector as shown in Table 1, the proposed method may have better results for rotated faces from various fields of view, which is under evaluation. Table 1. Quantitative performance of the proposed face candidate localizer KNU DB (104~192 lux) # of total face areas Selected face areas %
1124 1084 96.44% Reject ratio 81.85%
Georgia Tech. DB Proposed AdaBoost model 525 525 493 501 93.90% 95.43% Reject ratio 72.71%
Fig. 5. The experimental results of the proposed face candidate localizer
Biologically Motivated Face Selective Attention Model
961
It is hard to discriminate a human hand from a human face by only considering human face color and elliptical shape. However, the proposed AAMLP model can successfully discriminate a human hand from a human face as shown in Fig. 6. The proposed system can successfully find human faces in real time within 0.187~0.234 sec.
(a)
(b)
(c)
Fig. 6. The experimental results of the proposed face indication by AAMLP; (a) input scene, (b) face candidate regions without considering AAMLP, (c) face localization after considering AAMLP
4 Conclusion We proposed the face selective attention model to localize the human face areas by combining the face preferable attention, rejecting non-face area function and AAMLP in real time. The proposed model not only successfully localizes the face areas but also appropriately rejects non-face areas. Even though the proposed model could give plausible results to make human face selective regions, we need to verify the performance of the proposed model through intensive experiment using complex benchmark database. Acknowledgments. This research was funded by the Brain Neuroinformatics Research Program of the Ministry of Commerce, Industry and Energy, Korea, and the Deagu Gyeongbuk Institute of Science and Technology (DGIST) Basic Research Program of the MOST.
References 1. Asada, M., MacDorman, K.F., Ishiguro, H., Kuniyoshi, Y.: Cognitive developmental as a new paradigm for the design for humanoid robots. Robotics and Autonomous Systems 37, 185–193 (2001) 2. Brezeal, C.: Designing Social Robots. Mit Press, Cambridge (2002) 3. Scassellati, B.: Foundation of a Theory of Mind for a Humanoid Robot. Unpublished PhD Thesis, Dept. of Electrical Engineering and Computer Science, MIT (2001) 4. Walther, D., Itti, L., Riesenhuber, M., Poggio, T., Koch, C.: Attentional selection for object recognition – a gentle way. In: Bülthoff, H.H., Lee, S.-W., Poggio, T.A., Wallraven, C. (eds.) BMCV 2002. LNCS, vol. 2525, pp. 472–479. Springer, Heidelberg (2002) 5. Serre, T., Riesenhuber, M., Louie, J., Poggio, T.: On the role of object-specific features for real world object recognition in biological vision. In: Bülthoff, H.H., Lee, S.-W., Poggio, T.A., Wallraven, C. (eds.) BMCV 2002. LNCS, vol. 2525, pp. 387–397. Springer, Heidelberg (2002)
962
W.-J. Won et al.
6. Navalpakkam, V., Itti, L.: An Integrated Model of Top-down and Bottom-up Attention for Optimal Object Detection. In: CVPR, pp. 2049–2056 (2006) 7. Orabona, F., Metta, G., Sandini, G.: Object-based Visual Attention: a Model for a Behaving Robot. In: 3rd International Workshop on Attention and Performance in Computational Vision (2005) 8. Siagian, C., Itti, L.: Biologically-Inspired Face Detection: Non-Brute-Force-Search Approach. In: CVPRW 2004, Washington, DC, USA, vol. 5, pp. 62–69 (2004) 9. Schiller, P.H.: Area V4 of the primary visual cortex. American Psychological Society 3(3), 89–92 (1994) 10. Goldstein, E.B.: Sensation and perception, 4th edn. An international Thomson publishing company, USA (1996) 11. Park, S.J., An, K.H., Lee, M.: Saliency map model with adaptive masking based on independent component analysis. Neurocomputing 49, 417–422 (2002) 12. Choi, S.B., Jung, B.S., Ban, S.W., Niitsuma, H., Lee, M.: Biologically motivated vergence control system using human-like selective attention model. Neurocomputing 69, 537–558 (2006) 13. Kadir, T., Brady, M.: Scale, saliency and image description. International Journal of Computer Vision 45, 83–105 (2001) 14. Fukushima, K.: Use of non-uniform spatial blure for image comparison: symmetry axis extraction. Neural Network 18, 23–32 (2005) 15. Ban, S.W., Lee, M., Yang, H.S.: A Face Detection Using Biologically Motivated Bottomup Saliency Map Model and Top-down Perception Model. Neurocomputing 56, 475–480 (2004) 16. ftp://abr.knu.ac.kr/DB/Saliencymap_DB/TopDownSM_DB/Face_DB/ 17. ftp://ftp.ee.gatech.edu/pub/users/hayes/facedb/ 18. Viola, P., Jones, M.J.: Rapid Object Detection using a Boosted Cascade of Simple Features. In: IEEE CVPR2001, pp. 511–518 (2001)
Multi-dimensional Histogram-Based Image Segmentation Daniel Weiler1 and Julian Eggert2 1
2
Darmstadt University of Technology, Darmstadt D-64283, Germany Honda Research Institute Europe GmbH, Offenbach D-63073, Germany
Abstract. In this paper we present an approach for multi-dimensional histogram-based image segmentation. We combine level-set methods for image segmentation with probabilistic region descriptors based on multidimensional histograms. Unlike stated by other authors we show that colour space histograms provide a reasonable and efficient description of image regions. In contrast to Gaussian Mixture Model based algorithms no parameter learning and estimation of the number of mixture components is required. Compared to recent level-set based segmentation methods satisfying segmentation results are achieved without specific features (e.g. texture). In a comparison with state-of-the-art image segmentation methods it is shown that the proposed approach yields competitive results.
1
Introduction
In the field of image segmentation, two major approaches can be distinguished: multi region segmentation and figure-background segregation. While the former tries to group similar (by their image features f ) and related (by their spatial properties like location, etc.) pixels of an image into separate regions, the latter attempts to find a salient region of an image considering it as a foreground “figure”, labelling all the reminder without any further differentiation as background. In this paper we address the problem of figure-background segregation based on multi-dimensional histogram-based region descriptors. In state-of-the-art figure-background segregation algorithms (see “GrabCut” [1], “Graph cut” [2], “Knockout 2” [3] and “Bayes Matte” [4]) probabilistic colour distribution models are commonly used. In recent years also level-set methods [5,6,7,8,9] became a powerful tool for image segmentation. The former algorithms model colour distributions in a three dimensional colour space, whereas state-ofthe-art level-set methods are able to work on arbitrary feature maps [10]. These feature maps may incorporate the three colour components but might be extended by any other characteristic property of a region (e.g. texture and motion [11]). So far level-set methods assume the feature maps to be independent, which constitutes a major difference to the algorithm proposed here. The method presented in this paper combines the multi-dimensional approach of colour distributions of state-of-the-art figure-background segregation algorithms with the feature maps used by level-set methods. The combined algorithm is formulated in a two-region level-set framework. Whereas state-of-the-art M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 963–972, 2008. c Springer-Verlag Berlin Heidelberg 2008
964
D. Weiler and J. Eggert
image segmentation methods commonly model the colour distribution by means of Gaussian Mixture Models, we use colour space histograms that do not require parameter learning and the estimation of the number of mixture components and thus are more efficient to implement. In contrast to state-of-the-art level-set methods it is shown that competitive segmentation results are achieved without any additional specific feature maps, like texture. Level-set methods [5] separate all image pixels into two disjoint regions by favouring homogeneous image properties for pixels within the same region and distinct image properties for pixels belonging to different regions. The levelset formalism describes the region properties using an energy functional that implicitly contains the region description and that has to be minimised. The formulation of the energy functional dates back to e.g. Mumford and Shah [6] and to Zhu and Yuille [7]. Later on, the functionals were reformulated and minimised using the level-set framework by e.g. [8] and [9]. Among all segmentation algorithms from computer vision (see Sect. 2), levelset methods provide perhaps the closest link with the biologically motivated, connectionist models as e.g. represented by [12]. Similar to neural models, levelset methods work on a grid of nodes located in image/retinotopic space, interpreting the grid as having local connectivity, and using local rules for the propagation of activity in the grid. Time is included explicitly into the model by a formulation of the dynamics of the nodes activity. Furthermore, the external influence from other sources (larger network effects, feedback from other areas, inclusion of prior knowledge) can be readily integrated on a node-per-node basis, which makes level-sets appealing for the integration into biologically motivated system frameworks. In this paper, we apply an extended level-set formalism to compare the representation of region characteristics by several independent features and by features located in a common feature space and show the advantages of the latter. In Sect. 2 state-of-the-art figure-background segregation algorithms are briefly described. Section 3 introduces the level-set method we use for image segmentation and its extension to multi-dimensional histogram-based region descriptors. In Sect. 4 we present the results of the proposed algorithm. A short discussion finalises the paper.
2
State-of-the-Art Figure-Background Segregation
In [1] a comprehensive summary of recent figure-background segregation methods is given. The reminder of this section compares two major approaches: “trimap”-based algorithms, introduced in Sect. 2.1 and level-set methods, described in Sect. 2.2. Inspired by these two methods, we introduce an extension to standard level-set methods for image segmentation in Sect. 3. 2.1
“Trimap”-Based Methods
A number of state-of-the-art figure-background segregation algorithms (e.g.: “GrabCut” [1], “Graph cut” [2], “Knockout 2” [3] and “Bayes Matte” [4])
Multi-dimensional Histogram-Based Image Segmentation
965
perform the image segmentation task based on “trimaps”. Starting with an initial “trimap” T = {TB , TU , TF } – that specifies known background TB , known foreground TF and unknown TU regions of the image – the pixels of the unknown region are assigned to the foreground and background regions. The assignment is commonly based on probabilistic colour distribution models. Depending on the algorithm, the assignment is in a binary or probabilistic manner and the probabilistic colour distribution models are computed based only on the initial “trimap” or iteratively updated using the previous assignments within the region TU . To represent the probabilistic colour distribution models, different approaches are proposed. For grey values histograms are often used, whereas a common choice for the RGB colour space are Gaussian Mixture Models. According to [1] it is impractical to construct adequate colour space histograms, which will be disproved in this paper. In addition to the “trimap”, a smoothness term is used to control the granularity of the segmentation. The smoothness term acts in a way that encourages coherence of the assignments of neighbouring, unknown pixels within the region TU . Therefore adjacent pixels are forced to similar assignments depending on the difference of their corresponding colour and grey values, respectively. The more similar the pixel values are, the higher is the force to assign them to the same region TF and TB , respectively. 2.2
Level-Set Methods
Level-set methods are front propagation methods. Starting with an initial contour, the figure-background segregation task is solved by iteratively moving the contour according to the solution of a partial differential equation (PDE). The PDE is often originated from the minimisation of an energy functional. Famous representatives of energy functionals for image segmentation problems are those by Mumford and Shah [6] and by Zhu and Yuille [7]. While the former work in its original version on grey value images (i.e. on scalar data), utilise the mean grey value of a region as a simple region descriptor and were only later extended to vector valued data [10] (e.g. colour images), the latter use more advanced probabilistic region descriptors that are based on the distributions of each feature channel inside and outside the contour. In many cases it is sufficient to model these distributions by unimodal Gaussian distributions. In some rare cases the distributions are approximated in a multimodal way [9] e.g. by Gaussian Mixture Models or Nonparametric Parzen Density Estimates [13]. Regardless of the way the distributions are modeled, the features are in all approaches assumed to be independent. Thus, they are not located in a common feature space which leads to a separate model for each feature. Within a region the models of all features together add up to the region descriptor. Similar to the “trimap”-based approaches, level-set methods use a smoothness term to control the granularity of the segmentation. A common way is to penalise the length of the contour, that can be formulated in the energy functional by simply adding the length of the contour to the energy that is to be minimised.
966
D. Weiler and J. Eggert
In doing so, few large objects are favoured over many small objects as well as smooth object boundaries over ragged object boundaries. Compared to “active contours” (snakes) [14], that also constitute front propagation methods and explicitly represent a contour by supporting points, level-set methods represent contours implicitly by a level-set function that is defined over the complete image plane. The contour is defined as an iso-level in the level-set function, i.e. the contour is the set of all locations, where the level-set function has a specific value. This value is commonly chosen to be zero, thus the inside and outside regions can easily be determined by the Heaviside function H(x)1 .
3
Multi-dimensional Histogram-Based Image Segmentation
3.1
Standard Level-Set Based Region Segmentation
The proposed multi-dimensional histogram-based image segmentation framework is based on a standard two-region level-set method [9,15]. In a level-set framework, a level-set function φ ∈ Ω → R is used to divide the image plane Ω into two disjoint regions, Ω1 and Ω2 , where φ(x) > 0 if x ∈ Ω1 and φ(x) < 0 if x ∈ Ω2 . Here we adopt the convention that Ω1 indicates the background and Ω2 the segmented object. A functional of the level-set function φ can be formulated that incorporates the following constraints: – Segmentation constraint: the data within each region Ωi should be as similar as possible to the corresponding region descriptor ρi . – Smoothness constraint: the length of the contour separating the regions Ωi should be as short as possible. This leads to the expression2 |∇H(φ)|dx −
E(φ) = ν Ω
2
χi (φ) log pi dx
(1)
i=1 Ω
with the Heaviside function H(φ) and χ1 = H(φ) and χ2 = 1 − H(φ). That is, the χi ’s act as region masks, since χi = 1 for x ∈ Ωi and 0 otherwise. The first term acts as a smoothness term, that favours few large regions as well as smooth regions boundaries, whereas the second term contains assignment probabilities p1 (x) and p2 (x) that a pixel at position x belongs to the inner and outer regions Ω1 and Ω2 , respectively, favouring a unique region assignment. Minimisation of this functional with respect to the level-set function φ using gradient descent leads to ∂φ ∇φ p1 = δ(φ) ν div + log . (2) ∂t |∇φ| p2 1 2
H(x) = 1 for X > 0 and H(x) = 0 for X ≤ 0. Remark that φ, χi and pi are functions over the image position x.
Multi-dimensional Histogram-Based Image Segmentation
967
A region descriptor ρi (f ) that depends on the image feature vector f serves to describe the characteristic properties of the outer vs. the inner regions. The assignment probabilities pi (x) for each image position are calculated based on an image feature vector via pi (x) := ρi (f (x)). The parameters of the region descriptor ρi (f ) are gained in a separate step using the measured feature vectors f (x) at all positions x ∈ Ωi of a region i. For standard images, there may be only a single feature vector component like the pixel grey values. The case with several image features is – in standard levelset based region segmentation – covered by assuming independent contributions from each feature vector channel fj using assignment probabilities p1 = j p1j and p2 = j p2j . In many cases, the pij ’s are modeled by unimodal Gaussian region descriptor distributions so that pij (x) = Nfj (μij , σij ) [10], with mean μij and variance σij . Furthermore, μij and σij may act as locally calculated parameters that depend on the pixel position x. Remark that if we assume a single μij and σij for the entire region, (1) reduces to the standard MumfordShah functional as used in [8]. There are also approaches where the distributions are approximated in a multimodal way [9] e.g. by Gaussian Mixture Models or Nonparametric Parzen Density Estimates [13]. 3.2
A Multi-dimensional Histogram-Based Level-Set Method for Image Segmentation
For the multi-dimensional histogram-based level-set method presented in this paper, we propose to use multi-dimensional nonparametric region descriptor functions. In comparison to the commonly used Gaussian Mixture Models, we present an approach that represents the region descriptors extensively in a multi-dimensional grid-based way. Thus, the feature vector channels fj are no longer assumed to contribute independently from each other to the assignment probabilities pi via the pij ’s, but span a single multi-dimensional feature space ρi (f ). To this end, we calculate for the entire feature space f inside a region i a normalised histogram-vector hi with single entries indexed by T k = (k1 , k2 , · · · , kj , · · · , kJ ) where ˆ k (x)dx χi (φ)h Ω hik = (3) χi (φ)dx Ω
and ˆ k (x) = h
H(fj (x) − bkj ) − H(fj (x) − bkj +1 )
(4)
j
with hyper-bins indexed by vector k and borders of the histogram hyper-bins defined by bk 3 . For equally spaced bk ’s, the hyper-bins become hyper-cubes in the feature space of f . Smoothed versions of the multi-dimensional histogram hi can be gained by convolving it with a multi-dimensional Gaussian kernel of the 3
Assuming for simplicity same bin spacing for all feature dimensions j.
968
D. Weiler and J. Eggert
same dimensionality, but in our applications smoothing the histogram did not change the results substantially. The standard level-set method as described in the above section is extended by using the normalised multi-dimensional histogram hi as the feature-dependent region descriptor ρi (f ). The region assignment probability is then calculated by ˆ k (x)hik := ˆ k (x)hik pi (x) = h ··· ··· h (5) k
k1
k2
kj
kJ
i.e., by extracting the histogram entry of hi that corresponds to the hyper-bin indicated by f (x). In this way, both the region descriptor function as well as the computation of the region assignment become computationally inexpensive, since they amount to calculating and extracting single entries from normalised multi-dimensional histograms.
4
Main Results
In order to show the performance and some internal details of the proposed algorithm two exemplary source images were chosen. Both images are coloured, given in the RGB colour space and used without further preprocessing, thus the segmentation is based on three feature channels, namely the red, green and blue colour channel. The method proposed in this paper is not constrained to these specific features or to exactly three features, since other features, e.g. texture, might be utilised as well. The usage of other features was deliberately omitted to show the capability of the algorithm even in the elementary and commonly used RGB colour space. The first image shows a zebra standing in its natural environment, the steppe. The image consits of the black and white and shades of grey of the zebra, which
Fig. 1. Initial (left) and final (right) level-set contour of the zebra test image. The segmentation result was achieved after 37 iterations with the multi-dimensional, histogram-based RGB region-descriptor and without any further specific feature channel (e.g. texture).
Multi-dimensional Histogram-Based Image Segmentation
969
normalised energy
1
0.8
0.6
0.4
0.2
0 1
5
10
15
20
25
30
35 37
iterations
blue
blue
Fig. 2. Progress of the (normalised) energy over iterations. The energy converges after 29 iterations. The algorithm requires eight consecutive iterations to detect the convergence and stop the segmentation process.
n ee gr
red
n ee gr
red
Fig. 3. Distribution (multi-dimensional colour histograms) inside (left) and outside (right) of the final level-set contour of the zebra test image, shown in the threedimensional feature space spanned by the three colours red, green and blue. Larger and smaller blobs indicate larger and smaller histogram values, respectively. Only colours with a contribution greater than 1% are displayed.
constitutes the object to segment and the green and beige colouring of the surrounding steppe. Zebra images are common test images for texture based segmentation algorithms. Here we show that even without a description of texture the segmentation task can be successfully accomplished. Figure 1 shows the image overlaid by the initial and final level-set contours of the segmentation
970
D. Weiler and J. Eggert
Fig. 4. Final contour of the llama test image from [1] achieved with the segmentation method proposed in this paper. The segmentation result shows an error rate of 1.28% misclassified pixels based on the error measurement and ground-truth data provided in [1].
Fig. 5. Final contour of exemplary test images from the database provided in [1]. The segmentation results show an error rate of 1.63%, 0.72% and 1.43% misclassified pixels based on the error measurement and ground-truth data provided in [1] (from left to right). A preliminary evaluation of the proposed method with all 50 benchmark images (without special tuning to the database) resulted in an average error rate of 2.25%.
process. On the left, the initial level-set contour, a circle centred in the middle of the image and featuring a radius of one fourth of the smallest image dimension, is displayed. This initial level-set contour is commonly used to express the expectation of an object, e.g. gained by a preprocessing stage previous to the segmentation framework that focuses on salient points, like in autonomous mobile robotics. Figure 1, right, displays the final level-set contour that is obtained after 37 iterations of (2). The evolution of the level-set function is stopped according to the development of the value of the energy-functional (1). Figure 2 displays the progress of the values of the energy-functional over iterations. For convenience, the values are normalised to the interval [0, 1]. After 29 iterations, the energy has converged to its minimum. The algorithm needs eight consecutive iterations to detect the convergence and stop the segmentation process.
Multi-dimensional Histogram-Based Image Segmentation
971
Figure 3 displays the region descriptors for the inside and outside regions of the final level-set contour, ρ1 (f ) and ρ2 (f ), respectively. In the case of using the RGB colour space as the only features, the region descriptors equal the colour distribution of the object and its surrounding. In Fig. 3, left, the distribution of the colours belonging to the zebra, which is mainly composed of black and white and shades of grey, can be observed as the colours are grouped along the diagonal from black to white. The colour distribution of the outside, that mainly consists of a green and beige colouring, can be noticed in Fig. 3, right, where the colours stay in the “greenish” corner of the colour space. The second image is used in [1] to compare different state-of-the-art image segmentation methods. It was chosen to show the competitive results of the approach proposed in this paper. Figure 4 displays the final level-set contour of the segmentation process, as described in the preceding paragraph. With the ground-truth data provided in [1] and the error measurement introduced by [1] we achieve an error rate of 1.28% of misclassified pixels w.r.t. the number of initially unclassified pixels. This errore rate is comparable to the average error rate of the best performing state-of-the-art image segmentation method, which is specified by 1.36% in [1]. In Fig. 5 we show segmentation results of additional exemplary test images from the database provided in [1]. The segmentation results show an error rate of 1.63%, 0.72% and 1.43% misclassified pixels.
5
Conclusion
We have presented an approach for multi-dimensional histogram-based image segmentation that is embedded in a level-set framework for two-region segmentation. Contrary to standard level-set methods for image segmentation we assumed that the features on which the segmentation is based on are part of a single feature space. In contrast to recent state-of-the-art image segmentation methods, we did not model the feature distributions based on Gaussian Mixture Models, but applied multi-dimensional histogram-based feature models and showed that the proposed approach yields competitive results. Furthermore no specific features (e.g. texture) were needed to achieve the presented results. A number of state-of-the-art image segmentation methods provide an alpha mask as segmentation result, that assigns each pixel in a probabilistic manner to the inside and outside region, respectively. In a level-set framework, an alpha mask is not explicitly incorporated but can be easily extraceted as a by-product by evaluating the pi (x) of (5) as α(x) = p2 (x)/ (p1 (x) + p2 (x)).
References 1. Rother, C., Kolmogorov, V., Blake, A.: “GrabCut”: Interactive foreground extraction using iterated graph cuts. ACM Trans. Graph. 23(3), 309–314 (2004) 2. Boykov, Y.Y., Jolly, M.P.: Interactive graph cuts for optimal boundary & region segmentation of objects in N-D images. In: Computer Vision, 2001. ICCV 2001. Eighth IEEE International Conference on Computer Vision, vol. 1, pp. 105–112 (2001)
972
D. Weiler and J. Eggert
3. Corel Corperation: Knockout User Guide (2002) 4. Chuang, Y.Y., Curless, B., Salesin, D., Szeliski, R.: A bayesian approach to digital matting. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 264–271 (2001) 5. Osher, S., Sethian, J.A.: Fronts propagating with curvature-dependent speed: Algorithms based on Hamilton-Jacobi formulations. J. Comput. Phys. 79, 12–49 (1988) 6. Mumford, D., Shah, J.: Optimal approximation by piecewise smooth functions and associated variational problems. Commun. Pure Appl. Math. 42, 577–685 (1989) 7. Zhu, S.C., Yuille, A.L.: Region competition: Unifying snakes, region growing, and bayes/MDL for multiband image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 18(9), 884–900 (1996) 8. Chan, T., Vese, L.: Active contours without edges. IEEE Trans. Image Process. 10(2), 266–277 (2001) 9. Kim, J., Fisher, J.W., Yezzi, A.J., C ¸ etin, M., Willsky, A.S.: Nonparametric methods for image segmentation using information theory and curve evolution. In: International Conference on Image Processing, Rochester, New York, vol. 3, pp. 797–800 (2002) 10. Rousson, M., Deriche, R.: A variational framework for active and adaptative segmentation of vector valued images. In: IEEE Workshop on Motion and Video Computing, Orlando, Florida (2002) 11. Brox, T., Rousson, M., Deriche, R., Weickert, J.: Unsupervised segmentation incorporating colour, texture, and motion. Computer Analysis of Images and Patterns 2756, 353–360 (2003) 12. Grossberg, Stephen, Hong, Simon: A neural model of surface perception: Lightness, anchoring, and filling-in. Spatial Vision 19(2-4), 263–321 (2006) 13. Parzen, E.: On the estimation of a probability density function and mode. Annals of Mathematical Statistics 33, 1065–1076 (1962) 14. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: Active contour models. International Journal for Computer Vision 1(4), 321–331 (1988) 15. Chan, T., Sandberg, B., Vese, L.: Active contours without edges for vector-valued images. J. Visual Communication Image Representation 11(2), 130–141 (2000)
A Framework for Multi-view Gender Classification Jing Li and Bao-Liang Lu Department of Computer Science and Engineering, Shanghai Jiao Tong University, 800 Dong Chuan Rd., Shanghai 200240, China {jinglee,bllu}@sjtu.edu.cn
Abstract. This paper proposes a novel framework for dealing with multi-view gender classification problems and shows its feasibility on the CAS-PEAL database of face images. The framework consists of three stages. First, wavelet transform is used to intensify multi-scale edges and remove effects of illumination and noises. Second, instead of traditional Euclidean distance, image Euclidean distance which considers the spatial relationships between pixels is used to measure the distance between images. Last, a two layer support vector machine is proposed, which divides face images into different poses in the first layer, and then recognizes the gender with different support vector machines in the second layer. Compared with traditional support vector machines and min-max modular network with support vector machines, our method achieves higher classification accuracy and spends less training and test time.
1
Introduction
With the increasing requirements for advanced surveillance and monitor system, gender classification based on face images has received increasing attentions in recent years [1,2,3,4,5,6,7]. Many approaches to the problem include three steps: preprocessing, feature extraction and pattern classification. The preprocessing step often includes performing geometric normalization, masking, and histogram equalization. Then the images are converted into vectors according to the gray level of pixels. Feature extraction methods include shape or texture information extractions [5], and subspace transformations, such as PCA, ICA, and LDA [8,9,10]. Pattern classification methods include k-nearest-neighbor, Fisher linear discriminant [9], neural networks [1,11,12,13], and support vector machines [2,4]. As mentioned above, one representative work is Moghaddam and Yang’s RBFkernel SVM method based on the gray level of pixels [2], which achieved very good results on the FERET database. But, their work deals with only frontal face images. In real-world applications, we are often required to recognize gender based on face images with different poses. In the case of multi-view gender
To whom correspondence should be addressed. This work was partially supported by the National Natural Science Foundation of China under the grant NSFC 60473040, and the Microsoft Laboratory for Intelligent Computing and Intelligent Systems of Shanghai Jiao Tong University.
M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 973–982, 2008. c Springer-Verlag Berlin Heidelberg 2008
974
J. Li and B.-L. Lu
classification problems, Lian and Lu [5], and Luo and Lu [6] have used SVM and M3 -SVM on face images with different poses from the CAS-PEAL database [14]. But they only used the images with the pose angle which is less than 30 degree, and their experiments have been done on each pose separately. In this paper, we propose a new framework for dealing with multi-view gender classification problem. The proposed framework has the following three advantages over the existing approaches. a) We propose a multi-scale edge enhancement (MSEE) method to strengthen the edge information and eliminate the non-edge information of face images. The reason is that edges in images are often located at the boundaries of important image structures and reflect shapes, and therefore edges are more important than the non-edge areas, especially in gender classification. b) Almost the existing methods simply convert images into vectors according to the gray level of pixels. However, spatial relationships between pixels are lost after this conversion. On the other hand, Wang et al. have proposed the Image Euclidean Distance (IMED) [15] which considers spatial relationships between pixels. They have also proved that IMED can be realized by simply using a linear transformation to images before feeding them to the classification algorithm. In our framework, we use this linear transformation to images before converting them into vectors. c) Although support vector machines have achieved very good performance in gender classification, the time complexity limits its use in large-scale applications. To reduce the training time and increase the classification accuracy, we propose a layered support vector machine (LSVM) to divide a complicated and large-scale problem into several easy subproblems, and then solving these subproblems with different SVM modules in different feature spaces.
2 2.1
Preprocessing Multi-scale Edge Enhancement
Suppose ψ(t) is the derivative of a smoothing function θ(t), the wavelet transform of f (x) at the scale s and position x, is defined by Ws f (x) = f ∗ ψs (x) x
(1)
where ψs (x) = 1s ψ s is the dilation of the basic wavelet ψ(x) by the scale factor s. Let s = 2j (jZ, Z is the integral set), then the WT is called dyadic WT. According to Mallat’s algorithm [16], the dyadic WT of a digital signal can be calculated iteratively by convolution with two complementary filters, the lowpass and the high-pass filters, as illustrated in Fig. 1 (a). The down-sampling step in Fig. 1 (a) removes the redundancy of the signal representation. As by products, they separate f (x) into fragments and reduce the temporal resolution of the wavelet coefficients for increasing scales. To keep the continuum and the temporal resolution at different scales, we use the same sampling rate in all scales, what is achieved by interpolating the filter impulse
A Framework for Multi-view Gender Classification
G(z)
2
W 1 f [ 21 l ]
W 1 f [ n]
2
2
W 2 f [2 2 l ] G(z)
2
H(z)
2
W 2 f [ n]
G(z)
2
2
G(z2)
… H(z)
f [n]
2 1
S 1 f [2 l ] 2
975
… H(z2)
H(z)
S
2
22
f [2 l ]
(a)
f [n]
S 1 f [ n] 2
S
22
f [ n]
(b)
Fig. 1. Two implementations of dyadic discrete wavelet transform. (a) Mallat’s algorithm; (b) algorithm ` a trous.
Fig. 2. Wavelet decomposition of a face image. The upper row from left to right: the original image and the Modulus images of (W2j f )1≤j≤6 . The lower row from left to right: the multi-scale edge enhancement image and (S2j f )1≤j≤6 .
responses of the previous scale, as illustrated in Fig. 1 (b). This algorithm is called algorithm a` trous [17], and the detailed decomposition step is defined by j−1 S2j f (n) = H z 2 ∗ S2j−1 f (n) j−1 W2j f (n) = G z 2 ∗ S2j−1 f (n) (2) In this paper, we use a quadratic spline originally proposed in [18] as prototype wavelet ψ(t). The quadratic spline Fourier transform is sin ω4 4 ˆ Ψ (ω) = iω (3) ω 4
wher the symbol ‘ˆ’ represents the discrete Fourier transform. The corresponding wavelet transform of a face image is shown in Fig. 2, from which we can see that in each decomposition step, S2j f is smoothed and is removed from edge information at corresponding scales. At small scales, such as scale 21 , M2j f contains not only the edge information, but also many noises. At large scales, such as 25 and 26 , the edge information reflected by M2j f are almost meaningless. On the other hand, S25 f and S26 f mainly reflect the effects of illumination and non-edge information, eliminating them can remove the effects of illumination and non-edge information.
976
J. Li and B.-L. Lu
Although wavelet transform can be used to extract edges from M2j f , the calculation is time consuming and need user defined parameters such as thresholds at each scale. Extracted edges are also need to be linked by some morphological operations. In this paper, our goal is to strengthen the effect of edges without too heavy calculation. So we calculate the difference DS of S2j f at a small scale 2j1 and a large scale 2j2 . DS = S2j1 f (x, y) − S2j2 f (x, y)
(4)
DS mainly contains the information of edges from scale 2j1+1 to scale 2j2 . Histogram equalization of DS can increase the contrast and make the edges more clear. We call the histogram equalized image of DS as the multi-scale edge enhanced image. An example image is shown in the left bottom corner in Fig. 2, from which we can see that the contour of the face is enhanced and the right part is much clearer than the original image. 2.2
Image Euclidean Distance (IMED)
Traditional Euclidean distance compares the gray values of two images pixel by pixel. It dose not take into account the spatial relationships of pixels. If the images are not aligned well, the distance between them will be large even though they may be very same alike. Unlike the traditional Euclidean distance, IMED takes into account the spatial relationships of pixels. Therefore, it is robust to small perturbation of images. IMED defines the distance of two images x, y as T
d2E (x, y) = (x − y) G (x − y) =
MN
gij xi − y i xj − y j
(5)
i,j=1
where gij is the metric coefficient indicating the spatial relationships between pixels Pi and Pj . In this paper, gij is defined by 1 |Pi − Pj |2 gij = f (|Pi − Pj |) = exp − (6) 2πσ 2 2σ 2 where σ is the width parameter, which is set to be 1 in this paper, and |Pi − Pj | is the spatial distance between Pi and Pj on the image lattice. IMED can be embedded in classification algorithms which based on Euclidean distance by applying the following linear transformation (G1/2 ) to the original images x and y, u = G1/2 x,
v = G1/2 y
(7)
then calculating IMED between x and y can be reduced to calculating the traditional Euclidean distance between u and v as follows: (x − y)T G(x − y) = (x − y)T G1/2 G1/2 (x − y) = (u − v)T (u − v)
(8)
A Framework for Multi-view Gender Classification
977
As a result, embedding IMED in a classification algorithm is to simply perform the linear transformation G1/2 to images before feeding them to the classification algorithm. From this point of view, we treat the transformation G1/2 as a preprocessing step before classification in this paper.
3 3.1
Layered Support Vector Machine Algorithm
Given a training set of instance-label pairs (xi , yi ), i = 1, . . . l where xi Rn and yi {1, −1}, support vector machine [19] requires a solution of the following optimization problem: l 1 min W T W + C i=1 ξi 2 W,b,ξ subject to yi W T φ(xi ) + b ≥ 1 − ξi , ξi ≥ 0 (9) Here training vectors xi are mapped into a higher dimensional space by the function φ. Then SVM finds a linear separating hyperplane with the maximal margin in this higher dimensional space. C ≥ 0 is the penalty parameter of the error term. Furthermore, a kernel function can be defined according to φ. K(xi , xj ) ≡ φ(xi )T φ(xj )
(10)
The great success of SVM should be contributed to the introducing of kernel functions. The nonlinear mapping input vectors into a high-dimensional feature space makes nonlinearly separable problem into linearly separable problem. The key point of using SVM for classification is to find the suitable feature space. But a single feature space may be not enough for large-scale problems, since the distributions of data are always complicated and complex in real-world applications. These data may be classified more correctly in different feature spaces. In many large-scale applications, training data are always belonging to different subproblems, and it is easier to divide them into these subproblems than to divide them into the categories with complicated hidden meanings. So we propose a layered support vector machine (LSVM). The first layer of LSVM is a support vector machine to divide the problem into several subproblems, while the second layer has several SVMs to solve these subproblems individually. The SVMs in the second layer are independent, so they can have different feature spaces as experts in different fields solving problems using different methods. When a test instance is available, the first layer decides which subproblem it belongs, and then it is classified by the corresponding ‘expert’ in the second layer. It is clear that the accuracy of the first layer SVM will influence the final accuracy. So we emphasize that the proposed LSVM is used only in circumstances that original problems can be easily divided into different subproblems, which is not a strict requirement since many large-scale problems belong to different subproblems inherently.
978
3.2
J. Li and B.-L. Lu
Complexity Analysis
Theoretically, the LSVM can not only improve the classification accuracy, but also save the training and test time. The time complexity of a standard SVM QP solver is O(M 3 ), where M denotes the number of training samples. A decomposition method [20] has the complexity of O(k(M q + q 3 )), where q is the size of the working set, which is often relative with the number of support vectors, and k is the number of iterations. Of course, k is supposed to increase as M and the number of support vectors increase. The time complexity of traditional SVM can also be write as O(M p ), where p is between 2 and 3. In Layered SVM, the training data set in the first layer SVM is the same as that in traditional SVM. But the former is easier than the latter, which means less support vectors and less q, and the training time can saved. The bepgreatly time complexity of one SVM in the second layer is O M , where K is the K number of SVMs in second layer, and we suppose the number of training data in each SVM are same as each other for simplicity. If we train the SVMsin second p layer in parallel, the total time complexity of second layer is O M , which is K much less thanO(M p ). And if we train them in serial, the total time complexity p is O K M , which is still less than that of traditional SVM. K During the recognition phase, the main time consuming is to calculate the kernel of test vectors and support vectors in high dimension input space. So the test time complexity of traditional SVM is O(n), where n is the number of support vectors. In layered SVM, the test instance will be fed into the first layer SVM, the time complexity is O(n1 ), where n1 is the number of support vectors of the first layer SVM. Then the test instance will be classified by one SVM in the second layer according to the output of the first layer. The time complexity is O(n2,i ), where n2,i is the number of support vectors of the ith SVM in the K second layer. The average test time complexity is O(n1 ) + i=1 O (n2,i ) /K. K Since n1 and i=1 O (n2,i ) /K is much less than n, the test time can be saved compared to traditional SVM.
4 4.1
Experimental Results Experiment Setup
CAS-PEAL-R1 [14] is a large-scale face database that currently contains 21,832 images of 1,040 individuals (595 males and 445 females) in its ‘pose’ subdirectory. Each individual is asked to look upwards, forward, and downwards, respectively. In each pose, 7 images are obtained from left to right, as shown in Fig. 3. In our experiments, the images are scaled according to the eye coordinates and cropped to make only the face area left. No masking template is used because we think the outlines of face are important for gender classification, while this information will be removed when using masking template. The final resolution is 60 × 48 pixels. We use 5460 images of 260 individuals as the test data set, while the
A Framework for Multi-view Gender Classification
979
Fig. 3. Different poses of one individual in the CAS-PEAL-R1 database
rest 16372 images of 780 individuals as the training data set. All the images of 600 individuals in the training data set are divided into 3 groups according to looking left (looking left from 22◦ to 90◦ ), looking middle (from looking left at 22◦ to looking right as 22◦ ), and looking right (looking right from 22◦ to 90◦ ). The images of the rest 180 individuals in the training data set are grouped into the forth training data set. The detailed information of each data set is listed in Table 1. 4.2
Experiments on MSEE and IMED
We perform experiments on all the data sets with different preprocessing methods, which are histogram equalization on original images, histogram equalization with IMED, MSEE, and MSEE with IMED, respectively. The nearest neighbor classifier and support vector machines are used as classifiers. The classification accuracies are shown in Table 2. From this table, we can see that higher classification accuracies can be achieved when the poses of training data sets are the same as that of test data sets. If the poses of training data set and the test data set are same, IMED and MSEE achieves better classification accuracies than original images, and MSEE with IMED achieves the best performance whether embedded with nearest neighbor classifier or SVMs. 4.3
Experiments on LSVM
Now we carry out experiments on layered SVMs. Since the outlines of faces in different poses are distinct, it is much easier to dividing face images into different poses than dividing them into different genders. Other researchers have also shown the high accuracies in pose classification with support vector machines [1,21]. So, the first layer in our LSVM is a support vector machine which divides images into left poses, middle poses and right poses. The second layer contains 3 support vector machines for gender classification at different poses.
980
J. Li and B.-L. Lu Table 1. Number of images in each data set No. Pose
Train Test Male Female Total Male Female Total 1 left 2112 1714 3826 927 708 1635 2 mid 2538 2403 4941 1191 999 2190 3 right 2112 1714 3826 927 708 1635 4 all 2687 1092 3779 3045 2415 5460 All all 9449 6923 16372 3045 2415 5460 Table 2. Classification accuracies (%) of nearest neighbor classifier (NN) and support vector machines (SVMs) with different preprocessing methods Train Test
1
2
3
4
All
1 2 3 All 1 2 3 All 1 2 3 All 1 2 3 All 1 2 3 All
NN Original IMED MSEE 70.64 65.66 60.43 65.59 59.94 76.39 67.83 68.90 56.94 63.88 77.25 65.81 64.46 73.84 73.15 70.82 69.97 78.49 79.14 76.14
72.05 66.53 60.67 66.43 63.00 76.67 70.09 70.60 57.74 64.89 78.23 66.74 65.44 73.93 73.39 71.23 72.05 78.68 79.14 76.83
71.13 65.80 61.10 65.99 66.73 77.44 71.25 72.38 57.31 64.79 78.35 66.61 66.17 74.79 74.13 72.00 72.60 78.45 78.72 76.78
SVM MSEE Original IMED MSEE MSEE +IMED +IMED 73.70 87.89 88.93 89.60 89.97 64.79 76.67 76.48 75.89 76.71 60.61 67.71 67.09 69.48 70.76 66.21 77.34 76.98 78.08 78.90 66.60 67.09 67.34 70.83 70.83 78.86 89.95 90.14 92.01 92.33 71.56 73.46 71.62 77.80 78.17 73.00 78.17 77.77 81.41 81.65 54.92 66.61 66.73 69.11 69.30 68.40 76.99 77.58 75.84 76.48 79.39 90.58 91.01 92.54 92.84 67.66 77.95 77.53 78.83 79.23 66.91 82.14 83.06 83.06 83.61 75.53 85.62 86.35 87.12 87.40 74.19 86.73 86.73 86.42 86.79 72.55 84.91 84.98 85.70 86.08 73.64 87.95 88.69 89.72 90.83 80.18 91.14 91.42 91.74 92.37 80.24 91.62 91.93 92.84 92.97 78.24 90.33 90.62 91.47 92.09
Table 3. Results of using SVM, M3 -SVM and LSVM on all the training data. Here, the left column in ‘Time’ means training in serial, while the right column means training in parallel, and the unit is ‘s’. Preprocessing Acc original 90.33 IMED 90.62 MSEE 91.47 MSEE+IMED 92.09
SVM nSV Time 7887 28,554 6822 21,201 6970 24,651 7067 24,022
M3 -SVM Acc nSV Time 91.06 7447 3,051 541 91.15 6128 2,997 520 91.98 8473 3,843 693 92.20 8777 3,164 556
Acc 91.12 91.03 92.14 92.44
LSVM nSV Time 4814 2,131 1,119 4158 1,900 901 6143 2,755 1,355 4680 2,021 974
A Framework for Multi-view Gender Classification
981
RBF kernel has been chosen in each SVM, and parameters are chosen by five fold cross validation on the training data set. The results are shown in Table 3. Traditional SVM and M3 -SVM are also used for comparison. In the M3 -SVM, we divide images of each class into three poses, and then 9 SVMs are trained and combined according to the min-max combination rules [22]. All the experiments are performed on a 2.8GHz Pentium 4 PC with 2GB RAM, and LIBSVM [23] is used for the implementation of SVM. We can see that LSVM can achieve the best classification accuracy among the three methods. Further more, LSVM has the least number of support vectors, which leads to the least response time. Last, the training time of LSVM is comparable with M3 -SVM, and is much less than traditional SVM whether training in parallel or in serial. In average, the training time that LSVMs spent is only 4.4% in parallel and 9.0% in serial of that SVM used. In experiment of LSVM, the ‘pose divider’ may misclassify some test images into the wrong poses. But we have found that these improperly divided images always have pose directions near 22◦ degree. Suppose an image X with a left pose direction that a little bigger than 22◦ is misclassified into middle pose by the ‘pose divider’. Since there are some images with left pose a little smaller than 22◦ in the training set of middle poses, and these images can be helpful in classifying X in the correct gender.
5
Conclusions
In this paper we have proposed a multi-view gender classification framework which includes three steps. First, we use the multi-scale edge enhancement method on the original images to intensify edges and eliminate illumination and non-edge information. Then, the image Euclidean distance which concerns geometry relationships between pixels is used to make the distance measure between images more reasonable. Last, a layered SVM which divides face images into different poses in the first layer, and then recognizes the gender with different support vector machines in the second layer is proposed to increase the classification accuracy and reduce the training and test time. The experiments on the CASPEAL face database show the effectiveness of the proposed framework.
References 1. Gutta, S., Huang, J., Jonathon, P., Wechsler, H.: Mixture of experts for classification of gender, ethnic origin, andpose of human faces. IEEE Transactions on Neural Networks 11(4), 948–960 (2000) 2. Moghaddam, B., Yang, M.H.: Learning gender with support faces. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(5), 707–711 (2002) 3. Khan, A.: Combination and optimization of classifiers in gender classification using genetic programming. International Journal of Knowledge-Based and Intelligent Engineering Systems 9(1), 1–11 (2005) 4. Lian, H.C., Lu, B.L., Takikawa, E., Hosoi, S.: Gender recognition using a min-max modular support vector machine. In: Wang, L., Chen, K., S. Ong, Y. (eds.) ICNC 2005. LNCS, vol. 3611, pp. 438–441. Springer, Heidelberg (2005)
982
J. Li and B.-L. Lu
5. Lian, H.C., Lu, B.L.: Multi-view gender classification using local binary patterns ˙ and support vector machines. In: Wang, J., Yi, Z., Zurada, J.M., Lu, B.-L., Yin, H. (eds.) ISNN 2006. LNCS, vol. 3972, pp. 202–209. Springer, Heidelberg (2006) 6. Luo, J., Lu, B.L.: Gender recognition using a min-max modular support vector ˙ machine with equal clustering. In: Wang, J., Yi, Z., Zurada, J.M., Lu, B.-L., Yin, H. (eds.) ISNN 2006. LNCS, vol. 3972, pp. 210–215. Springer, Heidelberg (2006) 7. Kim, H.C., Kim, D., Ghahramani, Z., Bang, S.Y.: Appearance-based gender classification with Gaussian processes. Pattern Recognition Letters 27(6), 618–626 (2006) 8. Balci, K., Atalay, V.: PCA for gender estimation: Which eigenvectors contribute. Proceedings of Sixteenth International Conference on Pattern Recognition 3, 363– 366 (2002) 9. Jain, A., Huang, J.: Integrating independent components and linear discriminant analysis for gender classification. In: Sixth IEEE International Conference on Automatic Face and Gesture Recognition, pp. 159–163 (2004) 10. OToole, A.J., Deffenbacher, K.A., Valentin, D., McKee, K., Huff, D., Abdi, H.: The perception of face gender: The role of stimulus structure in recognition and classification. Memory and Cognition 26(1), 146–160 (1998) 11. Cottrell, G.W., Metcalfe, J.: EMPATH: face, emotion, and gender recognition using holons. In: Proceedings of the 1990 conference on Advances in neural information processing systems, pp. 564–571 (1990) 12. Edelman, B., Valentin, D., Abdi, H.: Sex classification of face areas: how well can a linear neural network predict human performance. Journal of Biological System 6(3), 241–264 (1998) 13. Golomb, B., Lawrence, D., Sejnowski, T.: SexNet: A neural network identifies sex from human faces. In: Proceedings of the 1990 conference on Advances in neural information processing, pp. 572–577 (1990) 14. Gao, W., Cao, B., Shan, S., Zhou, D., Zhang, X., Zhao, D.: The CAS-PEAL large-scale Chinese face database and baseline evaluations. Technical report of JDL (2004), http://www.jdl.ac.cn/peal/pealtr.pdf 15. Wang, L., Zhang, Y., Feng, J.: On the Euclidean distance of images. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(8), 1334–1339 (2005) 16. Mallat, S.: Multifrequency channel decompositions of images and wavelet models. IEEE Transactions on Acoustics, Speech, and Signal Processing 37(12), 2091–2110 (1989) 17. Cohen, A., Kovacevic, J.: Wavelets: the mathematical background. Proceedings of the IEEE 84(4), 514–522 (1996) 18. Mallat, S., Zhong, S.: Characterization of signals from multiscale edges. IEEE Transactions on Pattern Analysis and Machine Intelligence 14(7), 710–732 (1992) 19. Vapnik, V.N.: Statistical learning theory. Wiley, New York (1998) 20. Platt, J.C.: Fast training of support vector machines using sequential minimal optimization. Advances in kernel methods: support vector learning, 185–208 (1999) 21. Huang, J., Shao, X., Wechsler, H.: Face pose discrimination using support vector machines (SVM). In: Fourteenth International Conference on Pattern Recognition, vol. 1, pp. 154–156 (1998) 22. Lu, B.L., Wang, K.A., Utiyama, M., Isahara, H.: A part-versus-part method for massively parallel training of support vector machines. In: 2004 IEEE International Joint Conference on Neural Networks, vol. 1, pp. 735–740 (2004) 23. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001), http://www.csie.ntu.edu.tw/∼ cjlin/libsvm
Japanese Hand Sign Recognition System Hirotada Fujimura, Yuuichi Sakai, and Hiroomi Hikawa Oita University, Oita, 870-1192 Japan
Abstract. This paper discusses a Japanese hand sign recognition system with a simple classifier network. In the system, input hand images are preprocessed through horizontal/vertical projection followed by discrete Fourier transforms (DFTs) that calculate the magnitude spectrum. The magnitude spectrum is used as the feature vector. Use of the magnitude spectrum makes the system very robust against the position changes of the hand image. The final classification is carried out by the classifier network, which uses simple neurons. Each neuron evaluates the possibility of the input vector belonging to assigned cluster. From the evaluation results, the hand sign is identified. The feasibility of the system is verified by simulations. The simulation results show that the average recognition rate of the system is 93% even though the hand positions are changed randomly.
1
Introduction
A number of approaches to video-based hand gesture recognition has grown in recent years [1]. The use of hand gesture provides an attractive alternative to cumbersome interface devices for human-computer interaction (HCI). Visual interpretation of hand gestures can make it possible to migrate the natural means that humans employ to communicate with each other into HCI. Generally hand gestures are either static hand postures[2][3] or dynamic hand gestures[4][5]. The hand sign recognition is a kind of the pattern recognition, which is a mapping process of the input vectors to a finite set of clusters, and each cluster is associated with a posture. The input image is converted to a feature vector and its class is determined by searching the closest prototype of the cluster, which minimizes the distance to the input vector. As the distance measure, Euclidean distance is widely used and Radial basis function (RBF) network used for various pattern classification/recognitions [6][7][8], employs Gaussian function associated with Euclidean distance. However, the computational costs of the functions are prohibitive. In [9], very simple method to estimate the distance between input vectors and a predefined cluster, has been proposed. This method is employed in the neuron for pattern classifications. It should be also noted that the method can be implemented in hardware with lower hardware cost as it requires no complicated calculation, such as, multiplication or nonlinear functions. This paper describes the Japanese hand sign recognition system with the new classifier network with the neuron mentioned above and it classifies the input M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 983–992, 2008. c Springer-Verlag Berlin Heidelberg 2008
984
H. Fujimura, Y. Sakai, and H. Hikawa Input image in P × Q, RGB color format ( R(x, y), G(x, y), B(x, y) )
? Binary quantization I(x, y)
? Horizontal projection PH (y)
? Vertical projection PV (x)
?
?
DFT
DFT
FH (n)
FV (n)
?
? Classifier Network
?
Class Fig. 1. Hand sign recognition system
vectors as the RBF network does. In the system, input hand images are preprocessed through horizontal/vertical projection followed by discrete Fourier transforms (DFTs) that calculate the magnitude spectrum. The magnitude spectrum is used as the feature vector. Use of the magnitude spectrum makes the system very robust against the position changes of the hand image.
2
Hand Posture Recognition System
The process flow of the Japanese hand sign recognition system is shown in Fig. 1. Input image is P × Q pixels, RGB color format and the input image is preprocessed to obtain feature vectors. The preprocessing consists of a binary quantization, horizontal/vertical projections followed by two discrete Fourier transforms (DFTs). The preprocessing generates FH (n) and FV (n), which are the feature vector of the input image. The D dimensional feature vector is fed to the classifier network which finally identifies the hand sign. The network consists of the neurons with the evaluation method proposed in [9]. 2.1
Preprocessing
Signal examples of the system are depicted in Fig. 2. First, the color image is converted to a binary image prior to further processing as most image recognition systems do. Not only it quantizes the input image, this subsystem removes the background image, and extracts the hand portion only. To make the hand shape
985
0
Japanese Hand Sign Recognition System
-
Magnitude FH(n)
40 60
DFT
32 16 0 0
48
64
(D)
0 120
20
40
32
Frequency index (n)
(B)
80
64 60
DFT 40
-
20 0
Magnitude FV(n)
Number of count PV(x)
60
80
Vertical projection
?
16
Number of count PH(y)
(A)
48
100
80
-
Vertical index (Y)
20
Horizontal projection
64
48
(F)
32
16 0
0
20
40
60
80
100
120
0
16
32
48
Horizontal index (x)
Frequency index (n)
(C)
(E)
64
Fig. 2. Example of the signals in the hand sign recognition system. (A) Binary image, (B) Horizontal projection PH (y), (C) Vertical projection PV (x), (D) magnitude spectrum FH (n), (E) magnitude spectrum FV (n), (F) The same hand posture in different positions yield identical FH (n), FV (n).
extraction process easier, it is required for the users to wear a red glove. Hence the extraction/binary quantization process is simplified as, I(x, y) = g( Red(x, y), Green(x, y) + Blue(x, y) ) · g( Red(x, y), ρ )
(1)
where, I(x, y) is the binary pixel value at (x, y) coordinate, Red(x, y), Green(x, y) and Blue(x, y) are the intensity levels of red, green and blue color at (x, y) of the input color image, respectively. ρ is a threshold parameter and g(·) is a threshold function. 1 if x ≥ θ g(x, θ) = (2) 0 otherwise An example of the binary image is shown in Fig. 2(A). The use of red glove makes it possible to remove the background and arm portion, result is the binary image of hand portion. I(x, y) is then fed to horizontal and vertical projection subsystems. The projection is defined here as an operation that maps a binary image into a onedimensional array called a histogram. The values of the histogram are the sums of the white pixels along a particular direction. Two types of histograms are defined, they are horizontal projection histogram PH (y) and vertical projection histogram PV (x). PH (y) =
P −1
I(x, y)
(3)
I(x, y)
(4)
x=0
PV (x) =
Q−1 y=0
986
H. Fujimura, Y. Sakai, and H. Hikawa
The horizontal and vertical projection histograms of the binary image of Fig. 2(A) are depicted in Fig. 2(B) and (C), respectively. Following two DFTs calculate FH (n) and FV (n), the magnitude spectrums of PH (y) and PV (x), respectively. Two images with the same hand posture placed in different positions yields different projection histograms. However, the only difference between two histograms is their positions and their histogram shapes are identical. Thus the spectrum amplitude FH (n) and FV (n) calculated from the same hand posture images placed in different positions are all identical. Two images (A) and (F) in Fig. 2 yield same amplitude spectrums (D) and (E) as they are the same hand sign in different positions. In this way, the use of the amplitude spectrums FH (n), and FV (n) provides the position independent hand posture identification. As can be seen in Fig. 2(D) and (E), the feature of the image concentrates in the lower frequency components. Thus the lower D/2 frequency components are taken from both of the spectrum data FH (n), FV (n) to form a D dimensional feature vector. The dimension of the vector is reduced from P × Q to D, and usually D P × Q.
3
Classifier Network
The input to the network is a D-dimensional vector x from the preprocessing, and each input vector element is written as xi . Each neuron is associated with a single cluster, and it evaluate how the input vector close to their assigned clusters. The basic idea of the evaluation method in [9] is based on the assumption that if the vector belongs to the same cluster, then the probability of the vector elements falling within a certain range is expected to be very high. (s) Here, M upper and lower limit sets are defined for each xn from training data set. The evaluations are carried out by testing if the input vector elements are within the upper and lower limits. M upper and lower limit sets are defined for each vector element. (s) (s) Unm = μ(s) n + αm · σn , m = 1, 2, · · · , M (s) (s) Lnm = μn − αm · σn(s) , m = 1, 2, · · · , M (s)
(5) (6)
(s)
μn and σn are the mean and standard deviation of the n-th vector element in the training vectors belonging to class s. αm is a coefficients to adjust the (s) (s) upper and lower limits. Using Unm and Lnm , following range tests are carried out against a given input vector. The number of the range tests performed by each neuron is D × M . (s) (s) 1 if Unm > xn > Lnm (s) rnm (xn ) = (7) 0 otherwise An example of the upper-lower data set is shown in Fig. 3. In this example, the dimension of the vector is four, and the vectors are divided into two classes.
Japanese Hand Sign Recognition System
987
(1)
x1 ∗ ∗ ∗ ∗ rr ∗ ∗ ∗ ∗ ∗ ∗ ∗ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ rr ◦ ◦ ◦ ◦ (2) x1
(s)
x1
∗ : training vectors in class 1 ◦ : training vectors in class 2 - : input vector element
(1)
x3 ∗ ∗ ∗ rr ∗ ∗ ∗ ∗ ∗ ∗ ∗
(1)
x ∗2 ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ rr ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
(s)
x2
(2)
x3 ◦ ◦ ◦ ◦ rr ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
(2)
x2 ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ rr ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
(s)
x3
Upper limit U1 = μ + α1 · σ
(1)
x4 ∗ ∗ ∗ ∗ rr ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
U2 = μ + α2 · σ (2) x4 rr ∗μ : average ∗ ∗ ∗ L2 = μ − α2 · σ ∗ rr ∗ ∗ ∗ ∗ ∗ ∗ ∗ L1 = μ − α1 · σ ∗ Lower limit ∗
(s)
x4
Fig. 3. Vector element ranges
Using α1 and α2 ( α1 > α2 ), each vector element is associated with two upperlower ranges (M = 2). The mean μ and the variance σ are calculated from the training vectors. The neuron output E (s) , is given by equation (8). E (s) =
M D
(s) rnm ( xn ) · wm
(8)
m=1 n=1
where, wnm is a weight. The weight value is determined by considering the size of the range that is determined by αm . As shown in Fig. 3, the smaller the αm , the larger weight should be assigned to the range check. E (s) becomes larger as the possibility of the input vector belonging to the neuron’s cluster increases. Suppose the input vector whose vector shown by the right arrow in Fig. 3, is given and the weights are w1 = 1, w2 = 2, then the neuron outputs are calculated as, E (1) = 1 + 3 + 1 + 3 = 8, E (2) = 0 + 1 + 0 + 1 = 2, respectively. As a results, the class of the input vector is decided as class 1. If M = 1 (only α1 is used), then E (1) = 1 + 1 + 1 + 1 = 4 and E (2) = 0 + 1 + 0 + 1 = 2, respectively. In this case, the difference between E (1) and E (2) is only two, while the difference is six in the above example with M = 2. The larger differences between E (s) s make the classification process more reliable. Thus, more precise classification is expected by increasing M .
H. Fujimura, Y. Sakai, and H. Hikawa
- Eq.(8) AQQ 3 A Q Q A s Q x2 A Eq.(8) J A
J A
J A
J A J A J A
U A JJ ^
xD Eq.(8) x1
(1)
E E (2)
E (D)
-
-
-
- C1 MAX value finder
988
- C2
- CR
Fig. 4. Classifier network
3.1
Classifier Network for Hand Sign Recognition
The classifier network for the hand sign recognition is shown in Fig. 4. Each neuron performs eq. 8. Winner-takes-all competition by the maximum value finder circuit is employed for the final classification. The circuit searches for the maximum output from the neurons, which is the winning neuron and the class assigned to that neuron is given as the recognition results. 3.2
Hardware Configuration
As the proposed neuron requires no complicated calculations, it can be realized by simple digital circuits. The hardware implementation of the proposed neuron is depicted in Fig. 5. The neuron examines all input vector elements to see if they are within the predefined range. A range check circuit shown in Fig. 6 performs the range test given by equation (7). Comparator becomes active and yields ‘1’ if the input element is between the upper and lower limit. Then, the sum of all the range (s) check circuit outputs is given as the evaluation index Em . M copies of the E (s) calculation circuits with different αm are included in the neuron, and the final (s) estimation index E (s) is given as a sum of Em , which indicates how the input vector close to the neuron’s cluster.
4
Simulation
The feasibility of the proposed system is tested by simulations. The Japanese hand sign system consists of 41 static hand signs and 5 dynamic hand signs. The proposed system is designed to classify 41 static hand signs that are shown in Fig. 7. The image size is 128 × 128(P = Q = 128), RGB color format. For the conversion from RGB to binary image, the threshold ρ = 150 is used.
Japanese Hand Sign Recognition System Class s, m = 1 w1 x1 q- RC: x(s) 11 w1 x2 q - RC: x(s) 21 .. . w1 xD q - RC: x(s) D1
q q q q
989
? (s) - ΣnE1 6
? j Σ - E (s) * 6
Class s, m = M wM - RC: x(s) 1M ? wM - Σn - RC: x(s) (s) 2M E 6 M .. . wM - RC: x(s) DM RC: range check circuit Fig. 5. Block diagram of the neuron
(s)
(s)
A
μn + αm · σn (in)
xn
r
B A
(s) μn
− αm ·
(s) σn
B
A>B
A>B
Fig. 6. Range check circuit
Each class data consists of 100 images, thus 4,100 images are used for the simulation. For each class, 50 images are used for the training and the remaining 50 images are used for the recognition test. During the recognition test, the positions of the hand images are intentionally altered from the center as shown in Fig. 8. In the following simulations, 11 lower frequency components are taken from FH (n) and FH (n), respectively. Thus 22-dimensional feature vector is fed to the network. In this case the dimension of the vector is reduced from 16, 384 (128 × 128) to 22. 4.1
Recognition Rate and Circuit Size
Four classifiers with the same structure shown in Fig. 4 using different types of neurons listed below, are tested.
990
H. Fujimura, Y. Sakai, and H. Hikawa
Fig. 7. 41 static Japanese hand signs to be recognized
Fig. 8. Hand postures in different positions
1. proposed neuron that calculates eq. (8) with M = 1 and M = 2, 2. neuron calculating the Manhattan distance, 3. neuron calculating the Gaussian function. The parameters used for the proposed classifier are given in Table 1. The Manhattan distance and the Gaussian function are given by following equations. (s) EM
(s) EG
=
D i=1
(s)
| xi − ci
= exp(−
D
|
(9) (s)
− ci )2 ) 2σ 2
i=1 (xi
(10)
Networks with the Manhattan neuron use a minimum value find circuit instead of the max value finder. The recognition rates of the four types of network are obtained by C simulations. In addition, the proposed network and the network with the Manhattan neuron are described by VHDL and their circuit sizes are evaluated by Xilinx tool. The recognition rates and the circuit sizes of the classifiers are summarized in Table 2.
Japanese Hand Sign Recognition System
991
Table 1. Classifier parameters Neuron type α M=1 α1 = 1 M=2 α1 = 1 α2 = 0.8
weight w1 = 1 w1 = 1 w2 = 3
Table 2. Recognition rate and circuit size comparison Neuron type Recognition rate Equivalent gate count Range check (M=1) 88.78% 2,464 Range check (M=2) 93.61% 4,939 Manhattan distance 93.76% 6,043 Gaussian 81.50% -
In the recognition point of view, The recognition rate of the new classifier with M = 2 is comparable to that of the classifier with the Manhattan distance neuron. As is expected, the performance of the proposed classifier network is improved by increasing the number of the range tests for each vector element. The performance of the classifier with M = 2 is about 5% better than that with M = 1. The recognition rates of the classifier with the Gaussian function neuron is inferior to other types of neurons. To improve the performance more tuning of the function and structure of the network are necessary. Regarding the hardware cost, the required hardware size of the proposed network is doubled by increasing M = 1 to M = 2 but the hardware cost is still smaller than that of the classifier with the Manhattan distance. Due to the complex function required by the Gaussian neuron, it is easily expected that the circuit cost of the network with the Gaussian is much higher than other two networks.
5
Conclusions
This paper has described a Japanese hand sign recognition system with a new classifier network. The input image of the hand sign is preprocessed through horizontal/vertical projection followed by DFT that calculates the magnitude spectrum of the projection data. The magnitude spectrum data is used as the feature vector so that the system is robust against the position change of the hand image. The feature vector is classified by the classifier network. As the neuron used in the network can be realized with simple functions, the network is suitable for hardware implementation. The simulation results show that the average recognition rate of the system is 93.7% even though the hand positions are changed randomly. Also it has been verified that the hardware cost of the proposed classifier network is smaller
992
H. Fujimura, Y. Sakai, and H. Hikawa
than that of the network using the Manhattan distance measure, while providing comparable recognition rate. The hardware implementation of the whole system is left for further research. Especially, the development of a simple circuit that can substitute the DFT used in this system, is underway to reduce the overall hardware cost. The hardware solution of the proposed architecture will make it possible to recognize the hand posture in real-time. In addition, the hardware system can be extended to recognize the dynamic action, which is a sequence of hand postures. Acknowledgements. This work was supported by KAKENHI, Grant-in-Aid for Scientific Research (C) (19500153) from Japan Society for the Promotion of Science (JSPS).
References 1. Pavlovic, V.I., Sharma, R., Huang, T.S.: Visual Interpretation of Hand Gestures for Human-Computer Interaction: A Review. IEEE Trans. Pattern Analysis and Machine Intelligence 19, 677–695 (1997) 2. Triesch, J., von der Malsburg, C.: A System for Person-Independent Hand Posture Recognition against Complex Backgrounds. IEEE Trans. Pattern Analysis and Machine Intelligence 23(12), 1449–1453 (2001) 3. Hoshino, K., Tanimoto, T.: Realtime Hand Posture Estimation with Self-Organizing Map for Stable Robot Control. IEICE Trans. on Information and Systems E89-D(6), 1813–1819 (2006) 4. Bobick, A.F., Wilson, A.D.: A state-based approach to the representtion and recognition of gesture. IEEE Trans. Pattern Analysis and Machine Intelligence 19(12), 1325–1337 (1997) 5. Yang, H.-H., Ahuja, N., Tabb, M.: Extraction of 2D motion trajectories and its application to hand gesture recogntion. IEEE Trans. Pattern Analysis and Machine Intelligence 24(8), 1061–1074 (2002) 6. Chu, F., Wang, L.: Applying RBF Neural Network to Cancer Classification Based on Gene Expressions. In: Proc. IJCNN 2006, pp. 3888–3892 (2006) 7. Chen, S., Harris, C.J., Hong, X.: Construction of RBF Classifiers with Tunable Units Using Orthgonal Forward Selection Based on Leave-One-Out Misclassification Rate. In: Proc. IJCNN 2006, pp. 6390–6394 (2006) 8. Bors, A.G., Gabbouj, G.: Minimal topology for a radial basis function neural network for pattern classification. Digital Signal Processing: a review journal 2(2), 302–309 (1994) 9. Matsubara, S., Hikawa, H.: Hardware Friendly Vector Quantization Algorithm. In: Proc. IEEE ISCAS 2005, pp. 3623–3626 (2005)
An Image Warping Method for Temporal Subtraction Images Employing Smoothing of Shift Vectors on MDCT Images Yoshinori Itai1, Hyoungseop Kim1, Seiji Ishikawa1, Shigehiko Katsuragawa2, Takayuki Ishida3, Ikuo Kawashita3, Kazuo Awai2, and Kunio Doi4 1
Kyushu Institute of Technology, Japan [email protected] 2 Kumamoto University, Japan 3 Hiroshima International Univarsity, Japan 4 The University of Chicago, USA
Abstract. Detection of subtle lesions on computed tomography (CT) images is a difficult task for radiologists, because subtle lesions such as small lung nodules tend to be low contrast, and a large number of CT images must be interpreted in a limited time. A temporal subtraction image, which is obtained by subtraction of a previous image from a current one, can be used for enhancing interval changes (such as shapes of new lesions and the interval changes in existing abnormalities) on medical images by removing most of the normal structures. For detection of lesions in chest radiographs, the temporal subtraction technique has been applied successfully to clinical cases. In this paper, we propose a new method for removing the artifacts in temporal subtraction images obtained from chest multiple detectors CT (MDCT) images by using a smoothing of shift vectors. With our method, more accurate correspondence of normal structures between current and previous images is determined. The new method was applied for 6 clinical cases of MDCT images. Preliminary results indicated that interval changes on the subtraction images were enhanced considerably with reduction of misregistration artifacts. Keywords: Temporal subtraction method, Lung nodule, CT image, Image warping.
1 Introduction Detection of subtle lesions on CT images is a difficult task for radiologists, because subtle lesions such as small lung nodules tend to be low contrast, and a large number of CT images must be interpreted in a limited time. A temporal subtraction image, which is obtained by subtraction of a previous image from a current one, can be used for enhancing interval changes (such as shapes of new lesions and the interval changes in existing abnormalities) on medical images by removing most of normal structures. For detection of lesions in chest radiographs, the temporal subtraction technique has been applied successfully to clinical cases [1]. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 993–1001, 2008. © Springer-Verlag Berlin Heidelberg 2008
994
Y. Itai et al.
In the temporal subtraction technique, it is important to employ an image warping technique for accurately deforming the previous image to match the current image. If the warping is incorrect, normal structures will produce artifacts in the subtraction image, and the image quality can be degraded. Many warping techniques for twodimensional (2-D) images in chest radiographs have been developed in the field of computer-aided diagnosis (CAD) [1-7]. In the temporal subtraction with MDCT volume images that is used in the thoracic examination, it is necessary to employ a more complex and accurate three-dimensional (3-D) registration and warping of lung regions between the current and previous image [8, 9]. In this paper, we propose a new method for removing the artifacts in temporal subtraction images obtained from MDCT images by using a smoothing of shift vectors, which indicate the extent of deformation/warping. With our method, more accurate correspondence of normal structures between current and previous images is determined. The new method was applied for 6 clinical cases of MDCT images. Preliminary results indicated that interval changes on the subtraction images were enhanced considerably with reduction of misregistration artifacts.
2 Overall Scheme of the Temporal Subtraction Method For a temporal subtraction method, a registration between the current and previous image is the most important task, because image quality on the subtraction image would be degraded due to some artifacts caused by incorrect registration. In our temporal subtraction method, the registration is achieved by warping the previous image to match with the current image. For the warping, shift vectors, that show the extent of deformation/warping of the previous image, are obtained in all voxels in the current image. In our method, voxel value in the temporal subtraction image is represented as f sub (x ) = f cur ( x) − f pre (x + T( x)) , (1) T(x ) = [Tx (x) Ty ( x) Tz ( x)]T where fcur(x), fpre(x) and fsub(x) show voxel value in the current, previous and the temporal subtraction image at x = (x,y,z), respectively. In equation (1), T(x) shows the shift vector, and is obtained in matching process. The overall scheme of our 3-D temporal subtraction is illustrated in Fig.1. In our method, we adjust the sizes of pixels in the two images, which results from the difference in the field of view (FOV), by enlarging or minifying the previous image. In the next step, the shift vector is obtained by the global, local and a 3-D elastic matching technique for the warping process. With these matching techniques, the shift vector is determined only on the center in each VOIs (volumes of interest) located in a segmented lung region in the current image automatically, equally located in the direction of x, y and z axis, to reduce the processing time. At this point, it is hard to obtain optimal shift vector on the center in VOIs in the edge of lung, so these shift vector is obtained by tri-linear interpolation with determined shift vector in neighbor VOIs. The segmentation for the lung is achieved by a threshold method and morphological filter [8]. The distance between the adjacent VOIs is half the size of the template of
An Image Warping Method
Previous CT image
995
Current CT image
Normalization of the voxel size Global image matching by using 2-D cross-correlation technique Local image matching by using 3-D cross-correlation technique Smoothing of shift vectors by using 3-D elastic matching technique 3-D image warping 3-D image subtraction Fig. 1. The overall scheme of 3-D temporal subtraction
VOIs in each axis direction. The adjacent VOIs are overlapped one half each other to share the same image structure. The sizes and the distances are determined empirically. With the shift vector obtained, the shift vector in all of voxels in the current image is determined by use of tri-linear interpolation. An initial processing is performed for a global matching to correct for the global displacement caused by the variation in patient positioning. For accurate registration, a local matching based on 3-D cross-correlation is applied. After these matching techniques, a 3-D elastic matching technique is employed for smoothing of shift vector obtained. Finally the warped image based on the shift vectors is subtracted from the current image, thus providing a temporal subtraction image. 2.1 Global Matching by Use of a 2-D Template Matching Technique A global shift value, which can correct for global temporal displacement caused by patient positioning, is determined on each of the center voxels in all of VOIs. In this matching process, 2-D template matching based on a 2-D cross-correlation method is employed [8] as shown in Fig.2. First, blurred images were obtained from the previous and the current images by employing a Gaussian filter, and we reduce the matrix size to a quarter in the x-y plane for reducing the computational time and for the matching only for large structures. In each 2-D image including the center voxel of VOI, a rectangle region including the lung area is determined as a template image. The template image is moved on to the 2-D image of the previous image, in order to find the maximum of the 2-D cross-correlation value, which indicates the similarity between the current and previous image. The correlation value is shown as
996
Y. Itai et al.
∑∑ ( f X −1 Y −1
1 C cc = XY
f pre
(x) − f cur )( f pre (x) − f pre )
σ cur σ pre .
X −1 Y −1
1 XY
∑∑ f
1 = XY
∑∑ f
f cur =
cur
x =0 y =0
cur
(x)
pre
( x)
(2)
x = 0 y =0
X −1 Y −1 x = 0 y =0
Here, X and Y show the size of row and columns of the template image, respectively. In equation (2), σcur and σpre show the standard deviation of the template and the search region VOI, respectively. 2.2 Local Matching by Use of a 3-D Template Matching in VOIs In order to achieve a high accuracy of matching between the current and the previous image, the local matching technique is employed based on 3-D template matching with each VOI in the current image [9]. In the local matching, search region VOIs corresponding to the template VOIs are established in the previous image. Previous CT images Current CT images Template image
Template matching
Fig. 2. Global matching
Local shift vector
Template VOI
Search region VOI
Fig. 3. Local matching
Best matched image
An Image Warping Method
997
The corresponding search region VOIs are selected from the previous image by using global shift vector. In each VOI pairs, 3-D template matching is performed to obtain local shift vector. The template VOI is moved on the 3-D previous image, in order to find the maximum value for the 3-D cross-correlation value, as shown Fig. 3. 3-D cross-correlation value is shown as
∑∑∑(f X −1 Y −1 Z −1
C cc
1 = XYZ
cur
)(
(x) − f cur f pre (x) − f pre
)
σ cur σ pre .
X −1 Y −1 Z −1
1 XYZ
∑∑∑ f
1 = XYZ
∑∑∑ f
f cur = f pre
x =0 y = 0 z = 0
cur
(x)
pre
( x)
(3)
x=0 y =0 z =0
X −1 Y −1 Z −1 x =0 y = 0 z = 0
Here, X, Y and Z show the size of template VOI. In equation (3), σcur and σpre show the standard deviation of the template and the search region VOI, respectively. 2.3 Smoothing of Shift Vectors by Use of a 3-D Elastic Matching Technique The shift vector shows the correspondence between the two voxels in the current and the previous images, as mentioned above. The shift vector should be determined based on the similarity (cross-correlation) between two images to provide a proper correspondence in the local matching as well the global matching [8]. Although the most suitable shift vector can be obtained through this scheme, the subsequent warping process would not be successful. It is because that the shift vector having the largest similarity is not true in a case taking the other shift vectors into consideration. For example, a shift vector having the cross-correlation value of 0.891 is true rather than other shift vector having cross-correlation value of 0.897, because some shift vectors in the adjacent VOIs are similar to the previous shift vector. So the shift vector should be determined based not only on similarity between two images but also on a consistency around shift vectors. On the other word, the shift vectors of which size and direction are considerably different from the adjacent shift vectors are smoothed (Fig.4). It is shown that some shift vectors did not match in the without elastic matching technique which shown in Fig. 4 (a). To overcome this problem, in our matching process, the shift vector is obtained by use of a 3-D elastic matching method. With a 2-D elastic matching method, it is possible to obtain the local shift vector, which preserves a high cross-correlation coefficient and high consistency over the other local shift vectors, as Li et al. [7] have mentioned. We employed a 3-D elastic matching technique to deal with the shift vector in 3-D space. In the elastic matching method, the shift vector can be obtained by minimizing a cost function E that is a weighted sum of an internal Eint(x) and an external energy Eext(x) as shown (4) E = ∑ (Eext (x) + ω1 Eint (x) ) , x∈V
where V shows a set of center voxels in each VOIs in the current image. Here the internal energy is given by a squared sum of the first- and second-order derivative value over the shift vectors as
998
Y. Itai et al.
Eint (x) = ω2 a( x) + b(x) a( x) = T(x) x + T( x) y + T(x) z 2
2
b( x) = T( x) xx + T( x) yy + T( x) zz 2
where
,
2
2
2
T(x) x = T( x + X , y, z ) − T( x, y, z ) T(x) y = T( x, y + Y , z ) − T( x, y, z )
(5)
.
(6)
T( x ) z = T( x , y , z + Z ) − T( x, y , z )
Here, the smoother the shift vectors, the smaller the internal energy. On the other hand, the external energy is equal to the negative value of the 3-D cross-correlation value, as shown equation (3), which is obtained in each 3-D template VOIs.
(a) Without elastic matching
(b) With elastic matching
Fig. 4. Appearance of shift vectors (Left hand- and right hand-side figure show the shift vector without and with 3-D elastic matching, respectively)
The shift vectors are then updated one by one, as one with the minimum energy in an area (N x N x N) around the point indicated the one in the previous step. This procedure is applied to all of template VOI shift vectors, and it is repeated several times over the entire 3-D lung region until no more than α% of the shift vectors in all of the VOIs. When the final shift vectors have been obtained in the center points in all of VOIs, the shift vectors in all of the voxels in the current image are obtained by use of 3-D linear interpolation.
3 Results We used MDCT images obtained with multi-slice CT scanners (Light Speed QXi, GE, Milwaukee, USA), with four row detectors to develop a new method for the temporal subtraction technique to volume data. The matrix size for each slice image was
An Image Warping Method
(a)
(b)
(c)
(d)
999
Fig. 5. Comparison of 3-D MDCT images include nodules only on the current image, (a) Previous CT image, (b) Current CT image with nodules, (c) Subtraction image without smoothing, (d) Subtraction image with smoothing
512 x 512, and the voxel size was 0.67 mm on the x and y axes, and 5.00 mm on the z axis. In order to increase the resolution in z direction, we employed interpolated images between slice images were employed. We applied this temporal subtraction method for 6 chest MDCT cases. The volume date used in this study consisted of 2 normal and 4 abnormal cases with lung nodules. In our experiment, the size of template VOI and search VOI is 32 x 32 x 16 and 64 x 64 x 64, respectively. The size of Gaussian kernel is 15 x 15. In 3-D elastic matching, the size of update region N is 1,
1000
Y. Itai et al.
weight ω1 and ω2 is determined 0.001 and 0.6, respectively. The update value α in section 2.3 is 5. These parameters were determined empirically. Fig.5 shows one of the results of the use of our temporal subtraction technique. Fig.5 (a) and (b) show a current and a previous CT image in one case. Fig.5 (c) and (d) show results for subtracted image without and with smoothing of shift vector. In the subtraction image, as shown in Fig.5 (d), we found that two lung nodules are enhanced clearly by removing normal structures such as the large vessels (arrow areas). Furthermore, all of the results were improved in the entire subtraction image by use of our new method in comparison with the method without smoothing shift vectors.
4 Discussion and Conclusion We have developed a temporal subtraction method for MDCT images with smoothing of shift vectors. In our results, some effects of our method were shown. One of the results indicated that artifacts caused by displacement of normal structures such as blood vessels and chest walls were reduced with smoothing of shift vectors, as shown Fig.5. In our method, some parameter is needed to be determined properly. For more accurate registration of the current and previous image, some parameters might be variable but not fixed. For example, the size of temporal VOI might be reduced/enlarged when small/large normal structures are included in template VOI. On the other hand, weights in equation (5) might be variable in each location in lung region. The optimization of parameters might lead to more accurate deformation, however, more complicated process can be needed. With temporal subtraction image obtained with our method, it would be possible to develop a CAD system for detection of newly developed abnormalities. Generally, normal structures are similar to abnormalities in some features such as shape and intensity. In our temporal subtractions images, most of normal structures were removed. On the other hand, some problems still remained. One of these problems is that some artifacts are remained on the temporal subtraction image. These artifacts would be caused by subtle difference of intensity or shape in the current and previous image. To overcome this problem, registration would be needed in the level of voxel, which is element of 3-D image. In conclusion, we have developed the computerized method for providing temporal subtraction images of chest MDCT images. We believe that the temporal subtraction images would be useful for radiologists to detect interval changes on MDCT images.
References 1. Kakeda, S., Nakamura, K., Kamada, K., Watanebe, H., Nakata, H., Katsuragawa, S., Doi, K.: Improved detection of lung nodules by using a temporal subtraction technique. Radiology 224(1), 145–151 (2002) 2. Kano, A., Doi, K., MacMahon, H., Hassell, D.D., Giger, M.L.: Digital image subtraction of temporally sequential chest images for detection of interval change. Med. Phys. 21(3), 445– 461 (1994)
An Image Warping Method
1001
3. Ishida, T., Ashizawa, K., Engelmann, R., Katsuragawa, S., MacMahon, H., Doi, K.: Application of temporal subtraction for detection of interval changes on chest radiographs: Improvement of subtraction images using automated initial image matching. J. Digital Imag. 12(2), 77–86 (1999) 4. Ishida, T., Katsuragawa, S., Nakamura, K., MacMahon, H., Doi, K.: Iterative image warping technique for temporal subtraction of sequential chest radiographs to detect interval change. Med. Phys. 26(7), 1320–1329 (1999) 5. Katsuragawa, S., Tagashira, H., Li, Q., MacMahon, H., Doi, K.: Comparison of temporal subtraction images obtained with manual and automated methods of digital chest radiography. J. Digital Imag. 12(4), 166–172 (1999) 6. Difazio, M.C., MacMahon, H., Xu, X.W., Tsai, P., Shiraishi, J., Aramato III, S.G., Doi, K.: Digital chest radiography: Effect of temporal subtraction images on detection accuracy. Radiology 202(2), 447–452 (1997) 7. Li, Q., Katsuragawa, S., Doi, K.: Improved contralateral subtraction images by use of elastic matching technique. Med. Phys. 27(8), 1934–1943 (2000) 8. Ishida, T., Katsuragawa, S., Abe, H., Ashizawa, K., Doi, K.: Development of 3D CT temporal Subtraction based on nonlinear 3D image warping Technique. In: Proc. The 91st Radiological Society of North America (RSNA), Chicago, USA, vol. 111 (2005) 9. Itai, Y., Kim, H., Ishikawa, S., Ishikawa, S., Ishida, T., Kawashita, I., Awai, K., Li, Q., Doi, K.: 3D elastic matching for temporal subtraction employing thorax MDCT image. In: Proc. of the World Congress on Medical Physics and Biomedical Engineering, pp. 2181–2191 (2006)
Conflicting Visual and Proprioceptive Reflex Responses During Reaching Movements David W. Franklin1,2,3, Udell So3, Rieko Osu2,3, and Mitsuo Kawato3 1
Department of Engineering, University of Cambridge, Cambridge, United Kingdom 2 National Institute of Information and Communications Technology, Keihanna Science City, Kyoto, Japan 3 ATR Computational Neuroscience Laboratories, Keihanna Science City, Kyoto, Japan
Abstract. The visually and mechanically induced corrective motor responses of hand position during reaching movements were investigated. Subjects performed reaching movements on a robotic manipulandum where the hand position was presented to the subjects by means of a projected display. On random reaching trials the subject’s hand position was mechanically perturbed relative to the predicted hand trajectory. The visual representation of the hand position was either perturbed in the same direction as the hand, mirrored relative to the hand, or not perturbed at all. The visually induced reflexive responses were still elicited after a mechanical perturbation, whether or not the information agreed with the mechanical perturbation. The visually induced reflexes contributed to limb stiffness after 200 ms from the onset. If the visual and mechanical errors were consistent, the restoring force to a perturbation (or the effective stiffness) was increased at long latencies. The results suggest that on short time scales, error signals from different sensory modalities are processed separately, combined only at the output stage. Keywords: Motor control, reaching movements, stretch reflex, on-line control, electromyography, vision, hand movements, contributions to limb stiffness.
1 Introduction Humans are skilled at making reaching movements and interacting with the environment. We are able to do so even in the presence of perturbations. Several levels of feedback mechanisms act at different delays in order to ensure that the arm compensates appropriately for the disturbance and is able to reach the intended target. In response to a perturbation of the hand, an immediate response is produced by the intrinsic stiffness of the muscle and limb [1]. This response produces a restoring force which may not be directed exactly towards the original hand location. However, the size and orientation of the stiffness of the limb can be tuned to the environment, producing different levels and directions of responses with learning [2-4]. This immediate response is followed by several feedback responses at varying delays. These are both visual and proprioceptive in nature. At a short delay, the short latency reflex responds to either a stretch or shortening of a muscle with an increase (excitation) or decrease (inhibition) of the same muscle’s activation level [5]. However, it has been M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 1002–1011, 2008. © Springer-Verlag Berlin Heidelberg 2008
Conflicting Visual and Proprioceptive Reflex Responses
1003
shown that this simple description of the short latency stretch reflex does not always hold true in a multi-joint arm, showing excitation in muscles shortened by the perturbation [6,7]. The general finding is that the short latency reflex counteracts the kinematic error at the local joint. However, at a slightly longer latency, the long loop stretch reflex will respond to the perturbation. This reflex, which acts through the cortex [8], produces a more complex reaction to a perturbation of a multi-joint limb [6,9] with a coordinated response across joints such that the overall stability of the limb is maintained. More recent studies have started to examine the role of visual feedback of the hand in responding to the perturbations [10-12]. In these studies it has been shown that the visual system will correct for an error in either position or direction of the hand relative to the target [11,12]. These responses act at a delay approaching 150 ms from the onset of the perturbation. In these previous studies, the visually induced feedback of the hand trajectory was studied without contributions from proprioceptive feedback in order to characterize its properties. We wished to understand how the combination of visual and proprioceptive reflexes could work together to produce a combined response to a perturbation. Of interest also, is to examine conditions in which the two feedback mechanisms might convey opposite and conflicting information. How would the CNS respond to this situation? Would the presence of early proprioceptive sensory information in disagreement with the visual sensory information suppress the visual response? There are two possible ways in which the CNS could combine the different sensory information. In terms of Bayesian integration of sensory information, it would be expected that the two sensory error signals would be combined within the CNS to determine a single estimate of the error [13]. From this error information a single control network would determine the appropriate motor command response. The other possibility is that the error information in the two sensory modalities is processed separately, each producing the appropriate motor command for the given error signal which are only combined at the output stage. In order to examine the interaction between these two sensory modalities in providing a response to perturbations, visual and proprioceptive perturbations to the hand trajectory were applied during reaching movements.
2 Methods Six healthy, right-handed subjects (2 female, 4 male) participated in the study. The experiment was approved by the institutional ethics committee and subjects gave informed consent. Subjects performed the experiments over one day. On one day prior to the start of the experimental sessions, subjects were first asked to practice reaching movements in a null force field to accustom them to the equipment and the experimental protocol. 2.1 Apparatus Subjects were seated in a chair with their shoulder held against the back of the chair by seatbelts to restrict trunk movement. The height of the chair was adjusted such that the workspace was at the subject’s shoulder level. A custom-molded thermoplastic
1004
D.W. Franklin et al.
cuff was used to restrict the subject’s wrist motion and a horizontal beam was further secured to the subject’s forearm for support. The subject’s arm, along with the cuff and the beam were coupled to the parallel-link direct drive air-magnet floating manipulandum (PFM) (Fig. 1). Details of the PFM and setup can be found elsewhere [14]. Subjects performed point-to-point reaching movements in the horizontal plane from a start target to an end target located respectively 31cm and 56cm directly in front of the shoulder joint of the subject. Subjects were asked to perform each movement within 700ms ± 100ms. A computer monitor provided feedback to the subjects indicating whether the previous movement was successful or not. An opaque table above the workspace blocked the subject's view of the manipulandum and the subject's arm. The current position of the hand (0.5 cm diameter) and the start (1.5 cm diameter) and end target (2.5 cm diameter) circles were projected onto the surface of the table to provide subjects with visual feedback. 2.2 Mechanical Perturbations During the reaching movements performed in this study, small mechanical perturbations of the hand were applied. These perturbations consisted of a smoothly varying 100 ms ramp away from the subject’s predicted trajectory, a 200 ms hold phase where the hand position was held at a constant offset relative to the predicted location of the hand, and a 100 ms smooth ramp back to where the hand would have been had no perturbation been applied. The start of the perturbation occurred at 150 ms after the start of the movement. Descriptions of the prediction algorithm and its use in estimating the endpoint and joint stiffness of the hand have been published previously [2-4,15]. 2.3 Visual Perturbations During the reaching movements performed in this study, small visual perturbations of the hand were applied. These perturbations consisted of a smoothly varying 100 ms ramp away from the subject’s actual position, a 200 ms hold phase where the visual display of the hand position was held at a constant offset relative to the actual location of the hand, and a 100 ms smooth ramp back to the actual position. The start of the perturbation occurred at 150 ms after the start of the movement. 2.4 Joint Stiffness Estimation Using the average force and displacement during the hold portion of the mechanical perturbation window, an estimate of the 2x2 endpoint stiffness matrix (K) was obtained by linear regression of the mean change in hand force and the mean change in position, as represented by the equation:
⎡ ΔFx ⎤ ⎡ Δx ⎤ ⎢ ΔF ⎥ = K ⎢ ⎥ . ⎣ Δy ⎦ ⎣ y⎦
(1)
The joint stiffness (R) was calculated from the endpoint stiffness (K) using the relation:
Conflicting Visual and Proprioceptive Reflex Responses
⎡R R = ⎢ ss ⎣ R es
1005
R se ⎤ ∂J T T = J KJ + F. R ee ⎥⎦ ∂θ
(2)
where J is the Jacobian, a matrix which represents the geometric transformation of small changes in joint angles to small changes in endpoint position, and F is the endpoint force [16]. An estimate of the joint stiffness was made every 10 ms throughout the hold phase of the mechanical perturbation from 125 ms after the onset of the perturbation until 275ms. For each estimate, the mean values of the appropriate measurements were calculated using the surrounding 20 ms data. This produced an estimate of joint torque which changes in time during the movement.
A
B mechanical perturbations
C
visual perturbations
target
target
51
PFM
51
y-axis position [cm]
start y
46
135 deg
45 deg
41
36
225 deg
315 deg
31
[0,0] x
y-axis position [cm]
46
target
41
36
31
26
26 start
start
Fig. 1. The experimental set-up. A. The manipulandum. Subjects were attached to a 2 DOF planar manipulandum (PFM) by means of a rigid thermoplastic cast. They made reaching movements directly in front of the subject’s shoulder joint. The targets and subjects hand position were projected onto an opaque table just above the subjects arm. B. The mechanical perturbations. On random trials, while the subject was making a reaching movement to the target (black line), a mechanical perturbation was produced by the robotic manipulandum moving the hand a fixed distance (8 mm) away from the predicted hand trajectory (dotted light gray line). While the perturbation illustrated is in the 0° direction, perturbation were actually given in four directions as indicated by the arrows corresponding to (45°, 135°, 225°, and 315°) C. The visual perturbations. When a given mechanical perturbation was given, moving the hand along a perturbed trajectory (dotted thick gray line), one of three visual perturbations was given. The visual representation of the hand would either agree with the mechanical perturbation (2.0 cm in amplitude) (gray line), be a mirrored image of the mechanical perturbation (2.0 cm in amplitude) (black line), or show no perturbation (light gray line). Subjects were instructed not to react to any perturbation.
2.5 Electromyography On each experimental day, electromyographic activity of eight arm muscles was measured. Surface EMG was recorded from: pectoralis major, posterior deltoid, short
1006
D.W. Franklin et al.
and long heads of the biceps brachii, brachioradialis, and the medial, lateral and long heads of the triceps. EMG was recorded using pairs of silver-silver chloride surface electrodes. The EMG signals were Butterworth band-pass filtered between 25 Hz and 1.0 kHz and sampled at 2.0 kHz. 2.6 Protocol Subjects performed point-to-point reaching movements in a one of two oppositely rotated curl force fields. Subjects performed movements in each force field on separate days. After a learning session to accustom subjects to the force field, subjects performed the test session. Thirty pre-trials were in the force field followed by 340 trials (of which 170 contained perturbations and 170 contained no perturbations). Subjects were again instructed to make natural movements to the target and ignore any perturbations. The perturbation trials contained a mixture of proprioceptive (mechanical) and visual perturbations of similar timing and shape. The mechanical perturbations were in one of four directions (45, 135, 225, 315) relative to the x-axis with an amplitude of 8 mm. The visual perturbations would either show the actual mechanical perturbation, but enhanced to 2.0 cm amplitude (enhanced), no perturbation (no visual perturbation), or an opposite perturbation of 2.0 cm in amplitude (mirrored). By examining the effects of visual and proprioceptive errors which both agree or disagree to various degrees, we can examine how the two systems are combined to produce a response.
3 Results 3.1 Electromyography Subjects made reaching movements while attached to the robotic manipulandum. On random trials the hand was both mechanically and visually perturbed away from the hand’s normal trajectory to the target. The visual perception of the movement of the hand was enhanced, mirrored, or presented as no movement. The electromyographic activity of eight arm muscles were recorded and analyzed. The mean rectified muscle activation was smoothed and combined across subjects to examine changes in the reflex responses depending on the visual condition. Reflexes were analyzed in three different intervals roughly corresponding to the expected timing of three different reflex contributions. The first interval, the short latency interval, corresponding to the short latency stretch reflex response was examined from 50–100 ms from the onset of the perturbation. The second interval, the long latency interval, corresponding to the long latency stretch reflex response was examined from 100–150 ms from the onset of the perturbation. Finally the visuo-motor reflex interval, corresponding to the timing of the visually induced motor reflex was examined from 150-300 ms following the onset of the perturbation. No differences in the reflex responses across the visual conditions were seen in either the short or long latency intervals (Fig. 2). Only during the visuo-motor reflex interval was the muscle activation different for the three visual conditions. When compared to the reflexes elicited by the short latency reflex response, the enhanced
135 deg 10
5
0
15
Posterior Deltoid
Posterior Deltoid
Conflicting Visual and Proprioceptive Reflex Responses
1007
315 deg
10
5
0 -200
0
200 400 600 800 1000
time [ms]
-200
0
200 400 600 800 1000
time [ms]
Fig. 2. Responses in the Posterior Deltoid to mechanical perturbations of the hand in either the 135 degree direction (left) or 315 degree direction (right). The mean muscle activation in unperturbed trials for the same movement is shown with the dashed line. The mean responses in the three visual conditions associated with the mechanical stretch are shown: no visual perturbation (light gray); enhanced visual perturbation (gray); mirrored visual perturbation (black). The relative timings of expected reflex contributions are illustrated with the shaded regions: short latency reflex (50-100 ms) interval (dark gray); long latency reflex (100-150ms) interval (light gray); visuo-motor reflex (150-300 ms) interval (medium gray). The mean rectified and smoothed responses are averaged across all subjects. Timing of the visual perturbation (onset, start of hold, end of hold, and offset) are shown with dotted lines. Time is shown relative to the onset of the movement (0 ms).
visual condition showed a reflex response in the same direction compared to the no visual response. In contrast the mirrored visual condition showed the opposite response, an inhibition if the short latency response was excitatory, or an excitation if the short latency response was inhibitory. Despite the presence of mechanical perturbations which would definitely signal whether the limb had been moved and in which direction, the visual condition presented to the subject elicited motor reflexes which would have compensated for the expected change in the hand position in visual space. 3.2 Contributions to Joint Stiffness The change in the reflex responses produced by the different visual information presented to the subject in the three conditions can also be examined by looking at the change in the endpoint force. The endpoint force contains the overall effect produced by the net responses of reflexes in all of the arm muscles. The mean force response produced in each of the three visual conditions was calculated. The perturbation response in the no visual information condition was subtracted from the forces recorded in the other two visual conditions (Fig. 3). Both the enhanced and mirrored visual conditions produced a clear change in the endpoint force. This effect occurred at a latency of 200 ms from the onset of the perturbations. The enhanced visual information increased the restoring force to bring the hand back to the planned trajectory. In the case of the mirrored condition, a similar level of force acted to bring the visual representation of the hand back to the desired trajectory even though the hand had been perturbed in the opposite direction, and therefore the effect was also opposite.
1008
D.W. Franklin et al. 2
1 0 -1 135 deg -2 -200
0
400 time [ms]
Δ x-axis Force [N]
Δ x-axis Force [N]
2
800
1 0 -1 -2 -200
0 -1 45 deg -2 -200 0
400 time [ms]
800
2 225 deg
0
400 time [ms]
800
Δ x-axis Force [N]
Δ x-axis Force [N]
2
1
315 deg 1 0 -1 -2 -200
0
400 time [ms]
800
Fig. 3. The change in endpoint force (x-axis) due to the visual condition presented to the subject during the mechanical perturbation. The mean force during the no visual perturbation condition has been subtracted from either the enhanced visual condition (gray trace) or the mirrored visual condition (black trace). The zero force level is indicated with the black line. Timing of the visual perturbation (onset, start of hold, end of hold, and offset) are shown with dotted lines. Time is shown relative to the onset of the movement (0 ms). Each box of the figure contains the responses for a particular mechanical perturbation direction as indicated by the labels in the figure and the arrows.
The forces peaked at about 1 N and were equal in size (but opposite in direction) for both the enhanced and mirrored conditions. The changes in the endpoint force can also be examined in terms of the net effect of the limb to a perturbation. The endpoint stiffness of the limb can be used to assess the restoring forces in response to a perturbation. In terms of the moving limb, because the endpoint stiffness is strongly affected by the changing limb geometry, it is easier to assess the contributions in terms of the joint stiffness. The joint stiffness of the limb was estimated using a moving 20 ms window every 10 ms throughout the perturbation. The results for movement in the counterclockwise curl force field are shown in Fig 4. Both force fields showed similar responses. The visual conditions did not affect the estimates of joint stiffness prior to 200 ms following the perturbation. After this time, the enhanced visual perturbation produced a larger stiffness relative to the no-visual condition. On the other hand, the mirrored visual condition produced a smaller stiffness, which approached towards zero near the end of the intervals over which stiffness was estimated.
Conflicting Visual and Proprioceptive Reflex Responses
80
80 Rss
Rse 60
Nm/rad
Nm/rad
60 40 20
40 20
0 100 150 200 250 300 time from start of perturbation [ms]
0 100 150 200 250 300 time from start of perturbation [ms]
80
80 Ree
Res
40 20 0 100 150 200 250 300 time from start of perturbation [ms]
60
Nm/rad
60
Nm/rad
1009
40 20 0 100 150 200 250 300 time from start of perturbation [ms]
Fig. 4. Effect of visual feedback condition on the joint stiffness of the limb. An estimate of the joint stiffness of the arm was made every 10 ms throughout the hold phase of the mechanical perturbation. Each of the four components of the joint stiffness is shown in one of the above boxes. The no visual perturbation condition is shown with the light gray dots. The enhanced visual and the mirrored visual conditions are shown with the gray and black dots respectively.
4 Discussion Subjects made movements in one of two force fields. After adapting to the force fields, both visual and mechanical perturbations were applied in the middle of the reaching movements on random trials. The mechanical perturbations were in one of four directions, whereas the visual signal at the same time either provided similar information about the location of the hand to the mechanical perturbation, provided no visual information, or provided conflicting information. The visual perturbation signal produced a reflex response in the limb muscles which would have returned the hand to the planned trajectory had it been perturbed, even if the hand had been perturbed in the opposite direction. The visually induced motor reflex occurs at a longer latency (150 ms) compared to the proprioceptively induced stretch reflexes. The joint stiffness of the limb was estimated separately for each visual feedback condition. When the visual feedback agreed with the hand mechanical perturbation, the effective limb stiffness was increased over longer latencies (>200 ms after perturbation onset). When the visual feedback was oppositely directed to the mechanical perturbation the
1010
D.W. Franklin et al.
overall stiffness was reduced at these latencies. The difference in onset time of the visual responses between electromyographic activity and stiffness is produced by a combination of two factors. First the force produced by muscle activation is delayed with respect to the electrical activity by approximately 25 ms [17]. Secondly, our estimate of stiffness uses an average of 20 ms data in order to compute a single measure of stiffness. This produces a low pass filtering effect on the stiffness measurement. The visually induced motor reflex responses produced by a shift in the visual hand position during movement occur at a latency of 150 ms from the onset of the perturbation. This latency, equal to that measured previously under similar conditions [10-12] is slightly slower than similar visually induced reflex responses from either a shift in the target location (110 ms) [18] or shift of the background (100 ms) [19]. Despite the presence of a mechanical perturbation which would indicate to the nervous system the actual movement of the hand relative to the normal trajectory, the visual response produced a similar size of response to the presence of a visual perturbation whether or not it agreed with the mechanical perturbation. This suggests that the visually induced reflex response and the proprioceptive stretch reflex response are processed separately in the brain, combined only at the output stage. Otherwise it would be expected that the combination of conflicting sensory signals would have produced less change in endpoint force than the combination of agreeing sensory signals. These results suggest that while the voluntary response to information in separate sensory modalities may be integrated in a Bayesian fashion [13], the fast reflex responses to these different sensory modalities are processed separately with no combination between the senses. These results suggest that due to the increased processing time for combination across different sensory modalities, the Bayesian integration occurs only in longer voluntary corrections to error information. On the shorter reflex response scale, separate processing of the sensory modalities still predominates. The results indicate that the visual signal of hand position during movements also contributes to the restoring forces at a reflex latency if the hand is displaced. While the latency is somewhat longer than either the simple monosynaptic short latency stretch reflex or the more complex long latency stretch reflex responses, the response will contribute to the overall mechanical stiffness of the limb. These quick visual responses therefore can contribute to the stability of movements at a delay greater than 200 ms. They respond not only to errors in the position of the hand, but also to errors in the direction in which the hand is moving [11,12]. They are another way in which the CNS monitors and corrects for errors in the hand trajectory during movements. In combination with proprioceptive feedback, they provide a corrective response to ensure that the hand is able to reach the final target.
Acknowledgments We thank T. Yoshioka for his assistance in running the experiments as well as for programming the PFM. We also thank G. Liaw for his comments on a previous version of the manuscript. The experimental work was funded by NICT and partially supported by SCOPE. DWF is supported by a fellowship from NSERC, Canada.
Conflicting Visual and Proprioceptive Reflex Responses
1011
References 1. Mussa-Ivaldi, F.A., Hogan, N., Bizzi, E.: Neural, mechanical and geometric factors subserving arm posture in humans. J. Neurosci. 5, 2732–2743 (1985) 2. Burdet, E., Osu, R., Franklin, D.W., Milner, T.E., Kawato, M.: The central nervous system stabilizes unstable dynamics by learning optimal impedance. Nature 414, 446–449 (2001) 3. Franklin, D.W., So, U., Kawato, M., Milner, T.E.: Impedance control balances stability with metabolically costly muscle activation. J. Neurophysiol. 92, 3097–3105 (2004) 4. Franklin, D.W., Liaw, G., Milner, T.E., Osu, R., Burdet, E., Kawato, M.: Endpoint stiffness of the arm is directionally tuned to instability in the environment. J. Neurosci. 27, 7705–7716 (2007) 5. Liddell, E.G.T., Sherrington, C.S.: Reflexes in response to stretch (myotatic reflexes). In: Proc. R. Soc. London Ser. B, vol. 96, pp. 212–242 (1924) 6. Lacquaniti, F., Soechting, J.F.: Responses of mono- and bi-articular muscles to load perturbations of the human arm. Exp. Brain. Res. 65, 135–144 (1986) 7. Lacquaniti, F., Maioli, C.: Anticipatory and reflex coactivation of antagonist muscles in catching. Brain Res. 406, 373–378 (1987) 8. Mathews, P.B.C.: The human stretch reflex and the motor cortex. Trends. Neurosci. 14, 87–91 (1991) 9. Gielen, C.C.A.M., Ramaekers, L., van Zuylen, E.J.: Long latency stretch reflexes as coordinated functional responses in man. J. Physiol. 407, 275–292 (1988) 10. Saunders, J.A., Knill, D.C.: Humans use continuous visual feedback from the hand to control fast reaching movements. Exp. Brain Res. 152, 341–352 (2003) 11. Saunders, J.A., Knill, D.C.: Visual feedback control of hand movements. J. Neurosci. 24, 3223–3234 (2004) 12. Saunders, J.A., Knill, D.C.: Humans use continuous visual feedback from the hand to control both the direction and distance of pointing movements. Exp. Brain Res. 162, 458–473 (2005) 13. Kording, K.P., Wolpert, D.M.: Bayesian integration in sensorimotor learning. Nature 427, 244–247 (2007) 14. Gomi, H., Kawato, M.: Human arm stiffness and equilibrium-point trajectory during multijoint movement. Biol. Cybern. 76, 163–171 (1997) 15. Burdet, E., Osu, R., Franklin, D.W., Yoshioka, T., Milner, T.E., Kawato, M.: A method for measuring endpoint stiffness during multi-joint arm movements. J. Biomech. 33, 1705– 1709 (2000) 16. McIntyre, J., Mussa-Ivaldi, F.A., Bizzi, E.: The control of stable postures in the multi-joint arm. Exp. Brain. Res. 110, 248–264 (1996) 17. Ito, T., Murano, E.Z., Gomi, H.: Fast force-generation dynamics of human articulatory muscles. J. Appl. Physiol. 96, 2318–2324 (2004) 18. Brenner, E., Smeets, J.B.: Fast responses of the human hand to changes in target position. J. Mot. Behav. 29, 297–310 (1997) 19. Saijo, N., Murakami, I., Nishida, S., Gomi, H.: Large-field visual motion directly induces an involuntary rapid manual following response. J. Neurosci. 25, 4941–4951 (2005)
An Involuntary Muscular Response Induced by Perceived Visual Errors in Hand Position David W. Franklin1,2,3, Udell So3, Rieko Osu2,3, and Mitsuo Kawato3 1
Department of Engineering, University of Cambridge, Cambridge, United Kingdom 2 National Institute of Information and Communications Technology, Keihanna Science City, Kyoto, Japan 3 ATR Computational Neuroscience Laboratories, Keihanna Science City, Kyoto, Japan
Abstract. The visually induced corrective motor responses of hand position during reaching movements were investigated. Subjects performed reaching movements on a robotic manipulandum where the hand position was presented to the subjects by means of a projected display. On random reaching trials the projected hand position was perturbed relative to the actual hand position while the hand was constrained to the straight line to the target. Electromyographic activity of eight arm muscles were collected. A corrective muscular response starting at 150 ms from the onset of the visual perturbation was found. This response was found to be reflexive in nature and not suppressed by prior instruction. A second study found that the reflexive response was not modified by changes in the background muscle activity level or by the size of the perturbation. The results suggest that the visual system elicits simple motor reflexes in response to visual errors from the expected hand position. Keywords: Motor control, reaching movements, reflex, on-line control, electromyography, vision, hand movements, corrective response.
1 Introduction During reaching movements, the arm is guided by both visual and proprioceptive feedback. If the arm is disturbed on the way to the target, these feedback mechanisms ensure that the arm is returned to the planned path to the target. While the proprioceptive feedback mechanisms have been studied extensively since the time of Sherrington, it is only recently that the visual feedback mechanisms have begun to be studied in the context of motor control [1-3]. In these studies it has been shown that the visual system will correct for an error in either position or direction of the hand relative to the target [2,3]. Previous studies have shown reflex responses in regards to the movement of either the target [4] or the entire background [5]. However in the work examining the motor responses in response to shifts in the visual representation of the hand, it is not clear whether this response is reflexive in nature or due to voluntary correction. In order to test this, we conducted a series of experiments. The first experiment was designed to examine if this visual motor response is reflexive in nature. This was done by examining the fastest possible timing of the voluntary response to the visual perturbation. The secM. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 1012–1020, 2008. © Springer-Verlag Berlin Heidelberg 2008
An Involuntary Muscular Response Induced by Perceived Visual Errors
1013
ond experiment examined whether the response to the visual perturbation is dependent on the background activity of the muscles. In contrast to previous experiments, the hand was constrained to a straight-line to the target during the visual perturbation. Similarly the visual perturbation itself was returned to the actual hand trajectory before the end of the movement. Therefore no correction to the visual perturbation was necessary in order to perform the task.
2 Methods Six healthy, right-handed subjects (2 female, 4 male) participated in the study. The experiment was approved by the institutional ethics committee and subjects gave informed consent. Subjects performed the experiments over four days (one day for experiment 1; three days for experiment 2). Prior to the start of the experimental sessions, subjects were first given a day of making null field movements to accustom them to the equipment and the type of movements which they were required to perform during the experiments. 2.1 Apparatus Subjects were seated in a chair with their shoulder held against the back of the chair by seatbelts to restrict trunk movement. The height of the chair was adjusted such that the workspace was at the subject’s shoulder level. A custom-molded thermoplastic cuff was used to restrict the subject’s wrist motion and a horizontal beam was further secured to the subject’s forearm for support. The subject’s arm, along with the cuff and the beam were coupled to the parallel-link direct drive air-magnet floating manipulandum (PFM) (Fig.1a). Details of the PFM setup can be found elsewhere [6]. Subjects performed point-to-point reaching movements in the horizontal plane from a start target to an end target located respectively 31cm and 56cm directly in front of the shoulder joint of the subject. Subjects were asked to perform each movement within 700ms ± 100ms. A computer monitor provided feedback to the subjects indicating whether the previous movement was successful or not. An opaque table above the workspace blocked the subject's view of both the manipulandum and the subject's arm. The current position of the hand (0.5 cm diameter) and the start (1.5 cm diameter) and end target (2.5 cm diameter) circles were projected onto the surface of the table to provide subjects with visual feedback. 2.2 Visual Perturbations During the reaching movements performed in this study, small visual perturbations of the hand were applied. These perturbations consisted of a smoothly varying 100 ms ramp away from the subject’s actual position, a 200 ms hold phase where the visual display of the hand position was held at a constant offset relative to the actual location of the hand, and a 100 ms smooth ramp back to the actual position. The start of the perturbation occurred at 150 ms after the start of the movement. During the visual perturbations the hand was constrained to move along an unperturbed trajectory (similar to a mechanical channel) to the target so that it neither assisted nor hindered the subject to reach the final target. In order to limit the movement of the hand due to any
1014
D.W. Franklin et al.
corrective responses while still producing the smallest change in the trajectory, a prediction algorithm was used to predict the future movement of the hand based on previous movements. The hand trajectory was then constrained to this predicted trajectory. Details of the prediction algorithm have been previously published [7]. Any forces that the subject exerted during the visual perturbations were recorded against the channel wall. 2.3 Electromyography On each experimental day, electromyographic activity of eight arm muscles was measured. Surface EMG was recorded from: pectoralis major, posterior deltoid, short and long heads of the biceps brachii, brachioradialis, and the medial, lateral and long head of the triceps. EMG was recorded using pairs of silver-silver chloride surface electrodes. The EMG signals were Butterworth band-pass filtered between 25 Hz and 1.0 kHz and sampled at 2.0 kHz. 2.4 Experiment 1 Subjects performed point-to-point reaching movements in a null force field. After twenty pre-trials in the null force field, ninety test trials were performed. Of these ninety test trials, a random thirty constrained the subject’s movement while providing visual feedback about the hand trajectory. Of these thirty trials, ten contained no perturbations, the rest contained 2 cm perturbations either to the right or left, with ten trials each (Fig. 1b). When subjects perceived a visual perturbation, they were asked to make a voluntary movement in the direction of the perturbation as fast as possible. This experiment was designed to assess the fastest voluntary reaction time to the visual perturbation in order to determine if the visual responses are voluntary or involuntary. 2.5 Experiment 2 Subjects performed point-to-point reaching movements in a null force field and two oppositely rotated curl force fields. Each field was performed on a separate day and the order was randomized across subjects. These three force fields were used to create different levels of baseline muscle activation during the movements. The null field required minimal activation in all muscles, whereas the clockwise curl field (CF+) and counterclockwise curl field (CF-) preloads the flexor and extensor muscles of the arm respectively during forward reaching movements.
⎡ Fx ⎤ ⎡ 0 16⎤ ⎡ x ⎤ ⎡ Fx ⎤ ⎡ 0 − 16⎤ ⎡ x ⎤ CF+ : ⎢ ⎥ = ⎢ , CF- : ⎢ ⎥ = ⎢ ⎥ ⎢ ⎥ ⎥⎢ ⎥ . ⎣ Fy ⎦ ⎣− 16 0 ⎦ ⎣ y ⎦ ⎣ Fy ⎦ ⎣16 0 ⎦ ⎣ y ⎦
(1)
Subjects initially adapted to the force field in which they were moving. After learning was completed, subjects made 100 movements in the same field, fifty of which contained visual perturbations, and fifty of which contained no perturbations. On the visually perturbed trials a predictive channel trial was imposed to prevent any kinematic errors. Five different visual perturbation trials were used: 1 cm and 2 cm
An Involuntary Muscular Response Induced by Perceived Visual Errors
B
experiment 1
C
experiment 2
y-axis position [cm]
target
start y
51
46
46
y-axis position [cm]
PFM
41
36
31
[0,0] x
target
51
41
2.0 cm
target
1.0 cm
A
1015
36
31
26
26 start
start
Fig. 1. The experimental set-up. A. The manipulandum. Subjects were attached to a 2 DOF planar manipulandum (PFM) by means of a rigid thermoplastic cast. They made reaching movements directly in front of the subject’s shoulder joint. The targets and subjects hand position were projected onto an opaque table just above the subjects arm. B. Experiment 1. The subject made reaching movements in a null force field. On random trials the hand cursor was visually perturbed either 2cm to either side of the actual movement or along the movement itself while the hand was constrained to move along the undisturbed trajectory. Subjects were instructed to move as fast as possible in the direction of the visual perturbation as soon as they noticed the perturbation. C. Experiment 2. The subject made forward reaching movements in one of three different force fields: null field, clockwise curl field or counterclockwise curl field. On randomly selected trials a visual perturbation of the hand was made in one of two directions and two amplitudes (dark and light gray lines) or of zero amplitude (black line), while the hand was constrained to move along the direction to the final target (large dotted gray line). Subjects were instructed not to react to any perturbation.
amplitude perturbations to the right or left of the subject’s trajectory and a 0 cm perturbation for comparison (Fig. 1c). Subjects were instructed to make natural movements to the target and ignore any perturbations.
3 Results 3.1 Experiment 1 Subjects made reaching movements while attached to the robotic manipulandum. On random movements (2 of 10 trials) a 2cm visual perturbation of the hand cursor was applied in the middle of the reaching movements. Subjects were instructed to react as quickly as possible by moving their hand in the same direction as the visual perturbation. During the visual perturbation the hand was constrained along the line to the target with a channel. On other random movements (1 of 10 trials), the visual perturbation was not added but the same mechanical constraint was applied for comparison with the visual perturbation trials. The electromyographic activity of the arm muscles
10
5
0
200
400
600
800 1000 Triceps Longus
-200
10 5 0
-200
0
200
400
600
800 1000
6 4 2 0
-200
0
200
400
600
10 5 -200
0
200 400 600 time [ms]
10 5 0
0
200 400
600
800 1000
-200
0
200 400
600
800 1000
-200
0
200 400
600
800 1000
-200
0
200 400 600 time [ms]
800 1000
5 0
6
800 1000
-200
10
8
800 1000
15
15
Triceps Lateralus
0
0
B
Posterior Deltoid
D.W. Franklin et al.
Triceps Medialus
Biceps Longus - Short Head
Brachioradialis
Biceps Longus - Long Head
A
Pectoralis Major
1016
6 4 2 0
4 2 0
X-axis Force [N]
4
2
0 Visual Perturbation to Right - Task move to Right -2 Visual Perturbation to Left - Task move to Left -4
-200
0
200
400
time [ms]
600
800
1000
Fig. 2. Results of Experiment 1. A. Electromyographic activity of eight arm muscles during reaching movements. The responses are shown for the control trials (black), visual perturbation to the right trials (dark grey) and visual perturbation to the left trials (light grey). Subjects were instructed to move the handle in the direction of the visual perturbation as quickly as possible. The interval of muscle activity corresponding to these instructions is shaded in light gray. Prior to this interval and opposite effect is seen in the muscle activity. This time interval is shown with the shaded dark gray background. Timing of the visual perturbation (onset, start of hold, end of hold, and offset) are shown with dotted lines. Time is shown relative to the onset of the movement (0 ms). B. Endpoint Force at the hand for the two visual perturbation conditions. Results are mean responses across all subjects.
was examined to determine the fastest onset of the voluntary response to the visual perturbation (Fig. 2). The fastest changes in the muscle activity corresponding to the instructions given to the subjects occurred at 250 ms after the onset of the visual perturbations (in the posterior deltoid). Other muscles showed slightly longer delays.
An Involuntary Muscular Response Induced by Perceived Visual Errors
1017
These were followed by the appropriate change in the endpoint force of the hand in the x-axis occurring around 300 ms after the onset of the perturbation. However, prior to these voluntary reaction times to the visual perturbation, an opposite effect in both the electromyographic activity and endpoint force was seen. This response occurred at approximately 150 ms after the onset of the visual perturbation, lasting until the voluntary response. These responses can be seen throughout the dark gray shaded intervals in Figure 2. 3.2 Experiment 2 Subjects initially adapted to each force field in which they were moving. By the end of the learning period subjects were able to make straight undisturbed movements to the target. After the learning was completed subjects continued to make movements, of which random trials were visually perturbed to either side of the hand trajectory.
5 0
Pectoralis Major
-200
200
400
600
800
5 0 0
200
400
600
800
5 0 -200 ?
0
200
180 deg [1 cm]
400
time [ms]
20
20
NF 10 0 -200
1000
180 deg [2 cm]
0
200
400
600
800
1000 CF+
10 0
1000 CF-
10
20
1000 CF+
10
-200
Pectoralis Major
0
Posterior Deltoid
NF
10
Posterior Deltoid
B
Posterior Deltoid
Pectoralis Major
A
-200
0
200
400
600
800
1000 CF-
10 0 -200
0 deg [1 cm]
0
200
400
0 deg [2 cm]
time [ms]
1000
Baseline
Fig. 3. Electromyographic responses in two muscles (pectoralis major and posterior deltoid) to the visual perturbation shown in the three force fields. The onset, start and end of the hold phase, and offset of the visual perturbations are denoted by the dotted lines. The grey bar indicates timing of the EMG response to the visual perturbation of the hand cursor. The overall level of the EMG varies as expected in the force fields. Both muscles are low in the NF, the pectoralis major is high in the CF+ and low in the CF-, while the posterior deltoid shows the opposite effect. The onset of the motor response in the EMG occurs at 150 ms after the onset of the visual perturbation. Opposite effects are seen in the two opposing muscles. The 180 degree perturbations (to the left) produce an inhibitory response in the pectoralis major and an excitatory response in the posterior deltoid. In contrast, the 0 degree perturbations (to the right) produce an excitatory response in the pectoralis major and an inhibitory response in the posterior deltoid.
1018
D.W. Franklin et al.
The mean rectified EMG across the ten trials with the same visual perturbation was examined for changes in the muscle activity response. The muscle activity response was compared to random trials on which the hand was constrained in the same manner but the visual display showed the exact hand position. A change in the muscle activity occurred starting at approximately 150 ms after the onset of the visual perturbation. The responses for the two shoulder muscles are shown in Figure 3. It can be seen that the response in the muscles was appropriate for compensating for the perturbed hand position to bring it back to the expected trajectory. The normal pattern of muscle activity during the same kinematic movement was different after adaptation to each force field. All muscles had fairly low muscle activation patterns in the null field. At the time of the visual perturbation, the activity of the extensor muscles was high in the CF- and low in the CF+. In contrast the activity of the flexor muscles was high in the CF+ and low in the CF-. Although the background levels of the muscle activity were different in the three force fields, the reflex responses produced by the visual perturbation were similar. No significant differences were seen across the three force fields in any of the muscles.
2
NF
0
X-axis Force [N]
-2
200
0
200
400
600
800
2
1000 CF+
0 -2
200
0
200
400
600
800
2
1000 CF-
0 -2
200
0
200
400
600
800
1000
Time [ms] Fig. 4. Change in the x-axis force at the hand from to the visual perturbation shown in three force fields. The onset, start and end of the hold phase, and offset of the visual perturbations are denoted by the dotted lines. The onset of the change in force was consistent across all fields and visual perturbations. All of the 0 degree visual perturbations (to the right) produced a negative force in the x-axis, whereas the 180 degree visual perturbation (to the left) produced a positive force in the x-axis. These compensations were in the appropriate direction to compensate if the hand had actually been physically displaced as the visual information suggested. There were no consistent effects of the background activity on the size of the change in the measured force.
An Involuntary Muscular Response Induced by Perceived Visual Errors
1019
In order to examine this more clearly, the change in the endpoint force produced by the limb against the manipulandum was examined (Figure 4). The endpoint force is produced by the complete combination of the responses of all of the muscles of the arm. When the responses are examined across the three different force fields, no differences in the size or duration of the responses can be seen. Despite different levels of background muscle activity, the reflex responses produced by the visual perturbation are similar.
4 Discussion Two experiments were performed to characterize the motor responses to a visual perturbation of the hand position. While subjects were making reaching movements to a target, the visual representation of the subject’s hand was perturbed either to the left or right of the subject’s trajectory. During this time, the subjects hand was constrained along the straight-line path to the target to avoid any proprioceptive contributions to the motor response. After 200 ms the visual representation of the subjects hand was moved back towards the actual hand location. In the first experiment subjects were instructed to watch for the visual perturbation and move as fast as possible in the same direction as the visual perturbation. The results found a motor response which was delayed approximately 150 ms from the onset of the visual perturbation which resisted the visual perturbation. A second motor response, one which moved the hand in the direction of the visual perturbation as the subjects had been instructed to do, was found at 250 ms from the onset of the visual perturbation. The first motor response was not affected by the instruction to the subjects to assist the perturbation. These results suggest that the initial response to the visual perturbation is a reflexive response which is not modified by prior instruction. The relatively short delay time between the onset of the visual perturbation and this motor response agrees with this finding. The second experiment examined how this motor response scales with the background activity of the muscles. By having subjects first adapt to one of three force fields, the muscles were pre-loaded to different degrees when the visual perturbation was given. No difference in the size of the motor response (EMG or endpoint force) was seen despite the differences in the background muscle activity across the force fields. This is distinct from the dependence of stretch reflexes on the background activity of the muscles. Stretch reflexes and their force contributions are generally found to scale with the background activity [8-10]. Stretch reflexes show increased responses to both the velocity of stretch and amplitude of stretch [10,11]. The results from this study are not clear on the effect of velocity or amplitude of visual perturbation on the size of the motor response. There was no consistent effect on the endpoint force due to the size of the visual perturbation. However, some analysis of the muscle activation does show an effect of the visual perturbation size. This work demonstrates that the visual system elicits a motor response in the arm muscles during reaching when an error in the hand position relative to the planned trajectory is detected. This response occurs even when the actual hand position is not modified. Unlike previous studies [1-3] however, the subjects were not required to respond to the change in the visual hand position in order to make a successful
1020
D.W. Franklin et al.
movement. The subjects hand was maintained along the path to the target while the visual representation of their hand was returned to the expected path prior to the end of the movement. Despite the fact that this motor response was not necessary, a clear response was found throughout the experiment. This response, although not scaled with the background muscle activity, was appropriate directed to return the hand to its desired hand trajectory.
Acknowledgments We thank T. Yoshioka for his assistance in running the experiments as well as for programming the PFM. We also thank G. Liaw for his comments on a previous version of the manuscript. The experimental work was funded by NICT, Japan. DWF is supported by a fellowship from NSERC, Canada.
References 1. Saunders, J.A., Knill, D.C.: Humans use continuous visual feedback from the hand to control fast reaching movements. Exp. Brain Res. 152, 341–352 (2003) 2. Saunders, J.A., Knill, D.C.: Visual feedback control of hand movements. J. Neurosci. 24, 3223–3234 (2004) 3. Saunders, J.A., Knill, D.C.: Humans use continuous visual feedback from the hand to control both the direction and distance of pointing movements. Exp. Brain Res. 162, 458–473 (2005) 4. Brenner, E., Smeets, J.B.: Fast responses of the human hand to changes in target position. J. Mot. Behav. 29, 297–310 (1997) 5. Saijo, N., Murakami, I., Nishida, S., Gomi, H.: Large-field visual motion directly induces an involuntary rapid manual following response. J. Neurosci. 25, 4941–4951 (2005) 6. Gomi, H., Kawato, M.: Human arm stiffness and equilibrium-point trajectory during multijoint movement. Biol. Cybern. 76, 163–171 (1997) 7. Burdet, E., Osu, R., Franklin, D.W., Yoshioka, T., Milner, T.E., Kawato, M.: A method for measuring endpoint stiffness during multi-joint arm movements. J. Biomech. 33, 1705– 1709 (2000) 8. Marsden, C.D., Merton, P.A., Morton, H.B.: Stretch reflex and servo action in a variety of human muscles. J. Physiol. 259, 531–560 (1976) 9. Carter, R.R., Crago, P.E., Keith, M.W.: Stiffness regulation by reflex action in the normal human hand. J. Neurophysiol 64, 105–118 (1990) 10. Stein, R.B., Hunter, I.W., Lafontaine, S.R., Jones, L.A.: Analysis of short-latency reflexes in human elbow flexor muscles. J. Neurophysiol 73, 1900–1911 (1995) 11. Lee, R.G., Tatton, W.G.: Long latency reflexes to imposed displacements of the human wrist: dependence on duration of movement. Exp. Brain Res. 45, 207–216 (1982)
Independence of Perception and Action for Grasping Positions Takahiro Fujita, Yoshinobu Maeda, and Masazumi Katayama Department of Human and Artificial Intelligent Systems, Graduate School of Engineering, University of Fukui, Japan {fujita,katayama}@h.his.fukui-u.ac.jp
Abstract. In this paper, we investigate the independence between perception and action that has been reported in the previous psychophysical studies. The independence, however, might not be directly investigated because illusion tasks, such as the Ebbinghaus illusion, are used in those studies, although those findings are quite attractive. Thus, the independence may include specific characteristics in the illusion tasks. From this point of view, we focuses on grasping positions when grasping an object, such as a tool, which is used in our daily life. In the first experiment, we investigate the independence of the grasping positions by using the following measurement tasks: 1) a visual-estimation task that the grasping positions are visually estimated, 2) a pinch task that only grasps an object without lifting-up its grasped object and 3) a lift-up task that grasps an object and lifts it up. As a result, both the grasping positions of the visual-estimation and lift-up tasks are significantly different. Thus, these results indicate the independence of perception and action for grasping movement in daily life. In addition, those of the pinch and lift-up tasks are significantly different amazingly, although both the tasks can be considered as an action task. In the second experiment, for the pinch and lift-up tasks, we examine the difference of both the trajectories of the finger tip and moreover the influence of visual feedback for the grasping positions. As a result, we confirmed that the above results are not affected by the visual information perceived from own hand and arm during the movement. Moreover, these results indicate that grasping positions are determined before movement because both the trajectories are different just after movement onset. Finally, our findings are quit attractive because the difference of the grasping positions may be explained based on the Goodale’s hypothesis (“how system” and “what system”). Keywords: Grasping, visual perception, action.
1
Introduction
We can skillfully manipulate various objects such as a tool in our daily life. In the motor control system, perception and action are closely related because sensory information plays an important role when executing movements. From this point of view, the interaction between perception and action have been M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 1021–1030, 2008. c Springer-Verlag Berlin Heidelberg 2008
1022
T. Fujita, Y. Maeda, and M. Katayama
investigated. On the other hand, It is interesting to note that the independence of perception and action have been also investigated in the past decade (e.g., [1,2,3,4,5,6,7]). For example, by using the Ebbinghaus illusion, Haffenden and Goodale [1] have reported that grip aperture during a prehension movement is not affected by the visual illusion even when visually perceived size of a circle is clearly affected by the visual illusion (e.g., see [2,3,4] using the Ebbinghaus illusion). Moreover, the independence has been reported by related studies using other visual illusions: Westwood et al.[7] used a haptic size-contrast illusion and Kr´ oliczak et al. [8] used the hollow-face illusion. Moreover, by using the size-weight illusion, Flanagan and Beltzner [5] have reported that grip force while lifting up each of two objects is adjusted by the true physical weight even when weight perception is influenced by the size-weight illusion and even when subjects know that both the weights of two objects are equal. Those results are really attractive for investigation of the integrated mechanism of perception and action. However, the neural information processing when executing the illusion tasks may include specific processing or characteristics for the illusion. Therefore, it should be noted that the independence of perception and action may not be directly investigated in those studies. From this point of view, in this study, we focuses on grasping movement in daily life and investigate the independence of visual estimation and motor execution for grasping position in order to examine the independence directly.
2 2.1
Measurement Experiment Measurement System
We built a measurement system that measured grasping positions when we grasp various objects, as shown in Fig. 1. This measurement system was constructed by mainly using a three dimensional motion measurement device (OPTOTRAK 3020, Northern Digital, Inc.), a visual shutter device (UMU Glass, Nippon Sheet Glass Co., Ltd.), an experimental chair that holds on subject’s body and a chin support that fixes subject’s head. The motion measurement device measures positions of objects and trajectories of arm movements by attaching LED markers to each object and the index finger, respectively. Subject can see an object and his hand and arm while the visual shutter is turned on, and these are invisible while the visual shutter is turned off. The visual shutter is turned on just before executing a measurement task. 2.2
Measurement Tasks
We measured grasping positions under the following three tasks: Visual estimation task (VT). The subject’s right hand was placed in the initial position and subject maintained its position while executing the task. After the visual shutter was turned on, subject estimates grasping position of an object in front of the subject. It is noted that subject did not move his
Independence of Perception and Action for Grasping Positions
1023
OPTOTRAK Desk Arrow Base object Object
25 [cm] y Lever UMU Glass x
Chair
Inition position Marker
Fig. 1. Measurement setup
arm and hand. Before executing this task, subject was instructed to carry out the task with the intention to grasp and lift up a visible object. Pinch task (PT). The subject’s right hand was placed in the initial position. After the visual shutter was turned on, subject reached an object and pinched it. However, he must not lift it up after pinching. Before executing this task, subject was instructed to carry out the task with the intention that lifts up an object that he grasped. Lift-up task (LT). The subject’s right hand was placed in the initial position. After the visual shutter was turned on, subject reached an object, grasped it and lifted it up. Grasping positions were measured just when starts to lift up each object. 2.3
Target Objects
Nine objects were made by combating the base objects that the weight is different, as shown in Fig. 2. Thus, the centers of gravity of the combined objects is different. Each object was placed at 25 [cm] from the initial position (see Fig.1). 2.4
Experiment 1
The objective of Experiment 1 was to examine the difference of grasping positions measured in the above tasks (VT, PT and LT) in order to investigate an independence of perception and action for grasping movement in the daily life.
1024
T. Fujita, Y. Maeda, and M. Katayama
42mm
42mm 42mm
Obj1
Obj6
Obj2
Obj7
Obj3
Obj8
Obj4
Obj9
42mm
42mm
42mm
(a) Base objects
Obj5 (b) Target objects Fig. 2. Target objects. (a) Base objects that the weight of a white cube is 50(g) and that of black is 100(g). (b) Target objects are constructed by combining the different base objects. Therefore, the centers of gravity of these target objects are different and Obj5 is the control object that the center of gravity consists with the center of Obj5.
Ten right-handed subjects participated in Experiment 1. Before Experiment 1, subjects lifted up the base objects although subject could not lift up the target object. Measurement procedure in Experiment 1 was as follows: Before learning. In the first phase, we measured grasping positions for VT and PT in turn. The objective of this phase was to measure grasping positions before learning the lift-up task (LT) because in the tasks of VT and PT subjects could not know the center of gravity of each object. The nine objects described in Fig. 2 were used and only one object was displayed at random. Grasping position was measured just one time for each object. Learning phase. The second phase was to train the lift-up task (LT) through iterative trials. In this learning phase, all target objects were used and only one object was displayed at random. For measurement, at first, five trials for each object were measured (the total is 45 trials). Next, subject repeated the task for ten minutes in order to learn the appropriate grasping positions of all objects. After these iterative trials, five trials for each object were measured again. After learning. In the final phase, for VT and PT, we measured again grasping positions under the same conditions as the first phase. The objective of this phase was to measure grasping positions after learning the appropriate grasping positions of all objects. 2.5
Experiment 2
The objective of Experiment 2 was to measure the movement trajectories during task execution for PT and LT and moreover to investigate the influence of visual
Independence of Perception and Action for Grasping Positions
120 100
80 60
80 the first set (before learning)
60 VT PT Task
(a) Before learning (VT and PT).
1
2
3
OBJ1 OBJ2 OBJ3 OBJ4 OBJ5
OBJ6 OBJ7 OBJ8 OBJ9
OBJ1 OBJ2 OBJ3 OBJ4 OBJ5 OBJ6 OBJ7 OBJ8 OBJ9
140 Grasping Position [mm]
100
Grasping Position [mm]
Grasping Position [mm]
120
140 Learning for 10 minutes
OBJ1 OBJ2 OBJ3 OBJ4 OBJ5 OBJ6 OBJ7 OBJ8 OBJ9
140
1025
120 100
the last set (after learning)
4 5 1 2 Each trial of LT
(b) Learning phase (LT).
3
4
80 60
5
VT PT Task
(c) After learning (VT and PT).
Fig. 3. Grasping position of each trial in Experiment 1 (a typical subject)
feedback. Grasping positions and trajectories were measured under the following two conditions for each of PT and LT. C1. This condition is the same as the measurement condition of Experiment 1. C2. Subject could not see his own hand and arm although he could see the object during task execution. Three target objects (Obj1, Obj5 and Obj8) are used in Experiment 2. Six right-handed subjects that did not participate in Experiment 1, participated in Experiment 2. Before measurement experiment, subjects repeated the trials of the lift-up task (LT) for enough time to learn the appropriate grasping positions for the three objects. Six subjects were divided into two groups: Group A and Group B. Three subjects in Group A performed in the order corresponding to PT of C1, PT of C2, LT of C1 and LT of C2, and the other subjects in Group B performed in the order corresponding to PT of C2, PT of C1, LT of C2 and LT of C1. The object was displayed at random and 20 trials were measured for each object.
3 3.1
Result Experiment 1
Fig. 3 shows grasping positions of each trial for a typical subject in Experiment 1: (a) before learning, (b) learning phase and (c) after learning. Moreover, Fig. 4 shows averaged grasping positions for all subjects before and after learning. In this figure, the results of the first set of LT in the learning phase were described
1026
T. Fujita, Y. Maeda, and M. Katayama
VT PT LT
Obj1
Obj1
Obj2
Obj2
Obj3
Obj3
Obj4
Obj4
Obj5
Obj5
Obj6
Obj6
Obj7
Obj7
Obj8
Obj8
Obj9
Obj9
0
50
100
150
Grasping Position[mm]
(a) Before learning.
VT PT LT
200
0
50
100
150
200
Grasping Position[mm]
(b) After learning.
Fig. 4. Averaged grasping positions for all subjects in Experiment 1. (a) Before learning, LT express the results of the first set for each object. (b) After learning, LT express the results of the last set for each object. Horizontal bars are standard deviation of grasping positions.
in Fig. 4(a) and also those of the last set of LT were described in Fig. 4(b). Table 1 shows the results of statistical test for all subjects. For the results before learning in Fig. 4(a), both the grasping positions of VT and the first set of LT are significantly different (p < 0.01). This result shows that the visually-estimated grasping positions are not grasped when executing the liftup task (LT). Both the grasping positions of VT and PT are not significantly different (p > 0.05). Moreover, both the grasping positions of PT and the first set of LT are significantly different (p < 0.01). These results show that both the grasping positions of PT and LT are different amazingly, although both the tasks of PT and LT can be considered as an action task. The grasping positions of PT are almost the same as those of VT. For the results after learning in Fig. 4(b), the results in Fig. 4(b) are the same tendency as those in Fig. 4(a), as shown in Table 1. Although subjects could learn the appropriate grasping positions that are the center of gravity, both the grasping positions of VT and the last set of LT are significantly different (p < 0.01) and moreover both the grasping positions
Independence of Perception and Action for Grasping Positions
1027
Table 1. Statistical test for all subjects. (Paired-t test, *: p < 0.05; **: p < 0.01; NS: Not significant). (a) Before learning
(b) After learning
(c) Before and after learning
VT-LT VT-PT PT-LT ** NS **
VT-LT VT-PT PT-LT ** NS **
VT-VT PT-PT LT-LT ** ** **
of PT and the last set of LT are significantly different (p < 0.01). Thus, subjects did not grasp the appropriate grasping positions even after learning. Next, as shown in Table 1(c), grasping positions for all tasks changed through learning. For LT, both the grasping positions of the first and last sets in the learning phase are tested. Thus, the grasping positions of VT and PT are influenced by the learning of LT. However, it is amazed that for VT and PT subjects did not answer the appropriate positions even after learning. Consequently, these results indicate the existence of the dissociation of visuallyestimated positions and grasping positions when we lifted up the objects. Furthermore, it is interesting to note that both the grasping positions of PT and LT as an action task are different and moreover the grasping positions of PT are the same as those of VT. 3.2
Experiment 2
Averaged spacial paths of the finger tip for a typical subject in each group are shown in Fig.5. Fig.6 shows the averaged grasping positions in each condition for all subjects. Both the grasping positions of PT and LT after learning are significantly different (p < 0.01), although both the grasping positions of C1 and C2 for each task are not significantly different (p > 0.05). Moreover, both the spacial paths for PT and LT are clearly different from the movement onset, as shown in Fig.5. The spacial paths of C2 seem to be the same as those of C1 from movement onset, as shown in Fig.5. Therefore, these results show that the visual information of subject’s arm and hand during the prehension movement does not affect the movements and grasping positions. Thus, these results indicate that humans determine grasping position of a target object before movement.
4 4.1
Discussion Independence of Perception and Action
The visual information processing in the brain can be divided into two streams: the ventral stream and the dorsal stream. Both the streams are known as the “where system” that specifies spacial positions of target objects and the “what system” that recognizes target objects [9], respectively. On the other hand, Goodale and Milner [10] have supported that the role of the dorsal stream is to determine how to do an action (“how system”). From this point of view, the independence of both the streams have been investigated by using the illusion tasks (e.g., [1,2,3,4,5,6,7]), as described above. The independence reported
1028
T. Fujita, Y. Maeda, and M. Katayama
300
Obj 1
250
250
200
200
150
150
y [mm]
y [mm]
300
100
50
0
100 x [mm]
(a) Group A
100
50
PT C1 PT C2 LT C1 LT C2
0
Obj 1
PT C1 PT C2 LT C1 LT C2
0 200
0
100 x [mm]
200
(b) Group B
Fig. 5. Spacial paths of the index-finger tip for PT and LT in Experiment 2 (C1: normal condition; C2: no visual feedback; the results of Obj1 for typical subjects in each group)
by those studies supports the Goodale’s hypothesis. However, the plausibility of the hypothesis is an open question because the neural mechanism for the independence is not still clear. Moreover, although those dissociation of perception and action is really attractive, the dissociation may include some specific properties of the illusions. From this point of view, in this study, we investigated the dissociation for grasping movement in daily life. The visual estimation task (VT) is considered as a perception task because subjects estimate visually the grasping positions, and the lift-up task (LT) is a action task. In Experiment 1, the grasping positions of VT are significantly different from these of LT. Therefore, these results show the independence of perception and action for grasping positions. Note that the independence do not include some intrinsic properties such as the illusions. In addition, there is a possibility that the difference of the grasping positions in this study may be explained by the Goodale’s hypothesis. According to the hypothesis, the grasping positions of VT may be determined via the ventral stream, and those of LT may be calculated via the dorsal stream. We would like to emphasize that both the grasping positions of VT and LT may be calculated by different brain areas each other. 4.2
The Pinch Task and the Lift-Up Task as a Action Task
If the Goodale’s hypothesis were plausible, both of the pinch task (PT) and the lift-up task (LT) should be performed by the how system that includes the dorsal
Independence of Perception and Action for Grasping Positions
C1
C2
C1 Obj 1
PT
C2 Obj 1
PT
Obj 1
LT
1029
Obj 1
LT
Obj 5
PT
Obj 5
PT
Obj 5
LT
Obj 8
PT
Obj 5
LT
Obj 8
PT
Obj 8
LT
Obj 8
LT
0
(left)
100 (center)
200
(right)
0
(left)
100 (center)
grasping position [mm]
grasping position [mm]
(a) Group A
(b) Group B
200
(right)
Fig. 6. Averaged grasping positions and the standard deviation for each object and each condition in Experiment 2 (typical subjects in each group)
optic stream. Thus, the hypothesis predicts that both the grasping positions of PT and LT become to be equal. However, we found that the grasping positions of PT are significantly different from those of LT, although both of PT and LT are regarded as a action task. Here, it is noted that subjects did not need to lift up the grasped object in the trials of PT. Therefore, subjects could grasp at any positions when executing the trials of PT because we must consider the center of gravity when lifting-up the grasped object. However, they grasped at the same positions as VT that is a perception task. According to the Goodale’s hypothesis, the grasping positions of PT may be calculated by using the ventral optic stream that is “what system”. From this observation, we would like to emphasize that both the grasping positions of LT and VT are independently calculated in the brain and moreover those of PT are calculated in the same mechanism as those of VT. 4.3
Planning of Grasping Position
As noted above, although both the pinch task (PT) and the lift-up task (LT) are regarded as a action task, the grasping positions of PT are significantly different from those of LT. In Experiment 2, we confirmed that the grasping positions measured in Experiment 1 are the same as those measured under the condition that there is no visual feedback perceived from own arm and hand during the
1030
T. Fujita, Y. Maeda, and M. Katayama
movement. Moreover, both the spacial paths for PT and LT are also different from the movement onset. Therefore, the visual feedback is not effective for planning the grasping positions and reaching movements. These results indicates that the grasping positions are planned before movement.
5
Conclusion
In this paper, by focusing on the grasping positions of target objects, we investigated the independence of perception and action. Although Haffenden, et al. [1], Flanagan, et al. [5] and so on have reported the dissociation by using the illusion tasks, we ascertained the dissociation for perception and action for grasping movement in the daily life. In addition, these results indicate that both the grasping positions of the pinch task (PT) and the lift-up task (LT) as an action task may be independently calculated in the human brain before movement. Moreover, our results indicates that the grasping positions of the pinch task (PT) may be determined by the same neural mechanism as the visual estimation task (VT). These findings are quite important for investigating the integrated neural mechanism of perception and action in the human brain.
References 1. Haffenden, A.M., Goodale, M.A.: The effect of pictorial illusion on prehension and perception. J. Cogn. Neurosci. 10(1), 122–136 (1998) 2. Haffenden, A.M., Goodale, M.A.: Independent effects of pictorial displays on perception and action. Vision Res. 40, 1597–1607 (2000) 3. Haffenden, A.M., Schiff, K.C., Goodale, M.A.: The dissociation between perception ad action in the Ebbinghaus illusion: nonillusory effects of pictorial cues on grasp. Curr. Biol. 11, 177–181 (2001) 4. Danckert, J.A., Nadder, S., Haffenden, A.M., Schiff, K.C., Goodal, M.A.: A temporal analysis of grasping in the ebbinghaus illusion: planning versus online control. Exp. Bain Res. 144, 275–280 (2002) 5. Flanagan, J.R., Beltzner, M.A.: Independence of perceptual and sensorimotor predictions in the size-weight illusion. Nat. Neurosci., 737–741 (2000) 6. Ganel, T., Goodale, M.A.: Visual control of action but not perception requires analytical processing of object shape. Nature 426(6967), 664–667 (2003) 7. Westwood, D.A., Goodale, M.A.: A haptic size-contrast illusion affects size perception but not grasping. Exp. Brain Res. 153, 253–259 (2003) 8. Kr´ oliczak, G., Heard, P., Goodale, M.A., Gregory, R.L.: Dissociation of perception and action unmasked by the hollow-face illusion. Brain Research 1080(1), 9–16 (2006) 9. Mishkin, M., Ungerleider, L.G.: Contribution of striate inputs to the visuospatial functions of parieto-preoccipital cortex in monkeys. Behav. Brain Res., 57–77 (1982) 10. Goodale, M.A., Milner, A.D.: Separate visual pathways for perception and action. Trends Neurosci. 15(1), 20–25 (1992)
Handwritten Character Distinction Method Inspired by Human Vision Mechanism Jumpei Koyama1, Masahiro Kato2 , and Akira Hirose3 1
The University of Tokyo, 3-1, Hongo 7, Bunkyo, Tokyo, Japan [email protected] http://www.eis.t.u-tokyo.ac.jp/~koyama/ 2 Fuji Xerox Co., Ltd., Kanagawa, Japan 3 The University of Tokyo, Tokyo, Japan
Abstract. Human beings are capable of distinguishing a variety of texture almost at a glance. By modelling the mechanism, we will realize a flexible texture analysis. We propose a new technique inspired by such human early-vision ability to distinguish handwritten character regions from machine-printed regions in document images. In the technique, we evaluate the two-dimensional power spectrum to extract feature values that reflects fluctuations unavoidable in handwritten characters. Experiments show that a certain feature value of handwritten characters is often larger than that of machine-printed characters. We generated a map obtained by superimposing the feature value on the document image. The map showed that our proposed method is useful to distinguish handwritten character regions from machine-printed character ones. Keywords: Human vision, Handwriting recognition, Image texture analysis, Fourier transform, Optical character recognition.
1
Introduction
Development and spread of computing machine increase opportunity for us to use electronic data. As a result, document analysis technique is getting more important to convert paper documents into electronic data. Distinction between handwritten characters and machine-printed characters is one of the important issue. The distinction technique is useful at different situations. For example, it automatizes selecting process of purpose-built optical character recognition (OCR) system. It also extract handwriting annotations, which complement document contents, from document images. Among existing methods for extraction of hand-written characters, the most common approaches are typographical. Fan et al. utilized linearity of baselines [1]. It is obtained from vertical and horizontal histogram in a document. Zheng et al. extract aspect ratio, stroke density and run-length histogram [2]. In addition, pattern matching methods of frequent function words are used for font recognition [3]. These methods need correct extraction of characters and M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 1031–1040, 2008. c Springer-Verlag Berlin Heidelberg 2008
1032
J. Koyama, M. Kato, and A. Hirose
text lines from document images as preprocessing. However, actual documents contain touching characters and overlapping annotations. It is difficult to extract characters and text lines in such documents. Human beings can distinguish handwritten character regions from machineprinted character regions almost at a glance. Because it is processed momentary, it it thought that Human beings utilize the difference between texture of handwritten character regions and texture of machine-printed character regions for the distinction. Texture consists of statistical repetition. It isn’t necessary that the pattern is completely periodic. Texture is quantified by spatial correlation of pixel values in document images. Therefore, it is represented by autocorrelation function. The capability of human texture distinction consists of two cascaded processes [4][5]. The first process is done in primary visual cortex of human beings. It detects spatial frequency and orientation of texture on local regions [6]. The second process is processed in higher level of early visual cortex. It compares feature values of local regions obtained from primary visual cortex. The above processes are said to make human beings distinguish texture. Utilizing the mechanism enables flexible texture analysis. There exist many script identification techniques using texture analysis [7], [8], [9]. We have various texture analysis tools such as two-dimensional Fourier Transform, two-dimensional wavelet transform [10] and multichannel Gabor filter [6]. These techniques also achieve remarkable recognition rate for various scripts. However, these methods utilize global texture. To our knowledge, no results are reported that the techniques are used for extraction of regions in a document image including handwritten and machine-printed characters. We propose a novel method to distinguish handwritten character regions from machine-printed character regions. The method utilizes texture of local region in document images and extract textural features reflecting fluctuation of line segments of characters caused by handwriting. In document images, for example, machine-printed characters have regular orientation of line segments and uniformity of line width. They construct characteristic texture. On the other hand, in the case of handwritten characters, texture is affected by fluctuation caused by handwriting mechanism. The fluctuation caused by handwriting indicates irregularity of line segments and declination of line segments from horizontal and vertical axes. By utilizing such texture, we propose distinction method which is not influenced by location of characters. In other words, the method don’t need precise extraction of a single character or a single text line. We present a first-step experiment to extract the feature quantities from document images, in which we find the method for effective for the distinction.
2
Method of Fluctuation Extraction in Handwritten Characters Based on Vertical and Horizontal Orientation
Figure 1 shows a flowchart of our method. First, a power spectrum is obtained from an original image by computing two-dimensional Fourier transform.
Distinction Method Inspired by Human Vision Mechanism
1033
Input binalized image
Sweep FFT Window
2D Fourier transform
Calculation of feature value E
End
Generate a map of feature value E
Fig. 1. Flowchart of our proposed method
Secondly, we extract a feature value of texture, which human beings use to segregate texture, from the power spectrum. Finally, the feature values at respective points are mapped on the original image and we evaluate the effectiveness of the feature value. 2.1
Texture Representation in Fourier Domain
First, we obtain a power spectrum S(u, v) by computing two-dimensional Fourier transform F (u, v) in a region cut out from an original image as S(u, v) = |F (u, v)| −1
M 2
F (u, v) =
x=− M 2
−1
N 2
y=− N 2
2
x y f (x, y)wH (x, y) exp −2πj u+ v M N
(1)
(2)
where functions f and wH denote the pixel value at position (x, y) in the Fourier window and Hanning window weight, respectively. Constants M and N are numbers of horizontal and vertical pixels. Figure 2 (a) shows an example of a region cut out from an original image, while Fig. 2 (b) shows its power spectrum image. Variables (u, v) denote two-dimensional spatial frequency. Origins of (x, y) and (u, v) are chosen at the center, respectively. Power spectrum S is obtained by computing Fourier transform of autocorrelation function. Hence power spectrum represents texture of an image well.
1034
J. Koyama, M. Kato, and A. Hirose
y
N/2
Vertical Frequency (cycle/image)
v
Horizontal Frequency (cycle/image)
N
x
0
u
M/2
M (a)
(b)
Fig. 2. (a) Character image in real space f (x, y) and (b) its power spectrum S(u, v)
We list features below which can be obtained from texture of a machine-printed character region in the power spectrum domain S. – Machine-printed character regions usually have truly vertical and horizontal line segments. Therefore power spectrum on horizontal and vertical axes is intense. – Machine-printed character regions often have nearly same width and interval of line segments. It make certain frequently bands of power spectrum strong. In the case of Chinese characters, these features appear remarkable. We propose a distinction method between handwritten and machine-printed character regions by quantifying them. 2.2
Extraction of a Fluctuation Feature Value Caused by Handwriting
A lot of machine-printed Chinese character regions have vertical and horizontal line segments. Therefore, as mentioned above, power on horizontal and vertical axes is intense. On the other hand, handwritten Chinese character regions frequently have line segments that contain fluctuation and skew. Therefore, power off these axes is tend to be more intense, compared to the case of machine-printed Chinese characters. Figure 3 shows the power presenting such off axes intense belts. The tendency allows us to define feature value E which evaluates power off vertical and horizontal axes as S(u, v)2 (u2 + v 2 ) E=
u =−1,0,1 v =−1,0,1
u
v
S(u, v)2 (u2 + v 2 )
(3)
Distinction Method Inspired by Human Vision Mechanism
N/2
Vertical Frequency (cycle/image)
0
1035
v
M/2
Horizontal Frequency (cycle/image)
u
Fig. 3. Power spectrum including off vertical and horizontal axes components represented by broken lines
The denominator is the total power weighted by u2 and v 2 . The numerator is sum of power off vertical and horizontal axes of a power spectrum image. Formula The definition (3) means that E represents fluctuation-to-total power ratio. It provides a feature value to distinguish handwritten character regions from machine-printed character regions. The smaller the feature value E is, the larger the power on the vertical and horizontal axes is, and the more machineprinted characters exist in the region. Conversely, the large the feature value E is, the smaller the power on the axes is, and suggesting that there are many handwritten characters in the region. Feature value E is more effective in Chinese characters which has a lot of vertical and horizontal line segments. 2.3
Verification Experiment
Figure 5 shows samples for an experiment to display effectiveness of the feature value E. We select 200 Chinese characters that have simple structure, 200 Chinese characters that have complex structure and 200 Chinese characters that have many skew line segments. Five people were asked to write respective characters. Finally, with one sample of each machine-printed character, six samples are prepared for each character. Consequently, the total number of samples is 3600. Figures 6, 7 and 8 show the results. Figure 6 shows feature values obtained for samples of simple Chinese characters. Feature values obtained for machineprinted characters are smaller than those of handwritten characters. In the case of complex Chinese characters in Fig. 7, we can observe the same tendency. Figure 8 shows feature values obtained for Chinese characters that have many skew line segments. The proposed method utilizes only vertical and horizontal
1036
J. Koyama, M. Kato, and A. Hirose
N/2
Vertical Frequency (cycle/image)
v Total power Fluctuation power
M/2
0
Horizontal Frequency (cycle/image)
u
Fig. 4. Summation regions defined in (3) in power spectrum S(u, v) Machine-printed Handwritten character character (ei)
Handwritten character (hk)
Handwritten Handwritten Hanadwritten character (mk j character ts i j character yik
Simple Chinese character
Skew Chinese character
Complex Chinese character
Fig. 5. Samples of characters. Rows show three categories of Chinese characters. Columns show categories of writers. ‘(xx)’ which is written after ‘Handwritten character’ shows writer code. ei mk yk
Feature Value E
1.2
hk ts katsuji
1 0.8 0.6 0.4 0.2 0 ¢ …
ł
f ‰
¿ ‡
⁄ T ˜
¥ § \
Ł ß ˆ j
g fl @ f
i
“ œ l æ
Normal Chinese characters
Fig. 6. Feature values of Simple Chinese characters
¯
\
s
Distinction Method Inspired by Human Vision Mechanism
1037
ei mk yk
1.2
hk ts katsuji
Feature value E
1 0.8 0.6 0.4 0.2 0 ˘ V ›
« M œ \
˜
R –
e d
c Y
P B
G ~ b \ ˛
‰
“ n Z E ˜ m
m
˝ — W
Many-stroke Chinese character
Fig. 7. Feature values of complex Chinese characters 1.2
ei mk yk
hk ts katsuji
Feature value E
1 0.8 0.6 0.4 0.2 0 T v ‡ ª
”
‚ § ` ˚ ¸
ı “ ~ Æ
“
˝
ø
ı
o … i P [
z ˇ W d › ˆ W
Ł ›
c
Skew-Stroke Chinese character
Fig. 8. Feature values of skew Chinese characters
orientation as criterion. Therefore there is not so remarkable difference between feature values of machine-printed characters and those of handwritten characters in this category. Excluding this point, the feature value E is found effective for distinction between handwritten and machine-printed Chinese characters.
3
Map of Feature Value E
The feature value E is a local evaluation of texture in the region extracted from a global document by computing two-dimensional Fourier transform. To make an feature value E map on the whole document, we put the obtained feature value E as a function of position at the center pixel of the region. The whole document is scanned and feature values E are computed at all pixels. In this way, we can get a map of feature values for a document. Figures 9 (a) and (b) are sample document images. Scanner resolution is 200 dpi and image size is 640×480 pixels. These figures have identical document form and same characters. However, all the characters in Fig. 9 (b) are machineprinted, while some characters in Fig. 9 (a) are handwritten. Figures 10 (a) and (b) are maps of the feature value E superimposed on Fig. 9 (a) and (b), respectively. They are obtained from the sample document images with the proposed method. The Feature value E runs from 0.0 through 1.0, and is represented by gray scale (black represents 0.0, white represents 1.0).
1038
J. Koyama, M. Kato, and A. Hirose
(a)
(b)
Fig. 9. Sample documents. These figures have same characters. However, in the left figure, most of them are handwritten.
(a)
(b) Fig. 10. Feature value map of Fig. 9
Distinction Method Inspired by Human Vision Mechanism
1039
The smaller the feature value E is, the more vertical and horizontal line segments the region has. The regions that have a lot of machine-printed characters are darker. Adversely, the regions that have a lot of handwritten characters are white. By comparing these figures, we can find that the feature values E on areas including handwritten characters are large. For example, in Fig. 10 (a), areas framed by white double lines have handwritten characters, and the feature values E are large. On the contrary, in Fig. 10 (b), there are machine-printed characters in areas framed by white double lines, and the feature values E are small. This result demonstrates that the proposed method enables us to obtain an effective feature value for distinction between handwritten and machine-printed Chinese character regions. However, as seen in areas framed by white broken lines in Fig. 10 (b), the feature value E is also large even in the regions that have particular machineprinted characters, such as Roman alphabets, numeric characters and some Chinese characters. These characters have a lot of skew line segments. The current method evaluates only the power generated by line segments of which orientation is not truly vertical and horizontal. In this aspect, we need an improved determination of E to treat such skew line segments.
4
Summary
In this paper, we proposed feature value E to distinguish handwritten character regions from machine-printed character regions. The feature value is inspired by a mechanism of human vision. The proposed method utilizes power spectrum obtained by locally computing two-dimensional Fourier transform of a document image. The feature value is defined as fluctuation-to-total power ratio. Remarkable differences are observed between handwritten and machine-printed Chinese characters in the verification experiment. In addition, we generated maps of the feature value E and confirmed that the feature value E is effective for the distinction.
References 1. Fan, K.C., Wang, L.S., Tu, Y.T.: Classification of Machine-Printed and Handwritten Texts Using Character Block Layout Variance. Pattern Recognition 31, 1275–1284 (1998) 2. Zheng, Y., Li, H., Doermann, D.: Machine Printed Text and Handwriting Identification in Noisy Document Images. IEEE Trans. Pattern Analysis and Machine Intelligence 26, 337–353 (2004) 3. Khoubyari, S., Hull, J.J.: Font and Function Word Identification in Document Recognition. Computer Vision and Image Understanding 63, 66–74 (1996) 4. Julesz, B.: Texture and Visual Perception. Scientific American 212, 38–48 (1965) 5. Kingdom, F.A.A., Keeble, D., Moulden, B.: Sensitivity to Orientation Modulation in Micropattern-based Textures. Vision Research 35, 79–91 (1995)
1040
J. Koyama, M. Kato, and A. Hirose
6. Daugman, J.G.: Uncertainty Relation for Resolution in Space, Spatial Frequency, and Orientation Optimized by Two-dimensional Visual Cortical Filters. J. Opt. Soc. Am.A. 2, 1160–1169 (1985) 7. Zhu, Y., Tan, T., Wang, Y.: Font Recognition Based on Global Texture Analysis. IEEE Trans. Pattern Analysis and Machine Intelligence 23, 1192–1200 (2001) 8. Tan, T.N.: Rotation Invariant Texture Features and Their Use in Automatic Script Identification. IEEE Trans. Pattern Analysis and Machine Intelligence 20, 751–756 (1998) 9. Busch, A., Boles, W.W., Sridharan, S.: Texture for Script Identification. IEEE Trans. Pattern Analysis and Machine Intelligence 27, 1720–1732 (2005) 10. Daubechies, I.: The Wavelet Transform, Time-Frequency Localization and Signal Analysis. IEEE Trans. Information Theory 36, 961–1005 (1990)
Recent Advances in the Neocognitron Kunihiko Fukushima Kansai University, Takatsuki, Osaka 569–1095, Japan [email protected] http://www4.ocn.ne.jp/~fuku k/index-e.html
Abstract. The neocognitron is a hierarchical multilayered neural network capable of robust visual pattern recognition. This paper discusses recent advances in the neocognitron, showing several types of neocognitron, to which various improvements and modifications have been made.
1
Introduction
The neocognitron, which was proposed as a neural network model for the visual system, is a hierarchical multilayered network capable of robust visual pattern recognition [1,2]. It acquires the ability to recognize patterns through learning. Since the initial proposal of the neocognitron, several modifications have been made to endow it with various abilities. This paper introduces some of them.
2
Outline of the Neocognitron
The lowest stage of the network is the input layer. There are retinotopically ordered connections between cells of adjoining layers. Each cell receives input connections that lead from cells situated in a limited area on the preceding layer. Layers of S-cells and C-cells are arranged alternately in the hierarchical network. S-cells work as feature-extracting cells. They resemble simple cells of the visual cortex. Their input connections are variable and are modified through learning. After having finished learning, an S-cell comes to respond selectively to a particular feature. Generally speaking, local features, such as edges or lines in particular orientations, are extracted in lower stages. More global features, such as parts of learning patterns, are extracted in higher stages (Fig. 1). C-cells, which resemble complex cells, are inserted in the network to allow for positional errors of features. Each C-cell receives fixed excitatory connections from a group of S-cells that extract the same feature, but from slightly scattered locations. Thus, the C-cell’s response is less sensitive to the shift of the feature. We can also express that C-cells make a blurring operation. If an S-cell is active, its output comes to be distributed to a group of C-cells. In the whole network, with its alternate layers of S-cells and C-cells, the process of feature-extraction by S-cells and toleration of shift by C-cells is repeated. During this process, local features extracted in lower stages are gradually integrated into more global features. Since small amounts of positional errors of local M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 1041–1050, 2008. c Springer-Verlag Berlin Heidelberg 2008
1042
K. Fukushima
U0
US2
UC2
US3
UC3
US4
UC4
Fig. 1. Hierarchical information processing in the layered network of the neocognitron
features are absorbed by C-cells, an S-cell in a higher stage comes to respond robustly to a specific feature even if the feature is slightly deformed or shifted. In the neocognitrons of recent versions, S-cells in intermediate stages of the network are trained with competitive learning. Among the S-cells situated in a certain small competition area, only the one responding most strongly becomes the winner. If all S-cells are silent, a new S-cell is generated and takes the place of the winner. The winner has its input connections strengthened. The amount of strengthening of each input connection to the winner is proportional to the intensity of the response of the presynaptic C-cell. Thus the input connections to each S-cell grow to be a template of a feature that made the S-cell a winner. The response of an S-cell can be expressed mathematically as follows [2,4]. Let x be a vector that represents the response of the presynaptic C-cells (namely, the input signals to the S-cell). Also let X be the input connections of the S-cell. If we define similarity s between X and x using inner product by s=
(X, x) , ||X|| · ||x||
(1)
the response the S-cell, whose threshold is θ, is proportional to (s − θ) if s ≥ θ, and zero if s < θ.
3
Controlling the Blur in the Neocognitron
It might be felt that a blur decreases the information-processing ability of the visual system, but this is not always the case. On the contrary, it greatly increases the ability for processing visual information, if the blur is properly controlled. We mean by the word blur such an operation that is performed between S-cells and C-cells in the neocognitron, or between simple cells and complex cells in the visual cortex. The neocognitron uses this blurring operation skillfully [3]. 3.1
Inhibitory Surround in the Connections to C-Cells
The blurring operation by C-cells, which usually is effective for improving robustness against deformation, sometimes makes it difficult to detect whether a lump of blurred response is generated by a single feature or by two identical features located adjacently (Fig. 2(A)(a)). For example, a single line and a pair
Recent Advances in the Neocognitron
1043
of parallel lines of a very narrow separation generate a similar response when they are blurred. To cover this weakness, an inhibitory surround is introduced around the excitatory input connections of an C-cell [2]. The inhibitory surround creates a non-responding zone between the two lumps of blurred responses of C-cells (Fig. 2(A)(b)). This silent zone makes S-cells of the next stage easily detect the number of original features even after blurring. US
UC
(a) No inhibitory surround (conventional) US
UC
(a)
(b) Inhibitory surround (A) Separation of the blurred responses produced by two independent features.
(b)
(B) Response like an endstopped cell. Stimulus (a) produces a larger response than (b).
Fig. 2. The effect of inhibitory surround in the input connections to a C-cell
The concentric inhibitory surround also have another benefit. It endows the C-cells with the characteristics of end-stopped cells, and C-cells behave like hypercomplex cells (Fig. 2(B)). In other words, an end of a line elicits a larger response from a C-cell than a middle point of the line. Bend points and end points of lines are important features for pattern recognition. The C-cells, whose input connections have inhibitory surrounds, participate in extraction of bend points and end points of lines, while they are making a blurring operation. 3.2
Interpolating Vectors
To increase the recognition rate of the neocognitron further, we use interpolating vectors to analyze the response of the S-cells of the highest stage [4]. Training S-cells of the Highest Stage. Each S-cell of the highest stage is assigned with a label representing the class name of a learning pattern, and its input connections are produced by a kind of labeled competitive learning. We call the input connections of an S-cell the reference vector X of the cell.
1044
K. Fukushima
Suppose the training of the lower stages has already finished. Every time when a training pattern is presented, competition occurs among all S-cells in the highest stage. The winner of the competition is the S-cell whose reference vector has the largest similarity to the training vector (See eq. (1)). If the winner has the same label as the training pattern, the winner learns the training pattern, by adding the training vector to its reference vector. If the winner has a wrong label (or if all S-cells are silent), a new S-cell (or a reference vector) is generated and is assigned a label of the class name of the training pattern. This process is continued until the generation of new S-cells settles down. Method of Interpolating Vectors. Different from conventional neocognitrons, we do not simply search for a reference vector that has the largest similarity to the test vector. We assume a situation where virtual vectors, called interpolating vectors, are densely placed along the line segments connecting every pairs of reference vectors of the same label. From these interpolating vectors, we choose the one that has the largest similarity to the test vector. Actually, we do not need to generate infinite number of interpolating vectors. We just assume line segments connecting every pairs of reference vectors of the same label. We then measure distances (based on similarity) to these line segments from the test vector, and choose the nearest line segment. The label of the line segment shows the result of pattern recognition. Mathematically, this process can be expressed as follows. Let X i and X j be two reference vectors of the same label. An interpolating vector ξ for this pair of reference vectors is given by a linear combination of them: ξ=p
Xi Xj +q . ||X i || ||X j ||
(2)
Under various combinations of p and q, similarity between the interpolating vector ξ and a test vector x takes a maximum value s2i − 2si sj sij + s2j (ξ, x) psi + qsj smax = max = max = , p, q p, q ||ξ|| · ||x|| 1 − s2ij p2 + 2pqsij + q 2 (3) where si =
(X i , x) , ||X i || · ||x||
sj =
(X j , x) , ||X j || · ||x||
sij =
(X i , X j ) . ||X i || · ||X j ||
(4)
We can interpret that smax represents similarity between test vector x and the line segment that connects a pair of reference vectors X i and X j (Fig. 3(a)). If some parts of borders between classes are concave in the multi-dimensional feature space, however, some of the line segments might cross the concave borders and invade into the territory of other classes. Such line segments, whose example is illustrated by a dotted line in Fig. 3(b), will cause misclassification of test vectors. We then eliminate all line segments that cause misclassification
Recent Advances in the Neocognitron
1045
x si smax Xi
ξ sij
training vector of class B
sj Xj reference vector of class A
(a) Largest similarity between the test vector and the line segment connecting a pair of reference vectors of the same label.
reference vector of class B class border
(b) Elimination of a line segment that crosses the concave boarder between classes. The training vector of class B (×) is nearer to the line segment of class A (dotted line) than to reference vectors of class B.
Fig. 3. Principles of the method of interpolating vectors
of training vectors, by testing how the training vectors, which have been used to generate reference vectors, are classified. To apply interpolating vectors to the neocognitron, we calculate sij for every pairs of S-cells (or reference vectors) of the same label in advance, and store the values. Then, we can easily calculate smax using eq. (3), because si and sj are the responses of S-cells, whose threshold θ is zero. When we applied this method to the neocognitron for handwritten digit recognition, for example, the error rate reduced from 1.52% to 1.02% for a blind test set of 5000 digits [4]. The increase in computational cost by the use of interpolating vectors is very small, because the number of reference vectors generated in the learning is much smaller than the number of training vectors (or training patterns). Blur and Interpolating Vectors. The method of interpolating vectors is well suited to the blurring operation used in the neocognitron. S-cells of the highest stage of the network analyze the response of the C-cell layer of the preceding stage, which has already been spatially blurred. Deformed intermediate patterns between arbitrary two reference patterns (reference vectors) can be well emulated by an interpolating vector produced by linear combination of the two vectors, especially when the patterns are spatially blurred. Lower half of Fig. 4(a) shows a series of interpolating vectors produced by linear combinations. Upper half of the figure illustrates examples of deformed patterns generated from the same pair of reference vectors. A deformed pattern (to be more exact, a feature vector for a deformed pattern) comes to show a large similarity to one of the interpolating vectors. If the amount of blur is too small, however, interpolating vectors produced by linear combinations do not always emulate feature vectors of deformed patterns correctly enough, as is illustrated in Fig. 4(b).
1046
K. Fukushima deformed patterns
deformed patterns
linear interpolation
linear interpolation
(a)
(b)
Fig. 4. (a) Feature vectors of deformed patterns are well emulated by linear combinations of the pair of reference vectors. For the sake of intuitive display, feature vectors are displayed expediently by original patterns presented to input layer U0 , instead of the actual responses of the presynaptic C-cells. (b) If the blur is too small, it becomes difficult to emulate deformed patterns by interpolating vectors.
4
Incremental Learning of the Neocognitron
Since the neocognitron is a multi-layered network, the response of a layer becomes a training stimulus for the succeeding layer. If the characteristics of lower layers change with the progress of the learning, S-cells in a higher layer come to receive a different training vector, even when the same training pattern is given to the input layer. If the learning speed of the lower layers is too fast, S-cells in the higher layer might fail to follow the change in the training vectors. Some S-cells that used to be responding to a training pattern might cease responding, because the same training pattern elicits a different response from the preceding layer. These silent cells cannot be removed from the layer by the competitive learning, and they remains in the network as garbage cells. In most neocognitrons of recent versions, the ability of incremental learning is sacrificed. Once the learning has finished, no additional training patterns can be accepted. Training of the network is performed from lower stages to higher stages: after the training of a lower stage has been completely finished, the training of the succeeding stage begins. In the neocognitron, each S-cell is accompanied with an inhibitory cell, called a V-cell (Fig. 5(a)). Roughly speaking, the input signals to the S-cell through the excitatory connections correspond to (X, x) in eq. (1). The response of the V-cell corresponds to ||x||, and the inhibitory connection from the V-cell corresponds to ||X||. The response of the V-cell, which is equal to the average intensity of the responses of the presynaptic C-cells, increases even when the same training pattern is presented to the input layer, if the number of C-cells has increased in the progress of the learning of the preceding layers. This excessive inhibition suppress the response of the S-cell and makes it a garbage (Fig. 5(b)). To make the neocognitron capable of incremental learning without losing a reasonable learning speed, we proposed the use of variable input connections to V-cells [5]. The input connections to a V-cell are modified in such a way that the signals from newly generated C-cells are summed with a smaller weight than signals from older C-cells (Fig. 5(c)).
Recent Advances in the Neocognitron
1047
The same learning pattern to the input layer U0
C-cells x
C-cells
S-cell ai
xi
(t)
x(t) xi
u
S-cell ai
u
C-cells
S-cell ai
xi
u
x(t+1) b v V-cell
ciwi
S-cell is trained so as to respond maximally to x(t) (a)
ci newly generated cells
b v V-cell
Excessive inputs to V-cell inhibit S-cell
(b) Conventional network
ciwi
b v V-cell
inhibitory
Excessive inputs to V-cell are blocked (wi=0) (c) Proposed Network
(t)
Pattern x has been learned
excitatory
Presentation of the next learning pattern x
(t+1)
Fig. 5. Behavior of an S-cell during incremental learning. Comparison between the conventional and the proposed network, which is capable of incremental learning.
5
Top-Down Signals into the Neocognitron
Adding top-down (or backward) signal paths to the neocognitron, we proposed previously a Selective Attention Model [6]. When a number of patterns are presented simultaneously, the model can focus attention selectively to one of them, segment it from the rest, and recognize it. Even if noise or defects affect the stimulus pattern, the model can recognize it and recall the complete pattern from which the noise has been eliminated and defects corrected. The principles of this model have been used for several applications, such as, recognition of connected characters in English words, Chinese characters, and facial images. This section introduces an extended version of the selective attention model, which can recognize and restore partly occluded patterns, in a similar way as our perception. 5.1
How We Perceive Partly Occluded Patterns
We often can read or recognize a letter or word contaminated by ink stains that partly occlude the letter (Fig. 6[A](a)). If the stains are completely erased and the occluded areas are changed to white, however, we usually have difficulty in reading the letter, which now has some missing parts (Fig. 6[A](b)). We can also perceive the original shapes of occluded patterns. The perception, however, sometimes depends on the shape and location of the occluding objects. In Fig. 6[B], the black parts of the patterns are identical in shape between (a) and (b), but we feel as though different black patterns are occluded. 5.2
Recognition of Partly Occluded Patterns
The model is a multi-layered hierarchical network having forward (bottom-up) and backward (top-down) signal paths. Fig. 7 illustrates how information flows
1048
K. Fukushima
(a) (a) (b) [A] Patterns partly occluded by (a) visible and (b) invisible objects.
(b)
[B] An identical pattern is perceived differently by the placement of different gray objects.
Fig. 6. Recognition of partly occluded patterns U0
UG
US1
UC1
US2
UC2
US3
UC3
US4
UC4
WG
WS1
WC1
WS2
WC2
WS3
WC3
WS4
WC4
UM W0
Fig. 7. Signal flow between layers of the forward and the backward paths
in the network, where U represents a layer of cells in the forward paths, and W in the backward paths. Information is processed in the network through the interaction of forward and backward signals. The forward paths mainly take charge of recognizing partly occluded patterns, and the backward paths take charge of restoring missing portions of the occluded patterns. The forward paths of the network are almost the same as the neocognitron. Layer of contrast-extracting cells (UG ) follows input layer U0 . After UG , layers of S-cells and C-cells are arranged alternately. The first layer of S-cells, US1 , consists of edge-extracting S-cells and decomposes the input image into edge components of various orientations. To this network, an additional layer UM , which is called masker layer, is added [8]. Masker layer UM detects and responds only to occluding objects. There are inhibitory connections from UM to US1 . When a pattern is partly occluded, a number of new features, which do not exist in the original pattern, are generated near the contour of the occluding objects. These irrelevant features largely disturb the correct recognition by the neocognitron, as well as by the visual system. If the occluding objects are visible, the visual system can easily distinguish relevant from irrelevant features, and can ignore irrelevant features. In our network, the response of edge-extracting cells (US1 ) near the occluding objects is suppressed by the inhibitory signals from layer UM , and the transmission of irrelevant features toward higher stages is blocked (Fig. 8). Since the neocognitron, like the visual system, has a large tolerance to partial absence of relevant features, it can recognize the occluded patterns correctly. 5.3
Restoration of Partly Occluded Patterns
The cells in the backward (top-down) paths are arranged in the network making a mirror image of the cells in the forward (bottom-up) paths. Most of the
Recent Advances in the Neocognitron
1049
Visual system: Neural network model recognition relevant features
- disturbed by irrelevant features - tolerant of partial absence of relevant features masker layer
shape of occluding objects
inhibit irrelevant features
feature extraction irrelevant features generated by partial occlusion relevant features
occluding objects (occluded) relevant features target pattern
Fig. 8. The main stream of information in the forward paths of the model when recognizing a partly occluded pattern
connections also make a mirror image of those of the forward paths. Only the direction of signal flow through the connections is reversed in the backward paths. This architecture resembles that of the old version of the Selective Attention Model [6]. In the old model, however, the backward signals are fed only from the recognition cells at the highest stage of the forward paths. If a novel pattern is presented, no recognition cell will respond, hence top-down signals cannot start flowing. In the new model, the signals are fed to the backward paths from every stages of the forward paths (Fig. 7). Although the blur in the forward paths is very useful for robust recognition of deformed patterns, some deblurring operations are required in the backward paths to restore sharp images of occluded patterns. In the backward paths of the new model, the signals from intermediate stages of the forward paths are mixed to the backward signals from higher stages. The former signals usually convey information of sharper images than the latter and help reducing the blur of the latter signals. The latter signals are mainly used in the places where the former signals are not available. Backward signals thus reaches layer WG , which correspond to UG in the forward paths. The positive and negative contrast components of a restored pattern are expected to appear here. Layer US1 in the forward paths receives signals from WG , as well as from UG , and extract oriented edges. Multiple positive feedback loops are thus generated between the forward and the backward paths. Layer W0 is a virtual layer, which is used for monitoring how the occluded pattern is perceived by the model. The model try to restore the original pattern into W0 from the response of WG . The restoration is done by diffusing the response of WG . On-center cells of WG send positive signals to W0 , and offcenter cells send negative signals. These signals are diffused to the neighboring cells, but signals of opposite polarities work as barriers to the diffusion.
1050
K. Fukushima U0
W0
U0
W0
U0
W0
U0
W0
U0
W0
U0
W0
U0
W0
U0
W0
Fig. 9. The responses of W0 , which are restored from partly occluded patterns presented to U0
The model was simulated on a computer. The network was initially trained to recognize alphabetical characters correctly. Test patterns before occlusion are all deformed versions of the training patterns. Fig. 9 shows some examples of perceived patterns (W0 ) restored from partly occluded stimuli (U0 ). As can be seen in Fig. 9, the restored patterns are not necessarily identical in shape to the original patterns before occlusion. It is natural, however, that the model, which recognizes patterns accepting some deformation, cannot imagine the exact shapes in the occluded areas, because it has not seen the exact shapes of the patterns. It is noteworthy that, despite some deformation, all essential components of patterns have been correctly restored.
References 1. Fukushima, K.: Neocognitron: a hierarchical neural network capable of visual pattern recognition. Neural Networks 1, 119–130 (1988) 2. Fukushima, K.: Neocognitron for handwritten digit recognition. Neurocomputing 51, 161–180 (2003); A computer program of this neocognitron in C language is available from Visiome Platform: http://platform.visiome.neuroinf.jp/ 3. Fukushima, K.: Analysis of the process of visual pattern recognition by the neocognitron. Neural Networks 2, 413–420 (1989) 4. Fukushima, K.: Interpolating vectors for robust pattern recognition. Neural Networks 20, 904–916 (2007) 5. Fukushima, K.: Neocognitron capable of incremental learning. Neural Networks 17, 37–46 (2004) 6. Fukushima, K.: Neural network model for selective attention in visual pattern recognition and associative recall. Applied Optics 26, 4985–4992 (1987) 7. Fukushima, K.: Restoring partly occluded patterns: a neural network model. Neural Networks 18, 33–43 (2005) 8. Fukushima, K.: Recognition of partly occluded patterns: a neural network model. Biological Cybernetics 84, 251–259 (2001)
Engineering-Approach Accelerates Computational Understanding of V1–V2 Neural Properties Shunji Satoh and Shiro Usui Laboratory for Neuroinformatics, RIKEN Brain Science Institute, Japan [email protected]
Abstract. We present two computational models: (i) long-range horizontal connections and the nonlinear effect in V1 and (ii) the filling-in process at the blind spot. Both models are obtained deductively from standard regularization theory to show that physioligical evidence of V1 and V2 neural properties is essential for efficient image processing. We stress that the engineering approach should be imported to understand visual systems computationally, even though this approach usually ignores physiological evidence and the target is neither neurons nor the brain.
1
Introduction
Neuroscience will contribute to construction of novel algorithms and architectures to solve current problems related to image processing because the visual system is the ultimate system for effective image processing. To understand visual functions, the late Dr. David Marr identified three levels of analysis: – what problems vision solves and why (computational level) – strategies that might be used in the visual system (algorithmic level), and I/O representation (representational level) – how strategies are executed in the neural activity (implementational level) Neurophysiology, computational neuroscience, and psychophysics are natural approaches to explore those three levels. From an engineering point of view, we can regard the visual system as a largescale machine for image processing or as a collection of many algorithms to solve various problems. Myriad new problems and computer algorithms have been presented and proposed independently of brain science; these results have been presented in IEEE Image Processing, IEEE PAMI, the International Journal of Computer Vision and so on.
This work was partially supported by a Grant-in-Aid for Young Scientists (#17700244) from the Ministry of Education, Culture, Sports, Science and Technology, Japan.
M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 1051–1060, 2008. c Springer-Verlag Berlin Heidelberg 2008
1052
S. Satoh and S. Usui
These engineering studies rarely consider neurophysiological evidence. However, such studies actually contribute to vision science because their efforts are also aimed at finding problems to be solved (computation) and the effective methods used to solve them (algorithms). Implementation levels are different, but the computation level and the algorithmic level are the same as those of vision science. For that reason, neurophysiologists and computational neuroscientists can find many clues to explain complicated visual properties from studies that are unrelated to brain science. Meanwhile, vision scientists already have novel ideas to construct effective image processing algorithms because they know some algorithms of existing biological highly efficient machines for image processing (=visual system). Therefore, we expect that neurophysiology, computational neuroscience, and psychophysiology are matrices of efficient image processing. In this article, two examples of computational visual model are introduced based on standard regularization theory (SRT). The SRT is a powerful tool to understand the visual system computationally (scientific contribution) and to develop effective image processing methods (engineering contribution). The first target of this article is the long-range horizontal connections in V1; the second target is the filling-in process and neural properties of the blind spot.
2
Standard Regularization Theory
A short overview of standard regularization theory (SRT) is presented. The advantages of SRT are (i) computational models derived from SRT can be regarded as optimizers of evaluation functions, and (ii) numerous algorithms for image analysis based on the optimization process have been presented from engineering perspectives. Equation (1) is an example of evaluation functions for image reconstruction. Both models presented in this article are based on eq. (1). α λ E[I] = dx(I(x, t) − J(x))2 + dx ∇I(x, t)2 , (1) 2 2 Efit [I]
Ereg [I]
where J(x) represents the luminance of an observed image that is contaminated by noise at a spatial position x ∈ R2 . Therein, α and λ are positive constant parameters. Hereafter, we assume α = 1 and x ∈ R in this section. Image reconstruction involves the inference of an original (unknown) image I from J. The E[I] is designed so that it takes a smaller value when I is a desired image. The dynamics of I can be derived by applying the steepest descent method to the functional E[I]: τ
∂ ∂2 I(x, t) = J(x) − I(x, t) + λ 2 I(x, t), ∂t ∂x
(2)
Engineering-Approach Accelerates Computational Understanding of V1–V2
1053
Fig. 1. A: Neural implementation of image reconstruction with single spatial resolution. B: Image processing using multi-resolution method with different discretization steps.
where τ is a time constant. Hereafter, we assume τ = 1 for simple discussion. The steady state of I(x, t) is the result of reconstruction. We assume here that V1 cells have dedicated operation not only to edgedetection, but also to image reconstruction. The output of such a V1 neuron is designated as sθ , which is selective to θ-oriented edges. We can obtain the dynamics of sθ from eq. (2) deductively, as ∂ ∂ τ sθ (x, t) = gθ ∗ τ I(x, t) (3) ∂t ∂t ∂2 = gθ ∗ J(x) − sθ (x, t) + λ 2 sθ (x, t), (4) ∂x where ∗ is the spatial convolution and gθ is the Gaussian derivative function representing the spatial distribution of the receptive field (RF). The positional variable x signifies the retinal position of the RF. The parameter λ > 0 is the diffusion constant of the (forward-)diffusion process of sθ . A neural network for edge-detection and image reconstruction is obtained by discretizing the positional variable x of eq. (4). One example is the following. τ
∂ sθ (x, t) = gθ ∗ J(x) − (1 + λ)sθ (x, t) + λ(sθ (x − 1, t) + sθ (x + 1, t)) ∂t
(5)
The first term of the right-hand-side of eq. (5) represents a masking operation to detect θ-oriented edges in a retinal image J. Figure 1A is a schematic diagram representing neural implementation of the second and third terms of eq. (5). We find an excitatory intra-cortical (lateral) effect λ and self-inhibition (1 + λ). We expect that one of the computational roles of the lateral effect is image reconstruction as a scientific conclusion, and that one task of the visual system is to optimize (minimize) a pre-defined function E[I] if the computational lateral effect signified by λ is consistent with a corresponding neural property.
1054
A
S. Satoh and S. Usui B s.d. local minima
12 10
Peak PSC ampl. (pA)
Number of Connections (%)
14
8 6 4 2 0
80 Peak IPSC 60 not published 40 20 Peak EPSC 0
-80 -60 -40 -20 0 20 40 60 80 Orientation difference [deg]
Number of Connections (%)
4
6
8
10
D
14 12 10 8 6 4 2 0 -80 -60 -40 -20 0 20 40 60 80 Orientation difference [deg]
Peak EPSC + Peak IPSC (pA)
C
2
Stimulus Intensity (V)
1/4
60 40
1/8
20 0
2
4 k/2
10 [V]
8 k
3k/2
-20 -40
-1/8
-60
Fig. 2. A and B: The average number of buttons of long-range horizontal connections and nonlinearity of lateral effects. (adapted from [1,3]) C and D: Simulation results.
3 3.1
Long-Range Horizontal Connections Neural Evidence
Next, we consider the intra-cortical (horizontal) effect between V1 neurons, conveyed through long-range horizontal connections (LHCs). As shown in Fig. 2A, the LHCs in V1 tend to link orientation-selective cells with similar orientation preferences[1,2]. This preference is described as iso-orientation connectivity. The horizontal effect is very complicated as follows. As shown in Fig. 2B, changing the stimulus intensity to the pre-synaptic cell alters the balance of the evoked post-synaptic excitation and inhibition without changing the isoorientation connectivity of the horizontal connections [3]. Figure 2B shows polarity changes in excitatory and inhibitory post-synaptic currents (PSCs) of a V1 (layers II/III) cell as a function of the stimulus intensity provided to pre-synaptic cells. The recording cell showed excitation when a low-amplitude stimulus was provided to pre-synaptic cells, but an inhibitory effect was dominant with a higher-amplitude stimulus. We intend to answer the following questions: ☞ What is the computational theory of LHCs? ☞ What is the advantage of the polarity change for image processing?
Engineering-Approach Accelerates Computational Understanding of V1–V2
3.2
1055
Computational Theory and Algorithm
We investigate the computational necessity of polarity change of horizontal effects from an engineering point of view. As shown in Fig. 1A, horizontal effect is formulated by λsθ (x ± 1). Polarity change implies that λ, which is a diffusion coefficient, is not constant, but it can be a positive or a negative value depending on the value of sθ (x ± 1). In other words, the diffusion process of sθ is forward-diffusion when λ > 0, and backward-diffusion when λ < 0. We can find a candidate of the computational theory and the algorithm to explain the nonlinearity from the search engine, e.g. IEEE Xplore (not PubMed !), using keywords “forward backward diffusion image processing.” We have found two papers which suggest that a context-dependent (J-dependent) diffusion-coefficient is very efficient for image reconstruction [4,5]. Therefore, we expect that the computational theory of LHCs should be image reconstruction and the algorithm should be the forward-diffusion and backward-diffusion processes. Based on the Perona-Malik functional (evaluation function), the following functional is a strong candidate of the evaluation function to understand the computational theory of LHCs from an engineering perspective.
2 1 ∂ ∂ EH [I] = Efit [I] + dxdρ Λ J(x) I(x, t) (6) 2 ∂ρ ∂ρ In the equation, ∂/∂ρ is the directional derivative in the direction of ρ. The function Λ(a) = (λ/k)/(1 + (a/k)2 ) is referred to as an edge-function[5]. Moreover, multi-resolution method is introduced into eq. (6) to represent the “long-range” property of LHCs. As shown in Fig. 1B, “long-range” implies multiple spatial resolutions of spatial discretization. The multi-resolution method is realized using Gaussian blur for J and I. We therefore obtain
2 1 ∂ ∂ EH [I] = Efit [I] + dxdρ dσ Λ Jσ (x) Iσ (x, t) , (7) 2 ∂ρ ∂ρ where Jσ (x) = σ {gσ ∗ J(x)}, Iσ (x, t) = σ {gσ ∗ I(x, t)} and gσ is the Gaussian function with σ 2 variance. Assuming J I, we obtain the dynamics of orientation-selective neurons, which devote their operation not only to edge-detection but also to image reconstruction using multi-resolution method: ∂ sθ (x, t) = gθ ∗ J(x) − sθ (x, t) + dρ dx W θρ (x ) ψ(sρ (x − x , t)), (8) ∂t ∂2 (sρ /k) − (sρ /k)3 Wθρ (x ) = dσσ 2 gσ (x ), ψ(sρ ) = λ . (9) ∂ρ∂θ (1 + (sρ /k)2 )2 Equation (8) means that sθ (x, t) is affected by other V1 neurons sρ (x − x , t) with long-range connections signified by Wθρ (x ); the lateral effect is a nonlinear function ψ(sρ ).
1056
S. Satoh and S. Usui
Fig. 3. A: Input pattern. B: Initial output of orientation selective neurons. C: Steady state.
3.3
Simulation
We find the computational W and ψ are consistent with physiological evidence shown in Fig. 2A and Fig. 2B. Figure 2C presents the plot of connection strength Wθρ as the function of orientation difference θ − ρ.1 Figure 2D shows the physiological nonlinearity of the horizontal effect (total PSC = EPSC − IPSC : = − ) and the computational nonlinearity ψ(sη ) when η = θ. Figure 3 shows simulation results of eq. (8). We find the efficiency of the computational model. Figure 3A is an input (retinal) image J contaminated by noise. Its size is 128 × 128 pixels and its luminance is normalized so that 0 ≤ J ≤ 1. Parameters are λ = 0.1, k = 0.07, ρ and θ ∈ {0, π/4, π/2, 3π/2} (four orientations). The number of simulated V1 neurons is 4 × 128 × 128. Figure 3B depicts the initial outputs of sθ (x, t). A short line segment at each position x represents the orientation of the maximally reacting neuron among the four neurons. Figure 3C depicts the steady state of sθ (x, t). As time progresses, noise is suppressed; continuous curves are enhanced and integrated because the model neurons perform image reconstruction. The contribution of this section is summarized as the following: Scientific contribution. We can understand the polarity change and LHCs as a neural implementation of the forward-backward diffusion method and the multi-resolution method for image reconstruction. Engineering contribution. We propose a new algorithm for edge-detection using multi-resolution and forward-backward diffusion.
4
Filling-In at the Blind Spot
4.1
Neural Evidence
We next specifically examine the filling-in process at the blind spot (BS). The BS is the area of the visual field that corresponds to the lack of photoreceptors 1
inh Wθρ (x ) is decomposed into two Gaussian functions: Gexc θρ (x ) − Gθρ (x ), also known Ê exc inh as the DoG function. Figure 2C is a plot of dx {Gθρ (x ) + Gθρ (x )}.
Engineering-Approach Accelerates Computational Understanding of V1–V2
1057
Fig. 4. A: Schematic examples of bar stimuli. B: Retinal inputs. C: Responses of the recorded neuron. (adapted from [7]). D: Simulation result.
because of the optic disk. We do not see any black disks or strange patterns in our visual fields; rather, we perceive the colours or patterns surrounding the BS. This phenomenon is referred to as perceptual filling-in at the BS. Komatsu and colleagues reported a quantitative analysis of V1 neural responses of awake macaque monkeys to bar stimuli presented on the blind spot [6,7]. The receptive fields (RFs) of the recorded neurons overlapped with the BS area (Fig. 4A). Bar stimuli of various lengths were present at the BS. One end of the bar stimulus was fixed; the other end was varied (Figs. 4(a1)–(a4)). Some V1 neurons showed a marked increase in their activities when the bar end exceeded the BS (Fig. 4C), whereas those activities remained constant as long as the end was inside of the BS area. These results indicate that V1 neurons perform filling-in processes as well as orientation detection. Moreover, Matsumoto and Komatsu conjectured the existence of two different pathways with different velocities of visual signals: (i) fast feedforward and fast feedback connections, such as V1→V2→V1; and (ii) slow intra-cortical connections linking V1 neurons, such as V1→V1. Then, we discuss the computational necessities: ☞ Why are V2 neurons required for V1 filling-in? ☞ Why are the different velocities necessary for filling-in? 4.2
Computational Theory and Algorithm
As shown in Fig. 5A, the BS area and the boundary are respectively referred to as B and ∂B. One purpose is to obtain a filled-in pattern like that depicted in Fig. 5B from the initial condition of Fig. 5A. We consider an appropriate evaluation function for the filling-in process to obtain the desired filled-in pattern. Considering the lack of input in the BS,
1058
S. Satoh and S. Usui
Fig. 5. A: Filling-in process is applied in BS area B (a gray rectangle). B: Example of a desired filling-in pattern. C: Filling-in using the diffusion equation. D: Filling-in using the proposed visual model.
it is apparent that EBS defined in eq. (10) is a simple evaluation function based on eq. (1). λ 2 EBS [I] = Ereg [I] = dx ∇I(x, t) , (10) 2 B Unfortunately, the filled-in pattern minimizing the above EBS is not a completed bar, but fragmented bars, as shown in Fig. 5C. To obtain a completed bar as a result of filling-in, eq. (10) is modified by introducing physiological evidence: V2 neurons are required for V1 filling-in. Some V2 neurons are selective to angles embedded within V-shaped patterns composed of oriented line segments [8]. We therefore introduce two kinds of angular information of oriented lines, κ ¯ (curvature of level-set) and μ ¯ (curvature of flow-line), into eq. (10). 2 1 2 EBS [I] = dx κ ¯ (x, t) + μ ¯2 (x, t) ∇I(x, t) , (11) 2 B where κ ¯=
Iy2 Ixx − 2Ix Iy Ixy + Ix2 Iyy , Ix2 + Iy2
μ ¯=
(Ix2 − Iy2 )Ixy − Ix Iy (Iyy − Ixx ) Ix2 + Iy2
(12)
The curvatures of equi-luminance contours (level-sets) of an image I are evaluated using κ ¯. The curvatures of flow lines, which are perpendicular to level-sets, are evaluated using μ ¯. Smaller values of κ ¯ 2 and μ ¯2 represent more continuous contours of a filled-in pattern. We expect that the dynamics for filling-in can be obtained deductively by applying the steepest descent method to eq. (11). The following equation shows the resultant complex dynamics. ∂ I = ( Iy6 Iyyyy Ix2 + 3Iy4 Iyyyy Ix4 + 3Iy2 Iyyyy Ix6 ∂t + · · · (74 terms) + · · ·
3/2 + 3Iy6 Ix2 Ixxxx + 3Iy4 Ix4 Ixxxx + Iy2 Ix6 Ixxxx ) / Ix2 + Iy2
(13)
Engineering-Approach Accelerates Computational Understanding of V1–V2
1059
Fig. 6. A: Almost half of visual information is missing. B: Restored image using the proposed visual model.
The problematic complex equation, eq. (13), includes 80 terms. It seems impossible to analyze all 80 terms to investigate their physiological meanings. We again try to use a physiological phenomenon as a key to solving the problem. The key is the different pathways with different conductance velocities: fast pathways via V2, and slow pathways within V1 (horizontal connections). The faster conductance velocity of the visual pathway via V2 implies the faster optimization of κ ¯(x, t) and μ ¯(x, t) in time than that of ∇I(x, t). In an extreme situation of the velocity difference, we can assume a constant value of ∇I(x, t) with respect to time t. This assumption corresponds to variable separation and adiabatic approximation. Applying the steepest descent method to eq. (11) using this assumption, we obtain the following dynamics. 2 ∂ ∂ ∂ ∂ ∂2 I(x, t) = κ ˜+ μ ˜ − + κ ˜ + βIηη , (14) ∂t ∂η ∂ξ ∂ξ 2 ∂η 2 Therein, κ ˜ = Iξ2 κ ¯, μ ˜ = Iξ2 μ ¯ and β is a positive constant. The direction of ξ is parallel with ∇I, and η is perpendicular to ξ. The third term of eq. (14) is a consistency term, which alleviates the drawbacks of variable separation and adiabatic approximation. Comparing with eq. (13), we find a simple, analyzable, neural implementable equation in eq. (14). 4.3
Simulation and Consideration
First, numerical simulations of eq. (14) are performed to investigate whether the expected filling-in pattern is obtained using Fig. 5A as the initial value of I. A parameter is β = 0.1. Figure 5D is the steady state of I (filling-in pattern). We find an expected pattern of Fig. 5D, in which a broken bar of Fig. 5A is completed. We evaluate the efficiency of our visual model in a more difficult situation of recovering a damaged image. The results are portrayed in Fig. 6, in which the region B is a checkered orange area (Color results might be available on
1060
S. Satoh and S. Usui
the electric version of this article). We find a high ability to recover missing information in our visual model. Finally, we simulate the physiological experiments shown in Fig. 4. The dynamics of V1 neurons in the BS region is obtained by substituting eq. (14) into eq. (3). Figure 4D illustrates the steady values of sθ for various lengths of input bars. We find consistency between physiological results and our model. The contribution of this section is summarized as the following: Scientific contribution. V2-coding information (angles embedded within Vshaped patterns) is necessary to represent the continuity of desired filled-in patterns at the BS region. Moreover, different conductance velocities are useful for simple and effective dynamics for filling-in. Engineering contribution. Novel dynamics for image inpainting (recovering missing information) can be proposed using computational vision science.
5
Summary
We have shown that many areas of study, which are apparently unrelated to brain science, are important to understand visual systems computationally. We hope that physiologists and computational vision scientists export their knowledge to construct novel algorithms for effective image processing, and that they adopt an import engineering-oriented approach to understand neural evidence as neural implementations of image processing.
References 1. Bosking, W.H., Zhang, Y., Schofield, B., Fitzpatrick, D.: Orientation selectivity and the arrangement of horizontal connections in tree shrew striate cortex. J. Neuroscience 17, 2122–2127 (1997) 2. Rockland, K.S., Lund, J.S.: Widespread periodic intrinsic connections in the tree shrew visual cortex. Science 215, 1532–1534 (1982) 3. Weliky, M., Kandler, K., Fitzpatrick, D., Katz, L.C.: Patterns of excitation and inhibition evoked by horizontal connections in visual cortex share a common relationship to orientation columns. Neuron 15, 541–552 (1995) 4. Gilboa, G., Sochen, N., Zeevi, Y.Y.: Forward-and-backward diffusion processes for adaptive image enhancement and denoising. IEEE Transactions on Image Processing 11, 689–703 (2002) 5. Perona, P., Malik, J.: Scale-space and edge detection using anisotropic diffusion. IEEE Trans. PAMI 12, 629–639 (1990) 6. Komatsu, H., Kinoshita, M., Murakami, I.: Neural responses in the retinotopic representation of the blind spot in the macaque V1 to stimuli for perceptual filling in. J. Neuroscience 20, 9310–9319 (2000) 7. Matsumoto, M., Komatsu, H.: Neural responses in the macaque V1 to bar stimuli with various lengths presented on the blind spot. J. Neurophysiology 93, 2374–2387 (2005) 8. Ito, M., Komatsu, H.: Representation of angles embedded within contour stimuli in area V2 of Macaque monkeys. J. Neuroscience 24, 3313–3324 (2004)
Recent Studies Around the Neocognitron Hayaru Shouno Yamaguchi University, Tokiwadai 2-16-1, Ube, Japan [email protected]
Abstract. Neocognitron, which was proposed by Fukushima, is recently studied in several styles. In this paper, we introduce these studies from the both engineering and biological sides. From the engineering side, we discussed about the ability of the pattern classifier of the Neocognitron and relationship to the “convolutional net”, which is recently well studied in the field of pattern recognition. From the biological side, we tried to explain the recent result of a biological experiment with the Neocognitron, and compare it with another model.
1
Introduction
Neocognitron is an artificial neural network (ANN) model inspired from biological architecture of the visual system in brain. The architecture of Neocognitron is a multi-layered feed-forward type network. Fukushima proposed the Neocognitron in 1980 as a visual recognition system [1], and he and his colleagues mainly applied this model to the hand-written character recognition [2][3][4][5][6]. Nowadays, Neocognitron based several studies has been done. In this paper, we try to review the Neocognitron architecture from the both engineering and biological viewpoints. In section 2, we summarize the architecture and learning method of Neocognitron. In section 3, we discuss about the relationship between “convolutional net” proposed by LeCun et al. [7][8][9]. In section 4, we discuss the property of the Neocognitron from the biological viewpoint with several experiments by Logothetis et al. [10][11]. In section 5, we summarize and discuss several topics.
2 2.1
Summary of Neocognitron Network Architecture
Neocognitron mainly consists of two types of cells. One is called ‘S-cell’ which is used for feature extractor. The other is called ‘C-cell’ which is used for reduction of local pattern distortion. In Neocognitron, each type of cell arranged in 2-dimensional array called ‘cell-plane’, and several cell-planes compose a ‘celllayer’. In fig.1, the left shows the structure of Neocognitron. The rectangle in the cell layer describes the cell-plane and the S-cell layer and the C-cell layer are M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 1061–1070, 2008. c Springer-Verlag Berlin Heidelberg 2008
1062
H. Shouno
Fig. 1. Left: Structure of Neocognitron model with 4 stages. Each layer is alternate of S-cell layer and C-cell layer. The rectangle in each layer shows cell-plane, and cone shows the connection area. Right: Example of response of C-cell layer.
arranged in alternately. Usually, the sub-sampling would be done in the C-cell plane, thus the size of cell-plane shrink in high order layers. The cone in the cell plane shows connection area, which shows that the connections of whole cells in the Neocognitron are restricted to the local area. The right figure in fig.1 shows an example of response of C-cell layers with a pattern ‘5’. The dark point in the cell-plane shows strong response of cells. We denote the response of the S-cell in the k-th plane of the l-th layer as uSl (n, k), and that of the C-cell in the κ-th plane of the l-th layer as uCl (n, κ), where n indicates the position in the cell-plane. The S-cell response is described as following convolutions: 1 + κ,ν al (ν, κ, k) uCl−1 (n + ν, κ) uSl (n, k) = rl ϕ −1 , (1) rl 1 + 1+r bl (k) uV l (n) l uV l (n) = cl (ν) (uCl−1 (n + ν, κ)2 . (2) κ,ν
The C-cell response is also described as follows: uCl (n, k) = ψ dl (ν) uSl (n + ν, k) .
(3)
ν
In these equations, the connections al and bl are learnable connections, and cl and dl are fixed connections (Gaussian like kernels). The output functions are ϕ[x] respectively denoted as ϕ[x] = max(x, 0) and ψ[x] = 1+ϕ[x] These equations show that whole cells responses in a same cell-plane is obtained by identical calculation except position n, so that, these connections, that is, al , bl , cl , and dl , can be shared with another cells in a same cell-plane. This architecture is called as ’weight-sharing’.
Recent Studies Around the Neocognitron
1063
The response of S-cells, which is described as the eq.(1), can be transformed to
uSl (n, k) = γl ϕ
κ,ν
al (ν, κ, k) uCl (n + ν, κ) bl (k) uV l (n)
rl − , 1 + rl
(4)
uV l (n) where γl = 1+rlrbll (k) . When we consider the input components uCl−1 (n+ b (k) u (n) 1+rl
l
Vl
ν, κ) as the input vector x(n), the uV l (n) can be regarded as the norm of input, that is, ||x(n)||. In the same way, when we regard the learnable connection al (ν, κ, k) as the preferred pattern, which is called “reference vector”, of the kth cell plane template X(k), the connection bl (k) can be approximated as ||X(k)|| by use of following the learning method. Hence, each S-cell extracts the feature X·x in the meaning of the direction cosine ||X|| ||x|| . The response of C-cell, which is described as eq.(3), is a blurring operation by the convolution kernel dl (ν). During this operation, the sub-sampling for shrinking the cell plane is usually carried out to reduce the calculation cost. 2.2
Learning Method
The learning process of Neocognitron is performed from lower to the higher layers’ connections. Learnable connections al and bl can be reinforced in both supervised and unsupervised learning. The unsupervised learning method of Neocognitron is a kind of competitive learning [4][5][6]. In the competition process, the strongest response of the previous layer is chosen as the position of winner. When the winner position (n∗ , k ∗ ) is determined, each connection is updated as followings: al (ν, κ, k ∗ ) = al (ν, κ, k) + q · uCl−1 (n∗ + ν, κ), ∗
∗
∗
bl (k ) = bl (k ) + q · uV l (n ),
(5) (6)
where q is a learning strength. The update equations show that the similar feature is overlapped to the weight al , thus the connection al can be regarded as the local feature template.
3
Engineering Viewpoint Studies of the Neocognitron
From the engineering viewpoint, the Neocognitron can be regarded as a pattern classifier and feature extractor. Several important modification methods were developed. Wake et al. proposed a method of unsupervised learning algorithm [4], which could create and assign new cell-plane for a novel feature during the learning, so that, we should not fix the number of cell-planes before learning. Tanigawa et al. proposed using different threshold rl in the learning phase and in the recognition phase [5]. The high threshold in the learning phase create creates many cell-planes, and the low threshold in the learning phase improves robustness for the pattern distortion. Lovell et al. also proposed an improvement of learning method by optimizing S-cell selectivity [12].
1064
H. Shouno C3: 10x10x16
C1: 28x28x6 Input 32x32
S4: 5x5x16 S2: 14x14x6
C5: 1x1x120 F6: 1x1x84 Output: 1x1x10
Subsampling Convolutions Convolutions
Subsampling
Full connecition
Fig. 2. LeNet-5 consists of “Convolution layer” (abbreviated “C”) and “Sub-sampling layer” (abbreviated “S”). The network architecture is very similar to the Neocognitron. This figure is modified from [8].
We evaluated the ability of the improved Neocognitron with ETL-1 database, which consists of handwritten digit-patterns [6]. In this work, we used the hybrid model consists of the Neocognitron, which play a feature extractor role, and a perceptron, which classifies these extracted features. As a result we obtained 1.9 % error rate for 1,000 blind test patterns with the Neocognitron trained by 2,000 training patterns. Satoh et al. also improves the recognition ability of the Neocognitron to apply rotated patterns [13]. Moreover, Fukushima introduced an interpolating method to measure the similarity in the S-cell calculation recently [14]. In the interpolating method, the distance measure between the input vector x and the vector space spanned by the two reference vectors in the same categories described as X i and X j is introduced. This concept is similar to the Adaptive Subspace Self-Organization Map (ASSOM) [15], and tangent propagation method [16][17]. By use of the interpolating method, the error rate is reduced to 1.02 % for blind test set of 5,000 digit patterns. 3.1
Comparison with “Convolutional Net”
Nowadays, LeCun and his colleagues also investigated several ANNs inspired from the Neocognitron. They called their ANNs as “Convolutional net”. One of the familiar network models is called LeNet-5 [8]. The structures in the low layers between the Neocognitron and the LeNet-5 are very similar (see Fig. 2). The convolutional layer, which corresponds to the S-cell layer in Neocognitron, extracts features, and the sub-sampling layer, which correspond to the C-cell layer, reduce the effect of local pattern distortions. On the contrary, the higher layers, which are described as ‘C5’, ‘F6’, and ‘Output’ in the Fig.2, are different structure from that of the Neocognitron. The cells in these layers have not local but full connections. In other words, the LeNet-5 is a kind of hybrid network of the Neocognitron and the full connected ANN. We consider the most significant difference point between the Neocognitron and the convolutional net is in the learning method. The convolutional method adopted energy based optimization method for learning such like
Recent Studies Around the Neocognitron AIT/CIT
8
1065
TF 7a
PIT
LIP
MST
DPL
VIP
V1 TEO
V4
TE
AIT
V2 CIT PIT
V4
Ventral Pathway
MT V3A
V3
VA/V4 VP
V2 V1
Fig. 3. The left figure shows the schematic diagram of the ventral pathway in the brain, which corresponds to the object recognition. The right figure shows connection block diagram around the visual pathway obtained from the knowledge of physiology.
the back-propagation [18]. In the convolutional net, they fixed network the structure of the network, that is, the number of fun-in and fun-out connections for each plane is restricted, and the weights between inter-layers are only learnable parameters for the back propagation. They evaluated the ability of digit recognition using the NIST database, and obtained 0.8 % error rate. In these years, Huang et al. applied the network for object recognition as well as the character recognition [9]. They proposed a hybrid type network consist of the convolutional net and the support vector machine (SVM), and showed the good performance.
4
Biological Viewpoint Studies of the Neocognitron
The Neocognitron is inspired from the several biological findings. Recently, our group tried to re-interpret the Neocognitron as the brain model. The ventral pathway in the brain corresponds to the object recognition, since cells in the output of the ventral pathway, which is called IT (Inferior Temporal) area, respond to the complex feature such like human faces. Fig.3 shows a schematic model of ventral pathway in the brain. The left figure shows the position of the ventral pathway, and the right one shows connection diagram obtained from the biological knowledge. The ventral pathway is considered as the signal processes path of V1 → V2 → V4 → IT area. Hubel & Wiesel investigated the cells in the area V1, and reported there exists two types of cells [19]. One is called “simple cell” and the other is called “complex cell”. Both types of cells have local responsible field, which called “receptive field”, and also have preferred feature, such that specific orientation line segments and edges to respond. The difference between the simple cell and the complex cell is on the response to the translated preferred feature. The simple cell only responds to the preferred feature in the specific position, but the complex cell responds to the preferred feature in the any position in the
1066
H. Shouno
Fig. 4. Wire like object input for the Neocognitron. This figure is modified from [20].
receptive field. Thus, Hubel & Wiesel proposed a model of the complex cell, which has cascade connections to the simple cells that have the similar preferred feature. This structure is the origin of Neocognitron model. 4.1
The Property of Object Recognition Cells
The cells in the IT area respond to the specific and complex feature in the view. In 1990’s, several experiments for the cells in the IT areas of monkeys are carried out, and interesting characteristics are revealed. Logothetis & Sheinberg investigated the response of the property in the IT cells for the folded wire like image stimuls differentiation task [10]. The responses of IT cells, which are tuned to a target stimulus, weaken gradually for the parametric stimulus change, which means magnification and 3-dimensional rotations. The tuning-curve to the parameter becomes bell-shaped curve, which means that IT cells code preferred view of the stimulus with tolerance for the stimulus distortion. 4.2
The Property of the Neocognitron Output Cells
Recently, Yoshizuka et al. tried to explain the IT cells property reported by Logothetis & Sheinberg. with the Neocognitron model[20]. We trained Neocognitron with 3-dimensional wire object created by computer graphics, such like Fig. 4. In the Fig.4, the left figure shows the target object, and the right figure show the distractor objects. Our Neocognitron is trained by whole 3-dimensional objects including distorted (rotated and magnified) target object and distractors. After training, we choose an output cell tuned to the target input, and investigate the response for the distorted targets and distractors. Fig.5 shows several tuning curves, which means normalized responses of recognition cells in the Neocognitron. Fig.5(a) shows the tuning curve for the magnification of the target object. The most prefer image, which indicate magnification rate as 1.0, is one of the trained views. The horizontal axis shows the magnification rate, and the vertical axis show the normalized responses of a tuned
Recent Studies Around the Neocognitron
1067
Fig. 5. Tuning curve of the Neocognitron output cells for the wire-like object. Each vertical axis shows the normalized response of recognition cells (a) shows the tuning curve for the magnification. The horizontal axis shows the magnification rate. (b) shows the tuning curves for the several cells to the rotated target. The horizontal axis shows the rotation degree. (c) shows the response to the several translations. (d) shows the response to the scrambled image of target object. The horizontal axis shows the number of scrambled tiles. The dashed line shows the response of the Riesenhuber model modified from [21].
cell. The cell shows the tolerance to the certain level of magnification. Fig.5(b) shows the result for the rotation. The horizontal axis shows the rotation degree. The training of the Neocognitron with 3-dimensional objects makes several view tuned cells. Almost all the shapes of tuning curves indicate tolerance to the distortion of the preferred view. The horizontal line in the Fig.5(b), shows the maximum response to the distractor view, so that, the cell in the Neocognitron can differentiate the preferred view and the other one. Fig.5(c) shows the result for the parallel translations. The each bar shows the response result for the parallel movement of the target object. Fig.5(d) shows the result for the response to the “scrambled” image. The horizontal axis shows the number of piece to
1068
H. Shouno
Fig. 6. The view tuned cell model proposed by Riesenhuber et al. This model have hierarchical structure consists from several simple cells layer and complex cells layer. This figure modifies from [21].
scramble. These properties show coincidences to the experiments by Logothetis & Sheinberg. Thus, we conclude that the Neocognitron is good for explain the experiments. 4.3
Other IT Cell Model
Bricolo et al. proposed the model of an IT cell with the RBF (Radial Basis Functions) network [22]. Riesenhuber & Poggio also proposed an IT cell model modifying the Bricolo’s model [21]. Fig. 6 shows the Riesenhuber model [21]. Their model is also layered feed forward network. The alternate structure of simple cell layer to complex cell layer is based on the model of Hubel & Wiesel. As a result, the Riesenhuber model has similar network structure to the Neocognitron. They claimed that the “MAX” operation from the S-type cell to the C-type cell is important, and the MAX mechanism cause good robustness in the object recognition. In the Neocognitron, the output modulation function ψ[·] in the eq.(3), which rapidly increase and gradually saturate, provides the similar mechanism. The major difference between the Riesenhuber model and the Neocognitron is the change of the receptive field size for layers. In the Riesenhuber model, the size of receptive field is rapidly diverged in the composite feature cell layer S2. On the contrary, the size of the receptive field in the Neocognitron model we used is diverged gradually from the low to the high layers. The divergence of the receptive field in the visual system of the brain is considered as gradually diverging[23]. We consider that this difference cause the possibility of recognition failure in scrambled images of a target object. Comparing with the Riesenhuber model, the response of the cells in the highest layer in the Neocognitron drops
Recent Studies Around the Neocognitron
1069
precipitously (see Fig.6(d)). Even in the 4 pieces scrambled image, the average response of the cell in the Neocognitron is at distractor level. This property shows that the Neocognitron would be hardly confused to the scrambled images.
5
Conclusion and Discussion
In these decades, the studies around the Neocognitron are slightly developed in the meaning of the Engineering and Biological model. We consider that the property of the Neocognitron is good for pattern classifier and rough model of ventral pathway in the brain. In the future work, from the engineering view, we are now interested in the 3-dimensional object recognition using the Neocognitron from the both viewpoints of engineering and biology. To introduce the extended model of the Neocognitron called “Selective Attention Model (SAM)” which includes feedback connections as well as the feed-forward connections [24][25], we can infer the object position and shape that the SAM is currently recognizing. We consider the SAM is applicable to the robot vision, and so on.
References 1. Fukushima, K.: Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics 36(4), 193–202 (1980) 2. Fukushima, K., Miyake, S.: Neocognitron: A new algorithm for pattern recognition tolerant of deformations and shifts in position. Pattern Recognition 15(6), 455–469 (1982) 3. Fukushima, K.: Neocognitron: A hierahical neural network capable of visual pattern recognition. Neural Networks 1, 119–130 (1988) 4. Fukushima, K., Wake, N.: Handwritten alphanumeric character recognition by the neocognitron. IEEE Trans. Neural Networks 2(3), 355–365 (1991) 5. Fukushima, K., Tanigawa, M.: Use of different threshold in learning and recognition. Neurocomputing 11(1), 1–17 (1996) 6. Shouno, H., Fukushima, K., Okada, M.: Recognition of Handwritten Digits in the Real World by Neocognitron. In: Intelligent Techniques in Character Recognition: Practical Applications, CRC Press, Boca Raton (1998) 7. LeCun, Y., Boser, B., Denker, J., Henderson, D., Howard, R.: Handwritten digit recognition with a back-propagation network. In: NIPS, vol. 2 (1990) 8. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998) 9. Huang, F.J., LeCun, Y.: Large-scale learning with svm and convolutional netw for generic object recognition. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2006, IEEE Computer Society, Los Alamitos (2006) 10. Logothetis, N.K., Sheinberg, D.L.: Visual object recognition. Annual Review of Neuroscience 19, 577–621 (1996) 11. Logothetis, N.K., Pauls, J., Poggio, T.: Spatial reference frames for object recognition: Tuning for rotation in depth. Technical report, M.I.T A.I Memo (1995)
1070
H. Shouno
12. Lovell, D.R., Downs, T., Tsoi, A.C.: An evaluation of the neocognitron. IEEE Transactions on Neural Networks 8(5), 1090–1105 (1997) 13. Satoh, S., Kuroiwa, J., Aso, H., Miyake, S.: A rotation-invariant neocognitron. Systems and Computers in Japan 30(4), 31–40 (1999) 14. Fukushima, K.: Interpolating vectors for robust pattern recognition. Neural Networks (2007) 15. Kohonen, T.: Self-Organizing Maps. Springer, Heidelberg (1995) 16. Simard, P.Y., Victorri, B., LeCun, Y., Denker, J.: Tangent prop - a formalism for specifying selected invariances in an adaptive network. In: Tourtezky, D.S., Lippman, R. (eds.) Advances in Neural Information Processing Sytems, vol. 4, pp. 895–903 (1992) 17. Simard, P.Y., LeCun, Y., Denker, J.: Advances in Neural Information Processings. In: Cowan, S.J.H.J.D., Giles, C.L. (eds.), vol. 5, pp. 50–58 (1993) 18. Rumelhart, D.E., McClelland, J.L., Group, P.R.: Parallel Distributed Processing: Explorations in Microstructure of Cognition. MIT Press, Cambridge (1986) 19. Hubel, D.H., Wiesel, T.N.: Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J.Physiol. 106(1), 106–154 (1962) 20. Yoshizuka, T., Shouno, H., Miyamoto, H., Okada, M., Fukushima, K.: Modeling visual ventarl pathway based on the neocognitron (in Japanese). The Brain & Neural Networks (to be appeared, 2007) 21. Riesenhuber, M., Poggio, T.: Hierachical modesl of object recognition in cortex. Nature Neuroscience 2, 1019–1025 (1999) 22. Bricolo, E., Poggio, T., Logothetis, N.K.: 3d object recognition: A model of viewtuned neurons. Advances in Neural Information Processing System 9, 41–47 (1996) 23. Rolls, E.T.: Brain mechanisms for invariant visual recognition and learning. Behavioural Processes 33, 113–138 (1994) 24. Fukushima, K.: Neural network model for selective attention in visual pattern recogition and associative recall. Applied Optics 26(23), 4985–4992 (1987) 25. Shouno, H., Fukushima, K.: Connected character recognition in cursive handwriting using selective attention model with bend processing. Systems and Computers in Japan 26(10), 35–46 (1995)
Toward Human Arm Attention and Recognition Using a Computer-Vision- and Neocognitron-Based Approach Takeharu Yoshizuka, Masaki Shimizu, and Hiroyuki Miyamoto Kyushu Institute of Technology, Graduate School of Life Science and Systems Engineering, 2-4 Hibikino, Wakamatsu-ku, Kitakyushu 808-0196, Japan yoshizuka-takeharu, [email protected], [email protected]
Abstract. We aim to develop a vision system that focuses on the human arm and recognizes the arm posture in complex scenes. In this paper, we introduce a computer-vision-based approach and a neocognitron-type neural network. In our computer-vision-based approach, we propose a method for arm posture estimation using an arm model with three kinds of small areas in the complex background and clothes patterns. In our neocognitron-based approach, we investigate the functional similarity between the neocognitron and the ventral visual pathway. We examine to the expand neocognitron base on stereo.
1
Introduction
Human beings can focus on one object selectively and recognize it in a scene. It is very difficult to realize the same functionality by employing past image processing technologies. In this study, we have attempted to simulate a human arm by generating a three-dimensional model from images acquired using a camera. However, the information along the direction of the optical axis of the camera could not be acquired using monocular camera, and it was very difficult to obtain depth-related information. One solution to resolve this problem is to arrange two or more cameras in a multi-aspect setup; however, this system becomes a largescale setup that requires a special environment. Human beings can perceive an object’s depth information. It is known that the cues of depth vision are binocular disparity, movement parallax, eye muscles, and pictorial information. We focus on the cue of binocular disparity and attempt to acquire depth information using a stereo-vision-based system. In order to build a system capable of flexible adjustments according to image information that varies with the environment, we need to consider the mechanism of the human visual information processing system. Therefore, we focus on the neocognitron-type neural network model that was proposed by Fukushima [1, 2, 3]. The neocognitron is a hierarchical neural network based on the knowledge of the biological visual system; it can recognize noised, distorted, and shifted patterns. For the purpose of this study, we are advancing the research by adopting two computer-vision-based approaches. In 2, M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 1071–1080, 2008. c Springer-Verlag Berlin Heidelberg 2008
1072
T. Yoshizuka, M. Shimizu, and H. Miyamoto
we describe the estimation of the arm posture realized using a computer-visionbased approach. Then, in 3, we examine the binocular selective attention model based on the neocognitron-type neural network.
2
Estimation of Human Arm Posture Based on Stereo Vision
Previously, several researchers have estimated the human posture [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]. The methods used by them were often effective under certain specific conditions; however, they did not work well for complex images that were changing. For the depth problem, particularly for two-dimensional images, several techniques employing special devices such as laser range finders and multi-aspect cameras were proposed. These systems were large scale and expensive. On the other hand, the proposed stereo-vision-based system is handy and inexpensive. However, it does incur the calculation cost of performing the correspondence point search in stereo vision. The proposed method does not search for the correspondence point immediately and does not depend on the clothes or background either. In our method, the posture of the skeleton model is determined in order to maximize the matching rate of the features (edge and color) from three kinds of small areas where the neighborhood area of the model is projected as right and left images. The features from a small portion of the right and left images are compared. Therefore, the matching rate does not depend on the size, color, texture of cloth, or background, and can provide an estimation of the human arm posture. We believe that this method can be applied in several fields such as gesture recognition, welfare, sign language recognition, monitoring system, CG, interface between high-tech and high-touch, robots, and gaming consoles. This section describes the special skeleton model with three kinds of small areas that we constructed and the arm posture estimation method based on model fitting; finally, it also presents the experimental results obtained on using a real image. 2.1
Arm Model
We used a skeleton-type human arm model with four degrees of freedom, as shown in Fig.1 (a). The wrist, which has three degrees of freedom, was excluded from this study, although the human arm has seven degrees of freedom. It is very difficult to detect the overlap between the area of the human arm and the model and isolate only the former in the image by employing only edge detection technologies in conjunction with the skeleton model. Therefore, as shown in Fig.1 (b), three kinds of small areas (center, outside, and hand) were effectively arranged in the vicinity of the arm model. The center areas were arranged on the link-side of the camera. The outside areas were arranged on either side of the link and contained the outline of the human arm. The center and outside areas rotated around the long axis of the link; they maintained the posture by aligning the plane area with that of the camera. The hand area was arranged at
Toward Human Arm Attention and Recognition 6KRXOGHU 5JQWNFGT
1073
%GPVGT
*CPF#TGC
1WVUKFG#TGC 1WVUKFG
\
'NDQY HOERZ 'NDQY
] [
%GPVGT#TGC
.KPM
*CPF
1WVUKFG#TGC
C
D
E
Fig. 1. Arm model with three kinds of small areas (center, outside, and hand)
one end of the arm model and always kept parallel to the image plane of a stereo camera. Each areas obtain the maximum area for the image. Fig.1 (c) shows the state wherein the model accurately matches the human arm. 2.2
Method for Estimating Human Arm Posture
The process flow for posture estimation is described as follows: as a preprocessing task, the edge and skin color are detected from the image acquired using the parallel stereo camera. For edge orientation, we assumed the horizontal direction to be 0 and defined four orientations (0, π/4, π/2, and 3π/4). For judging the skin color, we converted each pixel into the HSV color space and defined the skin color as the pixel color when the value of H was in a constant range. The face position was first detected using OpenCV [15]. Then, the shoulder position was determined based on the distance between the face and shoulder position, and the shoulder position of the arm model was matched. The evaluation function OU , which represents the overlap rate between the human and model upper arms, is calculated from the upper angles (θ1 and θ2 ) of the model. The upper arm position is fixed when OU exceeds the threshold TU , and the process starts searching for the forearm. In the forearm search, the evaluation function OF is calculated from the forearm angles (θ3 and θ4 ) of the model. The evaluation value of the entire arm is assumed to be O, based on the OU and OF values of the upper arm and forearm. Therefore, the human arm posture is acquired by searching for the joint angles where the following objective functions of O are assumed to be maximum: OU = wSEDGE NSEDGE − wC SC OF = wHSKIN NHSKIN + wHEDGE NHEDGE + wSEDGE NSEDGE − wC SC O = OU + OF
(1)
where SC represents the similarity between the center areas of the left and right images; NSEDGE , the number of edges in the outside area; NHEDGE , the number of edges in the hand area; NHSKIN , the number of skin-colored pixels in the
1074
T. Yoshizuka, M. Shimizu, and H. Miyamoto
hand area; and wHSKIN ,wHEDGE ,wSEDGE , and wC , the weight coefficients. Details regarding the processing for the center, outside, and hand areas are described in the subsequent subsections. Processing for the center area. If the arm model posture corresponds to the human arm posture, the center area is located along the center of the human arm. At this stage, there are several pixels corresponding to the color and brightness in each center area of the right and left images. If the arm model posture does not correspond to the human arm posture, in a center area, the area that does not overlap with the center area of a human arm will be large. Therefore, there are a small number of pixels corresponding to the color and brightness in each center area of the right and left images. If the background is black and white, small differences exist between the left and right images and the corresponding rate is large. However, posture estimation does not fail in this case because the corresponding rate of the following hand areas and outside areas was also measured simultaneously. SAD (sum of absolute difference), which represents the sum of the absolute value of the difference in RGB values for all pixels in a small area, is calculated when processing the center area, as given below. SC = WL (xLj +i,j) and WR (xRj +i,j)
RL (xLj + i, j) − RR (xRj + i, j) +GL (xLj + i, j) − GR (xRj + i, j +BL (xLj + i, j) − BR (xRj + i, j
(2)
The coordinates in the left end of a small area WL , WR projected to the right and left images here are assumed to be (xLj , j), (xRj , j), respectively. SC exhibits a similar degree in the right and left images in the center area. When processing SC for this center area, the presence of the edge is not considered because it may not exist for a short sleeve and monochrome clothes. Processing for the outside area. An outside area detects the outline of the human arm. If the arm model posture corresponds to the human arm posture, the outline of the human arm is included in an outside area. At this stage, the number of corresponding edges among the left and right areas increases. If the arm model posture does not correspond to the human arm posture, the outside area is not near the outline of the human arm and is in the background. The number of corresponding edges among the left and right outside areas decreases. The number of edges in outside areas is calculated by the following equation. NSEDGE = 1e (xLj + i, xRj + i, j) (3) WL (xLj +i,j) and WR (xRj +i,j)
If all the following conditions are satisfied, the function 1e (l, r, v) returns one. Otherwise, it returns zero.
Toward Human Arm Attention and Recognition
1075
1. Edge exists in WL (xLj + i, j), WR (xRj + i, j). 2. The orientation of the edge is near to the orientation of the link of the arm model. 3. The sum of the absolute value of the difference in RGB values between the left and right images is less than the threshold ζs Processing for the hand area. If the arm model posture corresponds to the human arm posture, a hand area is said to be near the human hand. At this stage, the number of corresponding edges between the left and right areas increases. If the arm model posture does not correspond to the human arm posture, the hand area is not near the human hand and is in the background. Further, the number of corresponding edges between the left and right hand areas decreases. The number of edges in hand areas is calculated by the following equation. NHEDGE = 1h (xLj + i, xRj + i, j) (4) WL (xLj +i,j) and WR (xRj +i,j)
If all the following conditions are satisfied, 1h (l, r, v) returns one. Otherwise, it returns zero. 1. Edge exists in WL (xLj + i, j), WR (xRj + i, j). 2. The sum of the absolute value of the difference in RGB values between the left and right images is less than the threshold ζh If the number of skin-colored pixels is added to the above expression, the recognition accuracy can be further improved. If such pixels are detected in both right and left pixels (xLj + i, j), (xRj + i, j), their number is counted. NHSKIN = 1s (xLj + i, xRj + i, j) (5) WL (xLj +i,j) and WR (xRj +i,j)
If the pixels of a left (l, v) and right (r, v) image are both skin colored, the function 1s (l, r, v) returns one. Otherwise, it returns zero. 2.3
Experiment
The proposed method was tested for a real image on a Note PC (Linux, Pentium M 2.1GHz). A parallel stereo image with a resolution of 320 × 240 pixels was acquired. Each pixel was converted to HSV and the skin color was judged in the range from 0 < H < 15, 0.1 < S < 0.7, and 0 < V < 0.5. The parameters of the arm model were assumed to be lengths L1 = 0.2m of link 1 and L2 = 0.2m of link 2. The initial state of the arm model posture was set as that where the arm is extended straight below from the shoulder and the palm faces the body. Under such a condition, each joint angle was set as 0. The range of movement for each joint was assumed to be −π/4 ≤ θ1 ≤ π/2,−π ≤ θ2 ≤ 0,−π/2 ≤ θ3 ≤ π/2, and 0 ≤ θ4 ≤ 2π/3. The arm posture search was assumed to search for the combination of joint angles where the value of equation (1) is maximum
1076
T. Yoshizuka, M. Shimizu, and H. Miyamoto
Fig. 2. Estimation results of the human arm posture by the arm fitting model
while changing each joint angle by π/15[rad] within the movable range. The two subjects alternately made a gesture for the experiment in real time. Fig.2 shows the result. The human arm posture was estimated at approximately 0.1 ∼ 0.5 s after preprocessing. Even if the background is complex and the human subject is clothed in a different shirt, the estimation was successful without requiring changes in the parameters. Moreover, it was confirmed to accurately estimate the same arm postures. The arm model can also be arranged in a posture where the arm is extended back and forth, including the depth direction in which estimation is difficult (two results on the bottom in Fig.2). However, when the arm are almost extended to the front, the ratio of the mistake increase. Because the center and outside areas become zero along the link of the arm model and equation (1) is not satisfied (it is satisfied only in the hand area). Moreover, the clothes of the subject, as shown in the upper half of Fig.2 can be suitably recognized although the contrast is low as compared to the wall.
3
Binocular Selective Attention Model Based on Neocognitron
In the previous chapter, the estimation of the human arm posture by stereo vision was described using a computer-vision-based approach. However, certain cases cannot suitably estimate the posture using the above system because a pure, simple image processing technique is used. For example, the system described in the previous chapter needs to detect the face position during preprocessing, and
Toward Human Arm Attention and Recognition
1077
the face must always exist in the image. In order to construct a system that flexibly adapts to changes in scenes in a manner similar to humans, we should apply the biological knowledge of the human vision processing system to the system. Then, we considered the selective attention model proposed by Fukushima [3] to focus attention only on the human arm. The selective attention model is one in which a feedback path is added to the neocognitron; it focuses on a certain pattern, and a part of a pattern that has been missed can be restored. We implement selective attention in our system by extending this selective attention model to binocular vision. Therefore, we expect that the system will be able to directly focus on the human arm area without searching for the face position, and restore a part of the occlusion and patterns in the shadowed areas. First, we investigated whether the neocognitron can be regarded as a proper biological model of the ventral pathway [16]. Next, for examining whether the selective attention model can be applied to the image processing system for the stereo base, we developed a prototype simulator for the binocular selective attention model. 3.1
Neocognitron as an IT Model
It is known that the IT cells (particularly AIT) in humans and monkeys respond to complex objects. Perret et al. reported that AIT cells, which are a higher-order area of the ventral visual pathway, respond to faces [17]. In order to examine the responsiveness of these cells, Logothetis et al. showed views of paperclip-like objects in a three-dimensional space to macaques and measured the response of the IT cells by using microelectrodes [18]. Additionally, they showed views wherein the object was rotated around one axis, the object size was changed, and the object was translated, and examined the information coded by these cells. It was observed that the number of invariant cells responding to the transformed views was less, while that of cells responding to specific views was the most. They reported that the results exhibited a bell-shaped response property of cells, which gradually decreases according to the degree of transformation. Based on this report, Bricolo and Riesenhuber et al. proposed hierarchical models using RBF (radial basis function) networks as an IT model, and explained the experimental results obtained by Logothetis et al. [19, 20, 21]. On the other hand, the neocognitron proposed by Fukushima [1,2] is a hierarchical neural network model for pattern recognition. Bricolo [19] and Riesenhuber et al. [20, 21] explained a qualitative character of the response property by using an RBF network, and regarded the RBF network as part from V4 to IT, corresponding the preprocessing to V1. In contrast, the neocognitron is a homogeneous network structure that can be constructed with the same training rules. If the response property of the neocognitron is qualitatively very similar to that obtained by Logothetis et al., it would be a better model than the IT model proposed by Bricolo and Riesenhuber et al. We investigated the functional similarity between the neocognitron and the ventral pathway and compared the response property of the neocognitron with that of IT cells. On comparing our results with those obtained by Logothetis et al., we found that the result were very similar
1078
T. Yoshizuka, M. Shimizu, and H. Miyamoto
5GNGEVKXG#VVGPVKQP/QFGN 7E 7U 7E 7U 7E 7U 7E 7U 7E
4KIJV'[G
2CVVGTP 4GEQIPKVKQP 9E 9U 9E 9U 9E 9U 9E 9U 9E
QT (CT 0GCT
5GNGEVKXG #VVGPVKQP
+PJKDKV &KURCTKV[&KVGEVQT
9E 9U 9E 9U 9E 9U 9E 9U 9E
(GGFHQTYCTF (GGFDCEM
5GNGEVKXG #VVGPVKQP
QT (CT
0GCT 7E 7U 7E 7U 7E 7U 7E 7U 7E
%GNN2NCPG
2CVVGTP 4GEQIPKVKQP
.GHV'[G 5GNGEVKXG#VVGPVKQP/QFGN
Fig. 3. Outline of binocular selective attention model
qualitatively. Thus, we concluded that the neocognitron is a proper model for the ventral pathway [16]. 3.2
Binocular Selective Attention Model
Uka et al. reported that IT in the ventral visual pathway contains many neurons that are selective toward binocular disparity [22]. If the neocognitron is a proper IT model in the ventral pathway, it follows that the mechanism for detecting disparity should be added to the neocognitron-type neural network to recognize the depth. We then attempt to extend the neocognitron-type neural network to a binocular vision system. Many visual selective attention models have been proposed previously. Selective attention models are roughly classified into two types: feedforward and feedback. Selective attention models that use the saliency map proposed by Itti et al. [23] are examples of the former type. The selective attention model proposed by Fukushima [3] is an example of the latter type. We attempt to apply the selective attention model and develop a system that can focus on objects with depth and recognize them. The outline of the binocular model is shown in Fig.3. We arranged two selective attention models corresponding to the left and right eyes. In the model, we attempted to inhibit signals in the feedback path based on the relative distance information
Toward Human Arm Attention and Recognition
1079
obtained from the binocular parallax in the feedforward path. We are currently developing a simulation software for binocular selective attention.
4
Future Works
We are developing a practical stereo vision system based on the neocognitrontype neural network. The problem that should be considered most important with regard to a system is the response speed. The neocognitron-type neural network is complex, and incurs a considerable computational cost. However, due to recent advancements in computer hardware, it appears that this neural network can be practically implemented. The Cell [24] microprocessor is inexpensively available. We will construct the neocognitron by the Cell architecture and examine its utility in the future.
References 1. Fukushima, K.: Neocognitron: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position. Biological Cybernetics 36, 193–202 (1980) 2. Fukushima, K.: Neocognitron for Handwritten Digit Recognition. Neurocomputing 51, 161–180 (2003) 3. Fukushima, K.: Neural Network Model for Selective Attention in Visual Pattern Recognition and Associative Recall. Applied Optics 26, 4985–4992 (1987) 4. Ziegler, J., Nickel, K., Stiefelhagen, R.: Tracking of the Articulated Upper Body on Multi-view Stereo Image Sequences. In: Proc. IEEE CVPR, vol. 1, pp. 774–781 (2006) 5. Yamamoto, M., Sato, A., Kawada, S., Kondo, T., Osaki, Y.: Incremental Tracking of Human Actions from Multiple Views. In: Proc. IEEE CVPR, pp. 2–7 (1998) 6. Ohno, H., Yamamoto, M.: Gesture Recognition Using Character Recognition Techniques on Two-dimensional Eigenspace. In: ICCV, vol. 1, pp. 151–156 (1999) 7. Cheung, G., Baker, S., Knade, T.: Shape-From-Silhouette of Articulated Objects and its use for Human Body Kinematics Estimation and Motion Capture. In: Proc. IEEE CVPR, vol. 1, pp. 77–84 (2003) 8. Gavrila, D.M., Davis, L.S.: 3-D Model based Tracking of Humans in Action: A Multi-view Approach. In: Proc. IEEE CVPR, pp. 73–80 (1996) 9. Date, N., Yoshimoto, H., Arita, D., Taniguchi, R.: Real-time Human Motion Sensing Based on Vision-based Inverse Kinematics for Interactive Applications. In: Proc. IEEE ICPR, vol. 3, pp. 318–321 (2004) 10. Kehl, R., Van Gool, L.: Real-time Pointing Gesture Recognition for an Immersive Environment. In: Proc. IEEE AFGR, pp. 577–582 (2004) 11. Rohr, K.: Toward Model-based Recognition of Human Movements in Image Sequences. CVGIP:Image Understanding 59, 94–115 (1994) 12. Herzog, G., Rohr, K.: Integrating Vision and Language: Towards Automatic Descriptionof Human Movements. KI- Kunstliche Intelligenz, 257–268 (1995) 13. Huber, E., Kortenkamp, D.: Using Stereo Vision to Pursue Moving Agents with a Mobile Robot. In: Proc. IEEE ICRA, vol. 3, pp. 2340–2346 (1995) 14. Kortenkamp, D., Huber, E., Bonasso, R.P.: Recognizing and Interpreting Gestures on a Mobile Robot. AAAI/IAAI 2, 915–921 (1996)
1080
T. Yoshizuka, M. Shimizu, and H. Miyamoto
15. Intel,Open Source Computer Vision Library, http://www.intel.com/technology/computing/opencv/index.htm 16. Yoshizuka, T., Shouno, H., Miyamoto, H., Okada, M., Fukushima, K.: Modeling Visual Ventarl Pathway Based on the Neocognitron (in Japanese). The Brain & Neural Networks (to be appeared, 2007) 17. Perret, D.I., Mistlin, A.J., Chitty, A.J.: Visual Neurons Responsive to Faces. Trends in Neuroscience 10, 358–364 (1987) 18. Logothetis, N.K., Sheinberg, D.L.: Visual Object Recognition. Annual Review of Neuroscience 19, 577–621 (1996) 19. Bricolo, E., Poggio, T., Logothetis, N.K.: A Model of View-tuned Neurons. In: Mozer, M., Jordan, M.I., Petsche, T. (eds.) Advances in Neural Information Processing Systems, vol. 9, pp. 41–47. The MIT Press, Cambridge (1997) 20. Riesenhuber, M., Poggio, T.: Just One View: Invariances in Inferotemporal Cell Tuning. In: Jordan, M.I., Kearns, M.J., Solla, S.A. (eds.) Advances in Neural Information Processing Systems, vol. 10, pp. 215–221. The MIT Press, Cambridge (1998) 21. Riesenhuber, M., Poggio, T.: Hierarchical Models of Object Recognition in Cortex. Nature Neuroscience 2, 1019–1025 (1999) 22. Uka, T., Tanaka, H., Yoshiyama, K., Kato, M., Fujita, I.: Disparity Selectivity of Neurons in Monkey Inferior Temporal Cortex. J. Neurophysiol. 84, 120–132 (2000) 23. Itti, L., Koch, C.: Computational Modeling of Visual Attention. Nature Reviews Neuroscience 2, 194–203 (2001) 24. Kistler, M., Perrone, M., Petrini, F.: Cell Multiprocessor Interconnection Network: Built for Speed. IEEE Micro 26, 10–23 (2006)
Projection-Field-Type VLSI Convolutional Neural Networks Using Merged/Mixed Analog-Digital Approach Osamu Nomura1 and Takashi Morie2 1
Canon Inc., Human Machine Perception Laboratory, Ohta-ku, 146-8501, Japan [email protected] 2 Graduate School of Life Science and Systems Engineering, Kyushu Institute of Technology, Kitakyushu, 808-0196, Japan [email protected]
Abstract. The hierarchical convolutional neural network models are considered promising for robust object detection/recognition. These models require huge computational power for performing a large number of multiply-and-accumulation (MAC) operations. In this paper, first we discuss efficient calculation schemes suitable for 2D MAC operations. Then we review the related algorithms and LSI architecture proposed in our previous work, in which we use a projection-field-type network architecture with sorting of neuron outputs by magnitude. For the LSI implementation, we adopt a merged/mixed analog-digital circuit approach using a large number of analog or pulse modulation circuits. We demonstrate the validity of our LSI architecture by testing proof-of-concept LSIs. It is essential to develop efficient and parallel A/D and D/A conversion circuits in order to connect a lot of on-chip analog circuits with the external digital system. In this paper, we also propose such an A/D conversion circuit scheme.
1
Introduction
The convolutional neural networks (CoNNs) with a hierarchical structure are known as robust image-recognition models. A typical and famous model is Neocognitron [1]. The models imitate the vision nerve system in the brain, and have tolerance for pattern deformations and pattern position shifts in object detection or recognition from natural images [1,2,3,4]. Operations required for implementing CoNNs are 2-D convolutional mappings, which include a large number of multiply-and-accumulation (MAC) operations and nonlinear conversion as usual neural network models. Because they require huge computational power to execute these operations in real-time for intelligent applications such as robot vision, its efficient LSI implementation is required. The latest digital media processors [5,6] can perform high-speed MAC operations, but they cost much because an advanced process technology is used, and have redundancy because of their general versatility. In contrast, an analog image-filtering LSI processor suitable for CoNNs was reported, which used not M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 1081–1090, 2008. c Springer-Verlag Berlin Heidelberg 2008
1082
O. Nomura and T. Morie
FP1 layer
FD2 layer
FP2 layer
FD3 layer
Eye Feature class Mouth
Receptive field
Integrating features
Integrating features
Tolerating the shifts of feature position
Fig. 1. Principle of pattern detection using a CoNN
so advanced process technology [7]. Because the layout area of analog circuits is generally small, parallel processing can be performed with an array of circuits on an LSI. In such analog LSI processors, however, it is difficult to memorize weight data and temporal calculation results because no good analog memory device has yet to be invented. On the other hand, we have reported merged/mixed analog-digital LSI processors, which can achieve high calculation performance using not so advanced LSI process technology [8,9,10]. Our merged/mixed analog-digital approach can resolve the above difficulty. In our previous work, we designed, fabricated, and tested proof-of-concept LSIs, and verified the validity of our LSI architecture. In this paper, first we discuss efficient calculation schemes suitable for 2D MAC operations. Then we review the related algorithms and LSI architecture proposed in our previous work, in which we used a projection-field-type network architecture with sorting of neuron outputs by magnitude [10]. For the LSI implementation, we adopt a merged/mixed analog-digital circuit approach using a large number of analog or pulse modulation circuits. It is essential to develop efficient and parallel A/D and D/A conversion circuits (ADC and DAC) in order to connect a lot of on-chip analog circuits with the external digital system. In our previous work, we proposed efficient and parallel DAC schemes. In this paper, we also propose such an ADC scheme. Next, we present a proof-of-concept LSI for verifying the validity of our CoNN LSI architecture, and discuss a performance estimation of the LSI processor based on our architecture.
2
Hierarchical CoNN Model for Object Detection
Figure 1 shows the principle of pattern detection using a CoNN. The feature pooling (FP) neurons are used to achieve detection tolerant to position shifts. The feature detection (FD) neurons operate for integrating features. By the hierarchically repetitive structure, local simple features (e.g., line segments) in the input image are gradually assembled into complex features. Operations between layers are 2-D convolutions, because all neurons belonging to a feature class have a receptive field with the same weight distribution. The
Projection-Field-Type VLSI Convolutional Neural Networks
1083
Fig. 2. Four 2D-MAC calculation schemes
receptive field of the FP neurons is on the same feature class of the previous FD layer. All neurons of the FP layer have the same positive-weight Gaussian-like distribution. The shifts of feature positions in the FD layers are tolerated in the FP layers by this convolution. On the other hand, the receptive fields of the FD neurons are on all feature classes of the previous FP layer. The weights of the FD neurons are obtained by training. The receptive field size becomes larger for the latter stages of the hierarchical structure than for the former stages [4].
3 3.1
CoNN LSI Processor Architecture and Related Algorithm 2D-MAC Calculation Schemes for CoNN LSI Processor Architecture
We consider four 2D-MAC calculation schemes as shown in Fig. 2, where the number of presynaptic neurons is M × M , and the receptive field size is N × N . In Fig. 2, scheme (a) performs N × N -parallel MAC operations for a receptive field of one postsynaptic neuron. Scheme (b) performs M × N -parallel 1D MAC operations for one line of the presynaptic neuron array and one line of the synaptic weight array [8,9,11]. Scheme (c) performs N × N -parallel MAC operations for a projection field of one presynaptic neuron. Scheme (d) performs M × M -parallel MAC operations for presynaptic neurons which correspond to one synaptic weight. In MAC operations, if the output value of a presynaptic neuron is very small, the neuron has little influence on the calculation results. To use this characteristic
1084
O. Nomura and T. Morie
Fig. 3. Sorted projection-field model
to advantage, scheme (c) is suitable because only the presynaptic neurons that have significant output values are considered in MAC calculation, and the number of parallel MAC operations can be reduced. In contrast, in the other schemes, even if the number of presynaptic neurons considered in calculation is reduced, the number of parallel MAC operations cannot be reduced. From the above viewpoint, we employed the scheme (c), projection-field model, for a CoNN LSI processor architecture. 3.2
Sorted Projection-Field Model
We found that, in the layers of the latter stages, only a few neurons have significant output values and most neurons have very small values. Even if we omit MAC calculations with such negligible output values, it is expected that the calculation result is not so different from the exact one. In fact, we found from many numerical simulations of face detection that the detection accuracy hardly degrade by using such operation. In addition, because a large number of MAC calculations are required for large receptive fields in the layers of the latter stages, omitting operation is much more efficient. We therefore proposed a new algorithm with a use of sorting for decreasing the number of MAC operations. Figure 3 explains the algorithm which we call a sorted projection-field model. The outputs of the presynaptic neurons are sorted by magnitude; each neuron has a ranking according to the sorting result. For example, if the output value of neuron j, oj , is largest and that of neuron k is the second largest, we set ranking r(oj ) = 1 and r(ok ) = 2. If plural neurons have the same output value in the case that the output values are quantized in the digital circuit, their rankings are set at the same value. Our model executes only MAC operations related to the presynaptic neuron output oj with a ranking r(oj ) higher than the predefined ranking threshold rth . Internal potential ui of neuron i in the target layer is given by ui = wij oj , (1) j∈R,r(oj )
Projection-Field-Type VLSI Convolutional Neural Networks
1085
where wij is the connection weight from presynaptic neuron j to postsynaptic neuron i, and R is the set of neurons belonging to the receptive field. By applying this algorithm, we found from the numerical simulations that MAC operations were reduced to 35 % in the layers of the latter stages without detection accuracy degradation (refer to Rsort in Sect. 5). We use this advantage of sorting over thresholding in Sect. 3.4. For efficient projection-field calculation, we employ an 2D array of M × M MAC circuits. For inputting the synaptic weight data into the MAC circuit array effectively, we proposed a synaptic weight-decomposition scheme. Synaptic weight matrix W (= {wij }), which expresses an N × N -pixel projection field, is k k k decomposed into some products of base vectors wkx ≡ t (wx1 , wx2 , · · · , wxN ) and k t k k k wy ≡ (wy1 , wy2 , · · · , wyN ); W =
Nw k=1
w kx · t wky
or wij =
Nw
k k wxi · wyj ,
(2)
k=1
where Nw is the number of decomposed products of the base vectors and t w indicates a transposed vector. We can calculate the product of wkx and w ky for each k by Nw -step operations with the MAC circuit array. In general, Nw > 1 because weight matrix W does not have enough symmetry. We found from the numerical simulations of face detection that Nw ∼ 5. On the other hand, in the conventional row- or columnparallel operation [8], N -step operations are required for N × N projectionfield convolution. For face position detection, although the projection fields have various sizes in each layer, the average projection-field size is more than 20 × 20 in our model. Comparing Nw (∼ 5) and this average value N (∼ 20), our decomposition scheme is more than four times as efficient as the conventional scheme. 3.3
LSI Processor Architecture for a Projection-Field Model
We propose a new CoNN LSI processor architecture using the sorted projectionfield model. Figure 4 shows the architecture, which implements a convolution for one feature-class. Calculation for the whole hierarchical structure is achieved by time-sharing operations with repetitive use of this LSI. The LSI includes a MAC circuit array corresponding to the 2-D postsynaptic neuron array, a sorting circuit (SRT), a temporary memory (MEM), switching circuits (SWTs), a nonlinear conversion circuit (NLC), and base-vector generators (BVGx, BVGy). The MAC circuit consists of a multiplier (MUL) and an accumulator (ACC). The operation of this circuit is as follows: (1) the MAC circuit calculates ui by accumulation of all multiplication results, and outputs the MAC results to NLC; (2) NLC performs nonlinear conversion of the MAC results by comparing them with a nonlinear reference signal, and outputs the conversion results to SRT; (3) SRT outputs the value oi according to ranking r(oi ) higher than rth and its addresses (the center position of the projection-field) {addxi , addyi } one
1086
O. Nomura and T. Morie
ACC MUL
B V G y
SRT MEM SWT NLC BVGx BVGy MUL ACC
MAC MAC circuit array
SW T
N L C w
k
x
i
k
y
SRT
k y
ui
k
wx o i
add yi
: Sorting memory : Temporary memory : Switching circuit : Nonlinear conversion circuit : Base-vector (W O ) generator : Base-vector (W ) generator : Multiplier : Accumulator
SWT BVGx
oi
add xi , add yi , o i
add xi oi
MEM
Fig. 4. Circuit block diagram based on our CoNN LSI processor architecture
by one to MEM; (4) MEM stores these data temporarily, and outputs the value oi to BVGx and addresses to SWTs at the next convolution calculation step; (5) BVGx and BVGy output the base vector wkx oi and w ky to the projection field on the MAC circuit array using addresses {addxi , addyi } through SWTs, respectively. By repeating the above sequence, the hierarchical CoNN operation is performed. 3.4
Circuit Implementation Based on Merged/Mixed Analog-Digital Approach
If the digital and analog circuit approaches are compared with the same design rule, even if the most advanced VLSI technology is assumed, the size of the analog MAC circuit can be much smaller than that of the digital one, and therefore the number of analog MAC circuits included in the same layout area is much larger than that of digital ones. In contrast, it is difficult to memorize weight data and temporal calculation results in the analog approach, but it is easy if digital memory is used. Therefore we employed a merged/mixed analog-digital approach, which we have been developing [8], to take the advantages of both approaches. In our approach, the MAC circuit consists of an analog multiplier and an integrating capacitor CMAC . The analog multiplier performs multiplication of k k wym and wxm oi , both of which are given as voltages from BVGy and BVGx, and the integrating capacitor accumulates the multiplication results as a charge. We use digital memory for storing weight data and temporal calculation results. Because oi is sorted by magnitude, we can represent temporal calculation results as one-bit flag data at transitions of values in the digital memory, and can
Projection-Field-Type VLSI Convolutional Neural Networks
CSG 000
RVG
CNT 3 001
3 010
3 011
3 100
3 6
3 101
1087
Row address Counter value
Fig. 5. Asynchronous output scheme with parallel ADC in one row SW4
Control pulse line SW3
V
Reference voltage line
CMP
Example of Reference voltage
Time
Address bus line
SW2
CMAC
SW1 Row-address 3 data 001
Fig. 6. Asynchronous output circuit included in a MAC circuit
continuously compute the multiplication by using a monotonically-changing signal from BVGx. This is an efficient and parallel D/A conversion scheme, which is one of the most important part in our approach. See Ref. [10] for detail. On the other hand, an efficient and parallel ADC is required for outputting data from the MAC circuit array. In our previous work, for converting the analog calculation results of the MAC circuit array into digital data, we used a conventional row-parallel outputting scheme [8,10]. This scheme, however, degrades the total speed of CoNN operation, because it takes M cycles to output the results from all MAC circuits. We therefore propose a new scheme for high-speed data outputting from the MAC circuit array, which we call asynchronous output scheme with parallel ADC because the data are output at different timing on each row and column. Figure 5 shows our asynchronous output scheme in one row, and Fig. 6 shows our asynchronous output circuit included in a MAC circuit. The peripheral circuit (outside of the MAC circuit array) required for operation based on this scheme is composed of reference voltage generators (RVGs), counters (CNTs), and control signal generators (CSGs). Each MAC circuit includes a comparator (CMP), switches sw1 -sw4 , and digital memory (ROM). In this scheme, the analog calculation results in all MAC circuits are output as digital data with their addresses memorized in the ROMs in only one cycle of reference voltage changes. As a result, this asynchronous output scheme can reduce an amount of time for outputting the analog data.
1088
4
O. Nomura and T. Morie
Proof-of-Concept LSIs
As a proof of concept for our proposed architecture, we designed and fabricated two test LSIs; a sorting LSI including SRT (shown in Fig. 4) and an CoNN LSI including the other circuits, using a 0.35 µm CMOS process. The sorting LSI performs sorting of neuron output values oi from the CoNN LSI by magnitude, and output the sorting result (oi , addxi , addyi ) to the CoNN LSI according to ranking r(oi ) higher than rth . The CoNN LSI performs projection-field-type convolution operation corresponding to neuron output oi and its addresses addxi , addyi . Then, it performs nonlinear conversion of convolution results ui , and output the conversion results to the sorting LSI.
NLC SW T
40 40 MAC circuit array BVGx
SWT
BVGy
MEM Fig. 7. Micro-photograph of the CoNN LSI 40 pixels 40 pixels
s l e x i p 0 4
8 pixels s l e
+31
3
(Max.)
x i p
0
6 1
×
3 pixels
+31
(Min.)
0
(Max.)
x i p 0 4
-31 (Min.)
15 pixels
+31 (Max.)
3 3
(a-1)
pixels
s l e x i p 5 1
0
+31
(Min.)
(Max.)
-31 (Min.)
Presynaptic neuron outputs Weight distribution (Projection field) (b-1)
40 pixels
40 pixels
+1
s l e x i p 0 4
0
(a-2)
3 3 pixels
s l e
Presynaptic neuron outputs Weight distribution (Projection field)
Measurement
3 3 pixels
Theoretical
-1
+1
s l e x i p 0 4
0
Measurement
Theoretical
-1
(b-2)
Fig. 8. Measurement results of convolution operation: (a-1) test patterns generated externally for a small (3 × 3) projection field, (a-2) measurement and theoretical results of OUTNLC (normalized), (b-1) test patterns generated externally for a large (15 × 15) projection field, (b-2) measurement and theoretical results of OUTNLC (normalized)
Projection-Field-Type VLSI Convolutional Neural Networks
1089
Figure 7 shows a micro-photograph of the CoNN LSI. The die size is 9.8 mm× 9.8 mm, and the LSI includes a 40 × 40 MAC circuit array. Since BVGx and BVGy include twenty-one element circuits, the maximum size of projection field is 21 × 21 pixels. The power consumption estimated using HSPICE was 440 mW at a 3.3 V power supply. Using only this CoNN LSI chip (without the sorting LSI), we verified convolution operations for two cases: (a) small projection field (3 × 3 pixels) corresponding to the FD1 layer in the hierarchical CoNN, and (b) large projection field (15 × 15 pixels) corresponding to the latter layer. The presynaptic neuron outputs and weight data, which were generated externally, are given as shown in Fig. 8(a-1) and (b-1). The measurement results of the outputs of NLC, OUTN LC , are shown in Fig. 8(a-2) and (b-2) with the theoretical ones. Here, we verified the linearity of convolution operation, we gave a linearly-ramped voltage to NLC as a reference signal, and we used the conventional row-parallel outputting scheme. We canceled the offset voltage of the multiplier in each MAC circuit by using off-chip canceling operation. In the practical design, this will be performed with an on-chip offset canceling mechanism. The measurement results agree well with the theoretical ones, and they demonstrate that the circuits designed based on the proposed architecture operate correctly.
5
Performance Estimation
We estimated the performance of an LSI processor based on the proposed architecture. Here, we defined GCOPS (Giga Convolutional Operations Per Second) as a performance measure, and calculated the performance P . The difference between GCOPS and GOPS (Giga Operations Per Second), which is a common measure used in the digital signal processor LSI, is division by Nw and Rsort , which is the ratio of the number of presynaptic neurons where r(oj ) < rth to that of the whole presynaptic neurons (it is found from the numerical simulations that Rsort ≈ 0.35). The reasons why such additional factors are needed are that Nw -step operations are unnecessary if the weight-decomposition scheme is not used and that Rsort is a ratio for computation-step skips by the sorted projection-field model. If we design a CoNN LSI processor based on our proposed architecture using the present LSI technology. we can achieve processing time for one MAC operation TMAC of 40 ns. In addition, we can also achieve a size of one side of a square projection field Rsize of 81 by assuming a use of stacked capacitor technology or high-dielectric materials, because the area of capacitors is dominant in the MAC circuit layout. Therefore, performance P is expected to be 187 GCOPS.
6
Conclusion
We discussed an efficient projection-field-type CoNN LSI architecture based on the merged/mixed analog-digital approach, and proposed an efficient and parallel A/D conversion circuit scheme. We successfully designed, fabricated and
1090
O. Nomura and T. Morie
tested proof-of-concept LSIs for verifying the validity of our LSI architecture. The measurement results of the LSIs agree well with the theoretical ones, and they demonstrate that the circuits designed based on the proposed architecture operate correctly.
References 1. Fukushima, K., Miyake, S.: Neocognitron: A new algorithm for pattern recognition tolerant of deformations and shifts in position. Pattern Recognition 15, 455–469 (1982) 2. Lawrence, S., Giles, C.L., Tsoi, A.C., Back, A.D.: Face recognition: A convolutional neural-network approach. IEEE Trans. Neural Networks 8(1), 98–113 (1997) 3. Neubauer, C.: Evaluation of convolutional neural networks for visual recognition. IEEE Trans. Neural Networks 9, 685–696 (1998) 4. Matsugu, M., Mori, K., Ishii, M., Mitarai, Y.: Convolutional spiking neural network model for robust face detection. In: Proc. Int. Conf. on Neural Information Processing (ICONIP), pp. 660–664 (2002) 5. Kyo, S., Koga, T., Okazaki, S., Uchida, R., Yoshimoto, S., Kuroda, I.: A 51.2GOPS scalable video recognition processor for intelligent cruise control based on a linear array of 128 4-way VLIW processing elements. In: IEEE Int. Solid-State Circuits Conf. Dig., pp. 48–49 (2003) 6. Nakajima, M., Noda, H., Dosaka, K., Nakata, K., Higashida, M., Yamamoto, O., Mizumoto, K., Kondo, H., Shimazu, Y., Arimoto, K., Saitoh, K., Shimizu, T.: A 40GOPS 250mW massively parallel processor based on matrix architecture. In: IEEE Int. Solid-State Circuits Conf. Dig., pp. 410–411 (2006) 7. Boser, B.E., S¨ ackinger, E., Bromley, J., Cun, Y.L., Jackel, L.D.: An analog neural network processor with programmable topology. IEEE J. Solid-State Circuits 26(12), 2017–2025 (1991) 8. Korekado, K., Morie, T., Nomura, O., Ando, H., Nakano, T., Matsugu, M., Iwata, A.: A VLSI convolutional neural network for image recognition using merged/mixed analog-digital architecture. J. Intelligent & Fuzzy Systems 15(3/4), 173–179 (2004) 9. Korekado, K., Morie, T., Nomura, O., Nakano, T., Matsugu, M., Iwata, A.: An image filtering processor for face/object recognition using merged/mixed analogdigital architecture. In: Symposium on VLSI Circuits, Digest of Technical papers, Kyoto 222–223 (2005) 10. Nomura, O., Morie, T., Korekado, K., Nakano, T., Matsugu, M., Iwata, A.: An image-filtering LSI processor architecture for face/object recognition using a sorted projection-field model based on a merged/mixed analog-digital architecture. IEICE Trans. Electron. E89-C(6), 781–791 (2006) 11. Sakai, M., Morie, T., Mitarai, M., Korekado, K.: Design of an 2D image matching processor LSI based on merged analog/digital architecture. In: RISP International Workshop on Nonlinear Circuits and Signal Processing (NCSP 2007), Shanghai, China, pp. 81–84 (2007)
Optimality of Reaching Movements Based on Energetic Cost under the Influence of Signal-Dependent Noise Yoshiaki Taniai and Jun Nishii Graduate School of Science and Engineering, Yamaguchi University, 1677-1 Yoshida, 753-0824 Yamaguchi, Japan {yosi,nishii}@bcl.sci.yamaguchi-u.ac.jp
Abstract. As candidates of a constraint which determines human arm reaching trajectories, criteria of the minimum torque change and of the minimum end-point variance have been proposed and it has been shown that these criteria well predict the characteristics of reaching trajectories. In our previous work, we have shown that these criteria would also suppress the energy cost for reaching movements, when motor command is affected by signal-dependent noise. In this study, we computed the trajectories which minimize the energy cost under the effect of the signal-dependent noise, and compared them with those of human subjects. The optimal trajectories were in good agreement with the measured trajectories at the points that when the movement duration is short, the speed profile of the hand movement takes a bell shape, and when the duration is long, the speed profiles take a collapsed shape. These results would suggest that human brain solves the redundancy problem of trajectory planning by the constraint of minimization of the expected value of energy cost.
1 Introduction It has been reported that human arm trajectories during reaching movements are slightly curved and their speed profiles take bell shapes [1]. Why do reaching trajectories show such characteristics? Hogan (1984) proposed the criterion of minimum jerk [2], which assumes that the kinematic smoothness is the determinant of the hand trajectory. On the other hand, Uno et al. (1989) referred the arm dynamics and proposed the criterion of minimum torque change. It has been reported that these models well predict the characteristics of human arm movements and many comparisons between them have been done [3]. Furthermore, Harris and Wolpert (1998) proposed the criterion of minimum end-point variance [4]. This criterion hypothesizes that the human brain plans the reaching trajectories so as to minimize the variance of the final arm position caused by signal-dependent noise on motor commands, and also well predicts the human arm trajectories. On the other hand, it has been reported that the characteristics of legged locomotor patterns can be well explained by the criterion of minimum energy cost [5]. Saving energy cost would be an important strategy for living bodies to survive. Some studies examined whether the characteristics of reaching trajectories can be predicted by the criterion of minimization of energy cost [6][7], however, it has been shown that the optimal trajectory given by the criterion does not show the bell-shaped speed profile M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 1091–1099, 2008. c Springer-Verlag Berlin Heidelberg 2008
1092
Y. Taniai and J. Nishii
[7]. Does this result show that reaching trajectories are determined by the factor of the smoothness or the minimization of noise effect, and independent of energy cost? In our previous study, we examined the effect of noise on the trajectories by the criterion of minimum energy cost [8]. The optimal torque planned by the criterion are large especially in the beginning of movement [7][8]. Such torque causes large positional error if the variance of the noise increases with the amplitude of control signal as Harris and Wolpert assumed [4]. If the arm does not reach the target due to noise effect, corrective sub-movements in order to compensate for the movement errors are necessary. Therefore, the trajectory planned by the criterion of minimum energy cost requires many corrective sub-movements to reach the target and the total energy cost which includes the cost of the sub-movements becomes larger than that required by criteria of minimum torque change or minimum end-point variance when the effect of noise is large [8]. Then, does the optimal trajectory which minimizes the expected value of the energy cost under the effect of the signal-dependent noise explain the characteristics of reaching trajectories? In this study, we computed the optimal trajectories and compared them with those of human subjects.
2 Experimental Methods The subjects were one male and one female who were right-handed and 22 and 26 years old. Figure 1 represents the experimental setup. The subjects sit on a char with their shoulders fixed to the back of the chair by bands. In order to prevent their elbows from bending, they were fixed by taping. The subjects were instructed to put the index finger of the right hand on the start position, to move the arm to a circular target in a given duration and then to keep the finger in the target during a certain period of time. The distance from the start position to the center of the target was 0.22 m and the radius of the target was 0.03 m. The movement duration was 0.5 s and 1.5 s, and the interval was instructed by a metronome. The subjects practiced the reaching movement task for ten minutes before each measurement. When the subjects felt fatigue, they were allowed to take a short break. An optical marker was fixed on the tip of index finger to record the trajectory by a three dimensional motion capture system (Carrot/3D, Library) with sampling frequency of 150 fps. The positional data were smoothed by a six-order low-pass Butterworth filter with a cutoff frequency of 10 Hz. 2.1 Computer Experiments In the computer experiment, we computed the trajectories which minimize the expected value of total energy cost required to reach the target (Fig. 2). The signal-dependent noise which affects a motor command signal u was assumed to be a gaussian noise with zero mean and variance σ2 = k|u|2 [4], and the optimal trajectories were computed for various values of k. An arm was modeled as a one-link fourth-order linear model in which the inertia and viscosity are involved [9]. The mass, length of the arm model were set as 2.5 kg, 0.55 m, respectively [10]. The damping coefficient around the joint was set as 0.2 Nms/rad [9]. The distance from proximal end of the link to the center of the mass was set as 0.29 m which was estimated by the method of Winter (1990) [11].
Optimality of Reaching Movements Based on Energetic Cost
1093
vi ca deo me ra
eo a vid er m a c
0.22
[m]
circle target 0.03 [m]
start point
marker
Fig. 1. The schematic view of the measurement experiment of arm reaching movements. The subjects sit on a char with their shoulders fixed on the back of the chair by bands. In order to prevent their elbows from being bent, the elbow was fixed by taping. The subjects (1) put the index finger of their right hand on the start position, (2) move the arm to a circular target in a given duration, and then, (3) keep the finger in the target during a certain period of time. The movement duration was 0.5 s and 1.5 s, and the time interval was instructed by a metronome. An optical marker was fixed on the tip of the index finger of each subject and arm trajectories were recorded at 150 fps by a motion capture system.
target area 0.03 [m]
brain (1) motor command finger width 0.01 [m]
distance 0.22 [m]
+ (2) noise
+
arm(0.55m) joint
start position
Fig. 2. The computation of the optimal trajectory based on the criterion of minimum energy cost. The optimal motor command which minimizes the expected value of total energy cost required to reach the target was computed by an optimization algorithm under the constraints that the hand was within the target over 95 percent success rate during a post-movement time and that the mean angular speed during the duration were equal to 0 rad/s. The radius of the target and the distance from the start position were 0.03 [m] and 0.22 [m], respectively.
1094
Y. Taniai and J. Nishii
The movement distance D and the width W of target area were set as 0.22 m and 0.03 m, respectively. The evaluation function for the criterion of minimum energy cost takes the form T
{P(τv , θ˙v ) + P(τh , θ˙h )}dt,
(1)
0
where τv , τh , θ˙v , θ˙h and T represent the torques to keep the arm in a horizontal plane and to move hand in the plane, the angular velocity in the vertical and horizontal plane, and the movement duration, respectively. The angular velocity θ˙v was set as zero, the torque τv was a constant value, and the effect of signal-dependent noise on the torque τv was ignored. The metabolic rate P was estimated by the following equation [6] ˙ = τiso (τ, θ) ˙ θ˙max Φ(θ), ˙ P(τ, θ)
(2)
1.2
Speed [m/s]
1 0.8 0.6 0.4 0.2
0
0.1
0.2
0.3 Time [s]
0.4
0.5
(a)
0.25
Speed [m/s]
0.2
0.15
0.1
0.05
0
0.25
0.5
0.75
1 1.25 Time [s]
1.5
1.75
2
2.25
(b) Fig. 3. The hand speed profiles of arm reaching movements when the movement duration is (a) 0.5 s and (b) 1.5 s
Optimality of Reaching Movements Based on Energetic Cost
1095
1
Speed [m/s]
0.8
0.6
0.4
0.2
0
0.1
0.2
0.3 Time [s]
0.4
0.5
0.6
(a) 0.3
0.25
Speed [m/s]
0.2
0.15
0.1
0.05
0
0.5
1 Time [s]
1.5
2
(b) Fig. 4. The hand speed profiles based on the criterion of minimum energy cost for the movement duration (a) 0.5 s and (b) 1.5 s. The noise parameter k was set as 0.012.
˙ θ˙max and Φ represent the moment that muscle exerts in isometric contracwhere τiso , θ, tion, the angular velocity, the angular velocity corresponding to the muscle’s maximum shortening speed, and the function which gives metabolic rate by ATP consumption, respectively. The angular velocity θ˙max was set as 15 rad/s as in Alexander (1997) [6]. The optimal control signal which minimizes the expected energy cost was computed by the Sequential Quadratic Programming (SQP) method under the constraints that the hand was within the target over 95 percent success rate during a post-movement time and that the mean angular speed during the duration was equal to 0 rad/s. The postmovement time was set as 0.15 s. The mean energy cost was estimated by Unscented Transfer under eq. (1) [12]. The time from the start of the movement to the end of the post-movement duration was divided into 50 time steps and the optimum motor command at each time step was computed by the SQP. The computed motor command was
1096
Y. Taniai and J. Nishii 4 3
Torque [Nm]
2 1 0 0.1
0.2
0.3
0.4
0.5
-1 -2 -3 -4
Time [s]
(a) 0.8 0.6
Torque [Nm]
0.4 0.2 0 1
0.5
1.5
-0.2 -0.4 -0.6 -0.8 Time [s]
(b) Fig. 5. The optimal torque based on the criterion of minimum energy cost. The command torque and the the joint torque were represented by the solid line and dotted line, respectively. The noise parameter k was set as 0.012 when the movement duration was (a) 0.5 s and (b) 1.5 s.
interpolated by the three-order spline function into 100 time steps in order to compute the hand velocity and the energy cost.
3 Results Figure 3 shows the typical speed profiles of the hand during reaching movement by a subject. When the movement duration was 0.5 s, the profile took a bell shape (Fig. 3(a)), but when the movement duration was 1.5 s, the speed gradually decreased in the middle range of the movement and the profile took a collapsed shape. Similar characteristics were also observed in other subjects.
Optimality of Reaching Movements Based on Energetic Cost
1097
1 k = 0.005 k = 0.01 k = 0.012
Speed [m/s]
0.8
0.6
0.4
0.2
0
0.1
0.2
0.3
0.4
0.5
Time [s]
Fig. 6. The optimal speed based on the criterion of minimization of energy cost. The movement duration was set as 0.5 s. The speed profiles for the noise parameter k = 0.005, 0.01 and 0.012 are represented by solid line, dotted line and dashed line, respectively.
Figure 4 shows the speed profile based on the criterion of minimum energy cost. The characteristic is in good agreement with the experimental characteristic, i. e. when the movement duration is short, the speed profile takes a bell shape, but when the movement duration is long, the speed profile takes a collapsed shape. Figure 5 shows the optimal torques based on the criterion of minimum energy cost. When the movement duration is 0.5 s, the torque changes gradually. On the other hand, when the movement duration is 1.5 s, the command torque becomes small and takes zero in the middle of the movement where the arm shows a kind of ballistic movement against viscosity (Fig. 5(b)), which causes a gradual decrease of hand speed and makes the speed profile a collapsed shape. Figure 6 shows the speed profiles for the movement duration T = 0.5 s and the noise parameter k = 0.005, 0.01 and 0.012. When the noise parameter becomes smaller, the speed profile approaches to a collapsed shape, and when the noise parameter becomes larger, the speed profile becomes a bell shape. These results show that when the task is difficult, e. g. the noise effect is large or the movement duration is short, the joint torque changes gradually so as to suppress the effect of noise as suggested in [8], and as a result the speed profile takes a bell shape. On the other hand the task is easy to accomplish, e. g. the noise effect is small or the movement duration is long, large torque is produced in the beginning of the movement that makes a duration of ballistic movement in the middle of movement, which suppresses the energy cost and makes the speed profile collapsed shape as observed in the trajectories which minimize the energy cost without considering noise effect [7].
4 Discussion Experimental results showed that the optimal trajectories which minimize the expected energy cost under the noise effect were in good agreement with the measured
1098
Y. Taniai and J. Nishii
trajectories at the points that when the movement duration is short, the speed profile takes a bell shape, and when the duration is long, the profile becomes a collapsed shape. The results show that the speed profile became closer to a collapsed shape with a smaller noise. Most studies which have discussed the trajectory planning of reaching movements focused on the movements of which duration is within about 1 second. Since the speed profile of such a movement takes a bell shape, this characteristic has been treated as a guide to investigate the constraint to determine reaching trajectories. On the other hand, few studies regarding slower reaching movements have been made by some reasons, such as that the profile takes a little different shape for every trial. Because human usually do not show slow reaching movements, it is possible that the subjects have not acquaint themselves with such movements. Therefore, it might be early to attribute the constraint of slow movements to the criterion of minimum energy cost. However, it would be notable that not only bell shaped speed profile in rapid movements but also a collapsed shape which is often observed in slow reaching movements can be explained by the criterion of minimum energy cost. Soechting et al. (1995) have reported that the final posture of the arm is determined so as to minimize the amount of work that must be done to transport the arm from the starting position in three dimensions [15]. Miyamoto et al. (2004) have proposed the maximum task achievement criterion which is the combination of task achievement and energy consumption and reported that the criteria can explain the trajectory for a tracking movement task. These studies also suggest that the energy cost would be an important factor to determine movement trajectories. In this study the signal-dependent noise was assumed to be gaussian noise with zero mean and variance σ2 = k|u|2 according to Harris and Wolpert (1989). Iguchi et al. (2005) showed that whether or not the movement planned by minimization of the endpoint variance could predict the actual movement depends on the variance coefficient p, and referred the difficulty of the estimation of the coefficient in living bodies. The possible factors which cause the positional error other than the noise would also exist, for instance, temporal change of muscle properties and the recognition error of target positions. For the further analysis of the optimality of reaching trajectories based on energy cost, experimental data about the relationship between these factors and the positional error is expected.
5 Conclusions In this study, we computed the arm reaching trajectories which minimize the expected value of total energy cost under the effect of the signal-dependent noise for various movement durations. The results showed that the optimal trajectories were in good agreement with the measured trajectories at the points that when the movement duration is short, the speed profile takes a bell shape, and when the duration is long, the speed profile takes a collapsed shape. This result suggests that human brain solves the redundancy problem of trajectory planning by the constraint of minimization of the expected value of energy cost.
Optimality of Reaching Movements Based on Energetic Cost
1099
Acknowledgments This work has been partially supported by a Grant-in-Aid for Scientific Research on Priority Areas “Emergence of Adaptive Motor Function through Interaction between Body, Brain and Environment” from the Japanese Ministry of Education, Culture, Sports, Science and Technology.
References 1. Uno, Y., Kawato, M., Suzuki, R.: Formation and control of optimal trajectory in human multijoint arm movement. Biol. Cybern. 61(2), 89–101 (1989) 2. Hogan, N.: Adaptive control of mechanical impedance by coactivation of antagonist muscles. Trans. Automatic. Control 29, 681–690 (1984) 3. Nakano, E., Imamizu, H., Osu, R., Uno, Y., Gomi, H., Yoshioka, T.: Quantitative examinations of internal representations for arm trajectory planning: Minimum commanded torque change model. J. Neurophysiol. 81(5), 2140–2155 (1999) 4. Harris, C.M., Wolpert, D.M.: Signal-dependent noise determines motor planning. Nature 394, 780–784 (1998) 5. Nishii, J.: Legged insects select the optimal locomotor pattern based on energetic cost. Biol. Cybern. 83, 435–442 (2000) 6. McN, R.: A minimum energy cost hypothesis for human arm trajectories. Biol. Cybern. 70(5), 97–105 (1997) 7. Nishii, J., Murakami, T.: Energetic optimality of arm trajectory. In: Proc. Int. Conf. on Biomechanics of Man, pp. 30–33 (2002) 8. Taniai, Y., Nishii, J.: An evaluation of trajectory planning models based on energy cost during reaching submovements (in Japanese). IEICE Transactions on Information and Systems J90-D(12), 3257–3264 (2007) 9. Van der Helm, F.C.T., Rozendaal, L.A.: Musculoskeletal systems with intrinsic and proprioceptive feedback. In: Winter, J.M., Crago, P.E. (eds.) Biomechanics and neural control of posture and movement, Springer, New York (2000) 10. Suzuki, K., Uno, Y.: Brain adopts the criterion of smoothness for most quick reaching movements (in Japanese). IEICE Transactions on Information and Systems J88-D-II(2), 711–722 (2000) 11. Winter, D.A.: Biomechanics and motor control of human movement. Wiley, New York (1990) 12. Julier, S., Uhlmann, J.: A new extension of the Kalman filter to nonlinear systems. In: Int. Symp. Aerospace/Defense Sensing, Simul. and Controls, Orlando, FL (1997) 13. Iguchi, N., Sakaguchi, Y., Ishida, F.: The minimum endpoint variance trajectory depends on the profile of the signal-dependent noise. Biol. Cybern. 92, 219–228 (2005) 14. McN. Alexander, R.: Optima for Animals. Princeton Univ. Press, Princeton (1996) 15. Soechting, J.F., Buneo, C.A., Herrmann, U., Flanders, M.: Moving effortlessly in three dimensions: does Donder’s law apply to arm movement? J. Neuroscience 15, 6271–6280 (1995) 16. Miyamoto, H., Nakano, E., Wolpert, D.M., Kawato, M.: TOPS (Task Optimization in the Presence of Signal-Dependent Noise) Model. Systems and Computers in Japan 35, 48–58 (2004)
Influence of Neural Delay in Sensorimotor Systems on the Control Performance and Mechanism in Bicycle Riding Yusuke Azuma and Akira Hirose Department of Electronic Engineering, The University of Tokyo 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-8656, Japan
Abstract. Neural sensorimotor systems have unavoidable delay in the processing. In this paper, we investigate the influence of the delay on its control performance and mechanism in numerical experiment with a simple network. The result suggests not only the fact that the delay degrades control performance of reflex motion in proportion to the amount of the delay, but also that the delay influences signal generation for relevant volitional motion. A moderate delay can even be in favor of volitional motion learning. The motion generation mechanism also seems dependent on the amount of delay. That is, a feedback strategy changes into a feedforward one with rough prediction.
1 Introduction Biological sensorimotor systems have unavoidable delay when sensory signals yield motor effects because of the transmission and processing delay as well as the mechanical inertia of objects. In the human being case, the amount of the delay is typically several tens to a few hundreds of milliseconds. When we require a quicker response, we need to overcome this delay. In this sense, we have to be adaptive in a predictive manner [1]. An interesting experiment was presented in a paper by Ishida & Sawada [2]. A test subject tracks a moving target on a computer screen with a mouse (cursor) for various periodic motions. Then, when the frequency of the target’s reciprocating motion is low, the cursor follows the target to catch up with the movement. However, when the frequency exceeds a certain threshold, the cursor precedes the target and, in this case, the manner of the preceding is such that a transient locus error is minimized when the target moves abruptly non-periodically. The fact suggests, at least, that the human being behave in a certain special way to overcome the unavoidable delay. In this paper, we investigate the delay effect on a simple sensorimotor neural network experimentally. We assume a task to ride a bicycle to observe reflex and volitional motions, in which a feedforward network is loaded with various amount of delay. We find in experiments that the delay degrades the reflex motion performance in proportion to the delay time up to a certain delay amount. For a larger delay, the performance is rapidly deteriorated. In addition, the delay influences the performance in volitional signal generation to realize a motion relevant to the previously learnt reflex motion. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 1100–1109, 2008. c Springer-Verlag Berlin Heidelberg 2008
Influence of Neural Delay in Sensorimotor Systems
1101
Thereby, it is also suggested that a moderate delay sometimes results in a better performance of the learning of the volitional motion in total because the control strategy changes from a feedback method into a predictive one.
2 Modeling 2.1 Bicycle and Human Body Figure 1(a) is a top view of a bicycle moving in a small time step Δt. The advance Δs is related to the distance between front and rear wheels D, handle angle φ, curvature radius of bicycle trajectory R, and change in the forward movement direction Δψ as Δs sin φ = D sin Δψ ≈ DΔψ
(1)
Since R = Δs/Δψ = D/ sin φ, we obtain Δψ = vΔt/R
(2)
Δψ φ Δs
θ β
Centrifugal force
D
ycl
e
α
l
bo
dy
m Hu Center of gravity
Force of gravity
Bic
(a)
an
(b)
Fig. 1. (a) Top view and (b) rear view of a man riding bicycle
Figure 1(b) is a rear view of the bicycle and human body, which is analogous to a double inverted pendulum, where α and β denote the tilt angle of bicycle to the ground and that of human to the bicycle, respectively, m is total weight of the bicycle and the human body, and l is the length from the ground contact point of the wheel to the center of gravity of the total mass. Since the rate of change in the angular momentum ml2 β¨ is equal to the torque, that is a difference between the force of gravity mg and centrifugal force mv 2 /R, we have ml2 β¨ = mgl sin β − lmv 2 cos β/R We calculate the state evolution by (3). A falling is expressed as β = π/2.
(3)
1102
Y. Azuma and A. Hirose
The values of the physical parameters were chosen as D=1.0m, both of the heights of the bicycle and the body 0.8m, body weight 70kg, bicycle weight 10[kg], bicycle velocity constantly 3m/s for simplicity, and the calculation time step Δt=1ms. 2.2 Neural Network We model reflex motion, which is quick and simple, and volitional motion, which may be slow and complex. When we rid a bicycle, we control our sitting posture and handle direction to balance ourselves. If we feel falling to the right, for example, we quickly drive the handle to the right and tilt our body to the left. In the simulation presented below, we control the handle and the body using a network. The bicycle velocity is kept constant for simplicity. We realize a reflex control by a simple single-layer feedforward neural network. By choosing neutral position of body and direction of handle as 0, we find spatial symmetry with respect to 0 position or direction. Therefore, we employ an activation function of hyperbolic tangent with which we represent left and right by plus and minus numbers.
Aturnθ α α
wθα
d
θ
wφα
d
φ
(a)
α α
wθα wφα wθα wφα
d d
θ φ
α
wθα wφα
wθα
wφα
d
θ
d
φ
Aturnφ
(b)
(c)
Fig. 2. Neural network constructions for (a) reflex motion simulation with one input of tilt angle α, (b) that with two inputs of tilt angle α and its derivative α, ˙ and (c) volitional motion with volitional signal for body tilt neuron aturnθ , and that for handle angle neuron aturnφ in addition. Neural delay is assumed to be d for both of θ and φ neurons.
Figure 2 shows the network constructions we employ in this paper to observe the influence of delay. First, as a preliminary experiment, we conduct a reflex motion experiment using Network (a) with a single input of bicycle tilt angle α. Neural delay is expressed as d, which is assumed applicable for control of both θ and φ in this paper. Next, using Network (b), we examine a reflex motion using both α and its changing rate (time derivative) α. ˙ Finally, we observe a related volitional motion using Network (c) with additional inputs of volitional signals fed to the body tilt neuron aturnθ and handle angle neuron aturnφ .
3 Reflex Motion in Bicycle Riding 3.1 Single Input Network (Preliminary Experiment) We examine the delay influence on reflex motion for Network (a) with a single input of bicycle tilt α only. We generate neural weights at random and observe the behavior of
α [rad/s]
0.15 0.1 0.05 0
-0.05 -0.1
-0.15
-0.01
0
Bicycle tilt α
0.01
[rad]
Time derivative of bicycle tilt
Time derivative of bicycle tilt
α [rad/s]
Influence of Neural Delay in Sensorimotor Systems
1103
2.5 2 1.5 1 0.5 0 -0.5 -1 -1.5 -2 -2.5 -0.6
-0.4
-0.2
0
0.2
Bicycle tilt α
(a)
0.4
0.6
0.8
[rad]
(b)
Fig. 3. State trajectories of single-input system for delay of (a) d=0 and (b) d=5ms
the sensorimotor system. The initial state is α=0.01rad and α=0.01rad/s. ˙ Typical results are illustrated in Fig.3 as state trajectory in the α–α˙ space. Figure 3(a) is a result when the network has no delay. We find a stable trajectory. On the contrary, Fig.3(b) shows a typical result when a small delay (5ms) exists. The trajectory is unstable, and the bicycle finally falls. It is obvious that even a small delay influences the behavior of a sensorimotor system. 3.2 Two Input Network (Bicycle Tilt and Its Changing Rate) Human beings also catch outer world motion in the vision. We simplify the fact into an availability of time derivative information of tilt, α, ˙ in addition to α itself. The network construction is Fig.2(b). Again we choose neural weights at random to examine if the bicycle fluctuation diminishes or not, where ’diminish’ means that both α and θ remains within ±10−3 rad for more than 5s. In Fig.4, we plot the weights wθα , wφα , wθα˙ , and wφα˙ only when the fluctuation diminishes. The weights construct four-dimensional space in total, while the chart represents them on the two planes. The area where the points exist shows the parameter region of stable control. Figure 4(a) shows the result when there is no delay. The stable area is wide. Figure 4(b) is a result for a delay of 100ms. The stable region is limited in a belt. Figure 4(c) is for a delay of 200ms. The belt is much thinner. For a larger delay, we cannot find out stable regions any more. A larger delay obviously restricts the stable parameter area. An example of stable state trajectories is shown in Fig.5.
4 Volitional Motion Relevant to Reflex Motion 4.1 Tasks, Conditions, and Preparatory Reflex Motion Learning Next we investigate the delay influence on volitional behavior. We employ Network (c), shown in Fig.2(c), having additional inputs of volitional signals at body tilt and handle angle neurons, aturnθ and aturnφ , respectively.
1104
Y. Azuma and A. Hirose
15
Neural weight wφα.
Neural weight wφα
15
10
5
0
-5
-10
10
5
0
-5
-10
d=0ms
d=0ms
-15
-15 -15
-10
-5
0
5
10
Neural weight wθα
15
-15
-10
Neural weight wθα. -5
0
5
10
15
(a) 15
Neural weight wφα.
Neural weight wφα
15
10
10
5
5
0
0
-5
-5
-10
-10
d=100ms
d=100ms
-15
-15 -15
-10
-5
0
5
10
Neural weight wθα
15
-15
-10
-5 0 5 10 Neural weight wθα.
15
(b) 15
Neural weight wφα.
Neural weight wφα
15
10
10
5
5
0
0
-5
-5
-10
-10
d=200ms
d=200ms
-15
-15 -15
-10
-5
0
5
10
15
Neural weight wθα
-15
-10
-5
0
5
10
Neural weight wθα.
15
(c) Fig. 4. Neural weight maps wθα −wφα and wθα˙ −wφα˙ for stable trials when the delay is (a)d = 0, (b)d = 100ms, and (c)d = 200ms, respectively
Influence of Neural Delay in Sensorimotor Systems
1105
Tilt derivative α[rad/s]
0.015 0.01 0.005 0
-0.005 -0.01
-0.015 -0.02
-0.025 -0.03
-0.035 -0.002
0
0.002
0.004
0.006
Tilt α[rad]
0.008
0.01
0.012
Stabilization time T[s]
Fig. 5. Stable state trajectory when the time derivative tilt angle α˙ is available as an input in reflex motion
α[rad]
0.01
Bicycle tilt
Stabilization time
T Peak level less than 10%
0 0
1
2
3 Time
4
t[s]
(a)
5
6
7
12 10 8 6 4 2 0 0
50
100 150 Delay time
200
250
300
d[ms]
(b)
Fig. 6. (a) Definition of stabilization time T in the waveform of bicycle tilt α versus time t, and (b) obtained stabilization time T versus delay time d after completion of reinforcement learning of four weights wθα , wφα , wθα˙ and wφα˙
First, we conduct a preparatory experiment of reinforcement learning of reflex motion for bicycle stabilization. It simulates unconscious daily learning to acquire the reflex reaction. In the learning, the four weights, wθα , wφα , wθα˙ and wφα˙ , are adjusted to minimize the stabilization time T defined in Fig.6(a). This process optimizes the reflex motion. Figure 6(b) shows the stabilization time T versus neural delay time d. We find that T grows longer in proportion to d up to d ≈250ms. For a larger d, the stabilization time T drastically increases. The delay of 250ms is found to be a critical value in the present system. After this reflex motion learning is completed, we go on to the following volitional motion learning where the system learns the volitional signal waveforms, while the weights are fixed in this stage. The simulation schedule is similar to that in developmental learning [3]. Figure 7(a) illustrates the task in the volitional motion to turn to the right. The deviation of the actual locus from the ideal path is accumulated step by step to yield average
1106
Y. Azuma and A. Hirose
x
Ac tu
us loc al
Locus error
Starting point (a)
e
Volitional signals
y
Ideal path
40[m]
40[m]
aturnθ (t) and aturnφ (t)
locus error e, which is evaluated in the reinforcement learning. To change the body tilt and the handle direction, we feed volitional signals aturnθ (t) and aturnφ (t) to the respective neurons, which are variable and optimized in the volitional-motion learning. Figure 7(b) shows the waveforms of the signals aturnθ (t) and aturnφ (t), which correspond to a half cycle of cosine waveform, with four parameters of start time ts , end time te , signed (positive or negative) amplitude for body Aturnθ , and that for handle Aturnφ . The volitional signals are optimized in terms of these four parameters. That is, the system learns optimal ts , te , Aturnθ , and Aturnφ in another stage of reinforcement learning.
cosine shape waveforms
Aturnθ
t Aturnφ
t
t
End time e
Start time s
(b)
Fig. 7. (a) The locus error e to be accumulated and averaged in the task to turn to the right, and (b) waveform of the volitional signals fed to body tilt neuron aturnθ and handle angle neuron aturnφ
4.2 Results of the Learning Figure 8 shows the plots of start time ts and end time te of volitional signal injection, which is optimized in the reinforcement learning for various neural delay time d. Each delay value d contains several successful learning results corresponding to multiple trials. The origin of the time is the time when the bicycle starts the starting point in Fig.7(a). When d is smaller than ≈100ms, the dispersion of ts and te is small. The signal duration (te − ts ) is found minimum at d ≈120ms. For a larger d, the dispersion gradually grows larger. When d > 250ms, the dispersion gets infinitely large, and the learning process often fails. Figure 9 shows the average locus error E versus the neural delay d. We find that E is minimum at d ≈120ms. Figure 10 shows the obtained signed amplitude of the volitional signals, Aturnθ and Aturnφ , versus the delay time d. Their absolute values reduces in accordance with the increase of the delay time d. 4.3 Discussion In Fig.8, the volitional signal duration is minimum at around d=120ms. This value agrees with the minimum error delay in Fig.9. It is interesting that a moderate delay results in a better learning of volitional motion rather than a small d (< 50ms).
Turn signal start time ts and end time te[s]
Influence of Neural Delay in Sensorimotor Systems
1107
40 35 30 25
t
End time e
20 15 10 5 0
t
Start time s 0
50
100 150 Delay time
200
250
300
d[ms]
Fig. 8. Turn signal start time ts and end time te versus delay time d
Average locus error
E[m]
2.2 2.0 1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0 0
50
100 150 Delay time
200
250
300
d[ms]
Fig. 9. Average locus error E versus delay time d
In Fig.10, for a larger d, the absolute values of the signed amplitudes Aturnθ and Aturnφ approaches 0. The reason lies in the slow feedback in reflex motion for a large d. That is, large volitional signals make the bicycle fall down because of the slow feedback. They reduce extremely against the change in d, while the error E changes only moderately. The slow feedback caused by a large d enhances the effect of the smallamplitude volitional signals. In the above experiment, we have found that a moderate delay leads to a better learning in the volitional motion to turn to the right in the bicycle riding. An explanation is
Y. Azuma and A. Hirose 1.2
Turn signal amplitude
Aturnθ and Aturnφ
1108
1 0.8 0.6 0.4
Aturnθ
0.2 0 -0.2 -0.4 -0.6
Aturnφ 0
50
100 150 Delay time
200
250
300
d[ms]
Fig. 10. Optimized signed amplitudes of turn signals Aturnθ and Aturnφ versus delay time d
given as follows. When the neural delay d is small, the sensorimotor system observes the state instantaneously without delay, and controls the bicycle in a feedback manner by spending a certain time. Contrarily, when d is large, the system cannot observe the effect of its control and, therefore, it cannot employ a feedback strategy. Instead, the system generates signal in a cavalier fashion, or in a predicting manner, and then just waits for the result. This is a feedforward control. Another explanation may be available, if we regard the learning as a class of search, as follows. In Fig.4, we found that the successful parameter region becomes narrower when d is larger in the reflex motion learning. If the delay is small, the reflex learning may stop cursorily before it falls a better state because the task is too easy. In contrast, if a moderate delay exists, the reflex motion learning makes progress further to a much better state. It may lead to a better learning of volitional motion. A too large delay d (>200ms) results in fatal difficulty in total.
5 Summary A sensorimotor system learns how to overcome unavoidable delay to respond to environment. It has been shown that a moderate delay sometimes results in a better learning of relevant volitional motion. An explanation is as follows. When the neural delay is small, the sensorimotor system observes the state instantaneously and continuously, and controls the bicycle in a feedback manner. Contrarily, when the delay is moderately large, the system cannot observe the effect of its control output and, therefore, it cannot employ a feedback strategy. Instead, the system generates signal in a cavalier fashion, or in a predicting manner, and then just waits for the future result. It has been suggested that the amount of delay influences the neural learning strategy.
Influence of Neural Delay in Sensorimotor Systems
1109
References 1. Miall, R., Weir, D., Stein, J.: Manual tracking of visual targets by trained monkeys. Behavioural Brain Research 20(2), 185–201 (1986) 2. Ishida, F., Sawada, Y.: Human hand moves proactively to external stimulus: An evolutional strategy for minimizing transient error. Physical Review Letters 93(16), 168105 (2004) 3. Hirose, A., Asano, Y., Hamano, T.: Developmental learning with behavioral mode tuning by carrier-frequency modulation in coherent neural networks. IEEE Transactions on Neural Networks 17(6), 1532–1543 (2006)
Global Localization for the Mobile Robot Based on Natural Number Recognition in Corridor Environment Su-Yong An, Jeong-Gwan Kang, Se-Young Oh, and Doo San Baek LG 302, POSTECH, San 31 Hyoja-dong, Namgu, Pohang, Gyungbuk, 790-784, Korea {grasshop,naroo1,syoh,videru}@postech.ac.kr
Abstract. This paper proposes global localization for mobile robot by introducing local goal based navigation and model-based object recognition. In navigation stage, the robot follows the wall while detecting a door using a laser scanner, and then sets up the local goal near detected door. In recognition stage, room number is recognized and also ambiguous room number is rejected by multistage rejection method (MRSM) in order to reduce false recognition. Recognition results by various methods are demonstrated and room number feature map is built after exploring the whole corridor of LG research center. Keywords: Mobile robot, global localization, natural landmark, neural network.
1 Introduction In the case of service mobile robot that guides a user to a specific destination in an office building, the robot should know its real pose until the task is finished so that it can take an action which is proper to a current situation; otherwise, it behaves abnormally and finally gets to a false destination. In this sense, self-localization is very essential ability for successful task achievement of the mobile robot [3]. In a general way in mobile robotic research, localization methods are classified into two categories, which are local localization and global localization respectively [1]. Local localization is a method of tracking a pose of the robot successively with given initial pose. However, if data association fails consecutively, then the probability of the false correction will increase during the localization process, and moreover the robot cannot track its real pose in the worst case. On the other hand, global localization is a method of perceiving a pose of the robot with respect to its own representation of the world without any prior knowledge about initial pose; therefore kidnapped robot problem as well as the case of local localization failure can be solved through global localization. Global localization can be realized by range sensor or vision sensor. Thrun et al. showed global localization using Monte Carlo Localization(MCL) with a laser scanner in common office building [1], [2]. But, when the robot detects an object which is not in the global map(e.g., moving people) or the environment is simple like a corridor, global localization is hard to be accomplished in short time. For example, a corridor is very symmetric environment, so certain sensor readings do not correspond to single pose but multiple pose in the global map. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 1110–1119, 2008. © Springer-Verlag Berlin Heidelberg 2008
Global Localization for the Mobile Robot Based on Natural Number Recognition
1111
Therefore, most state-of-the-art algorithms use a vision sensor for global localization. By associating detected features in the current scene image with pre-registered features in global map, the absolute pose of the robot can be estimated. In this case, features are divided into artificial and natural landmark [8]. Method of using artificial landmark has disadvantage that the environment has to be modified. On the contrary, SIFT features, as one of the natural landmarks, are widely used for global localization. Stephen et al. considered global localization as a recognition problem, by matching the distinctive SIFT feature detected in the current frame to the pre-built SIFT database map [4], [5]. However, computational complexity of extracting SIFT features is quite high and possibility of false feature matching will increase if there are already many SIFT features in database. Meng et al. used the room number as a natural landmark, which offers sufficient information to global localization of humanoid mobile robot [7]. Tomono et al. implemented recognition of doors and room numbers as the research of global localization and object recognition based navigation using visual landmarks [6]. They detected a door by using edges and pixel intensities of a scene image and recognized the room number by employing template matching. This paper proposes global localization for mobile robot by introducing modelbased(i.e., room number model) object recognition, and also presents an efficient navigation strategy for detecting and recognizing room numbers.
2 Detection of the Door Vision sensor is usually used for recognizing the door, and the borders of the door can be represented as vertical edges in indoor scene image. By detecting these vertical edges, the candidates of the door can be extracted and the color properties(e.g., pixel value) of candidates are compared with those of the door model. However, color properties are susceptible to lighting condition and vertical edges appear in the corner of wall or a pillar as well as borders of the door, and hence the number of the candidates to be verified increases. Furthermore, motion blur effect occurs in the scene image when the robot moves fast and thus additional constraints related to the velocity of the robot are imposed. Therefore, we used a laser scanner to detect a door even though it does not contain rich information rather than vision sensor.
Fig. 1. (a) 2D door model, (b) 3D door model, (c) Clusters and detected door
1112
S.-Y. An et al.
2.1 How to Detect the Door Fig. 1(a) and (b) show 2D and 3D door model. wl is minimum length for being classified as a wall and dl is width of a door. 3D door model includes position of the number plate, which is located at (-p, 0, h) when one of the end points of the door is assumed to be origin. The robot continuously gathers laser scan data and does clustering so that a wall and a door are detected. Certain laser scan pattern satisfies conditions below in order to be detected as a door. • Clusters which have a length of wl exist more than 2. • The distance between end point of cluster ci and start point of cj is within 0.8m~1.1m. • Absolute value of angle difference, |ai - aj|, |aj - ak|, |ak - ai| has to be lower than cer tain threshold. where {cn| n = i,j} is a cluster, dk is a detected door, {an| n = i, j, k} is an angle of cluster or door with respect to local frame and these are shown in Fig. 1(c). A space between end point of cluster ci and start point of cj can be a candidate of the door and these two points offer basic information for local goal based navigation and room number recognition. 2.2 Local Goal Based Navigation Flow chart of navigation strategy for room number recognition is shown in Fig. 2. At the beginning of the procedure, the robot goes straight until confronting a wall, and if the wall is detected in current scan data then the robot is aligned parallel with it. The robot performs wall following motion until a door is detected while setting the local goal which is 1m perpendicularly apart from the wall. After detecting a door, the robot sets up the two local goals using end points of the door and relationships between door and number plate in 3D door model. The method of setting the local goal is depicted in Fig. 3(a) and Fig. 3(b) shows local path made by local goals. A point P1 which is p apart from one end point (e.g., D1) of the detected door is the candidate position of the number plate. By using an angle of the wall θw, we can find R which is p apart from the
Fig. 2. Flow chart of navigation strategy for recognition of room number
Global Localization for the Mobile Robot Based on Natural Number Recognition
1113
robot, and then calculate candidate position P1 by means of already known points R, D1 and origin. Finally, a point G1 which is g vertically from the wall is then one of the local goals. 1 G1 = ( D1 − R ) + Rθ w⊥ T 2 ,
(1)
g⎤ , ⎡cos θ w⊥ − sin θ w⊥ ⎤ . T is the factor that deterwhere R = ⎡⎢ p cos θ w ⎤⎥ , T = ⎡− ⎢0 ⎥ Rθ w⊥ = ⎢ ⎥ ⎣ ⎦ ⎣ sin θ w⊥ cos θ w⊥ ⎦ ⎣ p sin θ w ⎦
mines the distance between the local goal and the wall, and θ w⊥ is the angle of line which is perpendicular to the wall. The rest of the local goal, G2, can be calculated as same way. If two doors are detected in current laser scan data then the entire four goals are set up and sorted according to the distance from the origin of local frame, and we link these local goals sequentially then the local path is formed as in Fig. 3(b). In the course of following the local path, the velocity of the robot is reduced to 0.3m/s so as to diminish motion blur effect in scene image.
3 Recognition of Room Number In the work of Tomono et al., in order to recognize the room number, the robot searches predefined number plate in advance and then inspects the room number within already found number plate. However, this method has constraints that the robot should keep the prior knowledge about model of the number plate as well as the room number. That is to say, before recognizing the room number, the robot should recognize the number plate; therefore, the full shape of the number plate should appear in current scene image; otherwise, the robot cannot recognize the room number even though it appears in current scene image. Thus, this paper proposes the method of estimating the pose of the robot using only model of the room number and it is shown in Fig. 4.
Fig. 3. (a) Setting up the local goal, (b) Generated path from the local goals
1114
S.-Y. An et al.
Fig. 4. Room number model
3.1 Extraction of Room Number Candidates In order to recognize the room number in scene image, we first separate room number candidates from background. The details of the procedures are described below, and also processed images at each step are shown in Fig. 5. a. After edge detection, rotation angle θr is calculated by using horizontal edge and scene image is rotated by θr so that horizontal edge is aligned parallel with x coordinate of image frame. b. By applying adaptive threshold to original image, we get binary image includ ing number, character, edge, etc. c. Number candidates are obtained by filtering out noise using size filtering (e.g., the width/height ratio and size of candidate). d. Number candidate are grouped by applying dilation operator to the image of (c). A group which corresponds to model of the room number is regarded as the candidate. e. The room number candidate is reconstructed by affine transformation using rotation angle θr. An image from camera is distorted by perspective projection, so there is a need to rectify the room number candidate. But, because the camera is linked to two servo
Fig. 5. Each shows resulting image of steps (a)-(e): (a) Edge detection, (b) Adaptive threshold, (c) Filtering out noise, (d) Room number candidate, (e) Affine transformation
Global Localization for the Mobile Robot Based on Natural Number Recognition
1115
motors and direction of the camera is adjustable by pan & tilt control; thus, the camera can see the front of the number plate and consequently skew and slant [9] factor in room number candidate is small. In addition, if assuming the weak perspective camera model, an extra process to rectify the room number candidate can be omitted, i.e., affine transformation is sufficient to get rectified image. The upper room number image of Fig. 5(e) is obtained by rotating by θr from step (a) and lower image is obtained by affine transformation using θr. There must be at least three corresponding points to get an affine transform matrix. R2 and R3 are corresponding to R’2 and R’3 respectively without any coordinate transformation, and R1 is calculated by height of the room number hm and θr. 3.2 Recognition of Room Number Using Multistage Rejection Method(MSRM) If the robot does not move in front of the certain number plate, then the robot can perform recognition process several times on same room number. To this end, the robot can vote recognition result, so highly voted room number is assume to be a correct room number. However, this voting method cannot be applied to given problem, because the robot has to recognize the room number while moving, so the number of scene images that contains room number is small. Therefore, in order to reduce false recognition, ambiguous room number has to be rejected by inspecting it by stages. This paper proposes recognition of room number using Multistage Rejection Method (MSRM). MSRM consists of three stages, which are template matching, neural network, heuristic method respectively. Template matching as a first stage is used to extract accurate number region, and neural network outputs the recognized room number in second stage. In third stage, heuristic method which uses geometric shape of each number is applied to verify whether the recognized room number in second stage is correct or not. The block diagram of MSRM is shown in Fig. 6(a). 3.2.1 Extraction of Number Using Template Matching Template matching is applied to room number candidate obtained in section 3.1. There is the case that room number candidate and recognized number by means of
(a)
(b)
Fig. 6. (a) Block diagram of Multistage Rejection Method, (b) Segmented room number and structure of neural network
1116
S.-Y. An et al.
template matching could be different due to motion blur effect, camera specific noise, etc. Number 1, 4, 7 are easily distinguished by its geometric shape, but the other numbers are not. Thus, the numbers except 1, 4, and 7 are recognized by neural network in second stage. If the correlation between number candidate and templates is lower than certain threshold value, then the number candidate is rejected as noise. Correlation is given by
∑∑ T ( x ′, y ′) ⋅ I ( x + x ′, y + y ′) R ( x, y ) =
x′
y′
∑∑ T ( x ′, y ′) x′
y′
2
⋅
∑∑ I ( x + x′, y + y ′) x′
y′
(2)
2
,
where T and I are template and input image respectively. Once one of the three numbers is rejected, remaining recognition procedures are suspended to reduce false recognition rate, and next number candidate is processed. 3.2.2 Recognition of Number Using Neural Network Pixel intensity of number candidate becomes an input of neural network. Number of input node is equal to the size of candidate image (e.g., 14 by 15) and number of hidden and output node is 50 and 7, respectively. In an input image, there is a noise arising from motion blur and camera characteristics. A noise by motion blur is that an image is stretching and blurring, and this noise is combined with camera characteristic then the whole noise is represented as in Fig. 6(b). By reflecting this noise factor in training set when training the network, we can reduce false recognition rate. 3.2.3 Heuristic Method for Rejection of Recognized Number Heuristic method verifies the recognized number using geometric shape of number. In the beginning, number candidate becomes to have constant line width after applying Jhang and Suen’s thinning algorithm [10] and dilation operator. Number candidate is divided into two parts, which are upper and lower, and then we calculate average position (i.e., Cup and Cdown) of black pixel at each upper and lower part. From Cup and Cdown, we draw a line that passes by one of the boundary pixel of an image, and if at least one white pixel exists on that line, then the line is assumed to be ‘blocked’. By inspecting all of possible line that links Cup, Cdown to boundary pixel, an angle of the line which is not ‘blocked’ is averaged and defined as opening angle. So, each number except 1, 4, 7 has two opening angles (i.e., θu and θd in Fig. 7(b)). Table 1. Opening angle of each number
θ u (deg) θ d (deg)
0
2
3
5
6
8
9
-
192
192
5
5
-
-
-
25
177
177
-
-
192
Global Localization for the Mobile Robot Based on Natural Number Recognition
1117
Fig. 7. (a) Reformation of number with constant line width, (b) Calculation of opening angle
0 and 8 have no opening angle because all lines are blocked. However, those numbers are easily distinguished when considering the fact that if the white pixel exists on the line that connects Cup with Cdown, then the number is 8, otherwise 0. Opening angles of each number except 1, 4, 7 are obtained by not the model of the room number but by the several experiments and are listed in Table 1. (-) means that there is no opening angle.
4 Experiments Mobile robot ‘MORIS’ has a laser scanner for detecting the door, gyro sensor, pinhole camera for room number recognition. Camera is connected to independent pan & tilt module and the direction of camera can be controlled to be toward arbitrary direction, so the robot can hold wide area in which recognition is possible. The region where the room number can be recognized is shown in Fig. 8(a). θr is the orientation of the robot and θc is the angle of camera when seeing the number plate. Fig. 8(b) shows different scene images according to θc. Experiments were carried out in the third floor of LG research center, which is a corridor environment and the width of the corridor is 2m. Number plates are attached on one side of the door at intervals of 5m. In static scene image, false recognition rate (i.e., FRR) according to θc and recognition method is presented in Table 2. and Fig. 9(a). This experimental result is only in case of recognizing specific room number ‘132’.
Fig. 8. (a) Possible region to recognize room number, (b) Scene images at different camera angle
1118
S.-Y. An et al. Table 2. False recognition rate according to different camera angle θ c (deg)
TM (FR/TR, FRR)
50 66 90
86/167, 0.515 87/300, 0.29 2/96, 0.021
TM+NN(FR/TR, FRR)
MSRM TM+NN+HM (FR/TR, FRR) 0/27, 0 1/182, 0.005 0/95, 0
8/53, 0.151 4/213, 0.019 0/95, 0
In Table 2., TM is template matching, NN is neural network, HM is heuristic method, FR is false recognition and TR is true recognition. Room number is distorted by perspective projection when θc is small, so only TM gives no reliable recognition result(e.g., FRR is 0.515). In the case of using TM+NN, FRR decreases, however, there is a trade-off between FRR and number of TR. By using MSRM proposed by this paper, FRR can be reduced to almost 0, and thus it can be concluded that MSRM is efficient method when excluding a trade-off. Because reducing p(FR|R) is more important than simply increasing p(R) in terms of global localization; therefore, we can exclude a trade-off and focus on FRR(i.e., p(FR|R)) when evaluating a performance of each recognition method. The trajectory of the robot obtained by following walls of LG research center and recognized room number with its position in global frame are shown in Fig. 9(b). Total travel distance is approximately 120m and there are 16 number plates in total near the path of the robot. Room numbers which are recognized correctly are 14 in total, and the room number located at (14, 31) is omitted and (21,12) is recognized as ‘111’, so 2 out of 16 room numbers are failed to be recognized.
(a)
(b)
Fig. 9. (a) FRR by means of various recognition methods, (b) Room number feature map of LG research center
5 Conclusion We have proposed an efficient navigation strategy and room number recognition method for global localization. The robot searches the environment to find out its own position, so the room number can offer essential information, because it can be a
Global Localization for the Mobile Robot Based on Natural Number Recognition
1119
position of the robot to itself. Room number recognition is accomplished through navigation and recognition stage. In navigation stage, the robot enters the area in which the room number exists using wall following and approaching local goal motion. In recognition stage, room number is recognized or rejected by various recognition methods. We have also presented that FRR can be further reduced by MSRM. The real experiments showed that room number feature map can be built based on recognized room numbers, and this feature map can be applied to simultaneous localization and map building (SLAM) as well as global localization.
Acknowledgement This work was supported by grant No. RT104-02-06 from the Regional Technology Innovation Program of the Ministry of Commerce, Industry and Energy(MOCIE).
References 1. Thrun, S., Burgard, W., Fox, D.: Probabilistic Robotics. The MIT press, Cambridge (2005) 2. Thrun, S., Foxy, D., Burgardz, W., Dellaert, F.: Robust Monte Carlo Localization for Mobile Robots. Artificial Intelligence 128(1-2), 99–141 (2001) 3. Weber, J., Franken, L., Jorg, K.-W., Schmitt, K., von Puttkamer, E.: An integrative framework for global self-localization. In: International Conference on Multisensor Fusion and Integration for Intelligent Systems, pp. 73–78 (2001) 4. Stephen, S., Lowe, D., Little, J.: Global localization using distinctive visual features. In: International Conference on Intelligent Robots and System, pp. 226–231 (2002) 5. Stephen, S., Lowe, D., Little, J.: Vision-based global localization and mapping for mobile robots. IEEE Trans.Robotics and Automation 21, 364–375 (2005) 6. Tomono, M., Yuta, S.: Mobile robot navigation in indoor environments using object and character recognition. In: International Conference on Robotics and Automation, pp. 313– 320 (2000) 7. Qinghao, M., Bischoff, R.: Rapid door-number recognition by a humanoid mobile robot. In: International Conference on Robotics, Intelligent Systems and Signal Processing, pp. 71–76 (2003) 8. Mata, M., Armingol, J.M., de la Escalera, A., Salichs, M.A.: A visual landmark recognition system for topological navigation of mobile robots. In: International Conference on Robotics and Automation, pp. 1124–1129 (2001) 9. Yamaguchi, T., Nakano, Y., Maruyama, M., Miyao, H., Hananoi, T.: Digit classification on signboards for telephone number recognition. In: Seventh International Conference on Document Analysis and Recognition, pp. 359–363 (2003) 10. Lam, L., Lee, S.W., Suen, C.Y.: Thinning Methodologies – a Comprehensive Survey. IEEE Trans. Pattern Analysis and Machine Intelligence 14, 869–885 (1992)
A System Model for Real-Time Sensorimotor Processing in Brain Yutaka Sakaguchi University of Electro-Communications, Graduate School of Information Systems Laboratory for Human Informatics, Chofu, Tokyo 182-8585, Japan
Abstract. The present paper addresses a general diagram to investigate the real-time parallel computation mechanism in the brain, using an idea of “Gantt chart.” This diagram explicitly represents the temporal relationship between the computations running in various functional modules in the brain, and helps us to understand how the brain computation proceeds along the time. The author illustrates how we can utilize this diagram, taking a motor planning model of reaching movement as an example. Moreover, the author discusses the mechanism of intra- and inter-module computations on this diagram and addresses a tentative view that can explain the relationship between the movement variability and reaction time.
1
Introduction
Every second in daily life, our brain receives vast amount of sensor information and controls the motor system with many degrees of freedom (i.e., joints and muscles). It is astonishing that our brain can accomplish this complex sensorimotor processing in a real-time manner; revealing its underlying mechanism is one of the challenging topics in computational brain research. In the present paper, the author proposes a novel diagram for representing the parallel computation mechanism of sensorimotor processing. There, the computations in different functional modules are placed along the time axis, and their temporal relationship is explicitly represented. The author addresses how to utilize this chart, taking an on-line motor planning model as an example. Through a discussion on the mechanisms of intra- and inter-module computations, moreover, the author will give a tentative explanation how the trade-off between the movement accuracy and reaction time determines.
2
Computational Approaches to the Brain Functions
Marr[1] classified the computational approaches to brain function into three levels: “theory”, “representation and algorithm” and “implementation.” He somewhat put emphasis on the theory level among the three, and actually, many sophisticated theories have been proposed which successfully explained the essential features of human behavior. However, they indicate only what problem to solve for realizing a given function, and do not tell us how to solve the problem. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 1120–1129, 2008. c Springer-Verlag Berlin Heidelberg 2008
A System Model for Real-Time Sensorimotor Processing in Brain
1121
For example, the hand trajectories of planer reaching movements are beautifully explained by the theories based on some optimization criteria, such as minimum torque change[2] and TOPS[3]. However, it is still unclear how our brain finds such optimal motor commands. More specifically, the theories do not discuss how the brain determines motor commands within a specific time and how much cost (e.g., time, memory and resources) the brain takes for computation. In order to discuss these issues, we inevitably have to adopt the “representation and algorithm” approach. The objective of the present paper is to provide a helpful scheme for developing and examining the representation/algorithm-level brain models.
3
System Models for Sensorimotor Mechanism
A block diagram is one of the common ways to represent the computational structure of a complex system, and this is also true in the field of brain modeling. Figure 1 shows two examples, (a) a schematic model of the feed-forward control mechanism of voluntary reaching movement[4] and (b) a control model of eye movement system[5]. A chart indicating the anatomical connections (e.g., van Essen’s chart) can be regarded as a sort of this diagram.
a Target Position
Trajectory Generation
Coordinate Transform
Trajectory in Visual Coordinates
Command Generation
Command Pattern
Trajectory in Joint Coordinates
b
Fig. 1. Two examples of system models of motor control[4, 5]
1122
Y. Sakaguchi
In a block diagram, the system is represented by a set of distinct functional modules whose input-output relations are indicated by links between them. This diagram is convenient to see how the whole processing is divided into sub-processing (or modules). However, its limitation is that it cannot explicitly represent the temporal relationship between the sub-processing performed in different modules. For example, the schematic model shown in Fig. 1(a) does not tell us the temporal orders of computations in the modules: It is unknown whether the second module starts the computation after the first module finished its computation, or computations in these modules can proceed in parallel. On the other hand, the control model in Fig. 1(b) is appropriate for representing peripheral motor system whose characteristics are time-invariant. However, it is not suitable for modeling the dynamic computational process performed in the central brain. The point is that the processing in our brain is never uniform over either spatial or temporal dimension; some parts of the computation may be performed in a synchronized manner while other parts may be done independently. In order to represent such temporal relationship of these computations, we have to introduce the time axis to our diagram and to place them along the time axis. Here, the author proposes to adopt “Gantt chart” [6] as a tool satisfying this requirement. Gantt chart is a graphical bar chart, originally proposed for illustrating a project schedule. It presents the start and finish timing of activities (or jobs) together with their dependency relationships. In the computer science field, actually, this chart is commonly used to show the job assignment to processing elements (PEs) in multi-processor systems (Fig. 2). Bars in this chart represent jobs of the PEs, whose left and right ends show when they start and finish, respectively. Thus, we can readily see the time spent for each job and the logical/causal relationships between different jobs. Moreover, this chart tells us how efficiently the system utilizes the computational resources: If the most parts of the chart are filled by the bars, it means that the system makes full use of the system resource.
PE1 PE2 PE3 PE4 Start
Finish
Time Fig. 2. A Gantt chart for a multi-processor system
A System Model for Real-Time Sensorimotor Processing in Brain
1123
The author’s proposal is to make use of this chart as a tool to understand the computational structure of the sensorimotor processing. To the author’s knowledge, surprisingly, this kind of chart has never been adopted to illustrate the progress in computation in the brain. Below, the author addresses how to utilize this chart for modeling the brain functioning.
4
Gantt Chart for Brain Computation
4.1
General Structure
Figure 3 (a) shows an example of the diagram. This chart consists of three parts, arranged in the vertical direction. The central part represents the inside of the brain, and a number of functional modules are placed here.1 On the other hand, the upper and lower parts show the sensory and motor events, respectively. The time-invariant sensory/motor organs (e.g., retina and muscles) are placed in these regions.2 The horizontal axis represents the physical time. In more concrete, bars in the central part show the computational activities of the functional modules. Graduation in each bar represents the progress in computation, and the left and right ends roughly represents the start and finish timing (but, this point will be discussed later). On the other hand, broken ellipses show the communication (or coupling) between different modules. Therefore, we can see how the computations in different modules are related to each other and to sensory and motor events. Moreover, we can read reaction time (the time between the sensory trigger and motor response) by measuring the interval between the sensory input and motor output. Looking at this from the other side, this chart tells us how long each module can spend for a given calculation. Given this framework, our next task is to describe how the intra- and intermodule computations proceed in the model. Before going into this subject, however, we first see how to utilize this diagram, taking a hypothetical model of reaching movement as an example. 4.2
An Example: On-line Motor Planning of Reaching Movement
It is an interesting question how people can start reaching movement within a few hundreds milliseconds after the target’s visual information is provided. How does the brain calculate the motor commands in such a short time? One possible answer to this question is that the brain keeps calculating the commands concurrently with the movement execution, rather than finishing the planning in advance of the movement onset. Here, the author would like to see how this on-line planning model works, using the proposed diagram. 1 2
Each module roughly corresponds to a certain brain area, but the grain size of the modular structure can be determined according to the research target. Since sensory events and motor outputs are related by causal relationship (indicated as an arrow in the figure), it would be better to draw this chart on a cylindrical surface on which the sensory and motor parts adjoin each other.
1124
a
Y. Sakaguchi
Sensory Events Body Movement Sensory Stimulus
Resultant Movement Causal Relationship Cue Information
Cue for Next Task
Sensory Pre-Processing
Sensory Modules
Brain System
Dynamic Coupling/Decoupling
Association Modules
Parallel Processing of Different Tasks Motor Command Generation
Motor Modules Motor Command Muscle Contraction
Muscle Dynamics
Motor Events Time
b
Computation within Single Module When entropy is reduced to some level, coupling is triggered.
Decoupling Triggered by End of Computation Now, this module is ready for preparing the next computation.
Another Coupling Cooperative Computation by Coupling Two modules are operated as a unified system.
Fig. 3. An example of Gantt chart for brain computation
Figure 4 illustrates a typical scenario with this model, during the reaching movement. When a target is presented, the visual system acquires its position and creates its internal representation. Receiving this information, the command planning module starts to calculate the command pattern. From this point,
A System Model for Real-Time Sensorimotor Processing in Brain
1125
Sensory Events Eye Movement Visual Stimulus
Saccade Target
Next Target
Fixated Target
Acquiring target position
Visual Pre-processing Target Memory Forming target representation
Updating target rep.
Brain System (A set of Modules)
Updating body image
Body Image Generating motor command
Command Planning
Reflecting to motor command
On-line Command Computation Issuing commands
Muscle Contraction Hand Trajectory
Resultant Trajectory
Motor Events Time
Fig. 4. A Gantt chart for an on-line command planning model of reaching movement
the planning module continues to calculate and issue motor commands towards the end of movement.3 At the same time, the body image module updates the resultant body posture, cooperating with the planning module. On the other hand, a saccadic eye movement is triggered to capture the target within the fovea (i.e., the center of the visual field). When the saccade is done, the visual system acquires new target information and updates its internal representation. The command planning module reflects this updated information into the motor plan and later commands. This chart, furthermore, illustrates the possibility that different modules can be engaged in different tasks in parallel. When we move the hand sequentially among multiple targets, our gaze often moves to the next target before the hand reaches the present one. This implies that the computation for the next action starts during the execution of the present action. This situation is represented 3
Some researchers propose that motor commands are produced by central pattern generators (CPGs) in the downward pathway and that the brain need not to calculate the detailed commands [11]. Even if CPGs are essential for generating the final commands, however, it is still true that higher brain areas have to control them (i.e., activate CPGs at proper timings) for achieving purposive movements [12].
1126
Y. Sakaguchi
in the top-right part of the diagram: The target representation module starts to handle the information of the next target while the command planning module is still calculating the motor command for the current movement. Therefore, the proposed diagram helps us to understand spatio-temporal structure of sensorimotor processing in human motor control. Finally, digressing from our subject for a moment, the author would like to point out advantages of the on-line planning model introduced above, over the in-advance planning model. From a viewpoint of efficient usage of computational resource, first, it is desirable that the command planning module keeps working throughout the movement (remember the discussion for the multi-processor system). If the planning has been finished by the movement onset, this module would have nothing to do during the movement execution, which seems inefficient. Moreover, if the commands are planned in advance, the brain has to prepare a buffer for holding the planned commands until the end of the movement. Therefore, the on-line motor planning seems more efficient than the in-advance planning. Second, if adopting the on-line planning strategy, our brain can reflect the latest sensory information to the planning of on-going movement, as mentioned in the above scenario. Actually, this can explain why people could modify the movement without sensory feedback after making a saccade to the target: It was reported that if the target position was shifted during the saccade, the endpoint was shifted to the new target position even if no visual feedback is provided during the movement[10]. The in-advance planning model cannot explain this experimental fact.
5
Computation Flow in the Brain
Now, we go back to the detailed mechanism of the intra- and inter-module computations in the brain. In this section, the author develops some speculative discussion on these mechanisms. 5.1
Computation and Variability of Neural Activities
In a multi-processing computer system, the activity of a processing element (PE) is determined by whether a program is running or not on the PE. In the brain model, the activity of a module corresponds to neural activity in a specific brain area. In this sense, a Gantt chart may look like a time series of activity map obtained in recent neuro-imaging studies. However, here we should also keep in mind that the computation in the neural system can proceed even if the average neural activity is maintained at an identical level: It was suggested that the variability of the activity, rather than mean activity, may reflect the progress in computation[7].4 Below, the author 4
The cited paper [7] showed that inter-trial variability of activity of single neurons (not population variability) of the monkey’s premotor area diminished during the reaction time period and suggested that the movement was started when the variability decreased to some threshold level.
A System Model for Real-Time Sensorimotor Processing in Brain
1127
would like to discuss more about the relationship between neural computation and variability of neural activities. The variability-based computation view is attractive in some points. First, considering that the “variability” or “uncertainty” is measured by “entropy” in the field of information theory, we can relate the neural computation to such information measures. Second, the motor planning of voluntary actions is essentially a search problem where the brain tries to choose a best command sequence for a given task from a number of possible sequences. In other words, motor planning is a process to reduce the possibilities of the motor command. Comparing the variability of neural activities and possibilities of motor commands, therefore, finding an answer of motor planning may correspond to reducing the variability of the neural activities. This can be paraphrased as follows. The entropy of a neural module must be high when its component neurons are activated in a random manner (i.e., spontaneous firing). In this situation, no computation proceeds in this module. Once some cue signal is imposed, however, the activity would be updated into a more organized one, which reduces the entropy. When the network reaches an equilibrium state and the entropy reaches the minimum value, the network finishes the computation. That is, the progress in computation can be measured by the variability (or entropy) of the neural activities. This view provides an explanation to a behavioral property related to reaction time (RT). RT is the time required for making an action responding to the trigger sensory signal, and it can be regarded as the time spent for solving the search problem. Accepting this view, RT should depend on the complexity (or size) of the search problem. In concrete, RT would be shortened if the number of possibilities of motor commands (i.e., the size of search space) is reduced before the trigger signal is provided. This corresponds to the empirical fact that RT is shortened when the response variety is limited and when richer task information is provided in advance. Furthermore, the variability-based computation view can give an explanation to the relationship between the movement variability and reaction time. It is well known that movement variability increases as people are asked to make a quicker action (e.g., Fitts’ law). According to the variability-based view, one possible interpretation is that this is because the brain is forced to make a motor output before the module reaches the optimal answer: Motor outputs obtained by such incomplete computation would vary trial by trial because they are generated before the variability of neural activities becomes sufficiently low. This resultantly brings larger inter-trial variability. Therefore, the variability-based view suggests that the trade-off between the computational time and completeness determines the relationship between RT and movement accuracy of human behavior. This view is also meaningful in the point that it suggests that the computational variability can be a source of movement variability, in addition to sensory/perceptual and motor noises [3].
1128
5.2
Y. Sakaguchi
Information Processing through Inter-module Coupling
Finally, the author would like to discuss the inter-module interaction. In a multi-processor computer system, PEs are operated separately and their communication are explicitly controlled by the system. In a neural system, in contrast, many brain areas are closely interconnected through bi-lateral connections and the neurons in different areas often show similar response properties. Therefore, we should think of some specific mechanism of inter-module communication applicable to brain model, which is essentially different from that for the computer system. An important point is that the brain does not work as a statically unified network even if its component modules are closely interconnected: Some modules can be operated cooperatively while others can be operated independently, and such cooperative relations are formed and dissolved in a temporary manner. To be more specific, adjacent modules are coupled as a cooperative network to achieve a specific computation and dissolve the coupling when the computation is finished (see Fig. 3 (b)). Decoupled modules can behave independently and be engaged in different computations in parallel. A series of such coupling and decoupling mediates the information flow in brain, and finally brings an answer. This coupling/decoupling mechanism is essentially different from the communication in a multi-processor system in the point that the coupling creates a larger computational unit, not simply exchange information. In this sense, coupling is computation per se.5 Here, again, the entropy (or variability of the activity) plays a key role to indicate the progress in computation: The entropy reduces as the coupled module gets closer to the answer, and it reaches the minimum value when the computation is finished. Therefore, the author’s tentative view is that the brain achieves a complex computation with dynamic coupling/decoupling mechanism. Primary structure of coupling/decoupling is presumably set by an executive system because the primary process flow should depend on the task. However, timings of each coupling/decoupling must depend on the actual progress in computation, and this could determine the RT of the motor action. Of course, the relationship between RT and movement accuracy, discussed in the previous section, is also effective in the coupled network.
6
Concluding Remarks
In the present paper, the author first proposes a general diagram for real-time sensorimotor processing in the brain, based on the Gantt chart. Then, the author explained how to utilize this diagram by following the computational process of 5
The idea of “cell assembly” or “dynamic cell assembly” hypothesis [8, 9] is a possible implementation of this scheme. This hypothesis says that a group of neurons forms a temporal organization and integrates information by dynamical foundation of bidirectional interactions, which corresponds to our coupling/decoupling scheme.
A System Model for Real-Time Sensorimotor Processing in Brain
1129
an on-line motor planning on the diagram. In addition, the author gave some speculative discussion on the relation between the progress in computation and the variability of neural activity, and on the relationship between reaction time and movement accuracy of human motor behavior. The author believes that the proposed diagram be helpful to investigate the spatio-temporal computational mechanism of the brain. An ultimate goal of a computational brain research is to design the whole parts of this diagram so that the model’s temporal behavior agrees with those observed in the behavioral experiments. To this end, we should link various physiological, behavioral and imaging data to this diagram, which forms a unified platform to integrate various findings on our sensorimotor functions: It is desirable that such a platform works as an environment where the experimental and computational studies are examined together.
References [1] Marr, D.: Vision. Freeman, New York (1980) [2] Uno, Y., Kawato, M., Suzuki, R.: Formation and control of optimal trajectory in human multijoint arm movement. Minimum torque-change model. Biol. Cybern. 61, 89–101 (1989) [3] Harris, C.M., Wolpert, D.M.: Signal-dependent noise determines motor planning. Nature 394, 780–784 (1998) [4] Kawato, C.M., Furukawa, K., Suzuki, R.: A hierarchical neural-network model for control and learning of voluntary movement. Biol. Cybern. 57, 169–185 (1987) [5] Robinson, D.A.: Models of the saccadic eye movement control system. Kybernetik 14, 71–83 (1973) [6] Gantt, H.L.: Organizing for Work. Harcourt, Brace, and Howe, NY (1919) [7] Churchland, M.M., Yu, B.M., Ryu, S.I., Santhanam, G., Shenoy, K.V.: Neural variability in premotor cortex provides a signature of motor preparation. J. Neurosci. 26, 3697–3712 (2006) [8] Tsukada, M., Ichinose, N., Aihara, K., Ito, H., Fujii, H.: Dynamical Cell Assembly Hypothesis - Theoretical Possibility of Spatio-temporal Coding in the Cortex. Neural Networks 9, 1303–1350 (1996) [9] Sakurai, Y.: How do cell assemblies encode information in the brain? Neurosci. Biobehav. Rev. 23, 785–796 (1999) [10] Pelisson, D., Prablanc, C., Goodale, M.A., Jeannerod, M.: Visual control of reaching movements without vision of the limb. II. Evidence of fast unconscious processes correcting the trajectory of the hand to the final position of a double-step stimulus. Exp. Brain Res. 62, 303–311 (1985) [11] Taga, G.: Emergence of bipedal locomotion through entrainment among the neuro-musculo-skeletal system and the environment. Physica D 75, 190–208 (1984) [12] Sakaguchi, Y., Ikeda, S.: Motor planning and sparse motor command representation. Neurocomputing 70, 1748–1752 (2007)
Perception of Two-Stroke Apparent Motion and Real Motion Qi Zhang and Ken Mogi Sony Computer Science Laboratories, Inc. Takanawa Muse Bldg., 3-14-13, Higashigotanda, Shinagawa-ku, Tokyo, 141-0022, Japan [email protected], [email protected]
Abstract. Motion perception is an important visual ability of human beings. We can perceive motions not only from real motion but also from apparent motion. Usually, the direction of a perceived apparent motion reflects the real spatial shift between the two successional images, but sometimes the perceived direction is distorted. Two-stroke apparent motion is one of such phenomena, which is induced by two pattern frames from a motion sequence and a blank ISI. We studied on two-stroke apparent motion and found that there were temporal limits to its perception. We measured the brain activity using fMRI when subjects observing the two-stroke apparent motion, real motion and still images. Our fMRI experimental results show that MT/V5 is activated for both the apparent motion and real motion perception, but more activities in the prefrontal/frontal and parietal cortex are observed for the apparent motion perception. Keywords: Motion perception, two-stroke apparent motion, real motion, fMRI, human brain activity.
1 Introduction Motion perception is an important ability of human beings. We can perceive motions not only from a real moving object, but also from a series of images which are displayed in succession. This is called the apparent motion. Many researches have been reported on apparent motion perception both for monkeys [1, 2] and human beings [3, 4]. Usually, the direction of a perceived apparent motion reflects the real spatial shift between the two successional images, but sometimes the perceived direction is distorted. Two-stroke apparent motion [5, 6] is a phenomenon which is induced by two pattern frames from a motion sequence and a blank ISI. When the two frames are presented in alternation, only a back-and-forth motion is perceived. But when the blank ISI is inserted after the second frame, the blank ISI reverses the backward motion to make the motion sequence appear unidirectional. Thus, a continuous forward movement is perceived. The elucidation of the brain mechanism of the motion perception for both the real motion and apparent motion such as the two-stroke apparent motion perception is an important issue for understanding the human visual processing of motion objects in the real world. So M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 1130–1139, 2008. © Springer-Verlag Berlin Heidelberg 2008
Perception of Two-Stroke Apparent Motion and Real Motion
1131
(a)
(b)
(c) Fig. 1. Two-stoke apparent motion. (a) Two frames from a continuous motion sequences. When displayed in alteration, the back and forward motion is perceived. (b) Two-stroke apparent motion of a natural scene of car-driving. By adding a blank frame after the second image, the apparent motion is perceived unidirectional. (c) Two-stroke apparent motion of a rotating ring.
far, researches on the neural correlates of apparent motion have used only simple position changed light spot as the stimuli, and the perceived motion is consistent with the physical stimulus change [1-4]. Here, we study the two-stroke apparent motion by using the stimuli of a natural car-driving scene. We studied the temporal limits of this perception and measured the brain activities using fMRI when the subjects were observing the two-stroke apparent motion, real motion and still images.
2 Perception of Two-Stroke Apparent Motion 2.1 Phenomenon of Two-Stroke Apparent Motion Two-stroke apparent motion is a new illusion reported by George Mather at ECVP 2005, and won the second prize at the first international “Illusion of the Year”
1132
Q. Zhang and K. Mogi
1 8.75 cm
Visual Stimulus
25 or 18.75 cm* Fixation point
50 cm
28
°or 21° Subject
* 25 cm for car driving; 18.75 for ring rotating. Fig. 2. Experimental configuration of the measurement for temporal limits of the two-stroke apparent motion perception
competition. The illusion contains two pattern frames depicting a moving image (hence two-stroke) which are displayed using a technique that creates an impression of continuous unidirectional movement. When the two frames are presented in alternation (Fig. 1 a), only a back-and-forth motion is perceived. The switch from frame 1 to frame 2 creates forward apparent motion, while the switch from frame 2 back to frame 1 would normally create backward apparent motion. But when a blank inter-stimulus interval (ISI) is inserted after the second frame (Fig. 1b, 1c), the blank ISI reverses the backward motion to make the motion sequence appear unidirectional. Thus, a continuous forward movement is perceived. 2.2 Temporal Limits for the Perception of Two-Stroke Apparent Motion We found that the perception of the two-stroke apparent motion disappeared when the ISI or the image frame displaying duration were very short. Here we conducted experiments to measure the temporal limits of this perception.
Perception of Two-Stroke Apparent Motion and Real Motion
1133
2.2.1 Experimental Methods to Measure Temporal Limits We use two types of the stimuli for the two-stroke apparent motion: one is a natural car-driving scene (Fig. 1b), and the other is a rotating ring which consists of a series of rectangles (Fig. 1c). During the experiments, the visual stimuli were displayed on an EIZO Flexscan L771 LCD display, with the fresh rate of 60 Hz. Six subjects participated in the experiments. Subjects adjusted the duration of ISI (or image frame) by using 2 keys to lengthen or shorten it respectively, when they were fixing at the red fixation point. They pressed the determination key when the two-stoke apparent motion perception disappeared, and the current durations of ISI and image frame were recorded. There were 3 conditions with different durations (75, 100, 125 ms) of image frame (or ISI) when subjects adjusted the duration for ISI (or image frame) and there were 10 trials for each condition. The experiment configuration is shown in Fig. 2. We measured the limits under two conditions where either the ISI was fixed or the image frame duration was fixed. 2.2.2 Experimental Results on Limits of the Shortest ISI In the first experiment, subjects adjusted the duration of ISI when the image displaying duration was fixed. The experimental results are shown in Fig. 3. We noticed that a limit of the shortest ISI duration existed for two-stroke apparent motion perception. The ISI limit for the car driving is shorter than that for the ring rotating. It means that the limits of ISI depend on the complexity of the image, i.e., shorter ISI can eliminate this perception for more complicate image; while simpler image needs longer ISI duration. In addition, there is a trend that longer image displaying time needs longer ISI duration.
㪣㫀㫄㫀㫋㩷㫆㪽㩷㪠㪪㪠 㪎㪇
㪠㪪㪠㩷㪻㫌㫉㪸㫋㫀㫆㫅㩷㩿㫄㫊㪀
㪍㪇 㪌㪇 㪋㪇
㪺㪸㫉㩷㪻㫉㫀㫍㫀㫅㪾 㫉㫀㫅㪾㩷㫉㫆㫋㪸㫋㫀㫅㪾
㪊㪇 㪉㪇 㪈㪇 㪇 㪎㪌
㪈㪇㪇
㪈㪉㪌
㪠㫄㪸㪾㪼㩷㪻㫌㫉㪸㫋㫀㫆㫅㩷㩿㫄㫊㪀 Fig. 3. Experimental results on limits of the shortest ISI when when the image displaying duration is fixed
1134
Q. Zhang and K. Mogi
㪣㫀㫄㫀㫋㩷㫆㪽㩷㫀㫄㪸㪾㪼㩷㪻㫌㫉㪸㫋㫀㫆㫅
㪠㫄㪸㪾㪼㩷㪻㫌㫉㪸㫋㫀㫆㫅㩷㩿㫄㫊㪀
㪋㪇 㪊㪌 㪊㪇 㪉㪌
㪺㪸㫉㩷㪻㫉㫀㫍㫀㫅㪾 㫉㫀㫅㪾㩷㫉㫆㫋㪸㫋㫀㫅㪾
㪉㪇 㪈㪌 㪈㪇 㪌 㪇 㪎㪌
㪈㪇㪇
㪈㪉㪌
㪠㪪㪠㩷㪻㫌㫉㪸㫋㫀㫆㫅㩷㩿㫄㫊㪀 Fig. 4. Experimental results on limits of the shortest image frame duration when the ISI is fixed
2.2.3 Experimental Results on Limits of the Shortest Duration of Image Frame In the second experiment, subjects adjusted the duration of image frame when the ISI was fixed. The experimental results are shown in Fig. 4. Again a limit of the shortest image displaying duration exists for two-stroke apparent motion perception. In contrast with the previous experiment, the limit is longer for the car driving stimulus than for the ring rotation. It means that the limits of image duration also depend on the complexity of the image, i.e. more complicate image needs to be displayed longer, while simpler image can be displayed shorter. In addition, there is a trend that longer image displaying duration is necessary for shorter ISI condition to destruct the perception.
3 fMRI Study on Two-Stroke Apparent Motion So far we introduced the phenomenon of two-stroke apparent motion, and its temporal limits for the perception. Naturally, one would ask, what is the brain mechanism for this perception and what is the difference between the two-stroke apparent motion and real motion? To answer these questions, we conducted fMRI experiments to measure the human brain activities when subjects attained the motion perception. 3.1 Experimental Methods 3.1.1 Stimuli In the fMRI experiments, we applied the stimuli of the two-stroke apparent motion, real motion, and static images. Both the stimuli of the car driving and ring rotation
Perception of Two-Stroke Apparent Motion and Real Motion
1135
were used. The stimulus for the real motion was a single 7.5 s video clip. The stimulus for the two-stroke apparent motion consisted of two image frames and a blank ISI. Each of them was displayed for 100 ms, and repeated for 25 times, forming a sequence lasted for 7.5 s which was the same as the duration of the real motion clip. Each real motion clip and the apparent motion sequence were followed by a 2.5 s blank interval for rest. As for the static images, the two frames which induced the apparent motion were displayed for 2 s, followed by a 0.5 s ISI, and repeated 4 times to form a 10 s static image block. All these three types of stimuli repeated 6 times in one session, and all blocks were presented in random order. Two sessions were performed for each subjects. 3.1.2 Subjects and Experimental Protocols Seven healthy right-handed volunteers (aged 20-26 years; 3 male amd 4 female) with normal vision ability took part in the fMRI experiments. All subjects gave informed written consent for participation.
Fig. 5. fMRI experimental results of the brain activity for the two-stroke apparent motion against the static image perception
Subjects clicked a right or left button when they were presented with the stimulus image each time. The task was to determine if the stimulus was in the unidirectional motion or not. If yes, then click the left button using the right index finger. Otherwise click the right button using the right middle finger. The button assignment was reminded again before each session began. One session consisted of all the 6 blocks repeating 6 times, and lasted for 6 minutes. 3.1.3 FMRI Data Acquisition and Analysis EPI images were acquired with the Shimadzu-Marconi ECLIPSE 1.5T PowerDrive 250 fMRI System, and the TR was 2.5 s, TE was 55 ms. T2*-weighted functional
1136
Q. Zhang and K. Mogi
images with 5 mm thick slices were obtained in the resolution of 3 mm x 3 mm, and 25 slices were acquired in one scan. The EPI data were analyzed using the SPM5 software (Wellcome Department of Cognitive Neurology, University College London) [7]. All subjects were analyzed individually first, and then all the data were pooled together. A fixed-effect general linear model was applied for the group comparisons. A significant increase was tested with t statistics and displayed as statistical parametric maps. 3.2 Experimental Results 3.2.1 Two-Stroke Apparent Motion vs Static Images The fMRI experimental results are shown in Fig. 5-7. We first compare the human brain activity for the two-stroke apparent motion with the static images (Fig. 5). It is found that a wide range of the brain cortex is activated, including the occipital/temporal visual cortex as Brodmann area (BA) 18, 19 (MT), BA 37, 21 (IT); frontal motor cortex: BA 6, 9; prefrontal cortex: BA 45, 47; parietal cortex: BA 7, 39, 40; and limbic cortex: BA32, 24 (ACC). Next, we compare the human brain activity for the real motion with the static images (Fig. 6). Now, the brain activities in the frontal cortex disappeared mostly. The main activated areas are located in the occipital and parietal cortex, including the occipital/temporal visual cortex as Brodmann area (BA) 18, 19 (MT), BA 37, 21 (IT); parietal cortex: BA 7, 39, 40. To locate the differences between the two-stroke apparent motion and real motion clearly, we subtracted the activities for them directly (Fig. 7). We find that besides of the differences in the occipital/temporal visual cortex: BA18, 19 (MT), BA37, 21, 20 (IT), there are many significant differences distributed in the frontal/prefrontal cortex,
Fig. 6. fMRI experimental results of the brain activity for the real motion against the static image perception
Perception of Two-Stroke Apparent Motion and Real Motion
1137
such as frontal motor cortex: BA6, 9; Prefrontal cortex: BA45, 47, 11; parietal cortex: BA7, 39, 40, and limbic cortex: BA32, 24 (ACC). It implies that the two-stroke apparent motion involves higher level brain areas in the frontal/prefrontal cortex. In contrast, there is less areas which are more activated for the real motion than the twostroke apparent motion. These areas are located in the right parietal cortex (BA 7, 39, 40) and bilateral MT. When comparing the results for the perception of the car driving and the ring rotation, we found that the complex scene of car driving recruited more frontal/prefrontal cortex. In addition, the brain activity for apparent motion by a simple stimulus is mainly located in V1, MT, consistent with previous researches on simple apparent motion.
Fig. 7. fMRI experimental results of the brain activity for the two-stroke apparent motion against real motion perception
4 Discussions We have studied the perception of the two-stroke apparent motion which is induced by a blank ISI. Another remarkable phenomena induced by blank ISI is the change blindness, where the change between alternating displays of an original and a modified scene is extremely difficult to detect [8-11]. In the change blindness, Rensink [10] pointed out that when the duration of the blank ISI decreases, the change blindness will disappear. Here in this study, we examined the temporal features of the two-stroke apparent motion, and found that there is also a temporal limit for twostroke apparent motion when the duration of the blank ISI decreased. Here, we argue that the blank ISI acting as a noise to the real images triggers the brain visual system to create a consistent representation with that before the ISI. The temporal limit is the threshold of the noise. When the duration is shorter than the
1138
Q. Zhang and K. Mogi
threshold, the noise does not affect the visual processing, and the visual consciousness is identical to the real stimuli. When it becomes longer, however, the brain will try to remove it by complementing the information and tend to keep the entire perception uniform unless some specially focused attention is paid. Therefore, the two stationary images in change blindness tend to be perceived as identical by missing the change. Similarly, in the two-stroke apparent motion, the brain creates a direction reversion during the blank ISI to keep the previous motion sequence continuing. These results imply that elucidating the brain mechanism for removing noise such as the blank ISI is an important issue for understanding the human consciousness seeking for coherence in an environment full of noise.
5 Conclusions We studied the two-stroke apparent motion and found that there are temporal limits of ISI and image displaying duration existing for two-stroke apparent motion. We argue that the ISI acts as a noise to the brain processing when it exceeds the limit. Therefore, two-stroke apparent motion recruits higher level brain areas in the frontal/prefrontal cortex to process this noise. Our fMRI experimental results show that MT/V5 is activated for both the apparent motion and real motion perception. In addition, more activities in the prefrontal, frontal and parietal cortex are observed for the apparent motion perception, while the real motion recruits more brain activities in the right parietal cortex and bilateral MT. We also examine the apparent motion induced by a simple stimulus consisting of rectangles along a rotating ring. The apparent motion of the ring activated less prefrontal and frontal cortex. The elucidation of the brain mechanism for removing noise such as the blank ISI is an important issue for understanding the human consciousness seeking for coherence in an environment full of noise.
Acknowledgements The fMRI experiments were conducted at ATR, Kyoto, Japan. We thank Akiko Callan, Yasuhiro Shimada, Ichiro Fujimoto, Yuko Shakudo, Shizuhiko Deji, Shinobu Masaki, and Kayoko Nakagawa at Brain Activity Imaging Center, for their supports of acquiring the fMRI data, and helpful advice for experiment design and data analysis.
References 1. Newsome, W.T., Mikami, A., Wurtz, R.: Motion Selectivity in Macaque Visual Cortex. III. Psychophysics and Physiology of Apparent Motion. Journal of Neurophysiology 55, 1340–1351 (1986) 2. Williams, Z.M., Elfar, J.C., Eskandar, E.N., Toth, L.J., Assad, J.A.: Parietal Activity and the Perceived Direction of Ambiguous Apparent Motion. Nature Neuroscience 6, 616–623 (2003) 3. Muckli, L., Kriegeskorte, N., Lanfermann, H., Zanella, F.E., Singer, W., Goebel, R.: Apparent Motion: Event-Related Functional Magnetic Resonance Imaging of Perceptual Switches and States, The Journal of Neuroscience 22, RC219 (2002)
Perception of Two-Stroke Apparent Motion and Real Motion
1139
4. Muckli, L., Kohler, A., Kriegeskorte, N., Singer, W.: Primary Visual Cortex Activity along the Apparent-Motion Trace Reflects Illusory Perception. PLoS Biology, Vol 3, e265 (2005) 5. Mather, G.: Two-stroke apparent Motion. ECVP (2005) 6. Mather, G.: http://www.lifesci.sussex.ac.uk/home/George_Mather/ TwoStroke.htm 7. Friston, K., Holmes, A., Worsley, K., Poline, J., Frith, C., Frackowiak, R.: Statistical Parametric Maps in Functional Imaging: A General Linear Approach. Human Brain Mapping 2, 189–210 (1995) 8. Simons, D.J., Levin, D.T.: Change blindness. Trends in Cognitive Sciences 1, 261–267 (1997) 9. Rensink, R., O’Regan, J., Clark, J.: To See or Not to See: The Need for Attention to Perceive Changes in Scenes. Psychological Science 8(5), 368–373 (1997) 10. Rensink, R., O’Regan, J., Clark, J.: On the Failure to Detect Changes in Scenes Across Brief Interruptions. Visual Cognition 7(1/2/3), 127–145 (2000) 11. Beck, D., Rees, G., Frith, C., Lavie, N.: Neural Correlates of Change Detection and Change Blindness. Nature Neuroscience 4, 645–650 (2001)
Author Index
Agashe, Harshavardhan A. I-151 Ahmed, F. I-517 Ahn, Chang-Jun I-299 Aibe, Noriyuki II-637 Aihara, Kazuyuki I-170 Akiyama, Kei II-77 Amari, Shun-ichi I-781 Amemiya, Yoshihito II-117 An, Su-Yong I-1110 Aoki, Takaaki I-426 Aonuma, Hitoshi II-905 Aoyagi, Toshio I-426 Asai, Tetsuya II-117 Asakawa, Shin-ichi II-749 Aston, John A.D. I-126 Aubert-Vazquez, Eduardo I-802 Aucouturier, Jean-Julien II-647 Awai, Kazuo I-993 Awano, Takaaki II-637 Azuma, Yusuke I-1100 Babic, Jan II-214 Bacciu, Davide I-497 Bacic, Boris II-416 Baek, Doo San I-1110 Ban, Sang-Woo I-953, II-940, II-1055 Ban, Tao II-264 Bando, Takashi I-604 Banik, Sajal Chandra II-147 Barczak, Andre L.C. II-386 Bari, Md. Faizul II-453 Barros, Allan Kardec II-21, II-529 Bayard, Jorge Bosch I-802 Belanche-Mu˜ noz, Llu´ıs A. I-328, I-683 Benso, Victor Alberto Parcianello II-577 Benuskova, Lubica II-406 Biesiada, Jacek II-285 Blazquez, Pablo M. I-902 Boo, Chang-Jin II-127 Boyle, Jordan H. I-37 Brammer, Michael I-477 Bryden, John I-37 Burdet, Etienne I-913 Burkitt, Anthony N. I-102
Cˆ ateau, Hideyuki I-142 Cavalcante, Andr´e B. II-21 Chakraborty, P. I-517 Chan, Hsiao-Lung I-48 Chandra, B. II-366 Chang, Jyh-Yeong II-837 Chao, Pei-Kuang I-48 Chen, Yen-Wei I-821 Chen, Zhe I-527 Cheng, Gordon II-214 Cheng, Philip E. I-126, I-365 Cheng, Wei-Chen II-254, II-683 Cho, Minkook I-547 Cho, Sung-Bae II-856, II-950, II-1007 Choi, Kyuwan II-987 Chou, Wen-Chuang I-209 Chuang, Cheng-Hung I-365 Chung, I-Fang II-866 Cichocki, Andrzej I-112, I-527, I-781, I-811, II-519 Cohen, Netta I-37 Colliaux, David I-160 Corrˆea Silva, Arist´ ofanes II-529 Costa, Daniel Duarte II-529 Cunningham, John P. I-586 Danno, Mikio I-199 da Silva, Cristiane C.S. II-529 Dauwels, Justin I-112 Doi, Kunio I-993 Doi, Shinji I-7 Dolinsk´ y, J´ an I-248 Doya, Kenji I-596, I-614, II-167 Duch, Wlodzislaw II-285 Edelman, Gerald M. II-157 Eggert, Julian I-653, I-963 Einecke, Nils I-653 Erdogmus, Deniz II-488 Esaki, Hirotake II-559 Faloutsos, Christos I-791 Filik, U. Basaran II-703 Finnie, Gavin II-478
1142
Author Index
Franklin, David W. I-913, I-1002, I-1012 Freire, R.C.S. II-21 Frolov, Alexander I-861 Fujii, Hiroshi I-170 Fujimaki, Norio II-895 Fujimura, Hirotada I-983 Fujimura, Kikuo II-274 Fujisaka, Hisato I-299 Fujisumi, Takeshi I-199 Fujita, Takahiro I-1021 Fukai, Tomoki I-142 Fukuda, Eric Shun II-117 Fukumi, Minoru II-444 Fukushima, Kunihiko I-1041, II-1 Funatsu, Yuki II-40 Furuichi, Teiichi II-884 Furukawa, Tetsuo II-1075 Gaurav II-366 Gedeon, Tam´ as (Tom) Domonkos II-666 Geibel, Peter II-738, II-779 ¨ Gerek, Omer Nezih II-460 Gilson, Matthieu I-102 Gonz´ alez, F´elix F. I-683 Goto, Atsuko I-733 Grayden, David B. I-102 Gu, Xiaodong II-549 Gust, Helmar II-738, II-779 Haeiwa, Kazuhisa I-299 Hagiwara, Katsuyuki I-537 Hagiwara, Masafumi II-539 Hale, Joshua II-214 Hamada, Takashi I-219 Han, Min I-415 Hardoon, David R. I-477 Hartono, Pitoyo II-434 Hasegawa, Mikio I-693 Hasegawa, Osamu I-338 Hasegawa, Ryohei P. II-997 Hasegawa, Yukako T. II-997 Hasuo, Takashi II-1065 Hatori, Yasuhiro I-18 Hayami, Jiro I-279 Hayashi, Akira II-326, II-336 Hayashida, Yuki I-1, I-64, I-135 Hellbach, Sven I-653 Herbarth, Olf II-876 Hidaka, Akinori II-598
Highstein, Stephen M. I-902 Hikawa, Hiroomi I-983, II-137 Hiraoka, Kazuyuki I-487 Hirata, Yutaka I-902 Hirayama, Jun-ichiro I-742 Hirose, Akira I-1031, I-1100 Hirose, Hideaki II-987 Hirose, Takeshi I-228 Hirose, Tetsuya II-117 Ho, Cheng-Jung II-346 Hocao˜ glu, Fatih Onur II-460 Hong, Jin-Hyuk II-856 Honkela, Antti II-305 Horikoshi, Tetsumi II-559 Hosoe, Shigeyuki I-923, II-77 Hou, Zeng-Guang II-376 Hsieh, Shih-Sen I-723 Hu, Xiaolin I-703 Hu, Yingjie II-846 Huang, Chung-Hsien I-723 Huang, Kaizhu I-841 Hudson, Nigel II-921 H´ usek, Duˇsan I-861 Hussain, Zakria I-258 Hyun, Hyo-Young II-97 Hyv¨ arinen, Aapo I-752 Ichikawa, Kazuhisa II-895 Ichiki, Akihisa I-93 Ifeachor, Emmanuel II-921 Iijima, Toshio II-884, II-987 Ikeda, Kazushi II-295 Ikegami, Takashi II-647 Ikeguchi, Tohru I-673 Ikeno, Hidetoshi II-884, II-905 Ikoma, Norikazu I-507 Ikuta, Koichi II-569 Ilin, Alexander I-566 Imamizu, Hiroshi II-1027 Inagaki, Kayichiro I-902 Inamura, Tetsunari II-193 Inayoshi, Hiroaki II-588 Inohira, Eiichi I-395 Inoue, Hirotaka I-762 Inouye, Yujiro II-498 Isa, Tadashi II-884 Ishida, Takayuki I-993 Ishii, Shin I-604, I-713, I-742, II-817 Ishikane, Hiroshi II-884 Ishikawa, Masumi II-1075
Author Index Ishikawa, Satoru II-185 Ishikawa, Seiji I-993 Ishiki, Tatsuya I-7 Islam, Md. Monirul I-317, II-453 Isokawa, Teijiro II-759 Itai, Yoshinori I-993 Ito, Masanori II-509 Ito, Shin-ichi II-444 Ito, Yoshifusa I-238 Iwasa, Kaname I-199, II-577 Iwata, Akira I-199, II-577 Iwata, Azusa I-436 Iwata, Kazunori II-295 Iwata, Kenji II-628 Izumi, Hiroyuki I-238 Izumi, Kiyotaka II-147 Jang, Young-Min I-953, II-1055 Jankovic, Marko I-527, I-781 Jeong, Jaeseung II-921, II-930 Jia, Zhenhong II-426 Jimoh, A.A. II-713, II-721 Jordaan, Jaco II-693, II-713, II-721 Jwa, Chong-Keun II-127 Kabir, Md. Monirul I-374, I-517, II-1017 Kadi-allah, Abdelhamid I-913 Kadone, Hideki II-203 Kage, Hiroshi II-569 Kameyama, Keisuke I-851, II-608 Kamio, Takeshi I-299 Kamitani, Yukiyasu II-979 Kamiura, Naotake II-759 Kamiyama, Yoshimi II-884 Kanda, Hisashi II-222 Kang, Jeong-Gwan I-1110 Kang, Min-Jae II-127, II-468 Kang, Sang-Soo II-127 Kanzaki, Ryohei II-905 Karhunen, Juha I-566, II-305 Kasabov, Nikola II-396, II-406, II-416, II-846 Kashimori, Yoshiki I-27 Katayama, Hirokazu I-892 Katayama, Masazumi I-892, I-1021 Kato, Masahiro I-1031 Katsuragawa, Shigehiko I-993 Kawamoto, Mitsuru II-498, II-509 Kawamura, Masaki I-733
1143
Kawano, Hideaki I-507 Kawasaki, Takeshi II-50 Kawashita, Ikuo I-993 Kawato, Mitsuo I-913, I-1002, I-1012, II-214, II-979, II-1027 Kihato, Peter K. II-274 Kim, Ho-Chan II-127, II-468 Kim, Hyoungkyu II-930 Kim, Hyoungseop I-993 Kim, Kyung-Joong II-950, II-1007 King, Irwin I-841 Kini, B. Venkataramana II-11 Kinjo, Mitsunaga II-730 Kitano, Katsunori I-142 Ko, Hee-Sang II-127, II-468 Kobayashi, Takumi II-628 Kobayashi, Takuya II-598 Kohlmorgen, Jens I-556 Kohno, Kiyotaka II-498 Koike, Yasuharu II-960, II-987 Komatani, Kazunori II-222 Kondo, Tadashi I-882 Kong, Jae-Sung II-97 K¨ orner, Edgar I-653 Kotani, Kazuhiko II-274 Koyama, Jumpei I-1031 Krichmar, Jeffrey L. II-157 Kudo, Hiroaki II-30 Kugimiya, Kaori II-137 Kugler, Mauricio I-199, II-577 K¨ uhnberger, Kai-Uwe II-738, II-779 Kumagai, Sadatoshi I-7 Kundu, Gourab II-453 Kurata, Koji I-426 Kurban, Mehmet II-460, II-703 Kurimoto, Kenichi II-243 Kurita, Takio II-588, II-598, II-798 Kuroe, Yasuaki II-807 Kurogi, Shuichi II-40 Kuroyanagi, Susumu I-199, II-577 Kurozawa, Yoichi II-274 Kyuma, Kazuo II-569 Lai, Shang-Hong I-126 Lai, Weng Kin I-625 Latchoumane, Charles-Francois Vincent II-921 Lee, Jiann-Der I-723 Lee, Minho I-953, II-940, II-1055 Lee, Sangkyun II-915
1144
Author Index
Lee, Shih-Tseng I-48 Lee, Shin-Tseng I-723 Lee, Soo-Young II-915 Leung, Chi Sing I-289, I-456 Li, Jing I-973 Li, Xuelong I-643, I-791 Li, Yongtao I-179 Liaw, Gary I-913 Lim, Chee Peng I-625 Lin, Chin-Teng II-866 Lin, Ming-An I-48 Liou, Cheng-Yuan I-365, II-254, II-346, II-683 Liou, Jiun-Wei I-365 Liou, Michelle I-126, I-209, I-365 Liu, Li-Chang I-723 Liu, Meigen II-979 Liu, Xiuling II-426 Loy, Chen Change I-625 Lu, Bao-Liang I-973, II-827 Luo, Zhi-wei II-77 Lyu, Michael R. I-841 Ma, Jia II-376 MacDonell, Stephen II-416 Maebashi, Kumiko II-326 Maeda, Hiroshi I-507 Maeda, Shin-ichi I-713 Maeda, Yoshinobu I-1021 Maehara, Yousuke I-771 Majewski, Pawel II-769 Maniwa, Yoshio II-274 Mansour, Ali II-509 Martinez, Pablo I-527 Masaki, Shinobu II-895 Matsuda, Yoshitatsu I-635 Matsui, Nobuyuki II-759, II-905 Matsumoto, Tetsuya II-30 Matsunaga, Kaoru I-135 Matsuo, Takami I-83 Matsuura, Takafumi I-673 Maybank, Stephen J. I-791 Mehler, Alexander II-779 Mehta, Vrushank D. II-386 Miki, Tsutomu I-358 Mishima, Taketoshi I-487 Mitsubori, Kunihiko I-299 Mitsukura, Yasue II-444 Mitsunaga, Kouichi I-83 Miwakeichi, Fumikazu I-802
Miyagawa, Eiji I-873 Miyaji, Masahiro I-199 Miyakawa, Hiroyoshi II-884 Miyamoto, Hiroyuki I-1071 Miyamura, Hiroko Nakamura II-444 Miyazaki, Mari II-1047 Mizuhara, Hiroaki I-802 Mizunami, Makoto II-905 Mogi, Ken I-1130 Molter, Colin I-151, I-160 Moravec, Pavel I-861 Mori, Kenji I-299 Mori, Takehiro II-807 Mori, Yoshihiro II-807 Morie, Takashi I-1081 Morii, Fujiki II-57 Morita, Masahiko II-1065 Motomura, Tamami I-64 Mour˜ ao-Miranda, Janaina I-477 Munir, Sirajum II-453 Murase, Kazuyuki I-317, I-374, I-517, II-453, II-1017 Murata, Satoshi II-979 Murayama, Nobuki I-1, I-64, I-135 Muroga, Takeo II-969 Muslim, M. Aziz II-1075
Nagao, Soichi II-884 Nagata, Kenji II-67 Nagata, Yugo II-185 Nakagawa, Masahiro I-189 Nakajima, Koji II-730 Nakamura, Katsuki II-960 Nakamura, Kiyomi II-50 Nakamura, Yoshihiko II-203 Nakanishi, Ryoji I-135 Nakatomi, Masashi I-742 Nara, Shigetoshi I-179 Narihisa, Hiroyuki I-762 Narita, Hiroyuki II-336 Nedachi, Naoko II-40 Nicolae, D.V. II-713, II-721 Niki, Kazuhisa II-895 Nishii, Jun I-1091 Nishikawa, Jun I-54 Nishimura, Haruhiko II-759 Nitta, Katsumi I-446 Nomura, Osamu I-1081
Author Index Ogai, Yuta II-647 Ogata, Tetsuya II-222 Ogata, Yuki II-657 Oh, Se-Young I-1110 Ohkita, Masaaki II-274 Ohnishi, Noboru II-30, II-509 Oka, Nozomi II-608 Okada, Hiroyuki II-185 Okada, Kazuhiro I-219 Okada, Masato I-54 Okanoya, Kazuo I-54 Okayama, Mayuko II-608 Okuno, Hiroshi G. II-222 Okuno, Hirotsugu II-107 Omori, Takashi II-185 Onisawa, Takehisa II-657 Onishi, Masaki II-77 Oonishi, Hiromasa I-395 Osanai, Makoto I-7 Osu, Rieko I-1002, I-1012, II-979 Ota, Kaiichiro I-426 Otaka, Yohei II-979 Otsu, Nobuyuki II-628 Oyama, Takashi I-923 Ozawa, Seiichi II-396 Ozertem, Umut II-488 Oztop, Erhan II-214 Pal, Nikhil Ranjan II-866 Pang, Shaoning II-396, II-416 Panuku, Lakshmi Narayana I-73 Park, Hyeyoung I-547 Peng, Gang II-376 Peters, Jan II-233 Phan, Anh Huy I-811 Playne, Daniel P. II-386 Polyakov, Pavel I-861 Ponzi, Adam I-269, I-309 Pustylnikov, Olga II-779 Raiko, Tapani I-566, II-305 Reyes, Napoleon H. II-386 ˇ Rezankov´ a, Hana I-861 Richter, Matthias II-876 Roeder, Stefan W. II-876 Saeki, Takashi I-358 Saglam, Murat I-1, I-135 Sahani, Maneesh I-586 Saika, Yohei I-663
1145
Saito, Masahiro II-539 Saito, Takafumi II-444 Saito, Toshimichi I-873 Sakaguchi, Yutaka I-1120 Sakai, Hideaki II-315 Sakai, Ko I-18, I-348 Sakai, Yuuichi I-983 Sakumura, Yuichi II-817 Sakurai, Akito I-436 Sakurai, Yoshio II-987 Samejima, Kazuyuki I-596 Samsudin, Mohamad Faizal Bin I-228 Santana, Ewaldo II-21 Santos, Marcio de O. II-21 Sato, Akihiro I-338 Sato, Masa-aki I-576, II-1027 Sato, Shigeo II-730 Sato, Yasuomi D. I-385 Satoh, Shunji I-943, I-1051 Satoh, Yutaka II-628 Sattar, Md. Abdus I-317 Sawamura, Yasumasa II-336 Schaal, Stefan II-233 Schubert, Markus I-556 Segraves, Mark A. II-997 Sekhar, C. Chandra I-73, II-11 Sekino, Masashi I-446 Shahjahan, Md. I-374, I-517, II-1017 Shawe-Taylor, John I-258, I-477 Shen, Jialie I-791 Shenoy, Krishna V. I-586 Shi, Yi-Xiang II-837 Shibata, Katsunari I-228 Shibata, Tomohiro I-604, II-193 Shiino, Masatoshi I-93 Shimizu, Masaki I-1071 Shimizu, Ryohei I-348 Shimizu, Shohei I-752 Shin, Jang-Kyoo II-97 Shinozawa, Yoshihisa I-436 Shioya, Hiroyuki I-771 Shouno, Hayaru I-1061 Shyu, Jia-Jie II-837 Siti, M.W. II-713, II-721 Sn´ aˇsel, V´ aclav I-861 So, Udell I-1002, I-1012 Soga, Mitsuya I-27 Srinivasan, Cidambi I-238 Starita, Antonina I-497 Su, Hong-Ren I-126, I-209
1146
Author Index
Sudo, Akihito I-338 Suematsu, Nobuo II-326 Sugiyama, Koichi II-243 Sum, Pui Fai I-289 Sumi, Kazuhiko II-569 Sun, Han II-1037 Sun, Jimeng I-791 Sung, Dong-Kyu II-97 Suzuki, Ryoji II-884, II-895 Szyma´ nski, Julian II-769 Taji, Kouichi II-77 Takagi, Hideyuki I-248 Takahata, Masakazu II-905 Takano, Hironobu II-50 Takemoto, Atsushi II-960 Takenouchi, Takashi I-742 Takeuchi, Yoshinori II-30 Takizawa, Hotaka II-618 Tan, Kay Sin I-625 Tan, Min II-376 Tanaka, Ken-ichi II-569 Tanaka, Satoshi I-7 Tanaka, Yoshiyuki I-933 Taniai, Yoshiaki I-1091 Tao, Dacheng I-643, I-791 Thomas, Doreen A. I-102 Tiˇ no, Peter I-405 Toda, Akihiro II-1027 Tokutaka, Heizo II-274 Torikai, Hiroyuki II-87 Tornio, Matti II-305 Totoki, Yusuke I-83 Tovar, Gessyca Maria II-117 Tsai, Cheng-Fa II-356 Tsai, Yu-Shuen II-866 Tsubone, Tadashi II-243, II-969 Tsuboyama, Manabu I-338 Tsuda, Ichiro I-170 Tsuji, Toshio I-933 Tsukada, Yuki II-817 Tsutsui, Kiyotaka II-969 Uchibe, Eiji II-167 Uchida, Masato I-771 Umeno, Ken I-693 Umezaki, Taizo II-559 Uno, Yoji I-923, II-77 Usui, Shiro I-943, I-1051, II-884, II-895, II-905 Utama, Nugraha P. II-960
Valdes-Sosa, Pedro A. I-802 van Hemmen, J. Leo I-102 Vanstone, Bruce II-478 van Wyk, Anton II-693 van Wyk, Ben II-693 Vialatte, Fran¸cois I-112 von der Malsburg, Christoph I-385 Wada, Yasuhiro II-243, II-969, II-1027 Wagatsuma, Hiroaki I-151, I-160, II-177 Wagatsuma, Nobuhiko I-348 Wakuya, Hiroshi II-1047 Wang, Jun I-703 Wang, Lipo II-789 Wang, Shir Li I-625 Wang, Shuen-Ping I-723 Watanabe, Jobu I-802 Watanabe, Keigo II-147 Watanabe, Kenji II-798 Watanabe, Sumio I-466, II-67 Watchareeruetai, Ukrit II-30 Wei, Ru I-415 Weiler, Daniel I-963 Wimalaratna, Sunil II-921 Wolff, Christian I-385 Wolfrum, Philipp I-385 Won, Woong-Jae I-953 Wong, Kok Wai II-675 Wong, Kwok-Wo I-456 Wong, Tien-Tsin I-289 Wu, Tony I-48 Wu, Xindong I-791 Wysoski, Simei Gomes II-406 Xu, Rui I-821 Xu, Yong I-456 Xu, Zenglin I-841 Yagi, Tetsuya I-7, II-107 Yamaguchi, Kazunori I-635 Yamaguchi, Yoko I-151, I-160, I-802, II-177 Yamaguchi, Yoshiki II-637 Yamamoto, Yorihisa II-637 Yamane, Ken II-1065 Yamauchi, Koichiro I-279 Yamazaki, Keisuke I-466 Yang, Jie I-643 Yang, Jun-Mei II-315
Author Index Yang, Tao II-376 Yang, Wenlu I-831 Yang, Yang II-827 Yasunaga, Moritoshi II-637 Yasuyama, Kouji II-905 Yen, Chia-Chen II-356 Yoda, Ikushi II-628 Yokohari, Fumio II-905 Yokoi, Hirokazu I-395 Yokoyama, Ayami II-185 Yoon, Dongwoo I-547 Yoshida, Manabu I-487 Yoshihara, Ikuo II-637 Yoshimoto, Junichiro I-614 Yoshioka, Taku I-576
Yoshizuka, Takeharu I-1071 Yu, Byron M. I-586 Zdunek, Rafal I-781, I-811, II-519 Zhang, Chenli I-338 Zhang, Li-Qing I-811 Zhang, Liqing I-831, II-1037 Zhang, Qi I-1130 Zhang, Tianhao I-643 Zhihua, Wu I-151 Zhou, Nina II-789 Zhou, Ting II-426 Zhu, Jianke I-841 Zhu, Wenjun I-831
1147