George A. Tsihrintzis, Maria Virvou, Robert J. Howlett, and Lakhmi C. Jain (Eds.) New Directions in Intelligent Interactive Multimedia
Studies in Computational Intelligence, Volume 142 Editor-in-Chief Prof. Janusz Kacprzyk Systems Research Institute Polish Academy of Sciences ul. Newelska 6 01-447 Warsaw Poland E-mail:
[email protected] Further volumes of this series can be found on our homepage: springer.com
Vol. 132. Danil Prokhorov (Ed.) Computational Intelligence in Automotive Applications, 2008 ISBN 978-3-540-79256-7
Vol. 123. Shuichi Iwata, Yukio Ohsawa, Shusaku Tsumoto, Ning Zhong, Yong Shi and Lorenzo Magnani (Eds.) Communications and Discoveries from Multidisciplinary Data, 2008 ISBN 978-3-540-78732-7
Vol. 133. Manuel Gra˜na and Richard J. Duro (Eds.) Computational Intelligence for Remote Sensing, 2008 ISBN 978-3-540-79352-6
Vol. 124. Ricardo Zavala Yoe Modelling and Control of Dynamical Systems: Numerical Implementation in a Behavioral Framework, 2008 ISBN 978-3-540-78734-1 Vol. 125. Larry Bull, Bernad´o-Mansilla Ester and John Holmes (Eds.) Learning Classifier Systems in Data Mining, 2008 ISBN 978-3-540-78978-9 Vol. 126. Oleg Okun and Giorgio Valentini (Eds.) Supervised and Unsupervised Ensemble Methods and their Applications, 2008 ISBN 978-3-540-78980-2 Vol. 127. R´egie Gras, Einoshin Suzuki, Fabrice Guillet and Filippo Spagnolo (Eds.) Statistical Implicative Analysis, 2008 ISBN 978-3-540-78982-6 Vol. 128. Fatos Xhafa and Ajith Abraham (Eds.) Metaheuristics for Scheduling in Industrial and Manufacturing Applications, 2008 ISBN 978-3-540-78984-0 Vol. 129. Natalio Krasnogor, Giuseppe Nicosia, Mario Pavone and David Pelta (Eds.) Nature Inspired Cooperative Strategies for Optimization (NICSO 2007), 2008 ISBN 978-3-540-78986-4 Vol. 130. Richi Nayak, Nikhil Ichalkaranje and Lakhmi C. Jain (Eds.) Evolution of the Web in Artificial Intelligence Environments, 2008 ISBN 978-3-540-79139-3 Vol. 131. Roger Lee and Haeng-Kon Kim (Eds.) Computer and Information Science, 2008 ISBN 978-3-540-79186-7
Vol. 134. Ngoc Thanh Nguyen and Radoslaw Katarzyniak (Eds.) New Challenges in Applied Intelligence Technologies, 2008 ISBN 978-3-540-79354-0 Vol. 135. Hsinchun Chen and Christopher C. Yang (Eds.) Intelligence and Security Informatics, 2008 ISBN 978-3-540-69207-2 Vol. 136. Carlos Cotta, Marc Sevaux and Kenneth S¨orensen (Eds.) Adaptive and Multilevel Metaheuristics, 2008 ISBN 978-3-540-79437-0 Vol. 137. Lakhmi C. Jain, Mika Sato-Ilic, Maria Virvou, George A. Tsihrintzis, Valentina Emilia Balas and Canicious Abeynayake (Eds.) Computational Intelligence Paradigms, 2008 ISBN 978-3-540-79473-8 Vol. 138. Bruno Apolloni, Witold Pedrycz, Simone Bassis and Dario Malchiodi The Puzzle of Granular Computing, 2008 ISBN 978-3-540-79863-7 Vol. 139. Jan Drugowitsch Design and Analysis of Learning Classifier Systems, 2008 ISBN 978-3-540-79865-1 Vol. 140. Nadia Magnenat-Thalmann, Lakhmi C. Jain and N. Ichalkaranje (Eds.) New Advances in Virtual Humans, 2008 ISBN 978-3-540-79867-5 Vol. 141. Christa Sommerer, Lakhmi C. Jain and Laurent Mignonneau (Eds.) The Art and Science of Interface and Interaction Design, 2008 ISBN 978-3-540-79869-9 Vol. 142. George A. Tsihrintzis, Maria Virvou, Robert J. Howlett and Lakhmi C. Jain (Eds.) New Directions in Intelligent Interactive Multimedia, 2008 ISBN 978-3-540-68126-7
George A. Tsihrintzis Maria Virvou Robert J. Howlett Lakhmi C. Jain (Eds.)
New Directions in Intelligent Interactive Multimedia
123
Prof. George Tsihrintzis
Prof. Maria Virvou
Department of Informatics University of Piraeus 80, Karaoli & Dimitriou St. Piraeus 18534 Greece E-mail:
[email protected]
Department of Informatics University of Piraeus 80, Karaoli & Dimitriou St. Piraeus 18534 Greece E-mail:
[email protected]
Prof. Robert Howlett
Prof. Lakhmi C. Jain
University of Brighton School of Engineering Research Centre Moulsecoomb, Brighton, BN2 4GJ UK E-mail:
[email protected]
KES Centre School of Electrical and Information Engineering, University of South Australia Adelaide, Mawson Lakes Campus South Australia SA 5095 Australia E-mail:
[email protected]
ISBN 978-3-540-68126-7
e-ISBN 978-3-540-68127-4
DOI 10.1007/978-3-540-68127-4 Studies in Computational Intelligence
ISSN 1860949X
Library of Congress Control Number: 2008926411 c 2008 Springer-Verlag Berlin Heidelberg This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typeset & Cover Design: Scientific Publishing Services Pvt. Ltd., Chennai, India. Printed in acid-free paper 987654321 springer.com
Sponsoring Institutions
University of Piraeus – Rector’s Council
University of Piraeus – Research Center
University of Piraeus – Graduate Program of Studies in Informatics
Ministry of National Education and Religious Affairs
County of Piraeus
Eurobank EFG, S.A.
Preface to: New Directions in Intelligent Interactive Multimedia George A. Tsihrintzis1, Maria Virvou1, Robert J. Howlett2, and Lakhmi C. Jain3 1
Department of Informatics University of Piraeus 2 Center for Smart Systems University of Brighton 3 School of Electrical & Information Engineering University of South Australia
Multimedia systems is the term chosen to refer to the coordinated and secure storage, processing, transmission and retrieval of multiple forms of information, such as audio, image, video, animation, graphics, and text. During the last decade, multimedia systems has become a vibrant field of research and development worldwide. As a result, multimedia services based on multimedia systems have made significant progress in recent times. Multimedia systems and services have been developed to address needs in various areas including, but not limited to, advertisement, art, business, education, entertainment, engineering, medicine, mathematics, scientific research and spatiotemporal applications. The growth rate of multimedia services has become explosive, as technological progress only attempts to match consumers for content. In our times, computers are more widespread than ever and computer users range from highly qualified scientists to non-computer-expert professionals and may include people with special needs. Thus, interactivity, personalization and adaptivity have become a necessity in modern multimedia systems and services. Modern intelligent multimedia systems need to be interactive not only through classical modes of interaction where the user inputs information through a keyboard or mouse. They must also support other modes of interaction, such as visual or lingual computer-user interfaces, which render them more attractive, user friendlier, more human-like and more informative. On the other hand, solution in which “one-fits-all” are no longer applicable to wide ranges of users of various backgrounds and needs. Therefore, one important goal of many intelligent multimedia systems is their ability to provide personalized service and adapt dynamically to their users. To achieve these goals, intelligent interactive multimedia systems and services (IIMSS) need to evolve at all levels of processing. Specific sub-areas of required further research include: 1. Advances in Multimedia Data Analysis 2. New Reasoning Approaches 3. More efficient Infrastructure for Intelligent Interactive Multimedia Systems and Services 4. Development of innovative Multimedia Application Areas 5. Improvement of the Quality of Interactive Multimedia Services
VIII
Preface
This book summarizes the works and new research results presented at the First International Symposium on Intelligent Interactive Multimedia Systems and Services (KES-IIMSS 2008), organized by the University of Piraeus and its Department of Informatics in conjunction with KES International (Piraeus, Greece, July 9–11, 2008). The aim of the symposium was to provide an internationally respected forum for scientific research into the technologies and applications of intelligent interactive multimedia systems and services. Besides the Preface, the book contains sixty four (64) chapters. The first four (4) chapters in the book are printed versions of the keynote addresses of the invited speakers of KES-IIMSS 2008. Besides the invited speaker chapters, the book contains fifteen (15) chapters on recent Advances in Multimedia Data Analysis, eleven (11) chapters on Reasoning Approaches, nine (9) chapters on Infrastructure of Intelligent Interactive Multimedia Systems and Services, fourteen (14) chapters on Multimedia Applications, and eleven (11) chapters on Quality of Interactive Multimedia Services. More specifically, Chapter 1 by Germano Resconi is on “Morphic Computing.” Chapter 2 by Mike Christel is on “Amplifying Video Information-Seeking Success through Rich, Exploratory Interfaces.” Chapter 3 by Alfred Kobsa is on “PrivacyEnhanced Personalization.” Chapter 4 by Paul Brna is on “Narrative Interactive Multimedia Learning Environments: Achievements and Challenges.” Chapters 5, 6, 7, and 8 cover various aspects of Image and Video Analysis, while Chapters 9, 10, 11, and 12 are devoted to Fast Methods for Intelligent Image Recognition. Chapters 13, 14, 15, and 16 are devoted to Audio Analysis and Chapters 17, 18, and 19 present new results in Time Series Analysis in Financial Services. Chapters 20, 21, and 22 present new results in Multimedia Information Clustering and Retrieval, while Chapters 23, 24, and 25 are devoted to Decision Support Services. Additionally, Chapters 26, 27, 28, 29, and 30 are devoted to Reasoning–based Intelligent Information Systems. Chapters 31, 32, 33, and 34 are devoted to Wireless and Web-based Multimedia and Chapters 35, 36, 37, 38, and 39 present Techniques and Applications for Multimedia Security. Chapters 40, 41, 42, 43, and 44 are devoted to Tutoring Systems, while Chapters 45, 46, and 47 are devoted to Geographical Multimedia Services. Chapters 48 and 49 present multimedia applications in Interactive TV and Chapters 50, 51, 52, and 53 are devoted to Intelligent and Interactive multimedia in Bioinformatics and Medical Informatics. Chapters 54, 55, and 56 present new results in Affective Multimedia, while Chapters 57, 58, 59, 60, and 61 present Multimedia Techniques for Ambient Intelligence. Finally, Chapters 62, 63, and 64 are devoted to approaches for Evaluation of Multimedia Services. We wish to express our gratitude to the authors of the various chapters and reviewers for their wonderful contributions. For their help with organizational issues of KES-IIMSS 2008, we express our thanks to Ms. Paraskevi Lampropoulou, Ms. Lina Stamati, Mr. Efthimios Alepis and Mr. Konstantinos Patsakis, doctoral students at the
Preface
IX
University of Piraeus, and Mr. Peter Cushion of KES International. Thanks are due to Springer-Verlag for their editorial support. We would also like to express our sincere thanks to Mr. Thomas Ditzinger for his wonderful editorial support. We believe that this book would help in creating interest among researchers and practitioners towards realizing human-like interactive multimedia services. This book would prove useful to the researchers, professors, research students and practitioners as it reports novel research work on challenging topics in the area of intelligent interactive multimedia systems and services. Moreover, special emphasis has been put on highlighting issues concerning the development process of such complex systems and services, thus revisiting the difficult issue of knowledge engineering of such systems. In this way, the book aims at providing the readers with a better understanding of how intelligent interactive multimedia systems and services can be successfully implemented to incorporate recent trends and advances in theory and applications of intelligent systems. George A. Tsihrintzis Maria Virvou Robert J. Howlett Lakhmi C. Jain
Contents
Morphic Computing Germano Resconi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
Amplifying Video Information-Seeking Success through Rich, Exploratory Interfaces Michael G. Christel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
Privacy-Enhanced Personalization Alfred Kobsa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
Narrative Interactive Multimedia Learning Environments: Achievements and Challenges Paul Brna . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
A Support Vector Machine Approach for Video Shot Detection Vasileios Chasanis, Aristidis Likas, Nikolaos Galatsanos . . . . . . . . . . . . . . .
45
Comparative Performance Evaluation of Artificial Neural Network-Based vs. Human Facial Expression Classifiers for Facial Expression Recognition I.-O. Stathopoulou, G.A. Tsihrintzis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
Histographic Steganographic System Constantinos Patsakis, Nikolaos Alexandris . . . . . . . . . . . . . . . . . . . . . . . . . .
67
Moving Object Detection and Tracking for the Purpose of Multimodal Surveillance System in Urban Areas Andrzej Czyzewski, Piotr Dalka . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
Image Similarity Search in Large Databases Using a Fast Machine Learning Approach ˇ Smiljan Sinjur, Damjan Zazula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
XII
Contents
Fast Segmentation of Ovarian Ultrasound Volumes Using Support Vector Machines and Sparse Learning Sets Mitja Leniˇc, Boris Cigale, Boˇzidar Potoˇcnik, Damjan Zazula . . . . . . . . . . .
95
Fast and Intelligent Determination of Image Segmentation Method Parameters Boˇzidar Potoˇcnik, Mitja Leniˇc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Fast Image Segmentation Algorithm Using Wavelet Transform Tomaˇz Romih, Peter Planinˇsiˇc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Musical Instrument Category Discrimination Using Wavelet-Based Source Separation P.S. Lampropoulou, A.S. Lampropoulos, G.A. Tsihrintzis . . . . . . . . . . . . . . 127 Music Perception as Reflected in Bispectral EEG Analysis under a Mirror Neurons-Based Approach Panagiotis Doulgeris, Stelios Hadjidimitriou, Konstantinos Panoulas, Leontios Hadjileontiadis, Stavros Panas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 Automatic Recognition of Urban Soundscenes Stavros Ntalampiras, Ilyas Potamitis, Nikos Fakotakis . . . . . . . . . . . . . . . . . 147 Low Bitrate Coding of Spot Audio Signals for Interactive and Immersive Audio Applications Athanasios Mouchtaris, Christos Tzagkarakis, Panagiotis Tsakalides . . . . . 155 Extracting Input Features and Fuzzy Rules for Forecasting Exchange Rate Using NEWFM Sang-Hong Lee, Hyoung J. Jang, Joon S. Lim . . . . . . . . . . . . . . . . . . . . . . . . 165 Forecasting Short-Term KOSPI Time Series Based on NEWFM Sang-Hong Lee, Hyoung J. Jang, Joon S. Lim . . . . . . . . . . . . . . . . . . . . . . . . 175 The Convergence Analysis of an Improved Artificial Immune Algorithm for Clustering Jianhua Tong, Hong-Zhou Tan, Leiyong Guo . . . . . . . . . . . . . . . . . . . . . . . . . 185 Artificial Immune System-Based Music Genre Classification D.N. Sotiropoulos, A.S. Lampropoulos, G.A. Tsihrintzis . . . . . . . . . . . . . . . 191 Semantic Information Retrieval Dedicated to Multimedia Systems: A Platform Based on Conceptual Graphs Xavier Aim´e, Francky Trichet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
Contents
XIII
Interactive Cluster-Based Personalized Retrieval on Large Document Collections Petros Belsis, Charalampos Konstantopoulos, Basilis Mamalis, Grammati Pantziou, Christos Skourlas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 Decision Support Services Facilitating Uncertainty Management Sylvia Encheva . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 Efficient Knowledge Transfer by Hearing a Conversation While Doing Something Eiko Yamamoto, Hitoshi Isahara . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 On Managing Users’ Attention in Knowledge-Intensive Organizations Dimitris Apostolou, Stelios Karapiperis, Nenad Stojanovic . . . . . . . . . . . . . 239 Two Applications of Paraconsistent Logical Controller Jair Minoro Abe, Kazumi Nakamatsu, Seiki Akama . . . . . . . . . . . . . . . . . . . 249 Encoding Modalities into Extended Petri Net for Analyzing Discrete Event Business Process Takashi Hattori, Hiroshi Kawakami, Osamu Katai, Takayuki Shiose . . . . . 255 Paraconsistent Before-After Relation Reasoning Based on EVALPSN Kazumi Nakamatsu, Jair Minoro Abe, Seiki Akama . . . . . . . . . . . . . . . . . . . 265 Image Representation with Reduced Spectrum Pyramid Roumen Kountchev, Roumiana Kountcheva . . . . . . . . . . . . . . . . . . . . . . . . . . 275 Constructive Logic and the Sorites Paradox Seiki Akama, Kazumi Nakamatsu, Jair Minoro Abe . . . . . . . . . . . . . . . . . . . 285 Resource Authorization in IMS with Known Multimedia Service Adaptation Capabilities Tomislav Grgic, Vedran Huskic, Maja Matijasevic . . . . . . . . . . . . . . . . . . . . . 293 Visualizing Ontologies on the Web Ioannis Papadakis, Michalis Stefanidakis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303 Performance Analysis of ACL Packets Using Turbo Code in Bluetooth Wireless System Il-Young Moon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313 Design and Implementation of Remote Monitoring System for Supporting Safe Subways Based on USN Seok Cheol Lee, Chang Soo Kim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
XIV
Contents
Evaluation of PC-Based Real-Time Watermark Embedding System for Standard-Definition Video Stream Takaaki Yamada, Yoshiyasu Takahashi, Hiroshi Yoshiura, Isao Echizen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331 User Authentication Scheme Using Individual Auditory Pop-Out Kotaro Sonoda, Osamu Takizawa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341 Combined Scheme of Encryption and Watermarking in H.264/Scalable Video Coding (SVC) Su-Wan Park, Sang-Uk Shin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351 Evaluation of Integrity Verification System for Video Content Using Digital Watermarking Takaaki Yamada, Yoshiyasu Takahashi, Yasuhiro Fujii, Ryu Ebisawa, Hiroshi Yoshiura, Isao Echizen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363 Improving the Host Authentication Mechanism for POD Copy Protection System Eun-Jun Yoon, Kee-Young Yoo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373 User Stereotypes Concerning Cognitive, Personality and Performance Issues in a Collaborative Learning Environment for UML Kalliopi Tourtoglou, Maria Virvou . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385 Intelligent Mining and Indexing of Multi-language e-Learning Material Angela Fogarolli, Marco Ronchetti . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395 Classic and Multimedia Based Activities to Teach Colors for Both Teachers and Their Pre-school Kids at the Kindergarten of Arab Schools in South of Israel Mahmoud Huleihil, Huriya Huleihil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405 TeamSim: An Educational Micro-world for the Teaching of Team Dynamics Orazio Miglino, Luigi Pagliarini, Maurizio Cardaci, Onofrio Gigliotta . . . 417 The Computerized Career Gate Test K.17 Theodore Katsanevas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427 Fuzzy Logic Decisions and Web Services for a Personalized Geographical Information System Constantinos Chalvantzis, Maria Virvou . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
Contents
XV
Design Rationale of an Adaptive Geographical Information System Katerina Kabassi, Georgios P. Heliades . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451 Multimedia, User-Centered Design and Tourism: Simplicity, Originality and Universality Francisco V. Cipolla Ficarra, Miguel Cipolla Ficarra . . . . . . . . . . . . . . . . . . 461 Dynamically Extracting and Exploiting Information about Customers for Knowledge-Based Interactive TV-Commerce Anastasios Savvopoulos, Maria Virvou . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471 Caring TV as a Service Design with and for Elderly People Katariina Raij, Paula Lehto . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481 A Biosignal Classification Neural Modeling Methodology for Intelligent Hardware Construction Anastasia Kastania, Stelios Zimeras, Sophia Kossida . . . . . . . . . . . . . . . . . . 489 Virtual Intelligent Agents to Train Abilities of Diagnosis in Psychology and Psychiatry Jos´e Guti´errez-Maldonado, Ivan Alsina-Jurnet, Mar´ıa Virginia Rangel-G´ omez, Angel Aguilar-Alonso, Adolfo Jos´e Jarne-Esparcia, Antonio Andr´es-Pueyo, Antoni Talarn-Caparr´ os . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497 The Role of Neural Networks in Biosignals Classification Stelios Zimeras, Anastasia Kastania . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507 Medical Informatics in the Web 2.0 Era Iraklis Varlamis, Ioannis Apostolakis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513 Affective Reasoning Based on Bi-modal Interaction and User Stereotypes Efthymios Alepis, Maria Virvou, Katerina Kabassi . . . . . . . . . . . . . . . . . . . . 523 General-Purpose Emotion Assessment Testbed Based on Biometric Information Jorge Teixeira, Vasco Vinhas, Eugenio Oliveira, Luis Paulo Reis . . . . . . . . 533 Realtime Dynamic Multimedia Storyline Based on Online Audience Biometric Information Vasco Vinhas, Eugenio Oliveira, Luis Paulo Reis . . . . . . . . . . . . . . . . . . . . . . 545 Assessing Separation of Duty Policies through the Interpretation of Sampled Video Sequences: A Pair Programming Case Study Marco Anisetti, Valerio Bellandi, Ernesto Damiani, Gabriele Gianini . . . . 555
XVI
Contents
Trellis Based Real-Time Depth Perception Chip Using Interline Constraint Sungchan Park, Hong Jeong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565 Simple Perceptually-Inspired Methods for Blob Extraction Paolo Falcoz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577 LOGOS: A Multimodal Dialogue System for Controlling Smart Appliances Theodoros Kostoulas, Iosif Mporas, Todor Ganchev, Nikos Katsaounos, Alexandros Lazaridis, Stavros Ntalampiras, Nikos Fakotakis . . . . . . . . . . . . 585 One-Channel Separation and Recognition of Mixtures of Environmental Sounds: The Case of Bird-Song Classification in Composite Soundscenes Ilyas Potamitis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595 Evaluating the Next Generation of Multimedia Software Ray Adams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605 Evaluation Process and Results of a Middleware System for Accessing Digital Music LI braries in MObile S ervices P.S. Lampropoulou, A.S. Lampropoulos, G.A. Tsihrintzis . . . . . . . . . . . . . . 615 Interactive Systems, Design and Heuristic Evaluation: The Importance of the Diachronic Vision Francisco V. Cipolla Ficarra, Miguel Cipolla Ficarra . . . . . . . . . . . . . . . . . . 625 Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635
Author Index
Abe, Jair Minoro 249, 265, 285 Adams, Ray 605 Aguilar-Alonso, Angel 497 Aim´e, Xavier 201 Akama, Seiki 249, 265, 285 Alepis, Efthymios 523 Alexandris, Nikolaos 67 Alsina-Jurnet, Ivan 497 Andr´es-Pueyo, Antonio 497 Anisetti, Marco 555 Apostolakis, Ioannis 513 Apostolou, Dimitris 239 Bellandi, Valerio 555 Belsis, Petros 211 Brna, Paul 33 Cardaci, Maurizio 417 Chalvantzis, Constantinos 439 Chasanis, Vasileios 45 Christel, Michael G. 21 Cigale, Boris 95 Cipolla Ficarra, Francisco V. 461, 625 Cipolla Ficarra, Miguel 461, 625 Czyzewski, Andrzej 75 Dalka, Piotr 75 Damiani, Ernesto 555 Doulgeris, Panagiotis 137 Ebisawa, Ryu 363 Echizen, Isao 331, 363 Encheva, Sylvia 221 Fakotakis, Nikos 147, 585 Falcoz, Paolo 577
Fogarolli, Angela 395 Fujii, Yasuhiro 363 Galatsanos, Nikolaos 45 Ganchev, Todor 585 Gianini, Gabriele 555 Gigliotta, Onofrio 417 Grgic, Tomislav 293 Guo, Leiyong 185 Guti´errez-Maldonado, Jos´e
497
Hadjidimitriou, Stelios 137 Hadjileontiadis, Leontios 137 Hattori, Takashi 255 Heliades, Georgios P. 451 Huleihil, Huriya 405 Huleihil, Mahmoud 405 Huskic, Vedran 293 Isahara, Hitoshi
231
Jang, Hyoung J. 165, 175 Jarne-Esparcia, Adolfo Jos´e Jeong, Hong 565
497
Kabassi, Katerina 451, 523 Karapiperis, Stelios 239 Kastania, Anastasia 489, 507 Katai, Osamu 255 Katsanevas, Theodore 427 Katsaounos, Nikos 585 Kawakami, Hiroshi 255 Kim, Chang Soo 321 Kobsa, Alfred 31 Konstantopoulos, Charalampos
211
636
Author Index
Kossida, Sophia 489 Kostoulas, Theodoros 585 Kountchev, Roumen 275 Kountcheva, Roumiana 275
Reis, Luis Paulo 533, 545 Resconi, Germano 1 Romih, Tomaˇz 117 Ronchetti, Marco 395
Lampropoulos, A.S. 127, 191, 615 Lampropoulou, P.S. 127, 615 Lazaridis, Alexandros 585 Lee, Sang-Hong 165, 175 Lee, Seok Cheol 321 Lehto, Paula 481 Leniˇc, Mitja 95, 107 Likas, Aristidis 45 Lim, Joon S. 165, 175
Savvopoulos, Anastasios 471 Shin, Sang-Uk 351 Shiose, Takayuki 255 ˇ Sinjur, Smiljan 85 Skourlas, Christos 211 Sonoda, Kotaro 341 Sotiropoulos, D.N. 191 Stathopoulou, I.-O. 55 Stefanidakis, Michalis 303 Stojanovic, Nenad 239
Mamalis, Basilis 211 Matijasevic, Maja 293 Miglino, Orazio 417 Moon, Il-Young 313 Mouchtaris, Athanasios 155 Mporas, Iosif 585 Nakamatsu, Kazumi Ntalampiras, Stavros Oliveira, Eugenio
249, 265, 285 147, 585
533, 545
Pagliarini, Luigi 417 Panas, Stavros 137 Panoulas, Konstantinos 137 Pantziou, Grammati 211 Papadakis, Ioannis 303 Park, Su-Wan 351 Park, Sungchan 565 Patsakis, Constantinos 67 Planinˇsiˇc, Peter 117 Potamitis, Ilyas 147, 595 Potoˇcnik, Boˇzidar 95, 107 Raij, Katariina 481 Rangel-G´ omez, Mar´ıa Virginia
Takahashi, Yoshiyasu 331, 363 Takizawa, Osamu 341 Talarn-Caparr´ os, Antoni 497 Tan, Hong-Zhou 185 Teixeira, Jorge 533 Tong, Jianhua 185 Tourtoglou, Kalliopi 385 Trichet, Francky 201 Tsakalides, Panagiotis 155 Tsihrintzis, G.A. 55, 127, 191, 615 Tzagkarakis, Christos 155 Varlamis, Iraklis 513 Vinhas, Vasco 533, 545 Virvou, Maria 385, 439, 471, 523 Yamada, Takaaki 331, 363 Yamamoto, Eiko 231 Yoo, Kee-Young 373 Yoon, Eun-Jun 373 Yoshiura, Hiroshi 331, 363
497
Zazula, Damjan Zimeras, Stelios
85, 95 489, 507
Morphic Computing Germano Resconi Catholic University, Brescia, Italy
[email protected]
Abstract. In this paper, we introduce a new type of computation is called “Morphic Computing”. The Morphic Computing is based on Field Theory and more specifically based on Morphic Fields. Morphic Fields were first introduced by Rupert Sheldrake [1981] based on his hypothesis of formative causation which makes use of the older notion of Morphogenetic Fields. Rupert Sheldrake [1981] developed his famous theory, the Morphic Resonance, on the basis of the work by French philosopher Henri Bergson. The Morphic Fields and it‘s subset Morphogenetic Fields have been in the center of controversy for many years among mainstream science and the hypothesis is not accepted by some scientists, who consider it pseudoscience. We claim that the Morphic Computing is a natural extension of the Holographic Computation, Quantum Computation, Soft Computing, and DNA computing. All the natural computation that are bonded by the Turing Machine can be formalised and extended by our new type of computation model – Morphic Computing. In this paper, we introduce the basis for our new Computing paradigm – Morphic Computing-, it’s extensions such as Quantum Logic and Entanglement in Morphic Computing, Morphic Systems and Morphic System of Systems (M-SOS) and it’s applications to the field of computation by words as an example of the Morphic Computing, Morphogenetic Fields in neural network and Morphic Computing, Morphic Fields - concepts and Web search, and agents and fuzzy in Morphic Computing. Keywords: Morphic Computing, Morphogenetic Computing, Morphic Fields, Morphogenetic Fields, Quantum Computing, DNA Computing, Soft Computing, Computing with Words, Morphic Systems, Morphic Network, Morphic System of Systems.
1 Introduction Inspired by the work of the French philosopher Henri Bergson, Rupert Sheldrake [1981] developed his famous theory, the Morphic Resonance. His work on Morphic Fields which is based on Morphic Resonance Theory has been published in his well known book “A New Science of Life: The Hypothesis of Morphic Resonance” (1981, second edition 1985). Morphic Fields of Rupert Sheldrake [1981] is based on his hypothesis of formative causation which makes use of the older notion of Morphogenetic Fields. The Morphic Fields and it‘s subset Morphogenetic Fields have been in the centre of controversy for many years among mainstream science and the hypothesis is not accepted by some scientists, who consider it pseudoscience. The Morphogenetic Fields is a hypothetical biological fields and it has been used by environmental biologists since 1920's which deals with living things. However, the Morphic Fields are more general than Morphogenetic Fields and are defined as universal information for both organic (living things) and abstract forms. Sheldrake G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 1–20, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
2
G. Resconi
defined Morphic and Morphogenetic Fields in his book, The Presence of the Past [1988] as follows: “The term [Morphic Fields] is more general in its meaning than Morphogenetic Fields, and includes other kinds of organizing fields in addition to those of morphogenesis; the organizing fields of animal and human behaviour, of social and cultural systems, and of mental activity can all be regarded as Morphic Fields which contain an inherent memory.” – Sheldrake [1988]. Our new computation paradigm – Morphic Computing- is based on Field Theory and more specifically based on Morphic Fields. We claim that the Morphic Computing is a natural extension of the Holographic Computation, Quantum Computation, Soft Computing, and DNA computing. We also claim that all the natural computation that are bonded by the Turing Machine can be formalised and extended by our new type of computation model – Morphic Computing. In this paper, first we introduce the basis for our new computing paradigm – Morphic Computing based on Field Theory. Then we introduce it’s extensions such as Quantum Logic and Entanglement in Morphic Computing, Morphic Systems and Morphic System of Systems (M-SOS). Then Morphic Computing’s applications to the field of computation by words will be given. Finally, we present the Morphogenetic Fields in neural network and Morphic Computing, Morphic Field - Concepts and Web search, and Agents and fuzzy in Morphic Computing.
2 Morphic Computing and Field Theory: Classical and Modern Approach 2.1 Fields In this paper, we assume that computing is not always related to symbolic entities as numbers, words or other symbolic entities. Fields as entities are more complex than any symbolic representation of the knowledge. For example, Morphic Fields include the universal database for both organic (living) and abstract (mental) forms. In classical physics, we represent the interaction among the particles by local forces that are the cause of the movement of the particles. In classical physics, it is also more important to know at any moment the individual values of the forces than the structure of the forces. This approach considers that the particles are independent from the other particles under the effect of the external forces. But with further development of the theory of particle physics the researchers discovered that forces are produced by intermediate entities that are not located in one particular point of the space but are at any point of specific space at the same time. These entities are called “Fields”. Based on this new theory, the structure of the fields is more important than the value of force itself at any point. In this representation of the universe, any particle in any position is under the effect of the fields. Therefore, the fields are used to connect all the particles of the universe in one global entity. However, if any particle is under the effect of the other particles, every local invariant property will disappear because every system is open and it is not possible to close any local system. To solve this invariance problem, scientist discovered that the local invariant can be conserved
Morphic Computing
3
with deformation of the local geometry and the metric of the space. The change of local geometry compensate the change of the invariant due to the field. We can assume that the action of the fields can be compensate with deformation of the reference in any point. So we cannot use only one global reference by we have infinite number of references one in any point. Any particle is under the action of the fields however, the references that we have chosen, change in space and time, in a way to compensate the action of the field. In this case, all the reference space have been changed and the action of the field is completely compensate. In this way the action of the field is hidden in the change of the reference. In this case, we have different reference space which it’s geometry in general is non Euclidean. With the quantum phenomena, the problem becomes more complex because the particles are correlated one with the other in a much hidden way without any physical interaction with fields. This correlation or entanglement generates a structure inside the universe for which the probability to detect a particle is a virtual or conceptual field that cover the entire Universe. 2.2 Morphic Computing: Basis for Quantum, DNA, and Soft Computing Gabor [1972] and H. Fatmi and Resconi [1988] discovered the possibility to compute images made by a huge number of points as output from objects as a set of huge number of points as input by reference beams or laser (holography). It is also known that a set of particles can have a huge number of possible states given by the position and momentum of the particles. In the classical physics is impossible to have at the same time two different states. One or more particles cannot have in the same time different position and different momentum. So the states are separate one from the other. With respect to the quantum mechanics, one can have a superposition of all states, states are presented in the superposition at the same time. It is also very important to note that at the same time one cannot separate the states as individual entities but one must consider all the states as an entity. For this very peculiar property of the quantum mechanics, one can change all the superpose states at the same time. This type of global computation is the conceptual principle by which we think one can build quantum computers. Similar phenomena can be used to develop DNA computation where a huge number of DNA as a field of DNA elements are transformed (replication) at the same time and filtered (selection) to solve non polynomial problems. In addition, soft-computing or computation by words extend the classical local definition of true and false value for logic predicate to a field of degree of true and false inside the space of all possible values of the predicates. In this way, the computation power of the soft-computing is extended similar to the computation power that one can find in quantum computing, DNA computing, and Holographic computing. In conclusion, one can expect that all the previous approaches and models of computing are examples of a more general computation model called “Morphic Computing” where the morphic means “form” and is associated to the idea of holism, geometry, field , superposition, globality and so on. 2.3 Morphic Computing and Conceptual Fields – Non Physical Fields The Morphic Computing change or compute non physical but conceptual fields. One example can be to represent the semantic of the words. In this case, a field is
4
G. Resconi
generated by a word or a sentence as sources. For example, in a library, the reference space would be where the documents are located. At any given word, we define the field as a map of the position of the documents in the library and the number of the occurrences (values) of the word in the document. The word or source is located in one point of the reference space (query) but the field (answer) can be located in any part of the reference. Complex strings of words (structured query) generate a complex field or complex answer by which the structure can be obtained by the superposition of the fields of the words as sources with different intensity. Any field is a vector in the space of the documents. A set of basic fields is a vector space and form a concept. We break the traditional idea that a concept is one word in the conceptual map. Internal structure (entanglement) of the concept is the relation of dependence among the basic fields. The ambiguous word is the source (query) of the fuzzy set (field or answer). 2.4 Morphic Computing and Natural Languages – Theory of Generalized Constraint In a particular case, we know that a key assumption in computing with words is that the information which is conveyed by a proposition expressed in a natural language or word may be represented as a generalized constraint of the form “X isr R”, where X is a constrained variable; R is a constraining relation; and r is an indexing variable whose value defines the way in which R constrains X. Thus, if p is a proposition expressed in a natural language, then “X isr R” representing the meaning of p, equivalently, the information conveyed by p. Therefore, the generalised constraint model can be represented by field theory in this way. The meaning of any natural proposition p is given by the space X of the fields that form a concept in the reference space or objective space, and by a field R in the same reference. We note that a concept is not only a word, but is a domain or context X where the propositions p represented by the field R is located. The word in the new image is not a passive entity but is an active entity. In fact, the word is the source of the field. We can also use the idea that the word as an abstract entity is a query and the field as set of instances of the query is the answer. 2.5 Morphic Computing and Agents – Non Classical Logic In the agent image, where only one word (query) as a source is used for any agent, the field generated by the word (answer) is a Boolean field (the values in any points are true or false). Therefore, we can compose the words by logic operations to create complex Boolean expression or complex Boolean query. This query generates a Boolean field for any agent. This set of agents creates a set of elementary Boolean fields whose superposition is the fuzzy set represented by field with fuzzy values. The field is the answer to the ambiguous structured query whose source is the complex expression p. The fields with fuzzy values for complex logic expression are coherent with traditional fuzzy logic with a more conceptual transparency because it is found on agents and Boolean logic structure. As points out [Nikravesh, 2006] Web is a large unstructured and in many cases conflicting set of data. So in World Wide Web, fuzzy logic and fuzzy sets are essential part of query and also to find appropriate searches to
Morphic Computing
5
obtain the relevant answer. For the agent interpretation of the fuzzy set, the net of the Web is structured as a set of conflicting and in many case irrational agents whose task is to create any concept. Agents produce actions to create answers for ambiguous words in the Web. A structured query in RDF can be represented as a graph of three elementary concepts as subject, predicate and complement in a conceptual map. Every word and relationship in the conceptual map is variables whose values are fields which it’s superposition gives the answer to the query. Because we are more interested in the meaning of the query than how we write the query itself, we are more interested in the field than how we produce the field by the query. In fact, different linguistic representations of the query can give the same field or answer. In the construction of the answer by a query, we use symbolic words as sources of semantic fields with different intensity. Now given a query by a chain of words we activate the words as instruments to detect semantic sources in the context of documents for example in Web. The words in the query are diffuse in many different documents. Now from the sources as a semantic hologram in web, we can activate a process by which we generate other words locate in the same documents. The fields of words in the same documents are superposed in a way to generate the answer to the query. Now the localisation of the words inside the query in Web, can be denoted as a WRITE process. In fact we represent or write any individual word in the query inside the space of web as semantic field one for any word. After we have the READ process for which we generate other semantic fields for other words from the field of query word inside web. The READ process give us the answer. In analogy with the holography, the WRITE process is the construction of the hologram when we know the light field of the object. The READ is the construction of the light field image by the hologram. In the holography, the READ process uses a beam of coherent light as a laser to obtain the image. Now in our structured query, the words inside of text are activated at the same time. The words as sources are coherent in the construction by superposition of the desired answer or field. Now the field image of the computation by words in a crisp and fuzzy interpretation prepare the implementation of the Morphic Computing approach to the computation by words. In this way, we have presented an example of the meaning of the new type of computation “Morphic Computing”. 2.6 Morphic Computing: Basic Concepts Morphic Computing is based on the following concepts: 1) The concept of field in the reference space 2) The fields as points or vectors in the N dimension Euclidean space of the objects ( points ) 3) A set of M ≤ N basis fields in the N dimensional space. The set of M fields are vectors in the N dimensional space. The set of M vectors form a non Euclidean subspace H ( context ) of the space N. The coordinates Sα in M of the field X are the contro-variant components of the field X. The components of X in M are also the intensity of the sources of the basis field. The superposition of the basis field with different intensity give us the projection Q of X or Y = QX into the space H When M < N the projection operator of X into H define a constrain or relation among the components of Y.
6
G. Resconi
With the tensor calculus with the components Sα of the vector X or the components of more complex entity as tensors , we can generate invariants for any unitary transformation of the object space or the change of the basis fields. 5) Given two projection operators Q1 , Q2 on two spaces H1 , H2 with dimension M1 and M2 we can generate the M = M1 M2 , with the product of Y1 and Y2 or Y = Y1 Y2 . Any projection Q into the space H or Y = QX of the product of the basis fields generate Y. When Y ≠ Y1 Y2 the output Y is in entanglement state and cannot separate in the two projections Q1 and Q2 . 6) The logic of the Morphic Computing Entity is the logic of the projection operators that is isomorphic to the quantum logic. 4)
The information can be coded inside the basis fields by the relation among the basis fields. In the Morphic Computing the relation is represented by a non Euclidean Geometry which metric or expression of the distance between two points shows this relation. The projection operator is similar to the measure in quantum mechanics. The projection operator can introduce constrains inside the components of Y. The sources are the instrument to control the image Y in the Morphic Computing. There is a deep analogy among the Morphic Computing and the computation by holography and computation by secondary sources ( Jessel ) in the physical field. The computation of Y by X and the projection operator Q that project X into the space H give his result when the Y is similar to X. In this case, the sources S are the solution of the computation. We see the analogy with the neural network where the solution is to find the weights wk at the synapse. In this paper, we show that the weights are sources in the Morphic Computing. Now, it is possible to compose different projection operators in a network of Morphic Systems. It is obvious to consider this system as a System of Systems. Any Morphic Computation is always context dependent where the context is H. With the projection operator Q we project the query X in input inside the context H. The projection operator in the hologram is the process by which from the object X by the hologram we obtain the image Y. The context H is the space of the possible images that are generate by the hologram as sources. We remark that the images are a subset of the objects. In fact in the image we loss information that are locate in the object. We can control the context in a way to obtain wanted results. When any projection operator of X or QX is denoted as a measure, in analogy with the quantum mechanics, any projection operator loss information but can be see by the instruments . Also in the holography the image is the projection of the object , we loss information but by the image we can have information , as an instrument , of the properties of the object. In the measure analogy any measure depend on the previous measures. So any measure is dependent on the path of measures or projection operators that we realise before or the history. So we can say that different projection operators are a story ( See Roland Omnès in quantum mechanics stories). As in any story we loss information but we can use the story to have information and properties of the original phenomena or objects. The analogy of the measure gives us also another intuitive idea of the Morph Computing. Any measure become a good measure when give us an image Y of the real phenomena X that is similar. When the internal rules to X are not destroyed. In
Morphic Computing
7
the measure process, the measure is a good measure. The same for the Morphic Computing. The computation is a good computation when the projection operator does not destroy the internal relation of the field in input X. The projection operator as in the holography give as also a model of the object. In fact the image is a construction process by which we use sources to generate image. Now change or transformation of the sources give new images that we can say as computed by the sources and the transformation. So the sources give us the model of the object by which we can generate new images from the same object. The analogy with the measure in quantum mechanics is also useful to explain the concept of the Morphic Computing because the instrument in the quantum measure is the fundamental context that interfere with the physical phenomena as H interfere with the input field X. A deeper connection exist between the Projection operator lattice that represent the quantum logic and Morphic Computing processes ( see Eddie Oshins ). Because any fuzzy set is a scalar field of the membership values on the factors ( reference space ) ( Wang and Sugeno ). We remember that any concept can be viewed as a fuzzy set in the factor space. So at the fuzzy set we can introduce all the processes and concepts that we utilise in the Morphic Computing. For the relation between concept and field, we introduce in the field theory an intrinsic fuzzy logic. So in Morphic Computing, we have an external logic of the projection or measure ( quantum Logic ) and a possible internal fuzzy logic of the fuzzy interpretation of the fields. At the end, because we also use agents superposition to define fuzzy sets and fuzzy rules, we can again use the Morphic Computing to compute the agents inconsistency and irrationality. So, fuzzy set and fuzzy logic are part of the more general computation denoted Morphic Computing.
3 Reference Space, Space of the Objects, Space of the Fields in the Morphic Computing Given the n dimensional reference space ( R1 , R2 ,…,Rn ), any point P = ( R1 , R2 ,…,Rn ) is an object. Now we create the space of the objects which dimension is equal to the number of the points and the value of the coordinates in this space is equal to the value of the field in the points. We call the space of the points “space of the objects”. Now inside the space of the objects, we can locate any type of field as vectors in the space of the objects as points. In field theory, we assume that any complex field can be considered as a superposition of prototype fields that it’s model is well known. The prototype fields are vectors in the space of the objects that form a new reference or field space. In general, the field space is a non Euclidean space. In conclusion, any complex field Y can be written in this way Y = S1 H1 ( R1 ,…,Rn ) + S2 H2 ( R1 ,…,Rn )+ ....+ Sn Hn ( R1 ,…,Rn ) = H( R ) S
(1)
8
G. Resconi
In equation (1) , H1 , H2 ,…, Hn are the basic fields or prototype fields and S1 , S2 ,…, Sn are the weights or source values of the basic fields. We assume that any basic field is generated by a source. The intensity of the prototype fields is proportional to the intensity of the sources that generates the field itself. 3.1 Example of the Basic Field and Sources In Figure 1, we show an example of two different basic fields in a two dimensional reference space ( x, y). The general equation of the fields is
F ( x, y ) = S [ e
− h (( x − x0 )2 +( y − y0 )2 )
]
(2)
the parameters of the field F1 are S=1 h =2 and x0 = -0.5 and y0 = -0.5, the parameters of the field F2 are S=1 h =2 and x0 = 0.5 and y0 = 0.5
F
F 1
2
F
F
Fig. 1. Two different basic fields in the two dimensional reference space (x,y)
For the sources S1 = 1 and S2 = 1 the superposition field F that is shown in Figure 2 is F = F1 + F2. For the sources S1 = 1 and S2 = 2, the superposition field F that is shown again in Figure 2 is F = F1 + 2 F2 . 3.2 Computation of the Sources
To compute the sources Sk, we represent the prototype field Hk and the input field X in a Table 1 where the objects are the points and the attribute are the fields. The values in Table 1 is represented by the following matrices
⎡ H 1,1 ⎢H H = ⎢ 2,1 ⎢ ... ⎣⎢ H M,1
H 1,2
...
H 2,2
...
...
... ...
H M,2
⎤ H 2,N ⎥ ⎥ ... ⎥ H M,N ⎥ ⎦ H 1,N
,
⎡ X1 ⎤ ⎢X ⎥ X=⎢ 2 ⎥ ... ⎢X ⎥ ⎣ M⎦
Morphic Computing
F = F1 +
F = F1 +
F
9
F
Fig. 2. Example of superposition of elementary fields F1 , F2 Table 1. Fields values for M points in the reference space H1
H2
…
HN
Input
Field
X P1 P2 … PM
H1,1 H2, 1 … HM,1
F1,2 H2,2 … HM,2
… ... … …
H1,N H2,N … HM,N
X1 X2 ... XM
The matrix H is the relation among the prototype fields Fk and the points Ph. At this point, we are interested in the computation of the sources S by which they give the best linear model of X by the elementary field values. Therefore, we have the superposition expression
⎡ H1,1 ⎤ ⎡ H1,2 ⎤ ⎡ H1,n ⎤ ⎢H ⎥ ⎢H ⎥ ⎢H ⎥ 2,1 2,2 ⎢ ⎥ ⎢ ⎥ Y = S1 +S + ... + Sn ⎢ 2,n ⎥ = HS ⎢ ... ⎥ 2 ⎢ ... ⎥ ⎢ ... ⎥ ⎢H ⎥ ⎢H ⎥ ⎢H ⎥ ⎣ M ,n ⎦ ⎣ M ,1 ⎦ ⎣ M ,2 ⎦
(3)
Then, we compute the best sources S in a way the difference Y − X is the minimum distance for any possible choice of the set of sources. It is easy to show that the best sources are obtained by the expression T −1 T S = (H H ) H X
(4)
Given the previous discussion and field presentation, the elementary Morphic Computing element is given by the input-output system as shown in Figure 3.
10
G. Resconi
Field
Field Y = H S = QX
Sources S = (HTH)-1 HT X
X
Prototype fields H(R) Fig. 3. Elementary Morphic Computing
H
S1 X
Y S2
S3
H
H
Fig. 4. Shows the Network of Morphic Computing
Figure 4 shows network of elementary Morphic Computing with three set of prototype fields and three type of sources with one general field X in input and one general field Y in output and intermediary fields from X and Y.
When H is a square matrix, we have Y = X and S=H
-1
X and Y = X = H S
(4)
Now for any elementary computation in the Morphic Computing, we have the following three fundamental spaces. 1) The reference space 2) The space of the objects ( points ) 3) The space of the prototype fields Figure 5 shows a very simple geometric example when the number of the objects are three ( P1 , P2 , P3 ) and the number of the prototype fields are two ( H1 , H2 ). The space which coordinates are the two fields is the space of the fields.
Morphic Computing
11
P3
H1 P1 H2 P2 Fig. 5. The fields H1 and H2 are the space of the fields. The coordinates of the vectors H1 and H2 are the values of the fields in the three points P1 , P2 , P3 .
Please note that the output Y = H S is the projection of X into the space H Y = H ( HT H )-1 HT X = Q X With the property Q2 X = Q X Therefore, the input X can be separated in two parts X =QX+F where the vector F is perpendicular to the space H as we can see in a simple example given in Figure 6. P3 X F1 QX = Y
P1
F2 P2 Fig. 6. Projection operator Q and output of the elementary Morphic Computing. We see that X = Q X + F , where the sum is the vector sum.
Now, we try to extend the expression of the sources in the following way Given G ( Γ ) = ΓT Γ and G ( H ) = HT H and
12
G. Resconi
S* = [ G ( H ) + G ( Γ ) ] HT X So for S* = ( HT H )-1 HT X + Ωα = Sα + Ωα we have G ( Γ ) Sα + [ G ( H ) + G ( Γ ) ] Ωα = 0 and
S* = S + Ω = ( H T H + Γ T Γ ) H T X -1
where for Ω is function of S by the equation G(Γ )S + [G(H)+G(Γ )] Ω =0 For non-square matrix and/or singular matrix, we can use the generalized model given by Nikravesh [] as follows;
S * = ( H T ΛT Λ H
)
-1
H T ΛT Λ X = ( ( ΛH )T ( Λ H )
)
-1
( ΛH )T Λ X
Where we transform by Λ the input and the references H. The value of the variable D ( metric of the space of the field) is computed by the expression (5) D2 = ( H S )T ( H S ) = ST HT H S = ST G S = ( Q X )T Q X
(5)
For the unitary transformation U for which, we have UT U = I and H’ = U H the prototype fields change in the following way H’ = U H G’ = ( U H )T ( U H ) = HT UT U H = HT H And S’= [ ( U H )T ( U H ) ]-1 ( U H )T Z = G-1 HT UT Z = G-1 HT ( U-1 Z ) For Z = U X we have S’ = S and the variable D is invariant. for the unitary transformation U .
We remark that G = HT H is a quadratic matrix that gives the metric tensor of the space of the fields. When G is a diagonal matrix the entire elementary field are independent one from the other. But when G has non diagonal elements, in this case the elementary fields are dependent on from the other. Among the elementary fields there is a correlation or a relationship and the geometry of the space of the fields is a non Euclidean geometry.
Morphic Computing
13
4 Quantum Logic and Entanglement in Morphic Computing In the Morphic Computing, we can make computation on the context H as we make the computation on the Hilbert Space. Now we have the algebras among the context or spaces H. in fact we have H = H1 ⊕ H2 where ⊕ is the direct sum. For example given H1 = ( h1,1 ,h1,2 ,…..,h1,p ) where h1,k are the basis fields in H1 H2 = ( h2,1 ,h2,2 ,…..,h2,q ) where h2,k are the basis fields in H2 So we have H = H1 ⊕ H2 = ( h1,1 ,h1,2 ,…..,h1,p , h2,1 ,h2,2 ,…..,h2,q) The intersection among the context is H = H1 ∩ H2 The space H is the subspace in common to H1 , H2 . In fact for the set V1 and V2 of the vectors V1 = S1,1 h1,1 + S1,2 h1,2 + …..+ S1,p h1,p V2 = S2,1 h2,1 + S2,2 h2,2 + …..+ S2,q h2,q The space or context H = H1 ∩ H2 include all the vectors in V1 ∩ V2 . Given a space H , we can also built the orthogonal space H⊥ of the vectors that are all orthogonal to any vectors in H. Now we have this logic structure Q ( H1 ⊕ H2 ) = Q1 ∨ Q2 = Q1 OR Q2 where Q1 is the projection operator on the context H1 and Q2 is the projection operator on the context H2 . Q ( H1 ∩ H2 ) = Q1 ∧ Q2 = Q1 AND Q2 Q ( H⊥ ) = Q – ID = ¬ Q = NOT Q In fact we know that Q X – X = ( Q – ID ) X is orthogonal to Y and so orthogonal to H. In this case the operator ( Q – ID ) is the NOT operator.
14
G. Resconi
Now it easy to show [ 9 ] that the logic of the projection operator is isomorphic to the quantum logic and form the operator lattice for which the distributive low ( interference ) is not true. In figure 7 we show an expression in the projection lattice for the Morphic Computing
Field X
QX =[ ( Q1 Q2 ) Q3 ] X = Y
Sources S
Prototype fields H= ( H1 H2 ) H3A Fig. 7. Expressions for projection operator from the Morphic Computing Entity
Now we give an example of the projection logic and lattice in this way : Given the elementary field references
⎡ ⎢ ⎡1 ⎤ ⎡ 0⎤ H1 = ⎢ ⎥ , H 2 = ⎢ ⎥ , H 3 = ⎢ ⎢ ⎣0 ⎦ ⎣1 ⎦ ⎢⎣
1 ⎤ 2⎥ ⎥ 1 ⎥ 2 ⎥⎦
For which we have the projection operators
⎡1 Q1 = H1 ( H1T H1 ) −1 H1T = ⎢ ⎣0 ⎡0 Q2 = H 2 ( H 2T H 2 ) −1 H 2T = ⎢ ⎣0 Q3 = H 3 ( H 3T H 3 ) −1 H 3T
⎡1 ⎢ = ⎢2 ⎢1 ⎣⎢ 2
With the lattice logic we have
0⎤ 0 ⎥⎦ 0⎤ 1 ⎥⎦ 1⎤ 2⎥ ⎥ 1⎥ 2 ⎦⎥
Morphic Computing
15
⎡1 0 ⎤ ⎡1 0 ⎤ T -1 H 1,2 = H 1 ⊕ H 2 = ⎢ , Q Q = H (H H ) H = ∨ 1 2 1,2 1,2 1,2 1,2 ⎥ ⎢0 1⎥ ⎣0 1⎦ ⎣ ⎦ 1 ⎤ ⎡ ⎢1 ⎡1 0 ⎤ 2⎥ ⎥ , Q1 ∨ Q3 = H 1,3 (H 1,3T H 1,3 )-1 H 1,3 = ⎢ H 1,3 = H 1 ⊕ H 3 = ⎢ ⎥ 1 ⎥ ⎢ ⎣0 1⎦ 0 ⎢⎣ 2 ⎥⎦ 1 ⎤ ⎡ ⎢0 ⎡1 0 ⎤ 2⎥ ⎢ ⎥ , Q2 ∨ Q3 = H 2,3 (H 2,3T H 2,3 )-1 H 2, = ⎢ H 2,3 = H 2 ⊕ H 3 = 1 ⎥ 0 1⎥⎦ ⎢ ⎣ ⎢⎣ 1 2 ⎥⎦ And
⎡0 ⎤ H = H1 ∩ H 2 = H1 ∩ H 3 =H 2 ∩ H 3 = ⎢ ⎥ ⎣0 ⎦ ⎡0 0 ⎤ and Q1 ∧ Q2 =Q1 ∧ Q3 =Q 2 ∧ Q3 = ⎢ ⎥ ⎣0 0 ⎦ And in conclusion we have the lattice
Q1 Q2 = Q1 Q3 = Q2 Q3
Q1
Q2
Q3
Q1 Q2 = Q1 Q3 = Q2 Q3= 0 We remark that ( Q1 ∨ Q2 ) ∧ Q3 = Q3 but ( Q1 ∧ Q3 ) ∨ ( Q2 ∧ Q3 ) = 0 ∨ 0 = 0 When we try to separate Q1 from Q2 in the second expression the result change. Between Q1 and Q2 we have a connection or relation ( Q1 and Q2 generate the two dimensional space ) that we destroy when we separate one from the other. In fact Q1 ∧ Q3 project in the zero point. A union of the zero point cannot create the two dimensional space . the non distributive property assume that among the projection operators there is an entanglement or relation that we destroy when we separate the operators one from the other.
16
G. Resconi
Given two references or contexts H1 , H2 the tensor product H = H1 ⊗ H2 is the composition of the two independent contexts in one. We can prove that the projection operator of the tensor product H is the tensor product of Q1 , Q2 . So we have Q = H ( HT H )-1 HT = Q1 ⊗ Q2 The sources are Sαβ = Sα1 Sβ2 So we have Y = Y1 ⊗ Y2 = ( H1 ⊗ H2 ) Sαβ The two Morphic System are one independent from the other. The output is the product of the two outputs for the any Morphic System. Now we give some examples
⎡ ⎢1 ⎢ 1 H1 = ⎢ ⎢2 ⎢ ⎢1 ⎣⎢
1⎤ ⎡1 ⎢2 2⎥ ⎥ ⎢ ⎥ 1 , H2 = ⎢ 0 ⎥ ⎢ ⎥ ⎢1 1⎥ ⎢ ⎢⎣ 2 2 ⎥⎦
H1 ⊕ H 2 = H1α , β H 2γ ,δ
⎤ 1⎥ ⎥ 1⎥ 2⎥ 1 ⎥⎥ 2 ⎥⎦
⎡ ⎢ 1H 2 ⎢ 1 = ⎢ H2 ⎢2 ⎢ ⎢ 1H 2 ⎣⎢
⎡ ⎡1 ⎢ ⎢2 ⎢ ⎢ ⎢ 1⎢ 0 ⎢ ⎢ ⎢ ⎢ ⎢ ⎢1 ⎢ ⎣⎢ 2 ⎢ 1 ⎤ ⎢ ⎡1 H2 2 ⎥ ⎢ ⎢2 ⎥ ⎢ ⎢ 1 0H2 ⎥ = ⎢ ⎢ 0 ⎥ ⎢2 ⎢ 1 ⎥⎥ ⎢ ⎢⎢ 1 H2 ⎢ 2 ⎦⎥ ⎢ ⎣⎢ 2 ⎢ ⎡1 ⎢ ⎢ ⎢ ⎢2 ⎢ ⎢ ⎢ 1⎢ 0 ⎢ ⎢ ⎢ ⎢1 ⎢ ⎢⎣ 2 ⎣
⎤ 1⎥ ⎥ 1⎥ 2⎥ 1 ⎥⎥ 2 ⎥⎦
⎡1 ⎢2 ⎢ 1⎢ 0 2⎢ ⎢1 ⎢ ⎣⎢ 2
⎤ 1⎥ ⎥ 1⎥ 2⎥ 1 ⎥⎥ 2 ⎥⎦
⎡1 ⎢2 ⎢ 1⎢ 0 ⎢ ⎢1 ⎢ ⎣⎢ 2
⎤ 1⎥ ⎥ 1⎥ 2⎥ 1 ⎥⎥ 2 ⎦⎥
⎡1 ⎢2 ⎢ 1⎢ 0 2⎢ ⎢1 ⎢ ⎢⎣ 2
⎤⎤ 1 ⎥⎥ ⎥⎥ 1 ⎥⎥ 2 ⎥⎥ ⎥ 1 ⎥⎥ ⎥ 2 ⎦⎥ ⎥ ⎥ ⎤⎥ 1⎥ ⎥ ⎥⎥ 1⎥ ⎥ = H α , β ,γ ,δ 2⎥ ⎥ 1 ⎥⎥ ⎥ ⎥ 2 ⎦⎥ ⎥ ⎤⎥ 1 ⎥⎥ ⎥⎥ 1 ⎥⎥ ⎥ 2 ⎥⎥ 1 ⎥⎥ ⎥ 2 ⎥⎦ ⎥⎦
Morphic Computing
17
At every value of the basis fields we associate the basis field H2 multiply for the value of the basis fields in H1. Now because we have
⎡1 ⎢2 0 ⎢ Q1 = H1 ( H1T H1 ) −1 H1T = ⎢ 0 1 ⎢1 0 ⎢ ⎣2 ⎡2 1 ⎢3 3 ⎢ 1 2 Q2 = H 2 ( H 2T H 2 ) −1 H 2T = ⎢ ⎢3 3 ⎢ ⎢1 1 ⎢⎣ 3 3
⎡1 ⎢ 2 Q2 ⎢ Q = Q1 ⊗ Q2 = ⎢ 0Q2 ⎢1 ⎢ Q2 ⎣2
0Q2 1Q2 0Q2
1⎤ 2⎥ ⎥ 0⎥ 1⎥ ⎥ 2⎦ 1 ⎤ 3 ⎥ ⎥ 1 − ⎥ 3⎥ ⎥ 2 ⎥ 3 ⎥⎦
1 ⎤ Q2 2 ⎥ ⎥ 0Q2 ⎥ 1 ⎥ Q2 ⎥ 2 ⎦
⎡1 ⎤ ⎡1 ⎤ ⎡1⎤ ⎢3⎥ ⎢3 X2 ⎥ ⎢4⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ X1 = ⎢ 0 ⎥ , X 2 = ⎢ 0 ⎥ , X = X1 ⊗ X 2 = ⎢ 0 X 2 ⎥ ⎢1 ⎥ ⎢1 ⎥ ⎢1⎥ ⎢ ⎥ ⎢ X2 ⎥ ⎢ ⎥ ⎣4⎦ ⎣3⎦ ⎣3 ⎦ ⎡ 4 ⎤ ⎡ 3⎤ ⎢ ⎥ ⎢− 2 ⎥ 9 T −1 T T −1 T S1 = ( H1 H1 ) H1 X 1 = ⎢ ⎥ , S 2 = ( H 2 H 2 ) H 2 X 1 = ⎢ ⎥ ⎢− 2 ⎥ ⎢ 4 ⎥ ⎢⎣ 9 ⎦⎥ ⎢⎣ 3 ⎥⎦ And
S αβ
⎡ 4 ⎤ ⎢ 9 S2 ⎥ = S1 ⊗ S 2 = ⎢ ⎥ ⎢− 2 S ⎥ ⎢⎣ 9 2 ⎥⎦
18
G. Resconi
For
⎡ 7 ⎤ ⎡1 ⎤ ⎢ 12 ⎥ ⎢3⎥ ⎢ ⎥ ⎢ ⎥ 2 ⎥ ⎢ Y1 = H1S1 = ⎢ 0 ⎥ , Y2 = H 2 S2 = ⎢ 3 ⎥ ⎢1 ⎥ ⎢ ⎥ 1⎥ ⎢ ⎥ ⎢ − ⎣3⎦ ⎢⎣ 12 ⎥⎦ ⎡1 ⎤ ⎢ 3 Y2 ⎥ ⎢ ⎥ Y = Y1 ⊗ Y2 = ⎢ 0Y2 ⎥ ⎢1 ⎥ ⎢ Y2 ⎥ ⎣3 ⎦ In conclusion we can say that the computation of Q , S and Y by H and X can be obtained only with the results of Q1 , S1 , Y1 and Q2 , S2 , Y2 independently. Now when H and X cannot be write as a scalar product of H1 , X1 and H2 , X2 , the context H and the input X are not separable in the more simple entities so are in entanglement. In conclusion with the tensor product, we can know if two measures or projection operators are dependent or independent one from the other.
So we have for the tensor product of the context that all the other elements of the entity are obtained by the tensor product
Query Field X = X1
X2
Sources S = X1
X2 QX =Q(X1
X2 ) = Y1
Y2 =Y Prototype fields H= H1
H2
Fig. 8. Tensor product for independent projection operators or measures
5 Conclusion In this paper we present a new type of computation denoted morphic computing. The new type of computation extend and improve the principles of the optic computation by holography. In the holographic process we have one object and one image. The image is considered as the projection of the object. Now we give a formal description
Morphic Computing
19
of the projection operator with the hidden logic. Based on this new logic we can implement a new type of computation. The logic of the projection is similar to the quantum logic where we loss the distributive rule for the interference or superposition among the states. In fact we know that in quantum mechanics we loss the exclusive principle for which particles can assume in one time only one position or momentum. In quantum mechanics one particle can have in a superposition state different positions or momentum at the same time. This generate new type of state, superpose states, that we cannot found in the classical physics. The new property of superposition change dramatically the computation process and give us a new type of computer denoted quantum computer. Now the morphic computing extend the quantum computing process to any type of context and any type of query. The quantum mechanics for morphic computing is the prototype system. In the morphic computing we are beyond any identification with the physic of particles domain we argue that morphic computing include neural computing , soft computing , genetic computing.
Reference 1. Zadeh, L.A., Nikravesh, M.: Perception-Based Intelligent Decision Systems; Office of Naval Research, Summer 2002 Program Review, Covel Commons. University of California, Los Angeles, July 30th-August 1st (2002) 2. Zadeh, L.A., Kacprzyk, J. (eds.): Computing With Words in Information/Intelligent Systems 1: Foundations. Physica-Verlag, Germany (1999a) 3. Zadeh, L.A., Kacprzyk, J. (eds.): Computing With Words in Information/Intelligent Systems 2: Applications. Physica-Verlag, Germany (1999b) 4. Resconi, G., Jain, L.C.: Intelligent Agents. Springer, Heidelberg (2004) 5. Nikravesh, M.: Intelligent Computing Techniques for Complex systems. In: Soft Computing and Intelligent Data Analysis in Oil Exploration, pp. 651–672. Elsevier, Amsterdam (2003) 6. Gabor, D.: Holography 1948-1971. Proc. IEEE 60, 655–668 (1972) 7. Fatmi, H.A., Resconi, G.: A New Computing Principle. Il Nuovo Cimento 101B(2), 239– 242 (Febbraio 1988) 8. Omnès, R.: The Interpretation of Quantum Mechanics. Princeton Series in Physics (1994) 9. Oshins, E., Ford, K.M., Rodriguez, R.V., Anger, F.D.: A comparative analysis: classical, fuzzy, and quantum logic. In: 2nd Florida Artificial Intelligence Research Symposium, St. Petersburg, Florida, April 5, 1989 (1992); In: Fishman, M.B. (ed.), Advances in artificial intelligence research, vol. II. JAI Press, Greenwich, CT. Most Innovative Paper Award, 1989. In: FLAIRS 1989, Florida AI Research Symposium 10. Jessel, M.: Acoustique Théorique. Masson et Cie Editours (1973) 11. Wang, P.Z., Sugeno, M.: The factor fields and background structure for fuzzy subsets. Fuzzy Mathematics 2, 45–54 (1982) 12. Sheldrake, R.: A New Science of Life: The Hypothesis of Morphic Resonance (1981, second edition 1985) 13. Sheldrake, R.: The Presence of the Past (1988) 14. Nikravesh, G.R.M.: Morphic Computing. Concepts and Foundation. In: Nikravesh, M., Zadeh, L.A., Kacprzyk, J. (eds.) Forging the new Frontieres: Fuzzy Pioneers I. Studies in Fuzziness and Soft Computing. Springer, Heidelberg (2007)
20
G. Resconi
15. Nikravesh, G.R.M.: Morphic Computing: Quantum Field. In: Nikravesh, M., Zadeh, L.A., Kacprzyk, J. (eds.) Forging the new Frontieres: Fuzzy Pioneers II. Studies in Fuzziness and Soft Computing. Springer, Heidelberg (2007) 16. Nikravesh, G.R.M.: Morphic Computing. Applied Soft Computing Journal (July 2007) 17. Nikravesh, G.R.M.: Morphic Computing part 1: Foundation. In: Melin, P., Castillo, O., Aguilar, L.T., Kacprzyk, J., Pedrycz, W. (eds.) IFSA 2007. LNCS (LNAI), vol. 4529. Springer, Heidelberg (2007) 18. Nikravesh, G.R.M.: Morphic Computing part 1I: Web Search. In: Melin, P., Castillo, O., Aguilar, L.T., Kacprzyk, J., Pedrycz, W. (eds.) IFSA 2007. LNCS (LNAI), vol. 4529. Springer, Heidelberg (2007)
Amplifying Video Information-Seeking Success through Rich, Exploratory Interfaces Michael G. Christel School of Computer Science, Carnegie Mellon University 5000 Forbes Ave., Pittsburgh, PA 15213, USA
[email protected]
Abstract. Years of international participation and collaboration in TRECVID have shown that interactive multimedia systems, those with a user in the search loop, have consistently outperformed fully automated systems. Interface capabilities like querying by text, by image, and by semantic concept and storyboard layouts have led to significant performance improvements on provided search tasks. Lessons learned for TRECVID shot-based video retrieval are presented. In the real world, however, video collection users may focus on story threads instead of shots, or may not be provided with a clear stated search task. The paper also discusses users facing situations where they lack the knowledge or contextual awareness to formulate queries and having a genuine need for exploratory search systems supporting serendipitous browsing. Interfaces promoting various views for navigating complex information spaces can help with exploratory search and investigation into video corpora ranging from documentaries to broadcast news to oral histories. Work is presented using The HistoryMakers oral history archive as well as TRECVID international broadcast news to discuss the utility of various query and presentation mechanisms emphasizing people, time, location, and visual attributes. The paper leads into a discussion of how exploratory interfaces for video extend beyond storyboards, with a series of user studies referenced as empirical data in support of the presented conclusions. Keywords: Video browsing, digital video retrieval, TRECVID, Informedia, user studies.
1 Introduction The Informedia research project at Carnegie Mellon University (CMU) has worked since 1994 on various issues related to digital video understanding, tackling search, retrieval, visualization and summarization in both contemporaneous and archival video content collections through speech, image, and natural language understanding [1]. As the interface designer, developer, and evaluator on the Informedia team, the author’s role has been to iterate through a number of deployments that leverage the latest advances in machine learning techniques and other approaches to automated video metadata creation. Benchmarking user performance with digital video retrieval to chart progress became much easier with the creation of a TREC video retrieval track in 2001, the subject of Section 2. A number of Informedia user studies have taken place through the years, most often with CMU students and staff as the participants. These studies were surveyed in a 2006 paper reporting on how they can provide a user pull complementing the G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 21–30, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
22
M.G. Christel
technology push as automated video processing advances [2]. Section 3 overviews a few studies, reporting empirical conclusions on video summarization and browsing. Section 4 reports on recent studies on two types of video corpora: international broadcast news as used in TRECVID, and video oral histories from The HistoryMakers. Section 5 presents conclusions and opportunities for future work.
2 TRECVID Interactive Video Search Benchmarking The Text REtrieval Conference (TREC) was started in 1992 to support the text retrieval industry by providing the infrastructure necessary for large-scale evaluation of text retrieval methodologies. The same needs for the video retrieval community led to the establishment of the TREC Video Track in 2001. Now an independent evaluation, TRECVID began with the goal to promote progress in content-based retrieval from digital video via open, metrics-based evaluation. The corpora have ranged from documentaries to advertising films to broadcast news, with international participation growing from 12 to 54 companies and academic institutions from 2001 to 2007 [3]. A number of tasks are defined in TRECVID, including shot detection, semantic feature extraction, rush video summarization, and information retrieval. The Cranfield paradigm of retrieval evaluation is based on a test collection consisting of three components: a set of documents, a set of information need statements called topics, and a set of relevance judgments. The relevance judgments are a list of the “correct answers” to the searches: the documents that should be retrieved for each topic. Success is measured based on quantities of relevant documents retrieved, in particular the metrics of recall and precision. The two are combined into a single measure of performance, average precision, which measures precision after each relevant document is retrieved for a given topic. Average precision is then itself averaged over all of the topics to produce a mean average precision (MAP) metric for evaluating a system’s performance. For TRECVID video searches, the individual “documents” retrieved are shots, where a shot is defined as a single continuous camera operation without an editor’s cut, fade or dissolve – typically 2-10 seconds long for broadcast news. The TRECVID search task is defined as follows: given a multimedia statement of information need (a topic) and the common shot reference, return a ranked list of up to 1000 shots from the reference which best satisfy the need. For the interactive search task, the user can view the topic, interact with the system, see results, and refine queries and browsing strategies interactively while pursuing a solution. The interactive user has no prior knowledge of the search test collection or topics. The topics are defined by NIST to reflect many of the sorts of queries real users pose, based on query logs against video corpora like the BBC Archives and other empirical data [3, 4]. Three TRECVID test sets are used in studies cited in Section 3: TRECVID 2004 test set holds 128 broadcasts, 64 hours, of ABC News and CNN video from 1998, consisting of 33,367 reference shots. TRECVID 2005 is 140 international broadcasts (85 hours) of English language, Arabic, and Chinese news from 2004, consisting of 45,765 reference shots. TRECVID 2006 is similar but with more data: 165 hours of U.S., Arabic, and Chinese news with 79,484 reference shots.
Amplifying Video Information-Seeking Success through Rich, Exploratory Interfaces
23
An ACM SIGMM retreat on the future of multimedia information retrieval praised the contributions of TRECVID for international community benchmarking, noting that “repeatable experiments using published benchmarks are required for the field to progress” [5]. The NIST TRECVID organizers are clearly cognizant of issues of ecological validity: the extent to which the context of a user study matches the context of actual use of a system, such that it is reasonable to suppose that the results of the study are representative of actual usage and that the differences in context are unlikely to impact the conclusions. TRECVID user studies can make use of the TRECVID community effort to claim ecological validity in most regards: the data set is real and representative, the tasks (topics) are representative based on prior analysis of BBC and other empirical data, and the processing efforts are well communicated with a set of rules for all to follow. The remaining question of validity is whether the subject pool represents a broader set of users, with university students and staff for the most part comprising the subject pool for many research groups because of their availability. TRECVID has provided metrics showing the benefits of automated tool support in combination with human manipulation and interpretation for video information retrieval. Without automated tools to support browsing and summarization, the human user is swamped with too many possibilities as the quantity and diversity of video proliferate. Ignoring the human user, though, is a mistake, for through the history of TRECVID, fully automated systems involving no human user have consistently and significantly underperformed compared to interactive human-in-theloop search systems [3]. Over the years, Informedia TRECVID experiments have confirmed the utility of storyboards showing matching thumbnails across multiple video documents [6], the differences in expert and novice search behavior when given TRECVID topics [7], the utility of transcript text for news video topics [8], and the overlooking of using concept filters (e.g., include or exclude all shots having the “roads” concept or “outdoors” concept) to reduce the shot space [6, 8, 9]. These studies are surveyed in the next section.
3 Evaluation and User Studies with Respect to Video Summarization and Browsing Video summaries have many purposes, summarized well by Taskiran et al. [10] and including the following: intriguing the user to watch the whole video (movie trailers), deciding if the program is worth watching (electronic program guide), locating specific regions of interest (lecture overview), collapsing highly redundant footage into the subset with important information (surveillance executive summary). For most applications video summaries mainly serve two functions [10]: an indicative function, where the summary is used to indicate what topics of information are contained in the original program; and the informative function, where the summaries are used to cover the information in the source program as much as possible, subject to the summary length. This paper focuses on indicative summaries, i.e., the assessment of video surrogates meant to help users better judge the relevance of the source program for their task at hand.
24
M.G. Christel
A 1997 Informedia study with 30 high school and college students and a documentary corpus found that a single thumbnail image chosen from query context represents a source document well [11]. It produces faster, more accurate, more satisfying retrieval performance compared to straight text representations or a context-independent thumbnail menu, in which each document is always represented by the same selection strategy of taking the first shot in the document. Figure 1 shows a number of views into TRECVID 2006 data following a query on any of “tornado earthquake flood hurricane volcano.” As an example of a query-based thumbnail, the tornado story as the eleventh result in Figure 1 (3rd row, 3rd thumbnail in segment grid at lower left) starts off with anchorperson and interview shots in a studio that are much less informative visually than the tornado/sky shot shown in Figure 1, with the tornado shot chosen automatically based on the user’s query.
Fig. 1. 417 segments returned from text query against TRECVID 2006 data set, shown in 4 views: Segment Grid; Storyboard of shots filtered to non-map, non-anchor outdoor shots; Common Phrases; and VIBE visualization-by-example plot showing volcano by itself and hurricane aligned with flood
The automatic breakdown of video into component shots has received a great deal of attention by the image processing community [2, 5, 8]. TRECVID has had a shot detection task charting the progress of automatic shot detection since 2001, and has shown it to be one of the most realizable tasks for video processing with accuracy in excess of 90% [3]. A thumbnail image representing each shot can be arranged into a single chronological display, a storyboard surrogate, which captures the visual flow of a video document along with the locations of matches to a query. From Figure 1’s interface, clicking the filmstrip icon ( ) in the segment grid displays a storyboard surrogate for just that one segment. The storyboard interface is equivalent to drilling into a document to expose more of its visual details before deciding whether it should be viewed. Storyboards are also navigation aids, allowing the user to click on an
Amplifying Video Information-Seeking Success through Rich, Exploratory Interfaces
25
image to seek to and play the video document from that point forward. Informedia storyboards were evaluated primarily through discount usability techniques, two of which were heuristic evaluation and think-aloud protocol, working in the context of TRECVID tasks [6]. Storyboards were found to be an ideal roadmap into a video possessing a number of shots, and very well suited to the TRECVID interactive search task emphasizing the retrieval of shots relevant to a stated task, as evidenced in annual empirical studies run with CMU students and staff as participants [6, 7, 9, 12]. As for ecological validity, in practice users were struggling more with the task of finding the right shot from a collection of videos, rather than just finding the right shot within a single video, once the corpus grew from tens to hundreds to thousands of hours. The obvious interface extension for presentations like Figure 1 is to present all of the shots for a set of video segments in a multiple document storyboard, e.g., all of the shots for the 417 segments of Figure 1. If all shots are shown, though, for even this small set of 417 segments, the storyboard would contain 11503 shots, a much greater number of thumbnails than is likely to be scanned efficiently. Hence, a major difficulty with storyboards is that there are often too many shots to display in a single screen [11, 13]. Rather than show all the shots, only those shots containing matches could be included in a representation for a collection of video, so that rather than needing to show 11503 shots, 661 matching shots could be shown to represent the 417 segments returned in the query shown in Figure 1. Such a storyboard is shown in the upper right, with triangle notches at the top of thumbnails communicating some match context: what matched and where for a given query against this selected video. Storyboards have achieved great success for TRECVID interactive search tasks. Worring et al. from MediaMill report on three alternate forms of shot thumbnail displays for video: the CrossBrowser, SphereBrowser, and GalaxyBrowser [14], with the CrossBrowser evaluating well for TRECVID interactive search [15]. In the Cross Browser, two strips of thumbnails are shown rather than a storyboard grid, with the vertical strip corresponding to a visual concept or search engine ranked ordering and the horizontal strip corresponding to temporal shot order. In the Informedia storyboard interface, the thumbnails are kept the same size and in a packed temporal grid, with the dense layout allowing over two thousand shots to be visually reviewed within the 15-minute TRECVID task time limit, with high task performance [12]. For TRECVID evaluations from 2002 through 2006, storyboard interfaces from Informedia and the MediaMill team have consistently and overwhelmingly produced the best interactive search performance [6, 7, 9, 12, 15]. When given a stated need, a short period of time to fulfill that need, many answer candidates, and an average precision metric to measure success, storyboards produce the best results. Storyboards are the most frequently employed interface into video libraries seen today, but that does not mean that they are sufficient. On the contrary, a 2007 workshop involving the BBC [16] witnessed discussion over the shortcomings of storyboards and the need for playable, temporal summaries and other forms of video surrogates for review and interactive interfaces for control. A BBC participant stated that the industry is looking to the multimedia research community for the latest advances into video summarization and browsing. The next section overviews a few Informedia studies showing the need to move beyond storyboards and TRECVID tasks when addressing opportunities with various user communities.
26
M.G. Christel
4 Opportunities with Real World Users For some video, like an hour video of a single person talking, the whole video is a single shot of that person’s head, and a storyboard of that one shot provides no navigational value. Such video is typical of oral history interviews. For this material, storyboards are not as useful as other views, such as those shown in Figure 2. These views are interactive, allowing users to browse and explore, e.g., to filter down into a subset of interest as shown in Figure 3. Along with the nature of the video corpus, the nature of the task can modify whether storyboards and other widgets are effective and sufficient interfaces. The HistoryMakers is a non-profit institution headquartered in Chicago whose purpose is to record, preserve and disseminate the content of video oral history interviews highlighting the accomplishments of individual African Americans and African-American-led groups and movements. Life oral histories were used from The HistoryMakers (18,254 stories, 913 hours of video) addressing the question as to whether video adds value beyond transcripts and audio (full details in [17]). The question was operationalized by two variants of the same user interface: one with a still image with the audio and one with the video track with the audio.
Fig. 2. Multiple access strategies into an oral history collection like the Common Phrases, Map, and Event Timeline let the user browse attributes of the set of 725 segments produced from a query on “chemistry biology science”
These two interfaces were used in two user studies conducted with primarily CMU and University of Pittsburgh students. In the first study, 24 participants conducted a treasure hunt task (find the good video for 12 topics), similar to TRECVID topics in that the information need is targeted and expressed to the participant under tight time constraints on performance. There were no statistical differences on a range of metrics (performance and satisfaction), and participants did not seem to even notice the differences between the two systems. This is somewhat surprising and disappointing in a way: the video offered no value with this particular task.
Amplifying Video Information-Seeking Success through Rich, Exploratory Interfaces
27
Fig. 3. Working from the same data of Figure 2, the views encourage exploration; 1 story discusses Antarctica, the shown video with Walter Massey
In a follow-up study that considers the effect of task, when the same user interfaces are tested with 14 participants on an exploratory search task (find several stories for a report), there was significant subjective preference for the video interface. Unlike what occurred with the first study, the subjects were aware of the interface differences and strongly preferred the video treatment over still image with transcript for the exploratory task. These two studies together show the subtle dynamics of task and multimedia stimuli, as discussed in a full report on the work [17]. Reflecting on empirical validity again, what are the real world tasks that history students and other users of The HistoryMakers corpus actually conduct? If more of the stated fact-finding nature, video plays little or no role. If more of the exploratory generate-research-report nature, then video is much preferred. Ongoing work is taking place to investigate the utility of the views from Figure 2 and 3 with history students, showing a preference for text views like Common Text over more complex views like the Map View or VIBE View on initial interactions with the system.
28
M.G. Christel
Returning to Figure 1, TRECVID 2004-2006 tasks, and broadcast news corpora, what real-world users are there, and how do their tasks compare to the TRECVID topics for interactive search? Six intelligence analysts were recruited to participate in an experiment as representatives of a user pool for news corpora: people mining open broadcast sources for information as their profession. These analysts, compared to the university students participating in prior referenced studies, were older, more familiar with TV news, just as experienced with web search systems and frequent web searchers, but less experienced digital video searchers. Their expertise was in mining text sources and text-based information retrieval rather than video search. More details and a full discussion of the experiments appear in [18], with the results of a TRECVID study showing that the analysts do not even attempt performance on relatively easy sports topics, and in general stop filling in answers well before the 15minute time limit was reached. For the analysts, sports topics were irrelevant and not meaningful or interesting, and the TREC metric of MAP at a depth of 1000 shots is also unrealistic: they were content with finding 30 answer shots. Importantly, these real-world users did successfully make use of all three query strategies: query-by-text, query-by-image, and query-by-concept (using semantic concepts like “road” or “people” for video retrieval; filtering storyboards by such concepts is shown in Figure 1). When given an expressed information need, the TRECVID topic, this community performed better with and favored a system with image and concept query capabilities over an exclusive text-search system [18]. Analyst activity is creative and exploratory as well, where the information need is discovered and evolves over time based on interplay with data sources. Likewise, video search activity can be creative and exploratory where the information need is discovered and evolves over time. Evaluating tools for exploratory, creative work is difficult, as acknowledged by Shneiderman and Plaisant [19], with this subject being the current work of the author in the context of both broadcast news sources and life oral histories. The conference talk will discuss the very latest work, building from some exploratory task work done with these same six analysts using views like Fig. 1.
5 Lessons Learned and Future Directions TRECVID provides a public corpus with shared metadata to international researchers, allowing for metrics-based evaluations and repeatable experiments along with other advantages [8]. An evaluation risk with over-relying on TRECVID is tailoring interface work to deal solely with the genre of video in the TRECVID corpus. This risk is mitigated by varying the TRECVID corpus genre. Another risk is the topics and corpus drifting from being representative of real user communities and their tasks, which the TRECVID organizers hope is addressed by continually soliciting broad researcher and consumer involvement in topic and corpus definitions. An area that so far has remained outside of TRECVID evaluation has been the exploratory browsing interface capabilities supported by multiple views into video data as illustrated in Figures 1-3. The HistoryMakers study [17] hints that for TRECVID stated search topics and time limits, exploratory search is not needed and perhaps the video modality itself is not needed. What is the point of video for a user community, and what are their tasks with that video? The tasks drive the metadata and presentation
Amplifying Video Information-Seeking Success through Rich, Exploratory Interfaces
29
requirements. The work with the intelligence analysts [18] show that if the task does not fit the users’ expectations, there will be measurable changes in performance, i.e., for the analysts there was significant drop-off in gathering more than 30 shots per topic and in working diligently on the sports topics. From the cited user studies and Informedia interface work in general through the years, the following lessons learned and strategic directions are offered: • Leverage context (improve the video surrogate based on user activity) • Provide easy user tuning of precision vs. recall (e.g., some users may want to see just correct shots, others want to get all of them; through widgets like dynamic query sliders as shown with Figure 1 for “outdoor” the user is left in control) • Exploratory search, the new frontier for video repositories as cheap storage and transmission allows for huge corpora (discussed in [17]; evaluation in [19]) • Augment automatically produced metadata with human-provided descriptors (take advantage of what users are willing to volunteer, and in fact solicit additional feedback from humans through motivating games that allow for human computation, a research focus of Luis von Ahn at CMU) Of course, these directions are not independent. Fielding an interactive system supporting each of these features can gather operational data and feedback for yet more improvements in video information seeking. For example, a web video news service offered by a broadcaster could: • Track user interests to gauge that field footage of weather-related stories was a typical context, • Note that users wanted many possible rather than a few definite shots to peruse, • Streamline exploration along desired date, location, and reporter dimensions, and • Solicit additional feedback, recommendations, and tags from a willing social network user community. As video corpora grow on the web and their user bases grow as well, sophisticated personalization mechanisms can couple with automatically derived metadata for video to allow rich, engaging interfaces supporting effective exploration.
Acknowledgements This work is supported by the National Science Foundation under Grant No. IIS0205219 and Grant No. No. IIS-0705491. The HistoryMakers, CNN, and others’ video contributions are gratefully acknowledged, with thanks to NIST and the TRECVID organizers for enabling video evaluation work through the years.
References 1. Informedia Research at Carnegie Mellon University, http://www.informedia.cs.cmu.edu 2. Christel, M.: Evaluation and User Studies with Respect to Video Summarization and Browsing. In: Chang, E.Y., Hanjalic, A., Sebe, N. (eds.) Proceedings of SPIE, Multimedia Content Analysis, Management, and Retrieval 2006, vol. 6073 (2006), doi:10.1117/12.642841
30
M.G. Christel
3. NIST TREC Video Retrieval Evaluation Online Proceedings (2001-2007), http://www-nlpir.nist.gov/projects/tvpubs/tv.pubs.org.html 4. Enser, P.G.B., Sandom, C.J.: Retrieval of Archival Moving Imagery - CBIR Outside the Frame? In: Lew, M.S., Sebe, N., Eakins, J.P. (eds.) CIVR 2002. LNCS, vol. 2383, pp. 206–214. Springer, Heidelberg (2002) 5. Rowe, L.A., Jain, R.: ACM SIGMM Retreat Report on Future Directions in Multimedia Research. ACM Trans. Multimedia Computing, Comm., & Applications 1, 3–13 (2005) 6. Christel, M.G., Moraveji, N.: Finding the Right Shots: Assessing Usability and Performance of a Digital Video Library Interface. In: Proc. ACM Multimedia, pp. 732– 739. ACM Press, New York (2004) 7. Christel, M.G., Conescu, R.: Mining Novice User Activity with TRECVID Interactive Retrieval Tasks. In: Sundaram, H., et al. (eds.) CIVR 2006. LNCS, vol. 4071, pp. 21–30. Springer, Berlin (2006) 8. Hauptmann, A.G., Christel, M.G.: Successful Approaches in the TREC Video Retrieval Evaluations. In: Proc. ACM Multimedia, pp. 668–675. ACM Press, New York (2004) 9. Christel, M.G., Conescu, R.: Addressing the Challenge of Visual Information Access from Digital Image and Video Libraries. In: Proc. Joint Conference on Digital Libraries, pp. 69– 78. ACM Press, New York (2005) 10. Taskiran, C.M., Pizlo, Z., Amir, A., Ponceleon, D., Delp, E.J.: Automated Video Program Summarization Using Speech Transcripts. IEEE Trans. on Multimedia 8, 775–791 (2006) 11. Christel, M.G., Winkler, D., Taylor, C.R.: Improving Access to a Digital Video Library. In: Howard, S., Hammond, J., Lindgaard, G. (eds.) Human-Computer Interaction: INTERACT 1997, pp. 524–531. Chapman and Hall, London (1997) 12. Christel, M., Yan, R.: Merging Storyboard Strategies and Automatic Retrieval for Improving Interactive Video Search. In: Proc. CIVR 2007, pp. 69–78. ACM Press, New York (2007) 13. Lienhart, R., Pfeiffer, S., Effelsberg, W.: Video Abstracting. Comm. ACM 40(12), 54–62 (1997) 14. Worring, M., Snoek, C., et al.: Mediamill: Advanced Browsing in News Video Archives. In: Sundaram, H., et al. (eds.) CIVR 2006. LNCS, vol. 4071, pp. 533–536. Springer, Heidelberg (2006) 15. Snoek, C., Worring, M., Koelma, D., Smeulders, A.: A Learned Lexicon-Driven Paradigm for Interactive Video Retrieval. IEEE Trans. Multimedia 9, 280–292 (2007) 16. ACM: Proc. Int’l Workshop on TRECVID Video Summarization (in conjunction with ACM Multimedia). ACM Press, New York (2007) ISBN: 978-1-59593-780-3 17. Christel, M.G., Frisch, M.H.: Evaluating the Contributions of Video Representation for a Life Oral History Collection. In: Proc. Joint Conference on Digital Libraries. ACM Press, New York (2008) 18. Christel, M.G.: Establishing the Utility of Non-Text Search for News Video Retrieval with Real World Users. In: Proc. ACM Multimedia, pp. 706–717. ACM Press, New York (2007) 19. Shneiderman, B., Plaisant, C.: Strategies for Evaluating Information Visualization Tools: Multi-dimensional In-depth Long-term Case Studies. In: Proc. ACM BELIV 2006 Workshop, Advanced Visual Interfaces Conference, pp. 1–7. ACM Press, New York (2006)
Privacy-Enhanced Personalization Alfred Kobsa Donald Bren School of Information and Computer Sciences University of California, Irvine Irvine, CA 92697-3440, U.S.A.
[email protected] http://www.ics.uci.edu/~kobsa
Personalized interaction with computer systems can be at odds with privacy since it necessitates the collection of considerable amounts of personal data. Numerous consumer surveys revealed that computer users are very concerned about their privacy online. The collection of personal data is also subject to legal regulations in many countries and states. This talk presents work in the area of Privacy-Enhanced Personalization that aims at reconciling personalization with privacy through suitable human-computer interaction strategies and privacy-enhancing technologies.
References 1. Kobsa, A.: Privacy-Enhanced Personalization. Communications of the ACM 50(8), 24–33 (2007) 2. Kobsa, A.: Privacy-Enhanced Web Personalization. In: Brusilovsky, P., Kobsa, A., Nejdl, W. (eds.) Adaptive Web 2007. LNCS, vol. 4321, pp. 628–670. Springer, Heidelberg (2007)
G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, p. 31, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
Narrative Interactive Multimedia Learning Environments: Achievements and Challenges Paul Brna School of Informatics, the University of Edinburgh, Edinburgh EH8 9LW, Scotland
[email protected] http://webmail.mac.com/paulbrna/home/
Abstract. The promise of Narrative Interactive Learning Environments is that, somehow, the various notions of narrative can be harnessed to support learning in a manner that adds significantly to the effectiveness of learning environments. Here, we briefly review what has been achieved, and seek to identify the most productive paths along which researchers need to travel if we are to make the most of the insights that are currently available to us.
1 The Promise of Narrative As a (broad) working definition, a Narrative Interactive Learning Environment (NILE) is an environment which has been designed with an explicit awareness of the advantages for learning to be cast as a process of setting up a challenge, seeking to overcome some obstacles and achieving a (partial) resolution. The notion of a Narrative Interactive Learning Environment is attractive, in part, because of the potential for stories to engage the reader. There seems to be an implicit promise that NILEs will have, for example, the intensity of seeing an exciting film, or reading an absorbing book. Because of the association with the purpose of ILEs as promoting effective learning, there is also the suggestion that the learning experience will be enhanced by the use of narrative (in some sense). The notion of narrative is increasingly utilised in the rhetoric of current designers of interactive learning environments. Narrative is seen as one key ingredient in the search for providing environments that strongly motivate the learner. It is also seen as a key ingredient in making sense of personal experience — and hence of value in seeking to communicate meaning. Dickey, for example, comes to the problem of developing motivating learning environments from the perspective of edutainment; an approach based on games with a strong story line [1]. While the argument is often accepted with little criticism, there is an implicit assumption that engagement = motivation to learn, but it is equally possible that the motivation is just “to play”. While there is genuine value in some kinds of play, for the most part, the trick is to make learning environments that encourage enjoyable learning. G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 33–44, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
34
P. Brna
Turning to the issue of making sense, Davis and her colleagues are working with autistic children to help them participate with others to overcome some of the deficits traditionally associated with autism [2]. Davis is also hopeful that there is value for the autistic child in working through narrative to build some skills at making sense of experience, and communicating this to others. We provide a simple framework to present which aspects of learning have received the most attention from designers. We will then consider the main achievements with the help of some existing systems. Finally we present some possible future developments. The conclusion, in brief, is that NILEs can help designers refocus their attentions upon the whole person.
2 A Framework The simple framework presented here does not do full justice to the underlying complexity of narrative in all its forms. However, it allows us to focus on the main roles of narrative in learning. We classify the uses of narrative into five primary contexts that relate to the main purpose of learning: Preconditions for learning: arranging for the learner to be prepared for some learning experience in the fullest possible manner. This preparation can include cognitive, affective and conative factors, as well as situational and social ones. Postconditions for learning: ensuring that appropriate activities take place that encourage the learner to consolidate their learning, maintain or increase their motivation, reflect on their experience and so on. Learning content: the actual interaction with material to be understood, mastered and applied. Design: prior to any learning there is the planning that is needed to set in motion the events that lead to the experience. Design involves, amongst other things, setting up the design team, deciding on the design methodology, working with stakeholders, choosing the kinds of interactions with the learner that are to be provided, and the narrative within which the learner is to be placed. Evaluation: both during and after the learning experience there is a need to examine the learning experience with a view to making sure that the learner benefited. This process might well lead to suggestions for improving the experience. The above five contexts are relevant to the design of NILEs, but learners can be (self) motivated towards achieving many kinds of goals such as ones connected with: becoming more knowledgeable and skilled; becoming more creative; improving one’s sense of personal identity (so as to become more confident etc.); improving one’s social identity (e.g.receiving the approval of others); and improving relationships. These aims are neither mutually exclusive nor independent. Nor are they necessarily the “designed-in” aims of any specific NILE. By listing them we seek to emphasise the importance of taking into account the widest possible range of human wants and needs.
Narrative Interactive Multimedia Learning Environments
2.1
35
Preconditions for Learning
Preparing learners for learning is a key idea. Ausubel, in his persuasive approach to this problem, introduced the idea of advance organisers [3]. Ausubel was not just concerned with preparing the learner to learn by providing the vocabulary to be used, or even making sure that certain preconditions were met: Advance organisers bridge the gap between what the student already knows and the new material at a higher level of abstraction, generality and inclusiveness. [3] Other approaches focus on motivational and attitudinal aspects of learning — most famously, Keller and his colleagues [4] who introduced the ARCS model involving attention (A), relevance (R), confidence (C) and satisfaction (S). The ARCS model is used by many to support the design of ILEs that are intended to engage learners. Narrative is also used to set up the situation in which learning takes place. From a situational perspective, setting up the social and physical context in which action will happen requires some engagement with the concerns that are central to narrative design. Perhaps the most natural usage is in the exposition of the ‘back story’. Generating such a narrative is by no means straightforward but, if done well, can provide the context and the help to motivate learners. 2.2
The Postconditions of Learning
While Ausubel is well known for his notion of advance organisers, the idea of post organisers is less familiar, yet there is evidence as to their effectiveness [5]. Hall, Hall and Saling examined the effectiveness of knowledge maps — a form of concept map — when used as a post organiser. Their empirical work suggested some advantages to this approach. Much of the recent work that covers related ideas have tended to draw on Sch¨ on’s notion of reflection-on-action [6]. In the last few years there seems to have been a noticeable increase in interest in systems that encourage the learner to reflect on their learning or on the way that they learn. 2.3
Learning the Content
Designers frequently embed a set of tasks into a suitable story - real or imaginary. For example, Waraich’s Binary Arithmetic Tutor (BAT) system consists of a set of tasks connected with binary arithmetic within a motivating story embedded [7]. His MOSS (Map reading and Ordinance Survey Skills) system, developed as a pilot for his approach to narrative centred informant design, was aimed at teaching map reading skills to children [7]. The MOSS system weaves the narrative in together with the tasks. The key challenges are how to make the narrative aspects of an ILE work to achieve the learning aims and goals, and how to ensure that the learning goals do not undercut the narrative aspects in such a way that the whole endeavour is compromised.
36
2.4
P. Brna
Design
Traditional methods can be used for the design of NILEs. Learner centred design has been increasingly favoured; there have been some notable developments in the approaches taken. These include Scaife, Rogers, Aldrich and Davies in their development of informant design [8] as well as Chin, Rosson and Carroll’s scenario-based design [9] and illustrated by Cooper and Brna in their work on the development of T’rrific Tales [10]. A recent special issue of the International Journal of Artificial Intelligence (Volume 16, number 4) provides a good overview of the ways in which learner centred design is conceived as a general means of developing ILEs. In particular, Luckin, Underwood, du Boulay, Holmberg, Kerawalla, O’Connor, Smith and Tunley provide a detailed account of their learner centred methodology for software development [11]. They term their approach an example of Human Centred Design. Their approach seeks to identify the stakeholders and then involves them working with the stakeholders using, for example, storyboards and interviews. The process is cyclical. It is claimed that the experience of working through this cycle leads to an increasingly rich understanding of the needs of the learners. 2.5
Evaluation
If we want to know whether a particular NILE is delivering the goods then we need to evaluate the current state of the design in relation to the kinds of people who have an interest in the outcomes — learners, parents, schools, educational policy makers and designers themselves. Since NILEs work on so many levels — cognitive, affective and conative, as well as on self-identity and personal relationships — the methods needed for any evaluation are very varied. We are not ‘simply’ looking for learning gains on standardised tests, we are also looking for more elusive gains. Self identity, for example, is something that can be examined throughout one’s life, and there are no simple metrics that can identify changes in some absolutely ‘right’ direction. In the case of evaluations of NILEs we find both standard methods from experimental psychology, and ones that are qualitative . While there is no obvious requirement to evaluate the effectiveness of a NILE using ‘narrative’ methods, there is a place to use methods that can loosely be described under the heading of “narrative inquiry” [12] which seeks to take the stories of participants very seriously.
3 The Achievements We select some clear exemplars that demonstrate distinctive qualities for the first four primary contexts — but the fifth primary context, that of evaluation, does not have any strong exemplar. However, some promising approaches are outlined.
Narrative Interactive Multimedia Learning Environments
3.1
37
Preparing the Learner
Robertson’s thesis work on Ghostwriter [13] is a good example of the use of narrative to prepare the learner to take part in a learning experience. Ghostwriter is based on the idea that children who find it difficult to write stories could be provided with a stimulating experience which would then provide them with the germs of ideas that they could turn into stories. The environment was designed to avoid the point-and-shoot style of many games in order to encourage imaginative characterisation. Ghostwriter involves two participants who need to help each other complete the task given to them [13]. The participants cannot avoid making value-based decisions about other characters in the game. After the children finish discussing their experiences, they are encouraged to write a new story. The empirical evidence obtained from this work is impressive: learners were motivated, developed good working relationships with each other, identified with the characters in the game [13] and, importantly, their new stories featured a greater use of the relationships between characters [14]. The “Ghostwriter” scenario is a clear use of a NILE to give learners an experience to prepare them for an educational task. The designer’s aim with Ghostwriter may have been that it should be a preparatory experience that supports the development of a learner’s story writing skills, but the experience-in-itself also seems to have been a success. 3.2
Reflecting on the Experience
Machado’s Support And Guidance Architecture (SAGA) was a significant attempt to develop a more learner-centred support architecture for collaborative learning [15]. The aim of the work was to produce a kind of plug in component that could be included in a variety of systems designed for story creation. It has been tested with Teatrix, a 3D collaborative story creation environment designed for eight year olds. Perhaps the most significant educational innovation was the inclusion of a method for encouraging reflection through “hot seating”, derived from the approach developed by Dorothy Heathcote to the use of drama in education [16]. The reflection engine is the component that generates a ‘reflection moment’ consisting of a request for the learner to stop their work in the learning environment and review their character’s behaviour and the story’s development. Heathcote makes it clear that such a move in the classroom is more than just a means of generating reflection. She sees this as a failure saver as well as a slower down into experience [16]. 3.3
Learning to Manage Difficult Situations
While there are many ILEs that are designed for learning science and mathematics, modern NILEs featuring role playing immersive environments are often
38
P. Brna
targetted at procedural training, topics in the humanities or connected with social and psychological aspects. This latter class of systems is of great value with a growing awareness in some countries of the urgent need to socialise young people into good relationships with each other, older people and various social institutions (not least, the schools themselves). FearNot! is one of the most significant NILEs of recent years [17]. The VICTEC (Virtual ICT with Empathic Characters) project which developed the FearNot! system is an approach to help children understand the nature of bullying and, perhaps, learn how to deal with bullies. The system was targeted at 8-12 year olds in the UK, Portugal and Germany. When using FearNot!, the child views a sequence of incidents that features direct bullying. The characters in the scenes are synthetic, and the overarching context is a school. The characters needed to be believable because it was intended that children should be engaged with the situations, and care about the outcomes for the characters involved. It was also intended that the children using the system should both empathise with the characters and help them through proffering advice between incidents. This advice affects the synthetic character’s emotional state which in turn affects the actions the character selects in the next episode. Hence in two ways, the child is intended to be a “safe” emotional distance from the incidents — through an empathic relationship (rather than one in which the child identifies directly with the characters), and by trying out ways of dealing with bullying through offering advice. 3.4
Bringing Narrative into the Design Process
Waraich takes an informant design approach in his work on designing motivating narratives [7]. Informant design seeks to draw on the expertise of various stakeholders. However, when working with children in particular, it can be very difficult to extract the key information from the contributions being made. Informant design seeks to recognise the different kinds of contributions made by different contributors. Waraich explicitly introduces narrative concepts into the process of structuring the contributions from the informants. Such an approach focuses on helping informants work on the problem of generating software which is engaging in terms of theme, setting, characterisation and plot structure. Providing informants with sufficient background in the understanding of narrative is challenging. Not only do different learners have different needs in order to participate constructively, but the need is conditioned to some extent by the curricular system in which learners grew up. 3.5
Evaluating the Experience
As pointed out above, there is a need to be very flexible about the manner in which NILEs benefit learners. The approach needs to be suited to the kind of outcomes in which we are interested. For example, suppose we are interested in how engaged students are when using a NILE. Webb and Mallon used focus groups to examine engagement when students used narrative adventure and
Narrative Interactive Multimedia Learning Environments
39
role-play games [18]. The method yielded some very useful guidelines which demonstrated some ways in which narrative devices could work within a game scenario. Another interesting approach taken was by Marsh who, in his thesis work, turned the notion in VR of “being there” on its head, and examined the notion of “staying there” [19]. He developed an evaluation method for three categories of experience — the vicarious, the visceral and the voyeuristic. The voyeuristic experience is associated with sensations, aesthetics, atmosphere and a sense of place, the visceral experience with thrills, attractions and sensations, and the vicarious as connected with empathy and emotional information [20]. The engaging experience that a NILE is supposed to provide needs such evaluations: the division proposed by Marsh is one way of categorising the kinds of experience that need to be evaluated. However, the approach needs to be fleshed out further. Knickmeyer and Mateas have sought to evaluate the Fa¸cade system [21]. They were particularly interested in responses to interaction breakdown. Their method used a form of retroactive protocol analysis combined with a post experience interview; their analysis was based on a coding scheme developed from Malone’s identification of the importance of challenge, fantasy and curiosity [22].
4 Ways Forward There are three areas which I believe will be important in the future, and need attention. Progress in each of these areas has the potential for improving learning environments — and for each area there is some evidence that progress is possible. – Empathic design – Personal development/relationships – Narrative pedagogy We can also expect some significant developments in traditional learning environments — e.g. environments designed to train people (to drive, fight fires, play football etc.) or ones designed to deliver learning outcomes that are found in the school curriculum (solve equations, learn french etc.). Various researchers will no doubt produce learning environments that increasingly blend systems for learning and systems for delivering a strong narrative experience. While SAGA is one of the few systems for planning narratives designed for educational purposes, it does not blend the work on narrative with that of ILEs in an explicit manner. Riedl, Lane, Hill and Swartout at the Institute for Creative Technologies, University of Southern California [23] have been seeking to study how narrative and teaching objectives might be managed. The area of planning narrative experiences which has the potential to be highly productive is that exemplified by Machado’s reflection tool which emphasises the role of the learner to engage with the narrative. While this approach takes one out of the narrative being constructed in order to reflect on it, it is this that makes
40
P. Brna
Machado’s work attractive. There are two major pathways — “narrative as motivation” and “narrative as the ways in which we approach difficult ideas and experiences”. Machado uses both approaches, but it must be clear that, in the situation in which any NILE is used, the capability to move in and out of the engaging experiences and take stock of the situation in terms of what has been learned and how the experience can be built upon is at the heart of the educational uses of NILEs. 4.1
Empathic Design
In terms of designing for NILEs, there is a further issue worth mentioning which is implicit in much that has been done — that of Empathic Design [24]. This is connected with the designer’s duty of care to the learner. In the artificial intelligence in education community, John Self argued that caring for learners involved responding to their needs [25]. Caring for learners goes beyond effective and efficient learning of the specific content being considered, and looks to the wider picture, both in terms of the content and in terms of personal development. Designers of educational software need to factor empathy into the design process adequately to ensure that issues connected with management and the curriculum don’t dominate. Empathy can be defined in a number of distinct ways — all of which have some bearing on the problem of utilising empathy in the design of NILEs. Preston and de Waal’s process model makes: “empathy a superordinate category that includes all sub-classes of phenomena that share the same mechanism. This includes emotional contagion, sympathy, cognitive empathy, helping behaviour, etc.” [26]. Emotional contagion is the notion that, for example, seeing a person smile literally evokes a muscular response emulating a smile — or seeing a child in a state of fear evokes fear in the child’s mother. I feel that this notion has been exploited — knowingly or unknowingly — in many agent-based systems. The educational community is perhaps more interested in cognitive empathy, a conscious, rational assessment of another’s situation as found in Roger’s work [27]. ‘Designing in’ empathy into learning environments should almost certainly aim at working both at the conscious and unconscious levels. If we bring into consideration that all learning environments ‘stand in’ in some way for the teacher, then how does a good teacher express empathy? The empathic teacher treats children as individuals by seeking to discover a pupil’s existing skills and seeks to help them develop. The empathic teacher knows the child as a person, knows their confidence levels, as well as their knowledge. The empathic teacher also nurtures each child’s sense of self, supports their academic progress, and seeks to develop each child’s awareness of themselves [28]. So I take the position that: – A strong sense of empathy is a valuable, probably essential, characteristic for designers of learning systems.
Narrative Interactive Multimedia Learning Environments
41
– Good teachers demonstrate a set of empathic characteristics that provide a starting point for the development of better quality interactions between the learning environment and the learner. Empathic design supplements methods derived from informant design [8]. Waraich’s NCID adds a narrative dimension as an aspect of design that can be confronted explicitly within an informant design framework — but there is also scope for extending the — often — unconscious use of empathy within a design team to become a far more explicit component of the design process. 4.2
Personal Development/Relationships
While some environments such as FearNot! focus on personal experiences of bullying and the development of ways in which bullying might be managed, others have looked at the ways in which people can seek to grow/restore their sense of personal worth and relationships with others [29]. Their emphasis is on telling stories which evoke connections with the learner, and give insights into their own personal circumstances. Each learning experience may generate a story from which something is taken, memorised and learned. We don’t need a NILE for this to happen — but a NILE might well facilitate this by encouraging the learner to ‘tell’ their own stories whether within the NILE or elsewhere. 4.3
Narrative Pedagogy
What about the future of pedagogy in relation to the design of NILEs? In many of the systems discussed, the underlying pedagogy is obscure — sometimes, because this is not seen as important by the authors/designers of the NILEs, sometimes because the pedagogy is taken for granted. On the other hand, some of the systems have a clear pedagogy in mind even if there are other ways of using the specific NILEs. Rather than dwell on ‘standard’ pedagogies, there is an approach with significant potential for system designers. Diekelmann has introduced and advocated the use of narrative pedagogy within nursing education [30]. This approach has found application also within teacher education. Narrative Pedagogy is placed somewhat in tension with standard approaches to learning in that the emphasis is on the generation of person-centred descriptions of situations and their interpretation within a community. Narrative pedagogy downplays the importance of being absolutely right or absolutely wrong, and of assessing learners through objective tests. It is similar to Narrative Inquiry in that argumentation is aimed at mutual understanding rather than winning [31]. For some, this will appear anathema (i.e. more or less taboo). For others taking a strong constructivist approach, it is not such an alien way of thinking about learning. For school learners, the methods of Narrative Pedagogy may not always be appropriate, but there is resonance with some movements in education connected
42
P. Brna
with inclusion, promotion of self esteem, recognising different kinds of individual achievement and so on. I would also argue that the underlying philosophy can be used as the theoretical grounding for future work on NILEs. If NILEs can embody complex situations that encourage the learner to generate their own responses to the challenges found in a situation — whether this be in relation to understanding physics or how to respond to bullying — then we are part way to an approach that could support Narrative Pedagogy. What is evidently missing from most NILEs is the pooling of learner’s interpretations and the opportunity for a learning community to work with these interpretations to form a new understanding.
5 Consequences I would like to suggest, in line with the notion of empathic design [24] that it becomes increasingly important to understand the system designer in terms of their relationship with every learner. In some cases, it will prove worthwhile to realise this relationship as a two way one [32]. System designers can make a valuable contribution to re-establish the importance of personal relationships at a time when these are under stress in a world in which the impersonal, functional view of people seems to be dominant. This is the hope for the future design of NILEs: that such systems will be of use in sustaining the personal development of learners in terms of building and supporting quality relationships with, amongst others, parents, teachers and fellow learners. In using NILEs of the future, we might hope that these systems will also be used to help learners attain a wide range of competencies.
Acknowledgements This paper is based on another article [33]. I thank Atif Waraich for his comments.
References 1. Dickey, M.D.: Game design narrative for learning: Appropriating adventure game design narrative devices and techniques for the design of interactive learning environments. Educational Technology Research and Development 54(3), 245–263 (2006) 2. Davis, M., Dautenhahn, K., Nehaniv, C.L., Powell, S.D.: The narrative construction of our (social) world: steps towards an interactive learning environment for children with autism. Universal Access in the Information Society 6(2), 145–157 (2007) 3. Ausubel, D., Novak, J., Hanesian, H.: Educational Psychology: A Cognitive View. Holt, Rinehart and Winston, New York (1978) 4. Keller, J.M.: Motivational design of instruction. In: Reigeluth, C.M. (ed.) Instructional-design theories and models: an overview of their current status. Lawrence Erlbaum Associates, Hillsdale (1983)
Narrative Interactive Multimedia Learning Environments
43
5. Hall, R.H., Hall, C.R., Saling, C.B.: The effects of graphical post organization strategies on learning from knowledge maps. Journal of Experimental Education 67(2), 101–112 (1999) 6. Schon, D.A.: Educating the reflective Practitioner. Jossey-Bass, San Francisco (1987) 7. Waraich, A.: Designing Motivating Narratives for Interactive Learning Environments. PhD thesis, Computer Based Learning Unit, Leeds University (2003) 8. Scaife, M., Rogers, Y., Aldrich, F., Davies, M.: Designing for or designing with? Informant design for interactive learning environments. In: CHI 1997: Proceedings of Human Factors in Computing Systems, pp. 343–350. ACM, New York (1997) 9. Chin, G.J., Rosson, M., Carroll, J.: Participatory analysis: Shared development requirements from scenarios. In: Pemberton, S. (ed.) Proceedings of CHI 1997: Human Factors in Computing Systems, pp. 162–169 (1997) 10. Cooper, B., Brna, P.: Classroom conundrums: The use of a participant design methodology. Educational Technology & Society 3(4), 85–100 (2000) 11. Luckin, R., Underwood, J., du Boulay, B., Holmberg, J., Kerawalla, L., O’Connor, J., Smith, H., Tunley, H.: Designing educational systems fit for use: A case study in the application of human centred design for aied. International Journal of Artificial Intelligence in Education 16(4), 353–380 (2006) 12. Clandinin, D.J., Connelly, F.M.: Narrative Inquiry: Experience and Story in Qualitative Research. Jossey-Bass, San Francisco (2000) 13. Robertson, J.: The effectiveness of a virtual role-play environment as a story preparation activity. PhD thesis, Edinburgh University (2001) 14. Robertson, J., Good, J.: Using a collaborative virtual role-play environment to foster characterisation in stories. Journal of Interactive Learning Research 14(1), 5–29 (2003) 15. Machado, I., Brna, P., Paiva, A.: Learning by playing: Supporting and guiding story-creation activities. In: Moore, J.D., Redfield, C.L., Johnson, W.L. (eds.) Proceedings of the 10th International Conference on Artificial Intelligence in Education AI-ED 2001, pp. 334–342. IOS Press, Amsterdam (2001) 16. Heathcote, D.: Drama and learning. In: Johnson, L., O’Neill, C. (eds.) Collected Writing on Education and Drama, pp. 90–102. Northwestern University Press, Evanston, Illinois (1991) 17. Hall, L., Woods, S., Aylett, R.: Fearnot! involving children in the design of a virtual learning environment. International Journal of Artificial Intelligence in Education 16(4), 327–351 (2006) 18. Mallon, B., Webb, B.: Stand up and take your place: Identifying narrative elements in narrative adventure and role-play games. Computers in Entertainment 3(1) (2005) 19. Marsh, T.: Staying there: an activity-based approach to narrative design and evaluation as an antidote to virtual corpsing. In: Riva, G., Davide, F., IJsselsteijn, W. (eds.) Being There: Concepts, effects and measurement of user presence in synthetic environments, pp. 85–96. IOS Press, Amsterdam (2003) 20. Marsh, T.: Presence as experience: Film informing ways of staying there. Presence 12(5), 538–549 (2003) 21. Knickmeyer, R.L., Mateas, M.: Preliminary evaluation of the interactive drama fa¸cade. In: CHI 2005, ACM, New York (2005) 22. Malone, T.: Towards a theory of intrinsically motivating instruction. Cognitive Science 5(4), 333–369 (1981)
44
P. Brna
23. Riedl, M., Lane, H., Hill, R., Swartout, W.: Automated story direction and intelligent tutoring: Towards a unifying architecture. In: AI and Education 2005 Workshop on Narrative Learning Environments, Amsterdam, The Netherlands (July 2005) 24. Brna, P.: On the role of self esteem, empathy and narrative in the development of intelligent learning environments. In: Pivec, M. (ed.) Affective and Emotional Aspects of Human-Computer Interaction Game-Based and Innovative Learning Approaches, pp. 237–245. IOS Press, Amsterdam (2006) 25. Self, J.: The defining characteristics of intelligent tutoring systems research: ITSs care, precisely. International Journal of Artificial Intelligence in Education 10(3-4), 350–364 (1999) 26. Preston, S.D., de Waal, F.B.M.: Empathy: Its ultimate and proximate bases. Behaviour and Brain Science 25, 1–72 (2001) 27. Rogers, C.: Empathic: An unappreciated way of being. The Counselling Psychologist 5(2), 2–10 (1975) 28. Cooper, B., Brna, P., Martins, A.: Effective affective in intelligent systems — building on evidence of empathy in teaching and learning. In: Paiva, A. (ed.) Affect in Interactions: Towards a New Generation of Computer Interfaces, pp. 21–34. Springer, Heidelberg (2000) 29. Sharry, J., Brosnan, E., Fitzpatrick, C., Forbes, J., Mills, C., Collins, G.: ’working things out’ a therapeutic interactive cd-rom containing the stories of young people overcoming depression and other mental health problems. In: Brna, P. (ed.) Proceedings of Narrative and Interactive Learning Environments NILE 2004, pp. 67–74 (2004) 30. Diekelmann, N.: Narrative Pedagogy: Heideggerian hermeneutical analyses of lived experiences of students, teachers, and clinicians. Advances in Nursing Science 23(3), 53–71 (2001) 31. Conle, C.: The rationality of narrative inquiry in research and professional development. European Journal of Teacher Education 24(1), 21–33 (2001) 32. Sims, R.: Interactivity or narrative? a critical analysis of their impact on interactive learning. In: Proceedings of ASCILITE 1998, Wollongong, Australia, pp. 627–637 (1998) 33. Brna, P.: In search of narrative interactive learning environments. In: Virvou, M., Jain, L.C. (eds.) Intelligent Interactive Systems in Knowledge-based Environments, pp. 47–74. Springer, Berlin (2008)
A Support Vector Machine Approach for Video Shot Detection Vasileios Chasanis, Aristidis Likas, and Nikolaos Galatsanos Department of Computer Science, University of Ioannina, 45110 Ioannina, Greece {vchasani,arly,galatsanos}@cs.uoi.gr
Abstract. The first step towards indexing and content based video retrieval is video shot detection. Existing methodologies for video shot detection are mostly threshold dependent. No prior knowledge about the video content makes such methods sensitive to video content. To ameliorate this shortcoming we propose a learning based methodology using a set of features that are specifically designed to capture the differences among hard cuts, gradual transitions and normal sequences of frames simultaneously. A Support Vector Machine (SVM) classifier is trained both to locate shot boundaries and characterize transition types. Numerical experiments using a variety of videos demonstrate that our method is capable of accurately detecting and discriminating shot transitions in videos with different characteristics. Keywords: Abrupt cut detection, Dissolve detection, Support Vector Machines.
1 Introduction In recent years there has been a significant increase in the availability of high quality digital video as a result of the expansion of broadband services and the availability of large volume digital storage devices. Consequently, there has been an increase in the need to access this huge amount of information and a great demand for techniques that will provide efficient indexing, browsing and retrieving of video data. The first step towards this direction is to segment the video into smaller “physical” units in order to proceed with indexing and browsing. The smallest physical segment of a video is the shot and is defined as an unbroken sequence of frames recorded from one camera. Shot transitions can be classified into two categories. The first one which is the most common is the abrupt cut. An abrupt or hard cut takes place between consecutive frames due to camera switch. In other words, a different or the same camera is used to record a different aspect of the scene. The second category concerns gradual transitions such dissolves, fade outs followed by fade ins, wipes and a variety of video effects which stretch over several frames. A dissolve takes place when the initial frames of the second shot are superimposed on the last frames of the first shot. A formal study of the shot boundary detection problem is presented in [20]. In [11] the major issues to be considered for the effective solution of the G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 45–54, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
46
V. Chasanis, A. Likas, and N. Galatsanos
shot-boundary detection problem are identified. A comparison of existing methods is presented in ([4], [10], and [15]). There are several approaches to the shotboundary detection task most of which involve the determination of a predefined or adaptive threshold. A simple way to declare a hard cut is pair-wise pixel comparison [22]. This method is very sensitive to object and camera motions, thus many researchers propose the use of a motion independent characteristic, which is the intensity or color, global or local histogram ([17], [22]). The use of second order statistical characteristics of frames , in a likelihood ratio test, is also suggested ([12], [22]). In [21] an algorithm is presented based on the analysis of entering and exiting edges between consecutive frames. This approach works well on abrupt changes, but fails in the detection of gradual changes. In [5] mutual information and joint-entropy between frames are used for the detection of cuts, fade-ins and fade-outs. An original approach to partitioning of a video into shots based on a foveated representation of the video is proposed in [3]. A quite interesting approach is presented in [20] where the detection of shot boundaries is based on a graph partitioning problem. Finally, support vector machines with active learning are implemented to declare boundaries and nonboundaries. A Support Vector Machine classifier with color and motion features is also employed in [7]. In [8] the authors propose as features of SVMs, wavelet coefficient vectors within sliding windows. A variety of methods have been proposed for gradual transitions detection, but still are inadequate to solve this problem due to the complicated nature of such transitions. In [22], a twin-comparison technique is proposed for hard cuts and gradual transitions detection by applying different thresholds based on differences in color histograms between successive frames. In [18] a spatiotemporal approach was presented for the detection of a variety of transitions. There is also research specifically aimed towards the dissolve detection problem. In [16], the problem of dissolve detection is treated as a pattern recognition problem. Another direction, which is followed in ([9], [11], and [14]), is to model the transitions types by presupposing probability distributions for the feature difference metrics and perform a posteriori shot change estimation. It is worth mentioning that the organization of the TREC video shot detection task [19] provides a standard performance evaluation and comparison benchmark. In summary, the main drawback of most previous algorithms is that they are threshold dependent. As a result, if there is no prior knowledge about the visual content of a video that we wish to segment into shots, it is rather difficult to select an appropriate threshold. In order to overcome this difficulty we propose in this paper a supervised learning methodology for the shot detection problem. In other words, the herein proposed approach does not use thresholds and can actually detect shot boundaries of videos with totally different visual characteristics. Another advantage of the proposed approach is that we can detect hard cuts and gradual transitions at the same time in contrast with existing approaches. For example, in [7] the authors propose a Support Vector Machine classifier only for abrupt cut detection. In [20], features for abrupt cuts and dissolves are constructed
A Support Vector Machine Approach for Video Shot Detection
47
separately and two different SVM models are trained. In our approach, we define a set of features designed to discriminate hard cuts from gradual transitions. These features are obtained from color histograms and describe the variation between adjacent frames and the contextual information at the same time. Due to the fact that the gradual transitions spread over several frames, the frameto-frame differences are not sufficient to characterize them. Thus, we also use the differences between non adjacent frames in the definition of the proposed features. These features are used as inputs to a Support Vector Machine (SVM) classifier algorithm. A set of nine different videos with over 70K frames from TV series, documentaries and movies is used to train and test the SVM classifier. The resulting classifier achieves content independent correct detection rates greater than 94%. The rest of this paper is organized as follows: In Sections 2 and 3 the features proposed in this paper are described. In Section 4 the SVM method employed for this application is briefly presented. In Section 5 we present numerical experiments and compare our method with three existing methods. Finally, in Section 6 we present our conclusions and suggestions for future research.
2 Feature Selection 2.1
Color Histogram and x2 Value
Color histograms are the most commonly used features to detect shot boundaries. They are robust to object and camera motion, and provide a good trade-off between accuracy of detection and implementation speed. We have chosen to use normalized RGB histograms. So for each frame a normalized histogram is computed, with 256 bins for each one of the RGB component defined as H R , H G and H B respectively. These three histograms are concatenated into a 768 dimension vector representing the final histogram of each frame. H = [H R H G H B ] .
(1)
To define whether two shots are separated with an abrupt cut or a gradual transition we have to look for a difference measure between frames. In our approach we use a variation of the x2 value to compare the histograms of two frames in order to enhance the difference between the two histograms. Finally the difference between two images Ii , Ij based on their color histograms Hi , Hj is given from the following equation: 1 (Hi (k) − Hj (k))2 , 3 Hi (k) + Hj (k) 768
d(Ii , Ij ) =
k=1
where k denotes the bin index.
(2)
48
2.2
V. Chasanis, A. Likas, and N. Galatsanos
Inter-frame Distance
The dissimilarity value given in equation (2) can be computed for any pair of frames within the video sequence. We compute the value not only between adjacent frames, but also between frames with time distance l, where l is called the inter-frame distance as suggested in ([1], [11]). We compute the dissimilarity value d(Ii , Ii+l ) for three values of the inter-frame distance l: – l=1. This is used to identify hard cuts between two consecutive frames, so the dissimilarity values are computed for l=1, Fig. 1(a). – l=2. Due to the fact that during a gradual transition two consecutive frames may be the same or very similar to each other, the dissimilarity value will tend to zero and, as a result, the sequence of the dissimilarity values could have the form shown in Fig. 1(b). The computation for l=2 usually results in a smoother curve, which is more useful for our further analysis. A typical example of a sequence of dissimilarity values for l=2 is shown in Fig. 1(b). – l= 6. A gradual transition stretches along several frames, while the difference value between consecutive frames is smaller, so we are interested not only in the difference between consecutive frames, but also between frames that are a specific distance apart from each other. As the inter-frame distance increases, the curve becomes smoother as it can be observed in the example of Fig. 1(c). Of course the maximum distance between frames for which the inter-frame distance is useful is rather small. This distance should be less than the minimum length of all transitions in the video set in order to capture the form of the transition. Thus, the choice of most of the gradual transitions in our set of videos have length between 7-40 frames.
3 Feature Vector Selection for Shot-Boundary Classification The dissimilarity values defined in Section 2 are not going to be compared with any threshold, but they will be used to form feature vectors based on which an SVM classifier will be constructed. 3.1
Definition of Feature Vectors
The feature vectors selected are the normalized dissimilarity values calculated in a temporal window centered at the frame of interest. More specifically, the dissimilarity values that are computed in section 2 form three vectors, one for each one of the three inter-frame distances l. Dl=1 = [d(I1 , I2 ), . . . , d(Ii , Ii+1 ), . . . , d(IN −1 , IN )] Dl=2 = [d(I1 , I3 ), . . . , d(Ii , Ii+2 ), . . . , d(IN −2 , IN )] . Dl=6 = [d(I1 , I7 ), . . . , d(Ii , Ii+6 ), . . . , d(IN −6 , IN )]
(3)
A Support Vector Machine Approach for Video Shot Detection
(a) l=1
49
(b) l=2
(c) l=6 Fig. 1. Dissimilarity patterns
Moreover for each frame, we define a window of length w that is centered at this frame and contains the dissimilarity values. As a result for the ith frame the following three vectors are composed: W l=1 (i, 1 : w) = [Dl=1 (i − w/2), . . . , Dl=1 (i), . . . , Dl=1 (i + w/2 − 1)] W l=2 (i, 1 : w) = [Dl=2 (i − w/2), . . . , Dl=2 (i), . . . , Dl=2 (i + w/2 − 1)] . W l=6 (i, 1 : w) = [Dl=6 (i − w/2), . . . , Dl=6 (i), . . . , Dl=6 (i + w/2 − 1)]
(4)
To obtain the final features we normalize the dissimilarity values in equation (4) by dividing each dissimilarity value by the sum of the values in the window. This provides the normalized “magnitude” independent features. (i, j) , k = 1, 2, 6 . l=k (i, j) W j=1
˜ l=k (i, j) = wW W
l=k
(5)
The size of the window used is w=40. In our experiments we also considered windows of length 50 and 60 in order to capture longer transitions. The 120-long vector resulting from the concatenation of the normalized dissimilarities for the three windows given by ˜ l=2 (i) W ˜ l=6 (i)] , ˜ l=1 (i) W F (i) = [W
(6)
is the feature vector corresponding to frame i. In what follows we show examples of the feature vectors for a hard cut and a dissolve in Fig. 2.
50
V. Chasanis, A. Likas, and N. Galatsanos
(a) Hard cut
(b) Dissolve
Fig. 2. Feature vectors for transitions
4 Support Vector Machine Classifier After the feature definition, an appropriate classifier has to be used in order to categorize each frame in three categories: normal sequences, abrupt cuts and gradual transitions. For this purpose we selected the Support Vector Machine (SVM) classifier [6] that provides state-of-the-art performance and scales well with the dimension of the feature vector which is relatively large (equal to 120) in our problem. The classical SVM classifier finds an optimal hyperplane which separates data points of two classes. More specifically, suppose we are given a training set of l vectors xi ∈ Rn , i=1, . . . , l and a vector y ∈ Rl with yi ∈ {1,-1} denoting the class of vector xi . We also assume a mapping function φ(x), that maps each training vector to a higher dimensional space, and the corresponding kernel function (eq. (9)). Then the SVM classifier [6] is obtained by solving the following primal problem: l 1 T min i=1 ξi 2w w + C w,b,ξ T (7) subject to yi (w φ(xi ) + b) ≥ 1 − ξi . ξi ≥ 0, i = 1, . . . , l
The decision function is: l wi K(xi , x) + b), where K(xi , xj ) = φT (xi )φ(xj ) . sqn(
(8)
i=1
A notable characteristic of SVMs is that after training, usually most of the training patterns xi have wi =0 in eq. (8), in other words they do not contribute to the decision function. Those xi for which wi = 0, are retained in the SVM model and called Support Vectors (SVs). In our approach the commonly used radial basis function (RBF) kernel is employed: K(xi , xj ) = exp(−γxi − xj 2 ) ,
(9)
where γ denotes the width of the kernel. It must be noted that in order to obtain an efficient SVM classifier the parameters C (eq. (7)) and γ (eq. (9)) must be carefully selected, usually through cross-validation.
A Support Vector Machine Approach for Video Shot Detection
51
5 Experiments 5.1
Data and Performance Criteria
The video sequences used for our data set were taken from TV-series, documentaries and educational films. Nine videos (70000 frames) were used; containing 355 hard cuts and 142 dissolves, manually annotated. To evaluate the performance of our method we used the following criteria [2]: Recall =
Nc Nc 2 × Rec × P rec , , P recision = , F1 = Nc + Nm Nc + Nf Rec + P rec
(10)
where Nc stands for the number of correct detected shot boundaries, Nm for the number of missed ones and Nf the number of false detections. During our experiments we calculate the F1 value for the cuts (F1C ) and the dissolves (F1D ) separately. Then the final performance measure is given from the following equation: α b F1C + F1D , F1 = (11) α+b α+b where α is the number of true hard cuts and b the number of true dissolves. 5.2
Results and Comparison
In our experiments, 8 videos are used for training and the 9th for testing, therefore, 9 “rounds” of testing were conducted. In order to obtain good values of the parameters γ and C (in terms of providing high F1 values), in each ”round” we applied 3-fold cross-validation using the 8 videos of the corresponding training set. A difficulty of the problem under consideration is the generation of an imbalanced training set that contains few positives examples and a huge number of negative ones. In our approach we sample negative examples uniformly, thus we reduce their number to 3% of the total number of examples. More specifically, in our training set there are 440 positive examples (transitions) and 2200 negative examples (no transitions) on average. Finally each model of the training procedure generated on average 1276 support vectors for normal transitions, 101 support vectors for gradual transitions and 152 support vectors for abrupt transitions. We also tested our method by using larger windows of width w = 50 and w = 60. In what follows in Tables 1-3 we provide the classification results using different selections of window lengths. We notice that the performance improves as the size of the window increases. False boundaries are reduced since larger windows contain more information. The use of larger windows also helps the detection of dissolves that last longer. In order to reduce the size of our feature vector, we have also consider as feature vectors used to train the SVM classifier, those obtained from the concatenation of features extracted for l=2 and l=6, only. It can be observed (Table 4) that even with the shorter feature vector the proposed algorithm gives very good results that are only slightly inferior to the ones obtained by the longer feature vector.
52
V. Chasanis, A. Likas, and N. Galatsanos Table 1. Performance results for w = 40, l=1, l=2 and l=6 TRANSITION TYPE CUTS DISSOLVES AVERAGE
Nc 351 127 -
Nm 4 15 -
Nf 9 33 -
Recall 98.87% 89.44% 96.18%
Precision F1 97.50% 98.18% 79.38% 84.11% 92.32% 94.21%
Table 2. Performance results for w = 50, l=1, l=2 and l=6 TRANSITION TYPE CUTS DISSOLVES AVERAGE
Nc 352 130 -
Nm 3 12 -
Nf 8 25 -
Recall 99.15% 91.55% 96.98%
Precision F1 97.78% 98.46% 83.87% 87.54% 93.80% 95.37%
Table 3. Performance results for w = 60, l=1, l=2 and l=6 TRANSITION TYPE CUTS DISSOLVES AVERAGE
Nc 353 127 -
Nm 2 15 -
Nf 4 25 -
Recall 99.44% 89.44% 96.58%
Precision F1 98.88% 99.16% 83.55% 86.39% 94.50% 95.53%
Table 4. Performance results for w = 50, l=2 and l=6 TRANSITION TYPE CUTS DISSOLVES AVERAGE
Nc 350 129 -
Nm 5 13 -
Nf 5 21 -
Recall 98.59% 90.85% 96.38%
Precision F1 97.49% 98.04% 86.00% 88.36% 94.21% 95.28%
To demonstrate the effectiveness of our algorithm and its advantage over threshold depended methods, we implemented three methods that use thresholds in different ways. More specifically, we implemented pair-wise comparison of successive frames [22], likelihood ratio test ([12], [22]) and the twin-comparison method [22]. The first two methods can only detect cuts, while the third can identify both abrupt and gradual transitions. The obtained results indicate that our algorithm outperforms the other three threshold dependent methods. In Table 5 we provide the recall, precision and F1 values for our algorithm and the three methods under consideration. For our algorithm we present the results using w = 50, for best values (C, γ) , using all features (l=1, l=2 and l=6) and less features (l=2 and l=6). The thresholds used in these three methods were calculated in different ways. We used adaptive thresholds in pair-wise comparison algorithm, cross validation in likelihood ratio method and finally global adaptive threshold in the twin-comparison method. Especially for the dissolve detection, our algorithm provides far better results than the twin-comparison algorithm.
A Support Vector Machine Approach for Video Shot Detection
53
Table 5. Comparative results using Recall, Precision and F1 measures
METHOD w = 50, l=1, l=2 and l=6. w = 50, l=2 and l=6. PAIR-WISE COMPARISON [22] LIKELIHOOD RATIO [22] TWIN-COMPARISON [22]
Recall 99.15% 98.59% 85.07% 94.37% 89.30%
TRANSITION TYPE CUTS DISSOLVES Precision F1 Recall Precision F1 97.78% 98.46% 91.55% 83.87% 87.54% 97.49% 98.04% 90.85% 86.00% 88.36% 84.83% 84.95% 86.12% 90.05% 88.05% 88.92% 70.42% 64.94% 67.57%
6 Conclusion - Future work In this paper we have proposed a method for shot-boundary detection and discrimination between a hard cut and a dissolve. Features that describe the variation between adjacent frames and the contextual information were derived from color histograms using a temporal window. These feature vectors become inputs to a SVM classifier which categorizes transitions of the video sequence into normal transitions, hard cuts and gradual transitions. This categorization provides an effective segmentation of any video into shots and thus is a valuable aid to further analysis of the video for indexing and browsing. The main advantage of this method is that it is not threshold dependent. As a future work, we will try to improve the performance of the method by extracting other types of features from the video sequence.
Acknowledgments This research project (PENED) is co-financed by E.U.-European Social Fund (75%) and the Greek Ministry of Development-GSRT (25%).
References 1. Besc´ os, J., Cisneros, G., Mart´ınez, J.M., Men´endez, J.M., Cabrera, J.: A Unified Model for Techniques on Video-Shot Transition Detection. IEEE Trans. Multimedia 7(2), 293–307 (2005) 2. Bimbo, A.D.: Visual Information Retrieval. Morgan Kaufmann Publishers, Inc., San Francisco (1999) 3. Boccignone, G., Chianese, A., Moscato, V., Picariello, A.: Foveated Shot Detection for Video Segmentation. IEEE Trans. Circuits and Systems for Video Technology 15(3), 365–377 (2005) 4. Boreczky, J.S., Rowe, L.A.: Comparison of Video Shot Boundary Detection Techniques. In: Proc. SPIE Storage and Retrieval for Image and Video Databases, vol. 2664, pp. 170–179 (1996) 5. Cernekova, Z., Pitas, I., Nikou, C.: Information Theory-Based Shot Cut/Fade Detection and Video Summarization. IEEE Trans. Circuits and Systems for Video Technology 16(1), 82–91 (2006)
54
V. Chasanis, A. Likas, and N. Galatsanos
6. Cortes, C., Vapnik, V.: Support-vector network. Machine Learning 20(3), 273–297 (1995) 7. Dalatsi, C., Krinidis, S., Tsekeridou, S., Pitas, I.: Use of Support Vector Machines based on Color and Motion Features for Shot Boundary Detection. In: International Symposium on Telecommunications (2001) 8. Feng, H., Fang, W., Liu, S., Fang, Y.: A new general framework for shot boundary detection and key-frame extraction. In: Proc. 7th ACM SIGMM Int. Workshop Multimedia Inf. Retrieval, pp. 121–126 (2005) 9. Fernando, W.A.C., Canagarajah, C.N., Bull, D.R.: Fade and dissolve detection in uncompressed and compressed video sequences. In: Proc. IEEE Int. Conf. Image Processing, vol. 3, pp. 299–303 (1999) 10. Gargi, U., Kasturi, R., Strayer, S.H.: Performance characterization of video-shotdetection methods. IEEE Trans. Circuits and Systems for Video Technology 10(1), 1–13 (2000) 11. Hanjalic, A.: Shot-boundary detection: Unraveled and resolved? IEEE Trans. Circuits and Systems for Video Technology 12(2), 90–105 (2002) 12. Kasturi, R., Lain, R.: Dynamic Vision. In: Kasturi, R., Lain, R. (eds.) Computer Vision: Principles, pp. 469–480. IEEE Computer Society Press, Washington (1991) 13. Knerr, S., Personnaz, L., Dreyfus, G.: Single-layer learning revisited: a stepwise procedure for building and training a neural network. In: Fogelman, J. (ed.) Neurocomputing Algorithms, Architectures and Applications. Springer, Heidelberg (1990) 14. Lelescu, D., Schonfeld, D.: Statistical sequential analysis for real-time video scene change detection on compressed multimedia bitstream. IEEE Trans. Multimedia 5(1), 106–117 (2003) 15. Lienhart, R.: Comparison of automatic shot boundary detection algorithms. In: Proc. SPIE Storage and Retrieval for Image and Video Databases VII, San Jose, CA, vol. 3656, pp. 290–301 (1999) 16. Lienhart, R.: Reliable dissolve detection. In: Proc. SPIE Storage and Retrieval for Media Databases 2001, vol. 4315, pp. 219–230 (2001) 17. Nagasaka, A., Tanaka, Y.: Automatic video indexing and full-video search for object appearances. In: Knuth, E., Wegner, L.M. (eds.) Visual Database Systems II, pp. 113–127. Elsevier, Amsterdam (1995) 18. Ngo, C.W., Pong, T.C., Chin, R.T.: Video partitioning by temporal slice coherence. IEEE Trans. Circuits and Systems for Video Technology 11(8), 941–953 (2001) 19. NIST, Homepage of Trecvid Evaluation. [Online]. http://www-nlpir.nist.gov/projects/trecvid/ 20. Yuan, J., Wang, H., Xiao, L., Zheng, W., Li, J., Lin, F., Zhang, B.: A Formal Study of Shot Boundary Detection. IEEE Trans. Circuits and Systems for Video Technology 17(2), 168–186 (2007) 21. Zabih, R., Miller, J., Mai, K.: Feature-Based Algorithms for Detecting and Classifying Production Effects. Multimedia Systems 7(2), 119–128 (1999) 22. Zhang, H.J., Kankanhalli, A., Smoliar, S.W.: Automatic partitioning of full-motion video. Multimedia Systems 1(1), 10–28 (1993)
Comparative Performance Evaluation of Artificial Neural Network-Based vs. Human Facial Expression Classifiers for Facial Expression Recognition I.-O. Stathopoulou and G.A. Tsihrintzis Department of Informatics University of Piraeus Piraeus 185 34, Greece {iostath,geoatsi}@unipi.gr
Abstract. Towards building new, friendlier human-computer interaction and multimedia interactive services systems, we developed a neural network-based image processing system (called NEU-FACES), which first determines automatically whether or not there are any faces in given images and, if so, returns the location and extent of each face. Next, NEU-FACES uses neural network-based classifiers, which allow the classification of several facial expressions from features that we develop and describe. In the process of building NEU-FACES, we conducted an empirical study in which we specify related design requirements and, study statistically the expression recognition performance of humans. In this paper, we make and evaluation of performance of our NEU-FACES system versus the human’s expression recognition performance.
1 Introduction Facial expressions are particularly significant in communicating information in human-to-human interaction and interpersonal relations, as they reveal information about the affective state, cognitive activity, personality, intention and psychological state of a person and this information may, in fact, be difficult to mask. In the design of advanced human-computer interfaces, the variations of the emotions of human users during the interaction should be taken into consideration and the computer made able to react accordingly. In human-to-human interaction and interpersonal relations, facial expressions play a significant communicative role because they can reveal information about the affective state, cognitive activity, personality, intention and psychological state of a person and this information may in fact be quite difficult to mask. Similarly, images that contain user faces are instrumental in the development of more effective and friendlier methods in human-computer interaction, since facial expressions corresponding to the “neutral”, “smile”, “sad”, “surprise”, “angry”, “disgust” and “bored-sleepy” psychological states arise very commonly during a typical human-computer interaction session. The task of processing facial images generally consists of two steps: a face detection step, which determines whether or not there are any faces in an image and, if so, returns the location and extent of each face, and a facial expression classification step, which attempts to recognize the expression formed on a detected face. These problems are quite challenging because faces are non-rigid and have a high degree of G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 55–65, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
56
I.-O. Stathopoulou and G.A. Tsihrintzis
variability in size, shape, color and texture. Furthermore, variations in pose, facial expression, image orientation and conditions add to the level of difficulty of the problem. The task of determining the true psychological state of the person using an image of his/her face is complicated further by the problem of pretence, i.e. the case of the person’s facial expression not corresponding to his/her true psychological state. The difficulties in facial expression classification by humans and some indicative classification error percentages are illustrated in [1]. Previous attempts to address similar problems of face detection and facial expression classification in images have followed two main directions in the literature: (1) methods based on face features [2-5] and (2) image-based representations of the face [6-8]. Our system follows the first direction (feature-based approach) and has been developed over a time period of approximately two years and, therefore, is an evolution of previous of its versions [9-13]. Specifically, the face detection algorithm currently used was developed and presented in [9], while the facial expression classification algorithms are evolved and extended versions of those gradually developed and presented in [10-14]. In this paper we present a performance evaluation of NEU-FACES [14], a fully automated neural network-based face detection and facial expression classification system, emphasizing in the Facial Expression Classification Subsystem. Our studies have concluded to seven emotions which arise very commonly during a typical human-computer interaction session and, thus, vision-based human-computer interaction systems that recognize them could guide the computer to “react” accordingly and attempt to better satisfy its user needs. Specifically, these emotions are: expressions are: “neutral”, “happy”, “sadness”, “surprise”, “anger”, “disgust” and “bored-sleepy”. NEU-FACES is able to recognize these emotions satisfactory and in this paper we present it’s performance evaluation, compared to the answers given in empirical studies where humans were asked to classify the corresponding emotions from a subject’s image. More specifically, the paper is organized as follows: in Section 2, we present our empirical studies to human subjects, where we contracted two types of questionnaires and describe the structure each one of them. In Section 3, we present our NEUFACES system, concentrating in facial expression classification module. In Section 5, we evaluate the performance of our system versus the human’s performance. We draw conclusions and point to future work in Sections 6 and 7, respectively.
2 Empirical Studies on Human Subjects 2.1 The Questionnaire Structure In order to understand how a human classifies someone else’s facial expression and set a target error rate for automated systems, we developed a questionnaire in which each we asked 300 participants to state their thoughts on a number of facial expression-related questions and images. Specifically, the questionnaire consisted of three different parts: 1.
In the first part, the observer was asked to identify an emotion from the facial expressions that appeared in 14 images. Each participant could choose from
Comparative Performance Evaluation
2.
3.
57
the 7 of the most common emotions that we pointed out earlier, such as: “anger”, “happiness”, “neutral”, “surprise”, “sadness”, “disgust”, “boredom– sleepiness”, or specify any other emotion that he/she thought appropriate. Next, the participant had to state the degree of certainty (from 0-100%) of his/her answer. Finally, he/she had to state which features (such as the eyes, the nose, the mouth, the cheeks etc.), had helped him/her make that decision. A typical question of the first part of the questionnaire is depicted in Figure 1. When filling the second part of the questionnaire, each participant had to identify an emotion from parts of a face. Specifically, we showed them the “neutral” facial image of a subject and the corresponding image of some other expression. In this latter image pieces were cut out, leaving only certain parts of the face, namely the “eyes”, the “mouth”, the “forehead”, the “cheeks”, the “chin” and the “brows.” This is typically shown in Figure 2. Again, each participant could choose from the 7 of the most common emotions “anger”, “happiness”, “neutral”, “surprise”, “sadness”, “disgust”, “boredom–sleepiness”, or specify any other emotion that he/she thought appropriate. Next, the participant had to state the degree of certainty (from 0-100%) of his/her answer. Finally, the participant had to specify which features had helped him/her make that decision. In the final (third) part of our study, we asked the participants to supply information about their background (e.g. age, interests, etc.). Additionally, each participant was asked to provide information about: • The level of difficulty of the questionnaire with regards to the task of emotion recognition from face images • Which emotion he/she though was the most difficult to classify • Which emotion he/she though was the easiest to classify • The percentage to which a facial expression maps into an emotion (0-100%)
Fig. 1. The first part of the questionnaire
58
I.-O. Stathopoulou and G.A. Tsihrintzis
Fig. 2. The second part of the questionnaire
2.2 The Participant and Subject Backgrounds There were 300 participants in our study. All the participants were Greek, thus familiar with the greek culture and the greek ways of expressing emotions. They were mostly undergraduate or graduate students and faculty in our university and there age varied between 19 and 45 years.
3 Facial Expression Classification Subsystem In order for our system to be fully automated, first we locate and extract the face using the face detection subsystem. The face data is then fed to the facial expression classification subsystem, which preprocess the in order to extract some facial features of high discrimination power, namely: (1) left eye dimension ratio, (2) right eye dimension ratio, (3) mouth dimension ratio, (4) face dimension ratio, (5) forehead texture, (6) texture between the brows, (7) left eye brow direction, (8) right eye brow direction, and (9) mouth direction The above features consist the input data to a two layer neural network. The network produces a 7-dimensional output vector which can be regarded as the degree of membership of the face image in each of the ‘neutral’, ‘happiness, ‘surprise’, ‘anger’, ‘disgust-disapproval’ and ‘bored-sleepy’ classes. An illustration of the network architecture can be seen in Figure 3.
Fig. 3. The Facial Expression Neural Network Classifier
Comparative Performance Evaluation
59
3.1 Discriminating Features for Facial Expressions For the classification task, we gathered and studied a dataset of 1400 images of facial expressions, which corresponded to 200 different persons forming the “neutral” and the six emotions: “happiness”, “sadness”, “surprise”, “anger”, “disgust” and “boredsleepy”. We use the “neutral” expression as a template, which can somehow be deformed into other expressions. From our study of these images, we identified significant variations between the “neutral” and other expressions, which can be quantified into a classifying feature vector. Typical such variations are shown in Table 1. Table 1. Formation of facial expressions via deformation of the neutral expression
Variations between Facial Expressions: Happiness • Bigger-broader mouth • Slightly narrower eyes • Changes in the texture of the cheeks • Occasionally, changes in the orientation of brows Surprise • Longer head • Bigger-wider eyes • Open mouth • Wrinkles in the forehead (changes in the texture) • Changes in the orientation of eyebrows (the eyebrows are raised) Anger • Wrinkles between the eyebrows (different textures) • Smaller eyes • Wrinkles in the chin • The mouth is tight • Occasionally, wrinkles over the eyebrows, in the forehead
Boredness-Sleepy • Head slightly turned downwards • Eyes slightly closed • Occasionally, wrinkles formed in the forehead and different direction of the brows Sadness • Changes in the direction of the mouth • Wrinkles formed on the chin (different texture) • Occasionally, wrinkles formed in the forehead and different direction of the brows Disgust-Disapproval • The distance between the nostrils and the eyes is shortened • Wrinkles between the eyebrows and on the nose • Wrinkles formed on the chin and the cheeks
3.2 The Feature Extraction Algorithm The feature extraction process in NEU-FACES converts pixel data into a higher-level representation of shape, motion, color, texture and spatial configuration of the face and its components. We extract such classification features on the basis of observations of facial changes that arise during formation of various facial expressions, as indicated in
60
I.-O. Stathopoulou and G.A. Tsihrintzis
Fig. 4. The extracted features (gray points), the measured dimensions (gray lines) and the regions (orthogonals) of the face
Table 1. Specifically, we locate and extract the corner points of specific regions of the face, such as the eyes, the mouth and the eyebrows, and compute variations in size or orientation from the “neutral” expression to another one. Also, we extract specific regions of the face, such us the forehead or the region between the eyebrows, so as to compute variations in texture. The extracted features are illustrated in Figure 4. Specifically, the feature extraction algorithm works as follows: 1. 2.
3.
4.
5.
Search the binary face image and extract its parts (eyes, mouth and brows) into a new image of the same dimensions and coordinates as the original image. In each image of a face part, locate corner points using relationships between neighboring pixel values. This results in the determination of 18 facial points, which are subsequently used to form the classification feature vector. Based on these corner points, extract the specific regions of the faces (e.g. forehead, region between the eyebrows). The extracted corner points and regions can be seen in the third column in Table 3 in the Results Section, as they correspond to the six facial expressions of the same person shown in the first column. Although these regions are located in the binary face image, their texture measurement is computed from the corresponding region of the detected face image (‘window pattern’) in the second column. Compute the Euclidean distances between these points, depicted with gray lines in Figure 1, and certain specific ratios of these distances. Compute the orientation of the brows and the mouth. Finally, compute a measure of the texture for each of the specific regions based on the texture of the corresponding “neutral” expression. The results of the previous steps form the feature vector, which is fed into a neural network.
3.3 Training and Results After computing the feature vector, we use it as input to an artificial neural network to classify facial images according to the expression they contain. To train the neural network we used a training set of 1050 images, which consisted of 150 persons forming the seven facial expressions. During training, the neural network reached an error rate of 10-2.
Comparative Performance Evaluation
61
Some of the results obtained by our neural network can be seen in Table 2. Specifically, in the first column we see a typical input image, whereas in the second column we see the results of the Face Detection Subsystem. The extracted features are shown in the third column and finally the Facial Expression Classification Subsystem’s response is shown in the fourth column. According to the requirements set, when the window pattern represented a ‘neutral’ facial expression, the neural network should produce an output value of [1.00; 0.00; 0.00; 0.00; 0.00; 0.00; 0.00] or so. Similarly, for the “smile” expression, the output must be [0.00; 1.00; 0.00; 0.00; 0.00; 0.00; 0.00] and so on for the other expressions. The output value can be regarded as the degree of membership of the face image in each of the [Neutral; Happiness; Surprise; Anger ; Disgust-Disapproval; Sadness; Bored-Sleepy] classes in the adequate position. Table 2. Face Detection and Feature Extraction
Neutral
[1.00; 0.00; 0.00; 0.00; 0.00; 0.00; 0.00]
Happiness
Expression Classification
[0.12; 0.83; 0.00; 0.00; 0.00; 0.05; 0.00]
Surprise
Extracted Features
[0.00; 0.00; 0.93; 0.00; 0.07; 0.00; 0.00]
Anger
Detected Face
[0.00; 0.13; 0.00; 0.63; 0.34; 0.00; 0.00]
62
I.-O. Stathopoulou and G.A. Tsihrintzis
Bored-Sleepy
Sadness
DisgustDissaproval
Table 2. (continued)
[0.00; 0.00; 0.00; 0.22; 0.61; 0.01; 0.16]
[0.00; 0.00; 0.00; 0.23 0.00; 0.66; 0.11]
[0.00; 0.00; 0.00; 0.29; 0.00; 0.23; 0.58]
4 Evaluation of Performance The NEU-FACES System managed to classify the emotion’s based on a person’s face quite satisfactory. We tested the NEU-FACES wit 20 subjects forming the 7 facial expressions corresponding to 7 equivalent emotions, which formed a total of 140 images. The results are summarized in Table 3. In the three firs columns we show the results from our empirical studies to humans, specifically the first part of the questionnaire in the first column, the second part in the second column and the mean success rate in the third. In the fourth column we depict the success rate of NEU-FACES for the corresponding emotion. As we can observe, the NEU-FACES achieved higher success rates in most of the emotion compared to the success rates achieved by humans, with exception to the “anger” emotion, where it achieved only 55%. This is done mostly, first, because of the pretence we may have in such an emotion and, secondly, because of the difficulty of humans to show such an emotions full. The second is further validated by the fact that the majority of the face images depicting ‘anger” that were erroneously classified by our system, were misclassified as “neutral’. Generally, the NEU-FACES achieve very good results in positive emotions, such as ‘happiness” and “surprise”, where he achieved 90% and 95%, respectively.
Comparative Performance Evaluation
63
Table 3. Error rates in the two parts of the questionnaire
Success Rates Emotions
Neutral Happiness Sadness Disgust oredomSleepy Anger Surprise
Questionaire results
NEU-FACES System Results
1st Part 39,25% 68,94% 34,09% 18,74%
2nd Part ---------96,21% 82,58% 13,64%
Mean Value 61,74% 82,57% 58,33% 16,19%
80% 90% 60% 65%
50,76%
78,03%
64,39%
75%
76,14% 89,77%
69,7% 95,45%
72,92% 92,61%
55% 95%
5 Conclusions Automatic face detection and expression classification in images is a prerequisite in the development of novel human-computer interaction modalities. However, the development of integrated, fully operational such detection/classification systems is known to be non-trivial, a fact that was corroborated by our own statistical results regarding expression classification by humans. Towards building such systems, we developed a neural network-based system, called NEU-FACES, which first determines whether or not there are any faces in given images and, if so, returns the location and extent of each face. Next, we described features which allow the classification of several facial expressions and presented neural network-based classifiers which use them. The proposed system is fully functional and integrated, in that it consists of modules which capture face images, estimate the location and extent of faces, and classify facial expressions. Therefore, the present or improved versions of our system could be incorporated into advanced human-computer interaction systems and multimedia interactive services.
6 Future Work In the future, we will extend this work in the following three directions: (1) we will improve our system by using wider training sets so as to cover a wider range of poses and cases of low quality of images. (2) We will investigate the need for classifying into more than the currently available facial expressions, so as to obtain more accurate estimates of a computer user’s psychological state. In turn, this may require the extraction and tracing of additional facial points and corresponding features. (3) We
64
I.-O. Stathopoulou and G.A. Tsihrintzis
plan to apply our system for the expansion of human-computer interaction techniques, such as those that arise in mobile telephony, in which the quality of the input images is too low for existing systems to operate reliably. Another extension of the present work of longer term interest will address several problems of ambiguity concerning the emotional meaning of facial expressions by processing contextual information that a multi-modal human-computer interface may provide. For example, complementary research projects are being developed [15-17] that address the problem of emotion perception of users through their actions (mouse, keyboard, commands, system feedback) and through voice words. This and other related work will be presented on future occasions
Acknowledgement This work has been sponsored by the General Secretary of Research and Technology of the Greek Ministry of Development as part of the PENED basic research program.
References [1] Stathopoulou, I.-O., Tsihrintzis, G.A.: Facial Expression Classification: Specifying Requirements for an Automated System. In: 10th International Conference on KnowledgeBased & Intelligent Information & Engineering Systems, Bournemouth, United Kingdom, October 9-11 (2006) [2] Ekman, P., Friesen, W.: Unmasking the face: A Guide to Recognizing Emotions from Facial Expressions. Prentice-Hall, Englewood Cliffs (1975) [3] Terzopoulos, D., Waters, K.: Analysis and synthesis of facial image sequences using physical and anatomical models. IEEE Transactions on Pattern Analysis and Machine Intelligence 15(6), 569–579 (1993) [4] Essa, I., Pentland, A.: Coding, analysis, interpretation and recognition of facial expressions. IEEE Pattern Analysis and Machine Intelligence 19(7), 757–763 (1997) [5] Black, M.J., Yacoob, Y.: Recognizing facial expressions under rigid and non-rigid facial motions. In: Proceedings of the International Workshop on Automatic Face and Gesture Recognition, pp. 12–17. IEEE Press, Los Alamitos (1995) [6] Lisetti, C.L., Schiano, D.J.: Automatic Facial Expression Interpretation: Where HumanComputer Interaction, Artificial Intelligence and Cognitive Science Intersect. Pragmatics and Cognition (Special Issue on Facial Information Processing: Multidisciplinary Perspective) 8(1), 185–235 (2000) [7] Dailey, M.N., Cottrell, G.W., Adolphs, R.: A six-unit network is all you need to discover happiness. In: Proceedings of the Twenty-Second Annual Conference of the Cognitive Science Society, pp. 101–106. Erlbaum, Mahwah (2000) [8] Rosenblum, M., Yacoob, Y., Davis, L.: Human expression recognition from motion using a radial basis function network architecture. IEEE Transactions on Neural Networks 7(5), 1121–1138 (1996) [9] Stathopoulou, I.-O., Tsihrintzis, G.A.: A new neural network-based method for face detection in images and applications in bioinformatics. In: Proceedings of the 6th International Workshop on Mathematical Methods in Scattering Theory and Biomedical Engineering, September 17-21 (2003)
Comparative Performance Evaluation
65
[10] Stathopoulou, I.-O., Tsihrintzis, G.A.: A neural network-based facial analysis system. In: 5th International Workshop on Image Analysis for Multimedia Interactive Services, Lisboa, Portugal, April 21-23 (2004) [11] Stathopoulou, I.-O., Tsihrintzis, G.A.: An Improved Neural Network-Based Face Detection and Facial Expression Classification System. In: IEEE International Conference on Systems, Man, and Cybernetics 2004, October 10-13. The Hague, The Netherlands (2004) [12] Stathopoulou, I.-O., Tsihrintzis, G.A.: Pre-processing and expression classification in low quality face images. In: 5th EURASIP Conference on Speech and Image Processing, Multimedia Communications and Services, Smolenice, Slovak Republic, June 29 – July 2 (2005) [13] Stathopoulou, I.-O., Tsihrintzis, G.A.: Evaluation of the Discrimination Power of Features Extracted from 2-D and 3-D Facial Images for Facial Expression Analysis. In: 13th European Signal Processing Conference, Antalya, Turkey, September 4-8 (2005) [14] Stathopoulou, I.-O., Tsihrintzis, G.A.: Detection and Expression Classification Systems for Face Images (FADECS). In: IEEE Workshop on Signal Processing Systems (SiPS 2005), Athens, Greece, November 2–4 (2005) [15] Virvou, M., Alepis, E.: Mobile educational features in authoring tools for personalised tutoring. The journal Computers & Education (to appear, 2004) [16] Virvou, M., Katsionis, G.: Relating Error Diagnosis and Performance Characteristics for Affect Perception and Empathy in an Educational Software Application. In: Proceedings of the 10th International Conference on Human Computer Interaction (HCII) 2003, Crete, Greece, June 22-27 (2003) [17] Alepis, E., Virvou, M., Kabassi, K.: Affective student modeling based on microphone and keyboard user actions. In: 6th IEEE International Conference on Advanced Learning Technologies 2006 (ICALT 2006), pp. 139–141 (2006) ISBN:0-7695-2632-2
Histographic Steganographic System Constantinos Patsakis and Nikolaos Alexandris Department of Informatics, University of Piraeus Abstract. In this paper we propose a new steganographic algorithm for jpeg images named HSS, with very good statistical properties. The algorithm is based on a previous work of Avidan and Shamir for image resizing. One of the key features of the algorithm is its capability of hiding the message according to the cover image properties, making the hidden message as traceable as possible.
1 Introduction Data hiding has always been an important issue. Data hiding has two meanings, hiding the meaning of the data, or even hiding their very existence. Nowadays due to the growth of technology two sciences involving data hiding have been created, cryptography and steganography. The main purpose of the first is to hide the contents of data so that only authenticated entities can have access to them. The latter, steganography, has been developed for embedding data in other media, such as text, sound or image in order not to hide the existence of the hidden data. Perhaps the model that best describes steganography is the prisoners problem, given by [1] and [2], where two prisoners Alice and Bob, from now on A and B respectively, want to escape prison. Both of them have to exchange information without their Warden, from now on W, knowing that they are up to something. If W finds out that there is a peculiar way in the way A and B are talking to each other, then he will put them in separate wards and their escape plan is doomed. The algorithm that we propose, HSS, is embedding data in JPEG images so that only A and B are able to know their existence. The image is processed in such way that the original image does not differ much from the stego-image visually, and also it’s statistical properties do not show that there is data infiltration. In any case, both symmetric and asymmetric algorithms can be used as we will show later on. The paper is organized as follows, after this short introduction, we give some background material that is necessary. We then present the algorithm and show some facts about it’s performance as well as it’s advantages against other algorithms. Finally we have a conclusion, giving a summary of what has been achieved and things that can be done for later work.
2 Background In a recent work of Avidan and Shamir [3], they proposed a new algorithm for image resizing based on the energy weight of the pixels to be removed. They G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 67–73, 2008. springerlink.com © Springer-Verlag Berlin Heidelberg 2008
68
C. Patsakis and N. Alexandris
Fig. 1. The classic model of steganography
find seams, paths of small energy, and remove them from each row or column, depending on how the image has to be resized. We will give now the necessary definitions and tools for defining our algorithm. For sake of simplicity, we will give the definitions that Avidar and Shamir have used. Definition 1. Let I be an n×m image and define a vertical seam to be: sx = {sxi }ni=1 = {(x(i), i)}ni=1 such that ∀i, |x(i) − x(i − 1)| ≤ 1 and x is a mapping x : [1, ..., n] → [1, ..., m]. If we try to imagine the visual meaning of a vertical seam, we can say that it is a vertical path from the top of the image to the bottom. Similarly we have the horizontal seam.This will of course be a horizontal path from the left side of the picture to the right. Definition 2. Let I be an n×m image and define a vertical seam to be: m sy = {syj }m j=1 = {(j, y(j))}i=1
such that ∀i, |y(i) − y(i − 1)| ≤ 1 and y is a mapping y : [1, ..., m] → [1, ..., n]. We will now have the following notation for the pixels of a seam s, I(s). Definition 3. Given an energy function e, we define the cost of a seam as: E(s) = E(IS ) =
n
e(I(si ))
i=1
Opposite to Avidan and Shamir, we will regard the optimal seam s∗ as the seam that maximizes the seam cost:
Histographic Steganographic System
69
Fig. 2. Vertical and horizontal seams
Fig. 3. Original picture,its energy map and gradient
s∗ = maxs E(s) =
n
e(I(si ))
i=1
The energy function that we are going to use is eHoG eHoG =
∂ ∂ I| + | ∂y I| | ∂x
max(HoG(I(x, y)))
where HoG(I(x,y)) is taken to be a histogram of oriented gradients at every pixel . We use an 8-bin histogram computed over a 11×11 window around the pixel. Thus, taking the maximum of the HoG at the denominator attracts the seams to edges in the image, while the numerator makes sure that the seam will run parallel to the edge and will not cross it. As they propose the optimal seam can be found using dynamic programming.
3 The Algorithm The main idea of the algorithm is to hide data in areas of the image that have much energy and do not effect the way the picture is shown. In this way, the
70
C. Patsakis and N. Alexandris
Fig. 4. Original picture and stego image
bias of DCT will be ignored by W, since it is normal for high energy areas to have big changes around them. Moreover since they have so much energy they capture the eye so that it cannot detect many differences. It is obvious that the resizing algorithm of Avidan and Shamir tries to keep these areas untouched, as they are the true carriers of information. Furthermore these changes do not distort the picture, they do not add noise to the whole of it, making it suspicious to W. Finally since these areas need more information to be stored, the impact of their distortion will be unnoticed by the compression. We assume that both parties share a secret key K, which they will use in order to exchange their hidden information. We will now devide the picture in parts according to optimal seams. The encryption algorithm that we are going to use is AES [4, 5], so the key sizes that can be used are 128, 192 and 256 bits. The key size depends on the decision of entities A and B. In order to track possible malicious or accidental changes to the image we will use SHA-1 hashing function [6, 7]. The algorithm as we are going to present it, is in it simpler form yet it can be easily altered for meeting more special needs. Let M be the secret message that A wants to transfer to B and I the image of n by m pixels that will be the carrier of M. Entity A computes C = EncK (M ||SHA(M )). We pad C in order to have a C’ such that len(C’) mod n =0, for sake of simplicity, we pad C with zeros. Now we have to compute ) vertical seams, that will hide our data inside them, the same thing h = 3len(C 8n can be done using vertical seams. Using dynamic programming we find h optimal vertical seams that do not intersect and we increase their energy with the smallest possible value, so that after the infiltration, all h seams have the biggest possible values. We now take these h vertical seams and using LSB method we embed C’ to the their last two bits. It is apparent that the reversing the steps of the HSS algorithm one can extract the embedded message from the stego image.
4 Performance and Security Since the algorithm does not create new images in order to hide the message but detects the parts of the picture that have the most information, the general
Histographic Steganographic System
71
Fig. 5. The stego image its energy map and gradient
Fig. 6. The histogram of the original image
performance of the algorithm is rather fast. Furthermore, due to the nature of the algorithm the stego image and the original picture share almost equal histograms, something that is apparent in figures 6 and 7. Furthermore, the algorithm does not add information that can be detected by human eye, or that can be traced with automated tools. None of the tests of stegdetect [9] of Provos has recognized the presence of hidden message in the tested images.
72
C. Patsakis and N. Alexandris
Fig. 7. The histogram of the stego image
One of the main properties of HSS is its security. The use of AES and SHA1 as parts of the algorithm improve the performance of the algorithm and the security of the whole infrastructure. The message is encrypted with a well known algorithm and we can test if the message has been altered using the hash function.
5 Conclusion In this paper we introduced the HSS algorithm, a steganographic algorithm that embeds hidden messages in images, adapting each time to the properties of the cover image. This allows the hidden message to become less traceable from steganographic attacks, as the DCT coefficients remain the same, the histogram of the stego image is almost the same with the histogram of the original image and the stego image does not have any obvious differences with the original one. HSS uses modern cryptographic algorithms providing extra security for the embedded message with good statistical behavior. In some cases, the algorithm can even retrieve a part of the hidden message from images that have been tampered with, using the hash function or retrieve up to the destroyed area. Perhaps, the only drawback compared to other algorithms is the steganographic capacity due to the use of seams.
References 1. Kharrazi, M., Sencar, H.T., Memon, N.: Image steganography: Concepts and practice. In: WSPC. Lecture Notes Series (2004) 2. Simmons, G.J.: The prisoners problem and the subliminal channel. In Advances in Cryptology: Proceedings of Crypto 1983, pp. 51–67. Plenum Press (1984)
Histographic Steganographic System
73
3. Avidan, S., Shamir, A.: Seam Carving for Content-Aware Image Resizing. ACM Transactions on Graphics 26(3); SIGGRAPH 2007 (2007) 4. FIPS PUB 197: Advanced Emcryption Standard 5. Daemen, J., Rijmen, V.: The Block Cipher Rijndael. In: Schneier, B., Quisquater, J.J. (eds.) CARDIS 1998. LNCS, vol. 1820, pp. 277–284. Springer, Heidelberg (2000) 6. RFC 3174, US Secure Hash Algorithm 1 (SHA-1) 7. FIPS 180-2: Secure Hash Standard (SHS) 8. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: International Conference on Computer Vision & Pattern Recognition, vol. 2, pp. 886–893 (2005) 9. Provos, N.: Stegdetect, http://www.outguess.org/download.php
Moving Object Detection and Tracking for the Purpose of Multimodal Surveillance System in Urban Areas Andrzej Czyzewski and Piotr Dalka Multimedia Systems Department, Gdansk University of Technology Gdansk, Poland
[email protected]
Abstract. Background subtraction method based on mixture of Gaussians was employed to detect all regions in a video frame denoting moving objects. Kalman filters were used for establishing relations between the regions and real moving objects in a scene and for tracking them continuously. The objects were represented by rectangles. The objects coupling with adequate regions including the relation of many-to-many was studied experimentally employing Kalman filters. The implemented algorithm provides a part of an advanced audiovideo surveillance system for security applications which is described briefly in the paper. Keywords: moving object detection and tracking, background subtraction, mixture of Gaussians, Kalman filters.
1 Introduction Video surveillance system are very often used for monitoring of many public places in every agglomeration. Such systems usually utilizes dozens of cameras and produce large amount of video streams that cannot be analyzed effectively by human operators. Furthermore, commonly found systems do not utilize other rich sources of information, like acoustic signals. At the same time, surveillance systems are required to be very reliable and effective. Thus it is necessary to implement an autonomous system combining visual and acoustic data, which would be able to detect, classify, record and report unusual and potentially dangerous events. Such a multimodal surveillance system would be invaluable asset for various administrative authorities and safety agencies, as well as for private companies interested in securing their facilities. The outcome of the system would have positive influence on increasing the global safety of citizens. We started this kind of experiments in Poland with building a system for monitoring urban noise employing multimedia approach [1][8]. The paper presents a fragment of our research concerning advanced surveillance system elements. We restrict in this paper to our experiments devoted to visual objects detection and tracking them in motion being a part of the more complex surveillance system. The main goal of this multimodal surveillance system is to automatically detect events, classify them and alarm an operator if an unusual or dangerous activity is detected. Events are detected by many universal monitoring units which are placed in the monitored area. Results are sent to the central surveillance server, which stores them, analyses, classifies and notifies an operator if needed. G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 75–84, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
76
A. Czyzewski and P. Dalka
2 Moving Object Detection Moving object detection and segmentation is an important part of video based applications, including video surveillance. In the last application, results of detection and segmentation of objects in video streams are required to determine the type of an object and to classify events regarding the object. Most video segmentation algorithms usually employ spatial and/or temporal information to generate binary masks of objects [2]. Spatial segmentation is basically image segmentation, which partitions the frame into homogenous regions with respect to their colours or intensities. This method can be typically divided into three approaches. Region-based methods rely on spatial similarity in colour, texture or other pixel statistics to identify separate objects while boundary-based approaches use primarily a differentiation filter to detect image gradient information and extract edges. In the third, classification-based approach, a classifier trained with a feature vector extracted from the feature space is employed to combine different cues such as colour, texture and depth [3]. Temporal segmentation is based on change detection followed by motion analysis. It utilizes intensity changes produced by moving objects to detect their boundaries and locations. Although temporal segmentation methods are usually more computationally effective than spatial approaches, they are sensitive to noise or lighting variations [2]. There are also methods combining both spatial and temporal video characteristics, thus leading to spatio-temporal video segmentation. The final outcome is a 3-D surface encompassing object position through time called object tunnel [4]. The solution presented in the paper utilizes spatial segmentation to detect moving objects in video sequences. The most popular region-based approach is background subtraction [5], which generally consists of three steps. First, a reference (background) image is calculated. Then, the reference image is subtracted from every new image frame. And finally, the resulting difference is threshold filtered. As a result, binary images denoting foreground objects in each frame are obtained. The simplest method to acquire the background image is to calculate a time averaged image. However this method suffers many drawbacks (i.e. limited adapting capabilities) and cannot be effectively used in a surveillance system. A popular and promising technique of adaptive background subtraction is modelling pixels as mixtures of Gaussians and using an on-line approximation to update the model [6][7]. This method proved to be useful in many applications as it is able to cope with illumination changes and adapt the background model accordingly to the changes in the scene, e.g. motionless foreground objects eventually become a part of the background. Furthermore the background model can be multi-modal, allowing regular changes in the pixel colour. This makes it possible to model such events as trees swinging in the wind or traffic light sequences. Thus this method was used for moving object detection in our multimodal surveillance system. In this method, each image pixel is described by a mixture of K Gaussian distributions [13]. The probability that a pixel has value xt at the time t is given as: K
(
p( xt ) = ∑ wt η xt , μt , Σ t i =1
i
i
i
)
(1)
Moving Object Detection and Tracking for the Purpose of Multimodal
77
where wti denotes the weight and μti and Σti are the mean vector and the covariance matrix of ith distribution at the time t, and η is the normal probability density function. The number of distributions K is usually a small number (3 or 5) and is limited by the available computational power. For simplicity and to reduce memory consumption it is assumed that RGB colour components are independent, but their variances are not restricted to be identical as in [6]. In this way the covariance matrix Σ is a diagonal matrix with variances of RGB components on its main diagonal. It is assumed, that each Gaussian distribution represents a different background colour of a pixel. The longer a particular colour is present in the video stream, the higher value of the weight and lower values in the covariance matrix of the corresponding distribution are. With every new video frame, parameters of distributions for each pixel are updated according to the previous values of the parameters, current pixel value and the model learning rate α. The higher α the faster model adjusts to changes in the scene background (e.g. caused by gradual illumination changes), although moving objects remaining still for a longer time (e.g. vehicles waiting at traffic lights) would become a part of the background quicker. During determining whether the current pixel is a part of a moving object, only distributions characterized by high weights and low values in covariance matrices are used as the background model. If the current pixel matches one of the distributions forming the background model, it is classified as the background of the scene; otherwise it is considered as a part of a foreground object. Moving object image detection is supplemented with the shadow detection module which is required for every outdoor video processing application, especially in the field of surveillance applications. The shadow of a moving object is always present, moves together with an object and as such is detected as a foreground object by a background removal application. The shadow detection method is based on the idea that while the chromatic component of a shadowed background object remains generally unchanged, its brightness is significantly lower. It makes possible to separate the RGB colour space used in the model into chromatic and brightness components. Only pixels recognized as a part of a foreground object during the background subtraction process are checked whether they are part of a moving shadow. A binary mask denoting pixels recognized as belonging to foreground objects in the current frame is the result of the background subtraction. The mask is morphologically refined in order to allow object segmentation [7][8]. Morphological processing of a binary mask consists of removing regions (connected components) having too few pixels, morphological closing and filling holes in regions.
3 Applying Kalman Filters to Objects Tracking in Motion The Kalman filtering [9] provides a useful approach to tracking objects in video, thus numerous papers discussed this subject. A good review of older publications is provided in the Funk’s study [10]. Hongshan YU et al. present the application of Kalman filtering in an advanced framework form multi-moving target tracking [11]. Many interesting concepts as well as newer literature citations can be found in some
78
A. Czyzewski and P. Dalka
papers published after PETS’06 workshop, e.g. [12] concerning the tracking using multi-camera. In the process of tracking, each of the detected moving objects has its own Kalman filter (so-called tracker) created, that represents it. The Kalman filter is most of all used to establish proper relation between detected regions (blobs) that map to moving objects of a current frame, and the real moving objects under analysis. In our initial experiments, we planned to follow some ideas presented by a research team of the Institute for Robotics and Intelligent Systems University of Southern California [13], however the Kalman filter application is only mentioned in that paper, so that it was necessary to start the algorithm building from scratch. As a result of applying previously implemented algorithms for moving objects extraction [1][8] mentioned in the previous paragraph we obtained moving objects represented by rectangles. The experiments used two types of trackers (Kalman filters). In the first type, the state of the moving object (vector x8) is described by 8 parameters (which is denoted by the superscript value in the vector’s indicator) and in the second version state vector x6 has 6 elements: x 8 = [x
w h dx dy dw dh ]
T
y
x 6 = [x
(2)
y w h dx dy ]
T
(3)
where x and y denote the location of the object (actually the coordinates of the upper-left corner of the rectangle that represents it), w and h are the width and height of the rectangle, dx and dy indicate the change in the object’s location during subsequent time intervals, and dw and dh are the changes in the width and height of the rectangle during subsequent time intervals. The two additional parameters, that differ the 8-element state vector from the 6element one, express the dynamics of changes in an object’s dimensions. This means that an 8-element Kalman filter assures tracking faster and greater changes in an object's (rectangle) dimensions than the 6-element one which enables the changes only at the stage of the vector correction phase. In both cases the measurement vector adopts the following form:
[
z = xb
yb
wb
hb
]
T
(4)
which includes: the location, width and height of the region holding the pixels of the moving object associated with the current tracker. Transition matrix A and observation matrix H of the Kalman filter were binary matrices in the form appropriate for the state and observation vectors defined above. The applied model does not require any control inputs, which results in an input matrix B equal to 0. The process of tracking a moving object has several phases. Each newly detected object is assigned a new tracker with the state vector that is based on the parameters measured for the blob according to the following equation:
[
x −81 = x −b1
y −b1
w−b1
h−b1
0 0 0 0
]
T
[
x −61 = x −b1
y −b1
w−b1
h−b1 0 0
]
T
(5)
Moving Object Detection and Tracking for the Purpose of Multimodal
79
In the proceeding time interval (namely in the next frame), the state vector is updated once more based upon the parameters corresponding to the newly-created object:
[ = [x
x08 = x0b
y0b
w0b
h0b
x0b − x−b1
y0b − y−b1
b 0
y0b
w0b
h0b
x0b − x−b1
y0b − y−b1
x06
]
w0b − w−b1
h0b − h−b1
]
T
T
(6)
The vector x0 constitutes the initial estimate of the state vector xˆ 0 . In the following time intervals, firstly the forward prediction of the state vector of all Kalman filters assigned to the currently existing objects is made. This is done in order to obtain the a priori estimate of the location of objects belonging to the current image frame. The next step is to purge trackers whose a priori estimate of the state vector contains non-positive or too small values of the object's width and height (the situation which is possible for the 8-element state vectors, if errors occur during background subtraction). Then, it is decided which blob of the current frame is assigned to which one of the tracked objects. In the final phase, the Kalman filter state vectors of each object are corrected. This is done based on the parameters measurement of the regions holding the pixels of the respective moving objects detected in the current frame.
4 Establishing Relations between Moving Objects and Regions The key action of the tracking algorithm is to associate properly trackers with the blobs resulting from background subtraction in the current frame. For this purpose, a binary matrix M that depicts the relations is created. In this matrix, each tracker-blob − pair (where the a priori estimate xˆ k of the object's state represents the tracker and the measurement vector zk relates the blob) is assigned zero or one, depending on whether the rectangles enclosing the region and the estimated object location have a common part (i.e. whether they overlap). As a result a i × j relations matrix M is created for i trackers and j detected blobs of the current frame. This way of the matrix creation provides a vital simplification of hitherto used procedures. There are some basic types of relations between trackers and regions, each of them requesting some different actions to be taken [13]. If a certain blob is not associated with any tracker, a new tracker (Kalman filter) is created and initialised in compliance with this region. If a certain tracker has no relation to any of the blobs, then the phase of measurement update is not carried out in the current frame. If the tracker fails to relate to a proper region within several subsequent frames, it is deleted. The predictive nature of trackers assures that moving objects, whose detection through background subtraction is temporarily impossible, are not “lost” (e.g., when a person passes behind an opaque barrier). One of the most desirable types of relation is a one-to-one relationship between a tracker and a blob. In this case, the tracker is updated with the results of the region measurements. Another type of association is a many-to-one relationship, meaning that in the current image frame there are several trackers related to the same blob. Thus, each of these trackers is updated with the parameter measurements of this same region. Similar circumstances correspond to the situation in which two humans,
80
A. Czyzewski and P. Dalka
previously moving separately (mapping to one-to-one tracker-blob relations), start walking together, possibly causing their hiding one behind another, which makes their trackers start relating to the same region. Another type of an object-region relation is the one-to-many relation, i.e., one tracker is associated with many blobs. In this case, the tracker is updated with the parameters of the rectangle covering all of these regions. Such an approach is supposed to assure the cohesion of the traced object in case of faulty background subtraction that divides the object, and then erroneously couples it with several regions instead of one, or when a moving person temporarily disappears behind another object, e.g., a pillar. If actually it is a situation of two humans entering the camera scope as a group and then parting, or a situation of a person who abandons an object, the distance between the blobs increases, making in effect the current tracker start “following” the object that changes its dimensions and motion vector to a lesser extent, while the other object is assigned a new tracker. The last and most intricate way of coupling objects with regions is a relation of many-to-many. It corresponds to the situation of several people, who initially are moving separately, then form a group (thus, inducing the relation of many trackers to one blob), and after some time some persons decide to leave the group. Ideally, in such a case, the tracker originally assigned to a leaving person (before the person entered the group), should follow the same person (proving that the continuity of viewing is provided all the time the objects remain in the scene). For the algorithm, identical is the situation when two people pass by each other. First, we have to deal with 2 one-to-one relations. Next, a single two(trackers)-to-one(blob) relation is established, then (at the very moment of passing by) a many-two-many relation is formed, which finally again transforms to 2 one-to-one relations. All the time the trackers should be able to follow adequate objects. In order to achieve this, trackers store the descriptions of the objects they trace. At moments when many trackers are associated with many blobs (as in the case of a parting group of people), the degree of similarity is calculated in the same way for each object. It is computed based on the description held by the object’s tracker and the description derived from the calculations performed for each of the regions. If the maximum degree of similarity (within the group of the many-to-many relationship) exceeds a specified threshold, a suitable tracker is updated with the measurements of the region that is most similar to it. The obtained tracker-blob pair is excluded from further analysis, and the same process of finding another matching pair repeats. The next pair is updated and excluded. If, finally, only degrees that do not exceed the threshold are left, all the remaining trackers (in the analysed group) are updated with the parameters of the rectangle covering the whole rest of the regions. If there are no blobs left to associate trackers with (which is possible if the many-to-many relation was formed by more trackers than regions), the remaining trackers are not updated. In our experiments, a two-dimensional colour histogram using a chromatic space of RcGc colours was applied for each object as its description. The degree of similarity between the appearance of the object (the RcGc histogram) stored by the tracker, and the appearance (the RcGc histogram) of the analysed region is determined through the measurement of correlation.
Moving Object Detection and Tracking for the Purpose of Multimodal
81
5 Experiments and Results The experiments show that the developed algorithm works correctly. In Fig. 1, sample results of vehicle detection and segmentation are presented. The implemented algorithm for moving object detection correctly determines the scene background and marks locations of all objects (vehicles), both during day and night. The moving shadow detection and morphological processing turned out to be very useful in separating two vehicles originally labelled as one region (the first row in Fig. 1). The algorithm is also able to detect vehicles in night sequences (the second row in Fig. 1). There is one major drawback of the night environment. Car headlights illumine the road ahead of a car and nearby buildings which causes that illuminated areas are classified as foreground objects. Supplementary decision layer (possibly employing intelligent decision algorithm) needs to be added to the algorithm to prevent such false detections and obtain exact vehicle shapes. Based on results of moving object detection, the developed algorithm performs tracking of objects. The performance results are satisfactory, especially when the number of objects present in the analysed scene is not large. Fig. 2 demonstrates effectiveness of the algorithm in a sample situation of a car overtaking another car. Similar situation is shown in Fig. 3 (an event of two people passing by each other). Both examples show that trackers (marked by different colours) follow the right object and confirm that the algorithm is able to continuously track objects despite their partial overlapping. The experiments did not clearly prove which of the two types of Kalman filters (8- or 6-element state vector) is better. The 8-element vector filter shows greater
a)
b)
c)
Fig. 1. Examples of vehicle detection during day (first row) and night (second row); a) original frames from recorded video sequences; b) raw results of background removal without any further processing; b) final results of vehicle segmentation
82
A. Czyzewski and P. Dalka
flexibility and better results in the case of more crowded scenes. The 6-element vector filter assures more precise measurements of the shape and location of objects (real, traced objects do not change their size all the time) and in some problematic situations performs better, i.e., when many-to-many relationships are engaged. The decision as to which type of vector to chose will depend on further application of its performance results and on the characteristics of the analysed scene.
a)
b)
c)
d)
Fig. 2. Frames illustrating continuous tracking of two vehicles that overtake themselves
Fig. 3. Fragments of frames 2266, 2270 and 2277 of the S1-T1-C3 recording from the PETS 2006 [14] set. Two humans passing by each other. In the upper row, 8-element Kalman filter was used, and in the lower row 6-element vector was used.
Moving Object Detection and Tracking for the Purpose of Multimodal
83
The presented solution is the first version of our mobile objects tracking algorithm, which even at its initial state of development lets us achieve good results. It is of course advisable to advance and improve it, mostly through using some other more distinctive parameters in objects descriptions, and through specifying a more precise measure of similarity than correlation is. However, in simple situations, the described algorithm performs very well, however, its reliability decreases with the increase of the number of objects interacting with each other.
6 Conclusions The solution for tracking mobile objects applied in our hitherto experiments and considered for future work will constitute an important element of the prototype surveillance system consisting of a set of distributed monitoring units and a central server for data processing. Data regarding moving objects, obtained from trackers, can be directly used to detect unusual or prohibited events such as trespassing or luggage abandonment. In the area of moving object detection, future work will be focused on including spatial and temporal dependencies between pixels in the background model and dynamically adjusting the learning rate, depending on the current scene change rate. A possible area of improvements in the tracking part of the algorithm should address an implementation and examination of trackers that use more advanced algorithms to estimate state vectors of dynamic discrete processes, particularly those employing the Extended Kalman Filter and the Unscented Kalman Filter. These solutions utilize non-linear and/or non-Gaussian models of processes and therefore should estimate motion of real-world objects with greater accuracy.
Acknowledgements Research is subsidized by the Polish Ministry of Science and Higher Education within Grant No. R00-O0005/3 and by the European Commission within FP7 project “INDECT” (Intelligent Information System Supporting Observation, Searching and Detection for Security of Citizens in Urban Environment).
References 1. Czyzewski, A., Dalka, P.: Visual Traffic Noise Monitoring in Urban Areas. International Journal of Multimedia and Ubiquitous Engineering 2(2), 91–101 (2007) 2. Li, H., Ngan, K.: Automatic Video Segmentation and Tracking for Content-Based Applications. IEEE Communication Magazine 45(1), 27–33 (2007) 3. Liu, Y., Zheng, Y.: Video Object Segmentation and Tracking Using y-Learning Classification. IEEE Trans. Circuits and Syst. For Video Tech. 15(7), 885–899 (2005) 4. Konrad, J.: Videopsy: Dissecting Visual Data in Space Time. IEEE Communication Magazine 45(1), 34–42 (2007)
84
A. Czyzewski and P. Dalka
5. Yang, T., Li, S., Pan, Q., Li, J.: Real-Time and Accurate Segmentation of Moving Objects in Dynamic Scene. In: ACM Multimedia 2nd International Workshop on Video Surveillance and Sensor Networks, New York, October 10-16 (2004) 6. Stauffer, C., Grimson, W.: Learning patterns of activity using real-time tracking. IEEE Trans. on Pattern Analysis and Machine Intell. 22(8), 747–757 (2000) 7. Elgammal, A., Harwood, D., Davis, L.: Non Parametric Model for Background Subtraction. In: ICCV Frame-rate Workshop (September 1999) 8. Dalka, P.: Detection and Segmentation of Moving Vehicles and Trains Using Gaussian Mixtures, Shadow Detection and Morphological Processing. Machine Graphics and Vision 15(3/4), 339–348 (2006) 9. Welch, G., Bishop, G.: An Introduction To the Kalman Filter. Technical Report TR95041, University of North Carolina at Chapel Hill (1995) 10. Funk, N.: A Study of the Kalman Filter applied to Visual Tracking. University of Alberta, Project for CMPUT 652 (December 7, 2003) 11. Yu, H., Wang, Y., Kuang, F., Wan, Q.: Multi-moving Targets Detecting and Tracking in a Surveillance System. In: Proc. of the 5th World Congress on Intelligent Control and Automation, Hangzhou, China, June 15-19 (2004) 12. Martínez-del-Rincón, J., Herrero-Jaraba, J.E., Gómez, J.R., Orrite-Uruñuela, C.: Automatic left luggage detection and tracking using multi-camera UKF. In: Proc. 9th IEEE Internat. Workshop on Performance Evaluation in Tracking and Surveillance (PETS 2006), NY, USA, pp. 59–66 (2006) 13. Lv, F., Song, X., Wu, B., Kumar, V., Nevatia, S.: Left-Luggage Detection using Bayesian Inference. In: Proc. of 9th IEEE Int. Wrokshop on Performance Evaluation of Tracking and Surveillance, New York, USA, June 2006, pp. 83–90 (2006) 14. PETS 2006 – a collection of test recordings from the Ninth IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, New York, USA, June 18 (2006)
Image Similarity Search in Large Databases Using a Fast Machine Learning Approach Smiljan Šinjur and Damjan Zazula Faculty of Electrical Engineering and Computer Science University of Maribor Smetanova ulica 17 2000 Maribor Slovenia {smiljan.sinjur,zazula}@uni-mb.si
Abstract. Today’s tendency to protect various copyrighted multimedia contents, such as text, images or video, resulted in many algorithms for detecting duplicates. If the observed content is identical, then the task is easy. But if the content is even slightly changed, the task to identify the duplicate can be difficult and time consuming. In this paper we develop a fast, two-step algorithm for detecting image duplicates. The algorithm finds also slightly changed images with added noise, translated or scaled content, or images having been compressed and decompressed by various algorithms. The time needed to detect duplicates is kept low by implementing image feature-based searches. To detect all similar images for a given reference image, the feature extraction based on convex layers is deployed. The correlation coefficient between two features gives the first hint of similarity to the user, who creates a learning set for support vector machines by simple on-screen selection. Keywords: Image similarity, Convex layer, Correlation coefficient, Machine learning, Support vector machine.
1 Introduction If the similarity algorithms for text are more or less known and good by its quality and speed, this is not true for images and video. To detect the video duplicates, the process is easily transformed to an image problem. A complete reference video or a part of it is similar to the tested video or a part of it if a sequence of frames of the reference video is similar to the sequence of frames from tested video. All frames of all tested videos have to be stored locally in a database. Since usually a video is composed from a lot of frames, the storage size of such a database can be large. In this case, it is crucial to store only smaller amount of information which describes the images uniquely by their example features. Although the database size is smaller the number of feature vectors equals the number of images. So, the algorithms that perform any manipulation on the feature vectors have to be very fast. Image similarity has been used in many applications, such as content-base image retrieval, feature tracking, stereo matching, and scene and object recognition. First image matching algorithms were developed back in fifties [1]. Cross-correlation in stereo image tracking was used in [2]. Very fast image similarity search can be G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 85–93, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
86
S. Šinjur and D. Zazula
performed if an image is described by keywords and metadata [3]. The keywords for searching are generated by the user or by computer, for example from the image name. Metadata are used by Google Image Search [4] or Tilmoto search engine [5]. Grauman et al. [6] use local descriptors for object recognition. Similarity of the images is measured as a distance between local features, which leads to short computational times. The CBIR features, such as colour and texture, are used in [7]. The similarity of two images is defined by dynamic distance of those features that add value to the comparison. Given a reference image, we developed an algorithm that finds all similar images from a large database. In the first place, the database of image features is constructed from all tested images. To extract the features from an image, the convex layers based on monochromatic images are formed. The similarity measure is defined on correlation coefficient between the two feature vectors of images, and falls into the interval [-1, 1]. Correlation coefficients closer to 1 indicate similar images, while those closer to -1 indicate their dissimilarity. However, it is difficult to define a proper thresholding, because the feature vectors are not easily separable, in general. Therefore, we derived a two-step procedure: first, a coarse thresholding is done for the reference image using the correlation coefficient algorithm, which is then followed by user’s selection of an initial learning set of images. This selection is performed on-screen from three collections of images displayed according to their correlation coefficient: a group of most similar images, a group of border cases, and a group of least similar images with regard to the reference image. The obtained learning set is further used to train a support vector machine (SVM). This paper is organized in 5 sections. Section 2 describes convex layer construction on a grid of points. Also the correlation coefficient computed on the detrended image features is presented in Section 2. In Section 3, a selection of the initial learning set for the SVM-based algorithm is given. Section 4 interprets the experimental results, while Section 5 concludes the paper and gives hints for future work.
2 Image Similarity The image similarity measure is defined by a distance between the corresponding pairs of objects from two images. In our case, all the objects from an image constitute one unique feature vector. The similarity of two images is, therefore, measured by a comparison of two feature vectors: one belonging to the reference and the other to the tested image. To extract a feature vector for an image, image background and foreground have to be separated. Foreground is a set of objects that mean the point of user’s interest, is usually placed in the middle of image, and dynamically changes along the subsequent images (e.g. through video frames), whereas the background is usually static and covered by the objects. Also, the foreground and background usually differ in hue. The background is ignored and the foreground determines a unique feature vector. Separation of image background and foreground can be facilitated by a transformation to the HSV colour model. Human eye is most sensitive to hue, H, so saturation S and value V are set equal to 0. As a result, a greyscale image is obtained. Now, the background and foreground differ only in hue. A thresholding is based on the
Image Similarity Search in Large Databases Using a Fast Machine Learning Approach
87
minimum value between two local maximums in the greyscale image bimodal histogram [8]. Foreground pixels are laid over a grid, afterwards, to compute convex layers and use them in building up the image feature vector. 2.1 Convex Layers as Features An image feature vector is created from convex layers based on foreground points [12], [13]. The problem of convex layers is an extension of the convex hull problem. So, to generate convex layers a set of convex hulls is generated as described by the following algorithm: procedure ConvexLayers set S: all foreground pixels begin while S not empty CH = ConvexHull(S); S = S\CH; end end. The above algorithm shows that convex layers are generated recursively, until the point set S is empty. To compute convex hull on a point set, various known algorithms can be used [9], [10], [11], [12]. Those algorithms differ by their time complexity and implementation. Best results are achieved by Chazelle’s algorithm [12], which obtains the time complexity proportional to O(n log(n)) .
Fig. 1. Example image
Fig. 3 shows that any point can always take part in only one convex hull. Additionally, the upper and lower rows (resp. columns) of the current set S (represented as a matrix) lie on a convex hull. So, the maximum number of convex hulls on an image is ⎡max( p, q) / 2⎤ , where p and q are the image dimensions. Fig. 1 and Fig. 2 depict an example greyscale and monochromatic image from which the convex layers are generated and shown in Fig. 3. All the points in Fig. 3 lie
88
S. Šinjur and D. Zazula
Fig. 2. Monochromatic image for the example image from Fig. 1
Fig. 3. Convex layers for the example image from Fig. 2
on a grid. For better comprehension, Fig. 1 was resized to 32× 32 pixels, because larger convex layers become hard to follow visually. The number of convex hulls in our example is 11. 2.2 Similarity Measure Convex layers already define a feature vector that describes the content of an image. A direct comparison of such a two-dimensional feature is computationally complex, so a reduction of dimensions is performed first. We tested various measures to reduce feature dimensionality, such as the number of hulls or length of a hull. We found out the most significant information is given by the number of vertices on individual convex hulls. Therefore, we introduced a feature vector whose elements correspond to the number of vertices on the consecutive convex layers. Fig. 4 shows the feature vector obtained from convex layers in Fig. 3. An undesired property of decreasing tendency of feature vectors is evident from Fig. 4. This decrease is not caused by the individual features of an image, but is intrinsic because the inner layers are always “shorter” from the outer convex layers. It actually disturbs the comparison of image individual characteristics and must be
Image Similarity Search in Large Databases Using a Fast Machine Learning Approach
89
removed. By constructing the regression line of the feature vector, a means is given to eliminate the effect of intrinsically different sizes of convex layers, i.e. to detrend the feature vectors. The regression line t = [t1 , t2 K, t L ] is subtracted from feature vector x = [x1 , x 2 K , x L ] :
x′ = x − t ,
(1)
giving a detrended version of features x’.
Fig. 4. Feature vector, regression line and detrended features for the example image in Fig. 1
A regression line component ti is defined as
ti =
( L + 1 − i) , ⋅b L
(2)
where L stands for the length of x and b for the regression coefficient. This coefficient is calculated as follows: L
∑ ( xi − x ) ⋅ (i −
b=
i =1
L +1 ) 2
L +1 2 ) ∑ (i − 2 i =1
,
(3)
L
where xi stands for the i-th component of x and x for the mean of x. The similarity of two images can be measured by correlation coefficient of two detrended feature vectors x and y as: L
d ( x, y ) =
∑ (x
i
− x ) ⋅ ( yi − y )
.
i =1
L
L
∑ ( xi − x ) ⋅ ∑ ( y i − y ) 2
i =1
(4)
2
i =1
In general, the length of vectors x and y is different. To obtain the same vector length the shorter one is padded by zeros.
90
S. Šinjur and D. Zazula
3 Learning So far, we explained how to test a pair of images for similarity by using correlation coefficients. Correlation coefficient always returns a value in interval [-1, 1]. But, of course, the similarity threshold value is not completely general and it depends also on the reference image contents. There is also drawbacks of convex layers that cannot cope with rotated and interpolated image contents well.
Fig. 5. User interface for supervised learning
The correlation-coefficient-based thresholding is, therefore, not precise enough and can certainly be refined by more sophisticated approaches. We decided to use a machine learning algorithm, where a model of images similar to the reference one is learned and, afterwards, implemented in a refined similarity search. This means that we introduce a two-step procedure: first, the search in a large database is done using our fast convex-layer approach and the correlation coefficient thresholding, whereas the continuation is supervised by user as follows. A graphical user interface, as depicted in Fig. 5, offers three sets of most similar images, border cases, and most different images with respect to the reference one. The choice of sets is made automatically by our correlation-coefficient-based algorithm and the feature vectors. The user’s task now is to indicate all the displayed similar and different images and, thus, create a set of positive and negative learning examples. Of course, the user is displayed the original images, but his or her on-screen selection
Image Similarity Search in Large Databases Using a Fast Machine Learning Approach
91
enters the learning set as examples in the form of feature vectors. This learning set is further used for the SVM learning and classification [15], [16]. Fig. 5 may also be understood as an illustrative example of images that enter the three sets with different degree of similarity measured by the correlation coefficient. The group of images most similar to the reference one is formed according to the predefined number of highest correlation coefficients, while the group of the most different images exhibits the lowest values of correlation coefficients. The border set consists of a certain number of images with correlation coefficients just below or just above a preselected threshold correlation value. The number of images in any of the three sets and the threshold correlation value are defined by the user for every application run separately. The user has to go through all the displayed images and mark all similar and different images, respecting their own perception and decision. We have also studied the influence of the threshold correlation value as it is set for automated selection of border cases. It is obvious that the learning set can increase if new learning examples are added at different threshold levels. It is also expected that, by increasing the size of the learning set, the accuracy of the SVM learning and classification increases.
4 Results The proposed method is suitable for searching the large databases of images. The best example of a large set of similar and different images is a movie. It is expected the frames to be similar within the same scene and different between the scenes. However, even the frames within the same scene are not totally equal, because either the camera or the objects move, and some noise is added during image processing. The movement of an object or camera results in translation or scaling of the object, while the noise is caused by lossy compression algorithms whenever applied. To create a database of images, the movie “Star Wreck: In the Pirkinning” [17] published under the Creative Common licence [18] was chosen. Complete number of frames comprising the movie is 154892, and their resolution is 640 × 272 . If the movie is extracted by the best JPEG quality, the size of all its images grows up to 10.6 GB. The size of extracted images is enormous, so a direct search in such a database would be very time consuming. It is reasonable to create a database of features, as we have explained in previous sections. We generated the convex-layer-based feature vectors for all images, and stored them accompanied also by their labels and size. For this particular movie, the size of all feature vectors is 68 MB. Because of the learning process in the second step of our approach (see Section 3), the whole database of original images is also kept. The feature vector database has to be created only once. As the feature extraction for an image takes about 23 milliseconds on average, the complete process for the selected movie takes 3528 seconds. The times were measured on a Pentium Core 2 processor with 2.18 GHz frequency and 2 GB of memory, running under Linux. All code was written in Matlab, except the convex layer routine which was coded in C. First of all, we are interested in sensitivity and specificity of the proposed algorithm. We found out the sensitivity was strongly dependent on the learning set
92
S. Šinjur and D. Zazula
Fig. 6. Sensitivity and specificity of the proposed algorithm versus the learning set size
size. The larger the set, the bigger the number of similar images we recognized. Fig. 6 shows that an initial learning set of 40 feature vectors leads only to a half of similar images recognized in the processed movie. When increasing the learning set to 320, the probability of proper recognition increases to 0.83. The learning set size does not affect the specificity which yields 0.99 on average. All experiments were made using the linear kernel of SVM. The database searches are fast in both proposed steps, i.e. with the correlation coefficients and SVM classification. An average search time per image feature vector is 1.8 microseconds calculating the correlation and 15 microseconds using the SVM. This means that the tested database with the Star Wreck movie features was totally scanned in 279 milliseconds by the correlation approach and in 2.3 seconds by the SVM.
5 Conclusion In this paper, a novel method for searching similar images in large databases is described. Every image is computed and assigned a unique feature vector. Feature vectors are extracted from image convex layers that are based on image foreground pixels. The elements of vectors represent the numbers of vertices on corresponding layers. A simple algorithm for convex layers is revealed, where the fact that convex layers are an extension of the convex hull construction is used. The obtained feature vectors can be compared very fast by computing their correlation coefficients. This enables quick automated searches of reference images in large image databases. However, the correlation-coefficient-based approach is not very accurate. We refined it by an additional intelligent step based on SVM. A fast correlation-based search extracts the most similar and most dissimilar images and shows them to user whose duty is to mark the images that, according to his or her perception, can be considered similar. In this way, a learning set of positive and negative examples, i.e. feature vectors, is gathered. This set is deployed for the SVM learning. Optimal SVM weights are obtained that are implemented in the refined second database search. All the stored feature vectors are tested by the trained SVM in order to mine the images most similar to the reference one.
Image Similarity Search in Large Databases Using a Fast Machine Learning Approach
93
An application to create learning sets for SVMs was developed. It offers the user three sets of database images whose feature vectors produced highest, border, and lowest correlation coefficients when compared to the features of a reference image. The learning set selected by the user can be increased in subsequent searches, which leads to better sensitivity. Considering the speeds of the feature database scans, the SVM learning times and the duration of user’s on-screen image selection, we talk about a few tens of seconds. When this fact is combined with the obtained sensitivity of 83% and, possibly, more, the proposed image similarity approach proves worthwhile for further investigation and development.
References 1. Hobrough, G.L.: Automatic stereo plotting. Photogrammetric Engineering & Remote Sensing 25(5), 763–769 (1959) 2. Hannah, M.J.: A system for digital stereo image matching. Photogrammetric Engineering & Remote Sensing 55(12), 1765–1770 (1989) 3. Viitaniemi, V., Laaksonen, J.: Keyword-detection approach to automatic image annotation. In: Proceedings of 2nd European Workshop on the Integration of Knowledge, London, UK, pp. 15–22 (2005) 4. Google Image Search, http://images.google.si 5. Content based Visual Image Search Engine, http://www.tiltomo.com 6. Grauman, K., Darrell, T.: Efficient Image Matching with Distributions of Local Invariant Features. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 627–634. IEEE Press, Los Alamitos (2005) 7. Qamra, A., Meng, Y., Chang, E.Y.: Enhanced perceptual distance functions and indexing for image replica recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(3), 379–391 (2005) 8. Sonka, M., Hlavac, V., Boyle, R.: Image Processing, Analysis and Machine Vision. Chapman & Hall, London (1994) 9. Graham, R.L.: An efficient algorithm for determining the convex hull of a finite planar set. Information Proceeding Letters 1(4), 132–133 (1972) 10. Andrew, A.M.: Another Efficient Algorithm for Convex Hulls in Two Dimensions. Information Proceeding Letters 9(5), 216–219 (1979) 11. Jarvis, R.A.: On the identification of the convex hull of a finite set of points in the plane. Information Proceeding Letters 2(1), 18–21 (1973) 12. Chazelle, B.: On the convex layers of a point set. IEEE Transactions on Information Theory 31(4), 509–517 (1985) 13. Preparata, F.P., Shamos, M.I.: Computational Geometry: An Introduction. Springer, New York (1985) 14. Bewick, V., Cheek, L., Ball, J.: Statistics review 7: Correlation and regression. Critical Care 7(6), 451–459 (2003) 15. Lenič, M., Cigale, B., Potočnik, B., Zazula, D.: Fast Segmentation of Ovarian Ultrasound Volumes Using Support Vector Machines and Sparse Learning Sets. In: IIMSS 2008 (submitted, 2008) 16. Berthold, M., Hand, D.J.: Intelligent Data Analysis. Springer, Berlin (2003) 17. A film Star Wreck: In the Pirkinning: http://www.starwreck.com/ 18. Creative Commons license: http://creativecommons.org/
Fast Segmentation of Ovarian Ultrasound Volumes Using Support Vector Machines and Sparse Learning Sets Mitja Lenič, Boris Cigale, Božidar Potočnik, and Damjan Zazula University of Maribor, Faculty of Electrical Engineering and Computer Science, Smetanova 17, 2000 Maribor, Slovenia {mitja.lenic,boris.cigale,bozo.potocnik,damjan.zazula}@uni-mb.si
Abstract. Ovarian ultrasound imaging has recently drawn attention because of the improved ultrasound-based diagnostic methods and because of its application to in-vitro fertilisation and prediction of women's fertility. Modern ultrasound devices enable frequent examinations and sophisticated built-in image processing options. However, precise detection of different ovarian structures, in particular follicles and their growth still need additional, mainly off-line processing with highly specialised algorithms. Manual annotation of a whole 3D ultrasound volume consisting of 100 and more slices, i.e. 2D ultrasound images, is a tedious task even when using handy, computer-assisted segmentation tools. Our paper reveals how an application of support vector machines (SVM) can ease the follicle detection by speeding up the learning and annotation processes at the same time. An iterative SVM approach is introduced using training on sparse learning sets only. The recognised follicles are compared to the referential expert readings and to the results obtained after learning on the entire annotated 3D ovarian volume. Keywords: Medical image segmentation, Ultrasound imaging, Ovarian follicles, Support vector machines (SVM), Iterative SVM, Fast learning, Sparse learning sets.
1 Introduction Ovarian ultrasound imaging has recently drawn attention for several reasons. Its importance grows both because of the improved ultrasound-based diagnostic methods and because of its application to in-vitro fertilisation (IVF) and prediction of women's fertility. However, successful computer-assisted approaches to ovarian follicle analysis are still rare (see [1, 2] and references therein). Most of those analyses focus on 2D ultrasound scans (also addressed as B-mode scans) and measure specific properties of ovarian follicles (e.g. follicle diameter and area). In some cases, simple computational intelligence accompanies the integrated prior knowledge about follicles and about formation process of ovarian ultrasound images. Only a few recent developments utilize machine learning capabilities in conjunction with either artificial neural networks or other optimised classifiers. In the following introductory paragraphs, a short overview is given on the development of some of those approaches. Muzzolini et al. [1, 5] used the split-and-merge segmentation approach by using texture-based measure to segment 2D ultrasound images of ovarian follicles. The split and merge operations were controlled by means of a simulated annealing algorithm G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 95–105, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
96
M. Lenič et al.
(Metropolis). The algorithm’s efficiency was assessed on several ovarian ultrasound images with single follicles. Mislabelling error (i.e. the percentage of misclassified pixels) was around 30% with original images and around 1 % if pixels’ grey-level values were stretched on a predefined interval before images were processed. Sarty et al. [6] reported a semi-automated knowledge-based approach for detecting the inner and outer follicle wall. A cost function with integrated prior knowledge is minimised by using heuristic graph searching techniques to detect the walls. 31 ultrasound images were analysed. Euclidian distance between computer-segmented and observerannotated follicle boundaries resulted in 1.47 mm with 0.83 mm standard deviation, on average. A similar semi-automated approach was reported by Krivanek et al. [7]. An automatic approximation of the inner follicle walls using watershed segmentation on the smoothed images is first performed. Then binary mathematical morphology is employed to separate some merged small adjacent follicles. Finally, a knowledgegraph searching method is applied to detect inner and outer follicle walls. This approach was applied to 36 ovarian ultrasound images. Euclidian distance between computer-segmented and observer-annotated follicle boundaries resulted in around 1.64 mm with 0.92 mm standard deviation, on average. Our first attempt was based on an optimal thresholding applied to coarsely estimated ovary detected by observing the density and spatial distribution of edge pixels [8]. A test on 20 ovarian ultrasound images with 768×576 pixels yielded the recognition rate, which was defined as a ratio between the number of correctly identified follicles and the number of all follicles in images, around 62%. Considering only the dominant follicles, the recognition rate was 89%. The average misidentification rate defined as a ratio between the number of correctly identified follicles and the number of all computer-segmented regions was around 47%. Our most mature classical approach for automatic follicle segmentation in 2-D ultrasound images is three-step knowledge-based algorithm [1, 2]. Firstly, seed points (i.e. follicle centres) are found by a combination of watershed segmentation and several thresholding operations on pre-filtered images. Secondly, the accurate follicle boundary detection is performed by using a region growing from the seed points. The final step uses prior knowledge in order to eliminate non-follicle detected regions. This algorithm was tested on 50 randomly selected cases from an ovarian ultrasound image database. Image dimensions were 640×480 pixels. The obtained recognition rate was around 78%, while the average misidentification rate was around 29%. Reported Euclidian distance between segmented and correct follicle boundaries resulted in 1.1 mm with 0.4 mm standard deviation, on average. All abovementioned methods operate in 2D which mimics clinical practice. A much greater level of computational intelligence can be brought into follicle detection algorithms if they deal with a sequence of 2D ovarian ultrasound images acquired with classical B-mode ultrascanner during examination, or even with 3D ovarian ultrasound data (e.g. see [1, 9, 10]). In both cases the processing of a sequence of 2D cross-sections through ovary is applicable. Instead of focusing just on the follicle segmentation in single 2D image, it is much more advantageous to consider and integrate the information obtained from the analysis of vicinal images in a sequence. We proposed such solutions based on Kalman filter theory [3, 4]. 2D ovarian ultrasound recordings have also been processed by cellular automata in [11] and cellular neural networks (CNN) in [14] and [15]. The CNN templates were
Fast Segmentation of Ovarian Ultrasound Volumes
97
trained on a learning set of 4 images [15] randomly selected from the database of 1500 sampled images. Genetic learning algorithm (GA) and simulated annealing (SA) were applied. A testing set consist of 28 images [15], again randomly chosen from the same database. To recognize both the dominant and also smaller follicles, a grid of 3 CNNs was proposed [15], so that rough follicle estimation was done first, the expressive estimates were then expanded by the second CNN, and, finally, delineated by the area of ovary as detected by the third CNN. Actually, two learning sets were necessary for this reason: one with annotated follicles, and another with the annotated ovaries. In 28 images of testing set 168 follicles were annotated. The proposed detection algorithm recognized 81 regions of which 63 belong to the follicles. The main disadvantage of the learning processes used is a very slow convergence, which takes at least a few hours. Slow learning when using GA and SA makes the approach very annoying if large learning sets must be applied. And this certainly is the case with 3D imaging. If learning is not accomplished with a representative set of examples, the obtained recognition rates are rather low. There is no way to speed up GA or SA significantly, so another learning approach is necessary. A variety of optimised classifiers exist that run fast even with large data sets and give optimum classification according to the selected criterion. One of them is the well-known SVM approach. Its computational structure is very similar to the CNN model, which inspired us to merge both principles [16]. This led to a new way of the CNN template formation based on the SVM optimisation. The learning procedure shortened drastically, as it does not take more than a few minutes. Also the recognition rate for the tested ultrasound recordings of ovarian follicles increased slightly. The proposed detection algorithm recognized 113 regions of which 97 belong to the follicles (168 follicles were annotated in 28 images of testing set). Although the SVM-based learning proved to be a few 100 times faster than GA or SA, an important drawback still remains: representative and statistically large enough learning set of annotated follicles (and ovaries) must be provided first, which means a lot of routine work for an expert. If training data are too few, the recognition rate cannot be obtained satisfactory. But, the idea of combining the CNNs with SVM introduced an iterative application of SVM. This fact suggests that learning can also be done in several steps, so that only a few quick and sparse (limited) annotations are made by user at the beginning, which serves for the first optimisation and recognition by the SVM. User is presented the obtained results on-line to supervise the recognised regions and to mark the most evident false positives and negatives. The on-line processing becomes feasible because there is only a small, limited learning set and the speed of learning is boosted by the SVM-based training. The marked false positives and negatives are taken the new learning samples, and the next recognition iteration is performed. As the quality is supervised by user, he or she can stop immediately after the satisfactory outcomes are obtained. Our novel approach to the recognition of ultrasound images is fast and userfriendly, incorporates a combination of user’s and machine intelligence, and means a lot lighter burden for user than the need of annotating the entire ultrasound recording (possibly a 3D volume with 100 and more images). We describe its implementation in the following sections.
98
M. Lenič et al.
2 Linear Classification Using Support Vector Machines SVMs [13] solve classification problems by determining a decision function which separates samples from different classes with highest sensitivity and specificity. The approach is known for its noise resistance [12]. A learning set must be given with positive and negative instances for each class. In the case of ovarian ultrasound follicles, we deal with two classes; one belongs to the pixels constituting the follicles, while the other stands for the background. Hence, we face a two-class case where the SVM training can be accomplished by giving some positive instances belonging to the follicle structures and some negative instances belonging to the background. In a two-class case, the hyperplane separate positive from negative instances and can be defined as
w T xi + b = 0 ,
(1)
where w describes the hyperplane (weights), xi stands for the column vector of instance features, and b for the distance from the origin to the hyperplane. The hyperplane is fully described by both w and b: (w,b). Each instance xi must be given a classification (target) value yi ∈{-1,1} where the value 1 designates positive and -1 negative examples. These values correspond to the annotations provided in the learning set. In linearly separable cases, a classification parameter γ can be calculated as follows [17]:
γ i = yi w , xi + b ,
(2)
where <,> denotes scalar product. If γi > 0 then the classification of xi was correct. The learning phase of SVM simply looks for the hyperplane with the largest possible margin, i.e. the hyperplane whose distance to both the positive and negative examples is maximal. Optimum can be obtained by using Lagrangian multipliers Lagrange multiplies αi ≥ 0. Classification results may often be improved if the multipliers αi are also limited by an upper bound, c, so that 0 ≤ αi ≤ c, ∀i. SVM learning results in the set of Lagrange multipliers αi and distance from the origin hyperplane b. The weights w are obtained in the following way: m
w = ∑ yiαi xi . i =1
(3)
We limit the learning process only on liner separable case, since then resulting weight can be directly applied as image operator on 3D volume, usually utilizing hardware accelerated operations and thus making classification very fast compared to case where nonlinear kernel is introduced. 2.1 SVM for Ovarian Ultrasound Imaging
Consider a 2D ultrasound image of ovarian follicles first. A number of pixels belong to the follicular regions, while the others mirror the background and different ovarian structures. Only the pixels belonging to the follicles are treated as positive instances
Fast Segmentation of Ovarian Ultrasound Volumes
99
(yi = 1) in the learning process, while the background pixels are negative instances (yi = -1). Remaining at the single pixel level, it is quite clear that some parts of the follicles may have grey-levels very similar to the background and vice versa. Such a real situation is usually far from linear separability, especially if other tissues and blood vessels appear in the ultrasound recording which is, additionally, always corrupted by high level of speckle noise. This is why SVM would do no good job deploying only the information on single pixels. Feature vectors xi must, therefore, involve several pixels, most preferably a certain neighbourhood being centred over every observed pixel to be classified. Pixel values from the selected neighbourhood are vectorized along the rows and columns in order to obtain feature vectors xi. The overlap of joint distributions of the neighbourhoods belonging to the follicles and those from the background depends on the types of neighbourhoods chosen and the contents of images, i.e. the distributions of single pixel grey-levels. In general, it can be proved the separability improves when taking larger neighbourhoods. However, the larger the neighbourhoods, the more severe are threats of fusing the different vicinal regions. So, a compromise is usually sought empirically. After the learning phase successfully completed, by the SVM generated weights w and b are ready to be employed in a classification process. For any unknown feature vector ξi the expression <w,ξi> + b > 1
(4)
defines it belongs to the class of positive examples, which, in our ultrasound image recognition, means the class of follicles. This is equivalent to the interpretation with the saturation SVM output function (compare to CNNs in [15]). The explained approach does not change a lot when 3D volumes are processed. Again, a learning set of positive and negative examples is constructed by annotating the voxels inside 3D ovarian follicles as positive (yi = 1) and all others as negative examples (yi = -1). 3D neighbourhoods are selected and vectorized into feature vectors xi. Everything else is just the same as in 2D cases. 2.2 Simplification of Image Annotation for the Purpose of SVM Learning
Although some very handy software tools exist for manual image segmentation, such as ITK-SNAP [18], the annotation of even only one 2D recording may be very cumbersome. A keen eye perceives follicular boundaries quickly if the dominant follicles are in question. But smaller and less expressive ones would pose even experts. And finally, to encircle precisely a follicle on the screen by the computer mouse, it is not always trivial. Our goal was to derive a procedure which would be fast enough to run in real-time and would need as limited user interaction as possible. Yet, the SVM learning process must be given a proper learning set and this set must be contributed by an expert. Initially, this expert is expected to give just a hint of what he or she considers positive and negative instances. The SVM optimises the weights (w,b) and processes the entire input image or 3D volume. The recognised regions of interest, i.e. follicles, are displayed projected onto the original image or volume slices. The most evident discrepancies are outlined by user again, and those instances enter the second SVM
100
M. Lenič et al.
iteration as an additional, refined learning set. User can stop after any iteration and save the most appropriate weights. Denote the initial annotated examples by x0,i, i = 0, … , N-1. They comprise the initial learning set S0 = [x0,1, x0,2, … , x0,N-1]. After first learning phase, weights (w0,b0) are obtained for the first recognition. Evident mismatches guide the annotator in building the second learning set S1, which is actually the set S0 appended with new examples x1,i, i = 0, … , M-1, having been suggested by the outcomes of the first iteration. This yields the second version of weights (w1,b1) and the second, improved recognition of follicles. The described procedure just repeats in subsequent iterations.
3 Experiments and Result The images used in our experiments were extracted from 3D ultrasound volumes of women’s ovaries acquired at the University Clinical Centre of Maribor by using a Voluson 730 ultrasound device. The images of 7 patients in different stages of menstrual cycle were used. All the acquired volumes were transformed from polar to Cartesian coordinate system in the first place. The volumes were sliced into separate 2D images along axis X, Y and Z. Thus, 3117 2D images were obtained in 7 volumes. All the images contain other different structures of characteristics similar to follicles, such as veins and intestines, perceived as dark and monotone regions. Those regions were considered as part of the background. No speckle noise or other disturbances were removed prior to image analysis either. For the purpose of the experiment described in this paper we focused just on one volume labelled XX2. Spatial resolution of the volume was 149×103×121 voxels and its contrast (grey-level) resolution was 8 bits. Our main goal was to observe the SVMbased recognition accuracy and the complexity of the procedure in two different approaches. The experiment was, therefore, prepared in two variants: •
•
First experiment: The entire volume was considered for learning and testing. It was annotated by an expert in the sense that all the follicles perceived by human eye were outlined. Than the volumes were undersampled by factor 125, in order to keep the number of learning instances and computational complexity of the SVM learning phase lower (12667 instances). The undersampled voxels were taken as centres of 5×5×5 (undersampled) neighbourhoods. All the neighbourhoods were vectorized and labelled positive and negative examples, x0,i, regarding the central voxel position either inside or outside the follicles. The entire learning set was used for one step SVM learning procedure, as described at the beginning of Section 2.1. The obtained weights were employed for classification, as described in Section 2, and the recognition rate for the entire volume was verified. Second experiment: The same volume was taken, but no a priori annotation was taken into account. We followed the procedure proposed in Section 2.2. An expert was asked to browse the volume slices on the screen and decide for a most significant follicular region and a most significant part of background on any of the slides. A quick annotation of the two selected regions followed. The volume was presented in its full spatial resolution. The learning examples were formed from 5×5×5 neighbourhood again, but the learning set was much smaller owing
Fast Segmentation of Ovarian Ultrasound Volumes
101
to the fact that only one expressive follicle and one background area of approximately the same size were annotated. This set was involved in the SVM training without undersampling. The obtained weights were then applied to the classification of the entire volume. The expert was shown individual slides displaying both the original ultrasound image and the recognized regions, presumably follicles, overlapped in a semi-transparent mode. Browsing through the slides, he located the most evident misrecognitions and annotated a region of positive and a region of negative instances. At the same time, he was told by the system about the current classification rate on the voxel basis. This rate is expected to grow through successive iterations. When the growth stops, this is a clear sign that no further improvement can be expected. Of course, if the rate is satisfactory even sooner, there is no reason to go on with additional iterations. Finally, the recognition results for the entire volume were verified and also compared to the results obtained after learning on the entirely annotated ultrasound volume. Verification of the accuracy of recognition results was deployed using two different measures. The first one was based on the percentage of properly classified voxels. This is a measure of the algorithm’s learning and classification capabilities, but in the case of image recognition it does not give relevant information on the recognition success for the individual objects, such as ovarian follicles in our experiments. Hence, we resorted to the region-based measures proposed in [2] and [3]. Two ratios were introduced to estimate the percentage of the intersection area between the annotated and recognised region with respect to the annotated area, ρ1, and the percentage of the intersection area with respect to recognised area, ρ2. The closer the values of ρ1 and ρ2 to 1, the better the recognition accuracy has been achieved. It has to be stressed that measuring the recognition rate by applying planar or spatial measures, such as ρ1 and ρ2, even a small misalignment of the two compared areas changes their cross-sectional cover by power of 2. With volumes it is even much more important: misalignments of volumes cause cubical decrease of the joint volume. A practical interpretation for follicles would be as follows: even if the values of ρ1 and ρ2 may be as low as 0.5, or in particular case even lower, the visual inspection shows the annotated and recognised regions definitely mostly cover each other, so that there can be no doubt they indicate the same phenomenon. 3.1 Results of the First Experiment
As mentioned above, our intention in this experiment was to observe the SVM capabilities, its learning speed, and the effort to be put into the annotation and processing of an entire ultrasound volume with ovarian follicles and resolution of 149×103×121 voxels. The volume was undersampled by factor 5 and an expert annotated and cross-checked all the slices in all three dimensions. It took him about 5 hours of a very tedious and tough work. Then, positive and negative instances were generated automatically in 5×5×5 (undersampled) voxel neighbourhoods. Their number totalled 12666. This set was in the SVM learning phase, where the parameter c was set to 0.001 . It took 148.53 seconds to complete the learning on a average personal computer (AMD Athlon
102
M. Lenič et al.
3200+, 2G RAM). All our algorithms were implemented in Matlab and native implementation libSVM library was used for SVM learning. SVM-based recognition results yielded as follows: the recognition accuracy on the voxel basis was 96.83%, while the follicle recognition rate was 50.00% when measured by ρ1 and ρ2 set equal to 0.2 and 0.2 respectively. A follicle was considered recognised when ρ1 and ρ2 were fulfilled for it at least in the slide its central (largest) cross-section in any of three spatial directions (verified manually). These results are shown in Table 1 where can be compared with the results of the second experiment. At the same time Figure 1 shows some typical examples of the annotated and correspondingly recognised follicles.
(a)
(b)
(c)
Fig. 1. Typical result of ultrasound volume slice: (a) original ultrasound slice, (b) manual annotation and (c) recognized by SVM with undersampled learning set
3.2 Results of the Second Experiment
This experiment followed the idea presented in Section 0. Our main interest was in finding out whether a supervised learning approach using SVM can be run in real time and whether it can give comparable results to more exhaustive learning approaches. The same ultrasound volume was processed as in the first experiment, although no undersampling was needed here. The same sizes of voxel neighbourhoods were considered. The initial learning set chosen by expert on an expressive follicle and from a typical background region contained 2000 instances, on average. The learning set size increased through subsequent iterations by 60 instances, on average. Browsing through the volume slides on the screen and selecting two new regions as refined positive and negative instances was not very time consuming. It took about 3 seconds in each step, on average. The SVM learning phase was extremely fast and recognition slightly improves through subsequent iterations. The obtained voxel assignment accuracies in each iteration step is shown in Table 1. It can be observed that voxel classification accuracy as true positive rate and true negative rate is quite high and even improves in first few steps, when additional positive and negative learning instances are added to learning set. Voxel classification rates are even slightly higher than SVM with undersampled learning set. On other hande, when follicle classification rate is taken into the account it is not as high as voxel classification rates would imply. The problem can be observed on Fig. 2 where some typical images of the annotated and correspondingly recognised follicles by undersampled SVM and sparse learning SVM are shown. Some follicles are merged on the top of the image, thus counting only as
Fast Segmentation of Ovarian Ultrasound Volumes
(a)
(b)
103
(c)
Fig. 2. Typical segmentation result of ultrasound volume slice: (a) manual annotation and (b) recognized by SVM with undersampled training set, (c) recognized by SVM with sparse training set Table 1. Overview of segmentation results: iteration (It), training time, voxel classification accuracy (ACC), voxel true positive rate (TPR), voxel true negative rate (TNR), follicle recognition rate (RR), follicle recognition rate with ρ1 and ρ2 set to 0.2 (RR ρ), misclassification rate (MR) and number of instances in training set (INST) It
Time ACC TPR (s) (%) (%) Undersampled full learning set /
148.53
TNR (%)
RR (%)
RR ρ (%)
MR (%)
INST
96.83
82.04
98.59
52.94
50.00
14.29
12666
Sparse learning set 1 0.34 85.92 2 1.72 96.78 3 2.18 96.81 4 2.6 96.85 5 3.16 96.88 6 3.05 96.88 7 3.46 96.88 8 3.36 96.87
98.70 83.69 84.18 84.68 85.89 86.39 86.80 87.20
84.40 98.35 98.31 98.30 98.19 98.13 98.09 98.02
8.82 47.06 55.88 52.94 47.06 47.06 47.06 47.06
5.88 47.06 52.94 52.94 47.06 47.06 47.06 47.06
62.50 15.79 13.64 14.29 15.79 15.79 11.11 11.11
2198 2280 2354 2407 2446 2474 2495 2512
single follicle recognition, although identifying region of all follicles. SVM was not able to distinguish between the background and follicles that have practically identical learning instance with two different outcomes, one belonging to follicle and other to background.
4 Conclusion Recognition of ultrasound images still appears to be a delicate problem, in general. Non-adaptable approaches cannot achieve satisfactory results. Adaptation, on the other hand, necessitates a certain level of intelligence. We experimented with 3D ovarian recording and tried to recognise follicles by using SVMs. This implies learning and the problem of selection and gathering of learning sets, the latter being
104
M. Lenič et al.
(a)
(b)
Fig. 3. 3D segmentation of ultrasound: (a) manual annotation and (b) recognized by SVM with sparse training set
dependent on experts’ knowledge. As the annotation of 3D ultrasound recordings means dealing with 100 and more slides, it becomes hardly feasible in practice, in particular with applications in clinical environment. Therefore, we introduced a method based on iterative SVM and supervised learning with sparse learning sets. To build one of them is a very simple selection of two image areas, one belonging to positive and another to negative examples. Any new iteration does not demand more effort than the initial one. We have shortened the recognition of ultrasound volumes significantly this way. An important speed-up was achieved, first, by introducing intelligent image processing based on SVM. Learning algorithms known from the field of neural networks, such as GA and SA, run a few hours at least, and still their convergence may be questionable. SVM-based learning is completed within minutes even with very extensive learning sets, as we have show by our experiments. The speedup factor for this improvement lies over a few 100 times. Yet another fastening comes with the iterative SVM procedure we introduced in this paper. Considering the experts’ effort to be spent for annotations, a reduction from several hours to a few minutes is encountered again. Follicle recognition rate was not so high, but as can be observed from Fig. 3, resulting segmentation, after only 5 short iterations, is very similar to expert segmentation and can serve as good starting point for full annotation, reducing annotation time by hours. Our recognition system runs practically real time. The next possible step for improvement is introduction of ensembles, which would probably improve classification rate, but might increase learning time. Acknowledgments. This research was supported by Slovenian Ministry of Higher Education, Science and Technology through the programme ''Computer Systems, Methodologies, and Intelligent Services'' (P2-0041). The authors also gratefully acknowledge the indispensable contribution of Prof. Dr. Veljko Vlaisavljević from University Clinical Centre, Maribor, Slovenia, whose suggestions and imaging material enabled verification of this research.
Fast Segmentation of Ovarian Ultrasound Volumes
105
References 1. Noble, J.A., Boukerroui, D.: Ultrasound image segmentation: A survey. IEEE Transactions on medical imaging 25(8), 987–1010 (2006) 2. Potočnik, B., Zazula, D.: Automated analysis of a sequence of ovarian ultrasound images, Part I: Segmentation of single 2D images. Image vision and computing 20(3), 217–225 (2002) 3. Potočnik, B., Zazula, D.: Automated analysis of a sequence of ovarian ultrasound images, Part II: Prediction-based object recognition from a sequence of images. Image vision and computing 20(3), 227–235 (2002) 4. Potočnik, B., Zazula, D.: Improved prediction-based ovarian follicle detection from a sequence of ultrasound images. Computer methods and programs in biomedicine 70, 199– 213 (2003) 5. Muzzolini, R., Yang, Y.-H., Pierson, R.: Multiresolution texture segmentation with application to diagnostic ultrasound images. IEEE Transactions on medical imaging 12(1), 108–123 (1993) 6. Sarty, G.E., Liang, W., Sonka, M., Pierson, R.E.: Semiautomated segmentation of ovarian follicular ultrasound images using knowledge-based algorithm. Ultrasound in Medicine and Biology 24(1), 27–42 (1998) 7. Krivanek, A., Sonka, M.: Ovarian ultrasound image analysis: follicle segmentation. IEEE Transactions on medical imaging 17(6), 935–944 (1998) 8. Potočnik, B., Zazula, D., Korže, D.: Automated computer-assisted detection of follicles in ultrasound images of ovary. Journal of medical systems 21(6), 445–457 (1997) 9. Gooding, M.J., Kennedy, S., Noble, J.A.: Volume reconstruction from sparse 3D ultrasonography. In: Ellis, R.E., Peters, T.M. (eds.) MICCAI 2003. LNCS, vol. 2879, pp. 416–423. Springer, Heidelberg (2003) 10. Romeny, B.M.H., Tirtulaer, B., Kalitzin, S., Scheffer, G., Broekmans, F., Staal, J., Velde, E.: Computer assisted human follicle analysis for fertility prospects with 3D ultrasound. In: Kuba, A., Sámal, M., Todd-Pokropek, A. (eds.) IPMI 1999. LNCS, vol. 1613, pp. 56– 69. Springer, Heidelberg (1999) 11. Viher, B., Dobnikar, A., Zazula, D.: Cellular automata and follicle recognition problem and possibilities of using cellular automata for image recognition purposes. International journal of medical informatics 49(2), 231–241 (1998) 12. Pankajakshan, P., Kumar, V.: Detail-preserving image information restoration guided by SVM based noise mapping. Digital Signal Processing 17(3), 561–577 (2007) 13. Vapnik, V.: Statistical Learning Theory. Wiley-Interscience, New York (1998) 14. Cigale, B., Zazula, D.: Segmentation of ovarian ultrasound images using cellular neural networks. Int. j. pattern recogn. artif. intell. 18(4), 563–581 (2004) 15. Zazula, D., Cigale, B.: Intelligent segmentation of ultrasound images using cellular neural networks. In: Artificial intelligence in recognition and classification of astrophysical and medical images, pp. 247–302. Springer, Heidelberg (2007) 16. Cigale, B., Lenič, M., Zazula, D.: Segmentation of ovarian ultrasound images using cellular neural networks trained by support vector machines. Lect. notes comput. sci (part 3), pp. 515–522 17. Berthold, M., Hand, D.J.: Intelligent Data Analysis. Springer, Berlin (2003) 18. Yushkevich, P.A., Piven, J., Hazlett, H.C., Smith, R.G., Ho, S., Gee, J.C., Gerig, G.: Userguided 3D active contour segmentation of anatomical structures: significantly improved efficiency and reliability. Neuroimage 31(3), 1116–1128 (2006)
Fast and Intelligent Determination of Image Segmentation Method Parameters Božidar Potočnik and Mitja Lenič University of Maribor, Faculty of Electrical Engineering and Computer Science, Smetanova 17, 2000 Maribor, Slovenia
[email protected],
[email protected]
Abstract. Advanced digital image segmentation framework implemented by using service oriented architecture is presented. The intelligence is not incorporated just in a segmentation method, which is controlled by 11 parameters, but mostly in a routine for easier parameters’ values determination. Three different approaches are implemented: 1) manual parameter value selection, 2) interactive step-by-step parameter value selection based on visual image content, and 3) fast and intelligent parameter value determination based on machine learning. Intelligence of second and third approach is introduced by end-users in the repeated interaction with our prototype in attempts to correctly segment out the structures from image. Fast and intelligent parameter determination predicts a new set of parameters’ values for current image being processed based on knowledge models constructed from previous successful (positive samples) and unsuccessful (negative samples) parameter selections. Such approach pointed out to be very efficient and fast, especially if we have many positive and negative samples in the learning set.
1 Introduction To design and implement flexible and intelligent machine vision and interactive multimedia systems, a capability of performing the demanding tasks of image processing and pattern recognition becomes of crucial importance. An intelligent image recognition consists of the processing algorithms that incorporate a considerable level of adaptability and ability of learning and inference. Segmentation is one of the most important tasks in a digital image processing. With its main aim to divide images on regions of interest and on spurious regions like background, it has an imposing role at object detection [1, 10]. Segmentation methods mostly depend on images type and characteristics of searched objects [1, 5, 6, 10]. Usually, they are very well tuned for a specific problem domain. However, without appropriate parameter tuning they are less applicable for other problem domains. Parameter tuning may be very complex task also for an expert, because in some certain examples it is not possible to accurately determine influence of specific parameters on final segmentation result. Mentioned problem is intensified if parameter tuning is left to an end-user with just shallow knowledge about segmentation method and image processing. From all this G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 107–115, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
108
B. Potočnik and M. Lenič
we resolve, that majority of segmentation methods are applicable as an efficient utility only for a narrow group of very well skilled users. To overcome above delineated shortcoming, we propose in this paper a prototype for segmentation of arbitrary digital images, with integrated module for interactive, and, to some extent, intelligent determination of segmentation routine parameters. This prototype is implemented in a form of web service, and, consequently, exploits all advantages of service oriented architectures (SOA). The SOA combines ability to invoke remote object and functions with tools based on dynamic service discovery. It means that an end-user does not have to be concerned with installation, upgrading or customization of software, neither with assurance of sufficient computer resources (e.g. adequate processor power and storage size), which are essential for the efficient and prompt execution of commonly great pretension segmentation methods. Thus, at such paradigm the end-users just forward their imaging material to appropriate services and, at the end, collect results. Service oriented principle combined with an efficient and robust segmentation method is a future in the image processing field. The intelligence of our prototype is not gathered just in a segmentation method, which is controlled by 11 parameters, but mostly in a routine for segmentation method parameters’ values determination. We implemented three different methods for parameter value determination: 1) manual parameter value selection, 2) interactive step-by-step parameter value selection, and 3) fast parameter value determination based on machine learning. Intelligence of second and third approach is introduced by end-users in the repeated interaction with our prototype in attempts to correctly segment out the structures from image (method 2), understand the meaning and execution of image segmentation (method 2), and assist the machine learning process (methods 2 and 3). This paper is organized as follows. A novel paradigm for a segmentation of digital images is described in Section 2. Section 3 present fast and intelligent determination of segmentation method parameters’ values by using machine learning, followed by some implementation details covered in Section 4. This paper is concluded with some future work directions.
2 Paradigm for Digital Image Segmentation To avoid demanding and complex task of segmentation method parameters tuning, and, moreover, to make a method the more universal, a two part structure of segmentation method is proposed. The first module is a procedure for determination of segmentation parameters’ values with respect to imaging material (subsection 2.2), and the second is a module for image segmentation with known parameter set (subsection 2.3). The same segmentation is used in both modules. 2.1
Segmentation Framework
Proposed segmentation framework with some modifications follows a segmentation algorithm for follicles detection described in [5]. Our eight-step method
Fast and Intelligent Determination
109
carries results between steps, i.e. result of previous step determine an input for a current step. Steps are briefly presented in this sequel (see also [2, 5] for details). 1. Filtering. Original grey-level image is filtered by noise reduction filter. If colour image is processed then it is, first, transformed into grey-levels (i.e. luminance information of image presented in HSL colour model is retained). 2. Binarization. A global thresholding is implemented. Two different inputs can be selected: a) filtered grey-level image from step 1 or b) image obtained by calculating a standard deviation of grey-levels for each pixel in its k × k mask. 3. Removal of noisy regions. This optional step removes from the binary image all regions smaller than preset threshold (i.e. spurious regions). 4. Removal of regions connected to the image border. All regions with at least one pixel in a zone of m pixels from image border may be removed. 5. Filling region holes. Eventual region holes are filled by using morphological operators. This step is optional as well. 6. Image negation. If bright object are searched then image must be negated by replacing each pixel with 255 minus its current grey-level (optional step). 7. Region growing. Obtained initial homogeneous regions are grown by using method from [5] (optional step). Region growing is controlled by two parameters: a) parameter for controlling region compactness and growing strength, and b) parameter for limiting a number of iterations. 8. Post-processing. Post-processing combines methods from steps 3 to 5. Proposed framework offers this step as an option. Each step is controlled by a single parameter, with the exception of step 2 and step 7, which are controlled by two and three parameters, respectively. Finally, let’s list all eleven parameters used in this method sorted by step number (see also Fig. 2): 1) filter; 2) segmentation method and its threshold; 3) threshold for small region removal; 4) removal of regions at image border (boolean); 5) gaps filling (boolean) 6) image type (boolean), 7) region growing (boolean), number of iteration, and alpha; and 8) postprocessing (boolean). 2.2
Determination of Segmentation Parameters
A special routine has been developed to facilitate the determination of parameter values of the proposed segmentation framework. We implemented three parameter value determination methods: 1) interactive step-by-step selection, 2) manual selection, and 3) fast selection based on machine learning. Methods 1 and 2 are presented in this sequel, whereas method 3 is introduced in Section 3. Interactive step-by-step selection. This routine offers to the end-user several partial (intermediate) results in a form of visual information or images. Partial results are constructed by fixing parameters’ values for all steps j, except for the step i. Combinations of parameters’ values for this step i are formed by dynamically varying each parameter on some predefined set or interval of values (for details see [2]). Original image segmented with fixed parameters for steps
110
B. Potočnik and M. Lenič
Fig. 1. Partial results of interactive step-by-step parameter value selection
j and one combination of parameters for the step i results in a partial result (image). The number of partial results equals to a number of all parameter combinations in step i (always between 2 and 6). All partial results are presented to the end-user. Afterwards, the user selects a result best suiting his expectations (i.e. visually optimal result) among all offered partial results (see Fig. 1). By choosing a partial result, the user determines parameter values for step i, and, afterwards, proceed with the next step. Determined parameters’ values for the step i remain unchanged to the end of this selection method. The user commences parameter value determination by step 1 and proceed in an ascending order to the step 8. Through interactive step-by-step selection the end-user visually selects among 25 partial results (a number of all alternatives is, however, 4608) and, simultaneously, determines values for 11 segmentation parameters. In the most extreme case, the user determines parameters’ values without knowing their meaning, numerical values or their influence on a final segmentation result. Parameter value determination on suchlike, above described, manner belongs to a class of sub-optimal optimization methods [8]. Manual parameter value selection. This option was designed for advanced users. After establishing all 11 parameters, the user has an option to manually alter parameters. An image should be re-segmented after such parameter tuning. The advanced users can very efficiently fine tune the segmentation method by using this functionality. Fig. 2 depicts a GUI for manual parameters tuning.
Fig. 2. GUI with a list of all parameters (11) for our segmentation framework
Fast and Intelligent Determination
2.3
111
Ordinary Segmentation Execution
With known set of parameters, the proposed segmentation method can be applied on arbitrary single grey-level image or image sequence. To obtain quality segmentation results, a similarity of image(s) being processed with the image, on which the parameter set was established, should be ensured. There exist several indexes for measuring similarity between images [1]. If the image being processed is too dissimilar, then a parameter value determination module must be executed once again for this new type of images.
3 Fast and Intelligent Segmentation Method Parameter Value Selection by Using Machine Learning A step forward to an intelligent segmentation and intelligent parameter value selection was introduction of machine learning into proposed framework. A fundamental idea of machine learning is to make some conclusions and predictions from positive and negative samples of observed process [11]. If an analogy with image segmentation is made, then positive samples are correctly and accurately segmented images, while negative samples are falsely segmented images. Accordingly, the idea behind our intelligent segmentation process is as follows: first, an image is statistically described by a set of features, then parameters of segmentation framework are predicted by using machine learning, obtained parameters are eventually refined and, finally, segmentation result is evaluated and stored into database. We will detail this idea in this sequel. 3.1
Image Description
Despite the fact that our segmentation method (see section 2.1) is designed for segmenting grey-level images, we describe visual image content by using lowlevel features as are colours, textures, and edges. In this way more visual image information is retained, and, consequently, a machine learning is more effective. Every image is described by four histograms: a) colour histogram, b) colour autocorrelogram, c) histogram of Tamura’s contrast feature, and d) colour histogram of maximum edge pixels. We use such image description because it is simple and can be fast and easily calculated. Another reason is that histograms pointed out to be invariant for some geometric transformation [9]. To speed up image description process we divide the RGB space into Υ 3 different colours (Υ = 6). Afterwards, we determine four histograms for colour image I. First, we calculate a normalized colour histogram hB as hB , hB = Υ 3 i=1 hB,i
(1)
where hB is colour histogram and hB,i is a number of pixels having value (colour) i. Afterwards, we determine a normalized 1D colour autocorrelogram hA,i . Each autocorrelogram element is determined as
112
B. Potočnik and M. Lenič
hA,i = P (I(p2 ) = i|I(p1 ) = i ∧ p2 − p1 = 1),
(2)
where p1 and p2 are image pixels, P () is probability function and is a spatial distance measure. Finally, the autocorrelogram is normalized by using equation (1). The third normalized histogram hC –called also texture histogram–was constructed from Tamura’s feature contrast [4]. Contrast is calculated for each image pixel in its 13 × 13 neighborhood as σ (3) Fcon = 1/4 , α4 where α4 is kurtosis defined as α4 = μ4 /σ 4 , μ4 is the fourth moment, and σ is standard deviation in image (region). The final visual information used for image description was ”strong” edges. First, we calculate image gradient by using Sobel operator. Then, we determine κ (κ = 4096) pixels with the highest gradient value. For these pixels we identify their colours (quantified values) and calculate normalized histogram hE . Each image I was, thus, described by a feature vector x from 4Υ 3 features (i.e. 864 in our case) constructed from introduced normalized histograms as x = [hB , hA , hC , hE ]. 3.2
(4)
Machine Learning Method and Segmentation Parameter Value Determination
Each usage of segmentation service results in single learning example with all 864 features, 11 segmentation parameters and user feedback of final segmentation result satisfaction. If the user is authenticated, this information is also stored in the learning sample, which enable service personalization. Learning set is then used to automatically produce 11 classification/regression models, which are calculated offline and stored for next usage of segmentation service. For classification and regression support vector machines (SVMs) [12] are used, since learning set is large with big number of features. Classification/regression is based on all calculated image features and user feedback is set to ”TRUE” to acquire segmentation parameters with only positive feedback. For every parameter classification/regression is executed to obtain segmentation parameters, which is very fast, since decision/regression models are calculated offline. Until offline calculation of a new model is completed, the old model is used for classification/regression. The segmentation result based on predicted parameter is offered as the first choice to the user in the interactive selection. When segmentation result offered by machine learning is selected, the example in learning set is duplicated to increase weight of correct solution. 3.3
Evaluation of Segmentation Results
In our framework, a segmentation result are evaluated on a simple TRUE or FALSE end-user feedback, i.e. “TRUE” if result is approximately correct,
Fast and Intelligent Determination
113
“FALSE” otherwise. If an end-user has successfully determined segmentation method parameters’ values and if he, afterwards, launch ordinary segmentation on some image or image sequence, then we store original image and corresponding parameter set into positive samples of observed process, otherwise we treat them as negative samples. This type of result evaluation is easily to implement, however, it is not very precise.
4 Implementation Our segmentation framework is implemented as a set of Matlab routines. A module for interactive parameter value determination is called “Selection”, a module for segmenting a single grey-level image is “AnalyzeOne”, while the module “AnalyzeAll” segments grey-level image sequences. We convert these routines written for the Matlab environment into executable modules (.exe files) by using Matlab automatic executable code generator from .m files. By transforming this code, we got rid of one additional server, namely of the Matlab interpreter, and, simultaneously, we simplify SOA of our segmentation framework. The frontend segmentation service, which utilizes matlab segmentation routines, integrates machine learning service and logging service. Logging service stores user interaction, extracted image features and serves usage data to machine learning service, which constructs all models and performs prediction of all segmentation parameters. Both machine learning service and logging service are also used as independent services in other applications in the scope of the institute. Logging and segmentation services are implemented .NET framework. Machine learning services is implemented in Java and uses Weka machine learning tools [11] for classification and regression. Segmentation service submits usage information to logging service, stores extracted features, user information and segmentation parameters as separate data set, that is than periodically used by machine learning service to produce classification/regression models. Segmentation parameters’ values are calculated by the machine learning service, that is invoked right after image features are calculated by segmentation service. Segmentation service can operate on large images and is invoked in multiple steps with different operations. Statefull service model is used to reduce network traffic. Segmentation service can also operate asynchronously, to enable prompt interactions and status feedback when processing large sequence of images. To enable quick and easy access to the segmentation service, we developed a web user interface, that enables utilisation of segmentation service only with a web browser. Fig. 3 depicts GUI of our segmentation framework after segmentation parameter value selection process. Left image presents final segmentation results obtained by using determined parameter set (shown in bottom of this figure), while the right image presents the original image. A prototype of a presented segmentation framework can be tested using web tool, available in [3].
114
B. Potočnik and M. Lenič
Fig. 3. GUI of proposed segmentation framework
5 Conclusion Advanced image segmentation framework implemented by using SOA was presented. The main advantage of proposed approach is a possibility (i.e. a routine) to easily determine and/or fine tune parameters used in our framework. We designed three different routines for parameter value determination ranging from manual selection, over a simple and interactive step-by-step parameter value selection based on visual information, through fast and intelligent parameter selection based on machine learning. Intelligence of second and third approach is introduced by end-users in the repeated interaction with our prototype in attempts to correctly segment out the structures from image. Fast and intelligent parameter value selection using machine learning is a step forward to a more intelligent segmentation. We proposed the following idea: the web service calculates for each representative image a feature vector. A set of feature vectors form a learning set (positive and negative samples), which is used for construction of different knowledge based models. These models serve for predicting new segmentation method parameter set for image being processed. Such parameter value selection pointed out to be very efficient and fast, especially if we have many positive and negative samples in the learning set.
Fast and Intelligent Determination
115
The main drawback of proposed framework is a simple evaluation of segmentation results, which has also an influence on the machine learning, because positive and negative samples of observed process are not determined very precisely. In the future we would like to integrate into our framework a sophisticated evaluation method, which will measure a dissimilarity of calculated segmentation results and correct results (i.e. golden rule) provided by an end-user.
References 1. Forsyth, D.A., Ponce, J.: Computer vision, a modern approach. Prentice-Hall, Englewood Cliffs (2003) 2. Granier, P., Potočnik, B.: Interactive parameter determination for grey-level images segmentation method. In: Proceedings of the 13th Elect. and Comp. Conf., vol. B, pp. 175–178 (2004) 3. Lenič, M., Potočnik, B., Zazula, D.: Prototype of inteligent web service for digital images segmentation, http://www.cs.feri.uni-mb.si/podrocje.aspx?id=30 4. Long, F., Zhang, H., Feng, D.D.: Fundamentals of content-based image retrieval. In: Feng, D., Siu, W.C., Zhang, H.J. (eds.) Multimedia information retrieval and management-Technological fundamentals and applications. Springer, Berlin (2005) 5. Potočnik, B., Zazula, D.: Automated analysis of a sequence of ovarian ultrasound images, Part I: Segmentation of single 2D images. Image vision and computing 20(3), 217–225 (2002) 6. Potočnik, B., Zazula, D.: Automated analysis of a sequence of ovarian ultrasound images, Part II: Prediction-based object recognition from a sequence of images. Image vision and computing 20(3), 227–235 (2002) 7. Potočnik, B., Lenič, M., Zazula, D.: Inteligentna spletna storitev za segmentiranje digitalnih slik (Intelligent web service for digital images segmentation). In: Proceedings of the 14th Elect. and Comp. Conf., vol. A, pp. 193–196 (2005) 8. Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical recepies in C, The art of scientific computing. Cambridge University Press, Cambridge (1992) 9. Saykol, E., Güdükbay, U., Ulusoy, Ö.: A histogram-based approach for objectbased query-by-shape-and-color in image and video databases. Image and vision computing 23, 1170–1180 (2005) 10. Sonka, M., Hlavac, V., Boyle, R.: Image processing, analysis and machine vision. Chapman and Hall, Boca Raton (1994) 11. Witten, H.I., Frank, E.: Data Mining: Practical machine learning tools with Java implementations. Morgan Kaufmann, San Francisco (2005) 12. Vapnik, V.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995)
Fast Image Segmentation Algorithm Using Wavelet Transform Tomaž Romih and Peter Planinšič University of Maribor, Faculty of Electrical Engineering and Computer Science, Smetanova 17, 2000 Maribor, Slovenia {Tomaz.Romih,Peter.Planinsic}@uni-mb.si
Abstract. Fast image segmentation algorithm is discussed, where first significant points for segmentation are determined. Reduced set of image points is then used in K-means clustering algorithm for image segmentation. Our method reduces segmentation of the whole image to segmentation of significant points. Reduction of points of interest is made by introducing some kind of intelligence in decision step before clustering algorithm. It is numerically less complex and suitable for implementation in the low speed computing devices, such as smart cameras for the traffic surveillance systems. Multiscale edge detection and segmentation are discussed in detail in the paper. Keywords: Wavelet transform, multiscale edge detection, image segmentation.
1 Introduction Computer vision tasks are known by their complexity. To achieve final goal, which is some sort of description of image content, several task at different complexity levels are required. One of the most merchant descriptors of image content are edges, they define the shapes of the objects. In this article, our method, first proposed in [1], is described, where set of points, used in the task of image segmentation, is significantly reduced. This is done by introducing smart decision algorithm in this step, that is, edge points are used as guide points for definition of significant points, used in image segmentation procedure. To achieve compactness of the whole procedure, wavelet transformation is used in both steps of described procedure. In first step, finding the edges on the image, a optimal multiscale edge detection method is used, and in the second step, segmentation of the image, wavelet coefficients trough the scales are used as input to clustering algorithm. Improved edge detector has been first proposed by Canny [2], where optimal edge detector with mathematical background was proposed. Canny edge detector is fast, reliable, robust and generic. Mallat has later extended Canny edge detector to the time-scale plane [3, 4] and introduced multiscale edge detector, that is incorporated in our method. Using reduced set of points for image description is already proposed for example by Ma and Grimson [6], where modified SIFT (Scale Invariant Feature Transform) G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 117–126, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
118
T. Romih and P. Planinšič
keypoints [7] are used for vehicle classification. We extended this in such a way, that it is generally applicable for the image segmentation. Often the image segmentation using clustering algorithm has a drawback that it does not always follow the actual shape of the object. Borders of the segments are not aligned with the actual edges on the image. Our method preserves edges, found on the image and reduces the number of points, considered for image segmentation. It is therefore of lower numerical complexity and as such suitable for use in the low speed computing devices. In this paper we focus on image segmentation using smart selection of significant points. Significant points form reduced set of image points used for image segmentation. Whole procedure is based on wavelet transform. Proposed method is evaluated using K-means clustering algorithm, but can be used with other clustering algorithms too. This paper is organized as follows. Section 2 explains our method and its steps. The multiscale edge detection with significant points selection and building of multiscale feature vectors for segmentation using the wavelet transform are described in details. Section 3 discusses the experimental results and the conclusions are presented in section 4.
2 Defining Significant Points for Image Segmentation 2.1 The Principle of Our Method Our algorithm has two stages. In the first stage, the edges are detected using the wavelet based multiscale edge detection approach, proposed by Mallat [3, 4]. In the second stage, significant points for image segmentation are selected and described using the local energy values trough the scales of wavelet decomposition. Edges are used as guidelines for selection of significant points, as shown on Fig. 1. Significant points are selected from each side of the edge. In order to evaluate the properties of homogeneous area, which edge surrounds, significant points lies few pixels away from the edge perpendicular to the edge direction. As we will see in section 2.3, local energy in some neighborhood is used. The distance of significant point from the edge is therefore chosen at the half of the
(a)
(b)
(c)
Fig. 1. Selection of significant points. (a) Test image with added Gaussian noise, (b) found edges using multi-scale approach, (c) magnified section of edge with marked positions of selected significant points on both sides of the edge.
Fast Image Segmentation Algorithm Using Wavelet Transform
119
Find edges
Assign descriptors to edges
Group edges
Dark grey area
White area
Light gray area
Fill-in empty space
Directions of filling-in of areas
Fig. 2. Steps and data flow of our algorithm. Rows from top to bottom: 1) the original image and the detected edges, 2) the edges define significant points, 3) result of clustering, shown tree clusters, for the dark gray area, the white area and the light gray area, 4) initialization of the region filling process of the empty areas between the edges and 5) the resulting segmented image with three different clusters.
window size, that defines the neighborhood. For experiments a window of the size 3x3 pixels has been used, so significant points lies at the distance of 1 pixel from the edge. Assuming that the edge surrounds the object, one can say, that the area inside the edge does not show large variations, which is true for most of the objects. Surface of
120
T. Romih and P. Planinšič
the object does not introduce significant variations at different points. Experimental results show, that using only points around the edges for segmentation gives satisfactory results. Even more, results are better compared to segmentation with all image points, but without considering edges. Improvement is in a sense, that our method keeps borders of the segments settled with edges, where possible. After assigning significant points to clusters, the segment regions are defined by filling-in the empty space between the edges with respective cluster values. Simple flooding algorithm is used for this task. Each region is limited by other regions or by collided edges. Fig. 2 figuratively shows the corresponding steps of our method. The number of the clusters is not strictly defined; we found five clusters as sufficient for testing purpose. 2.2 Multiscale Edge Detection According to Mallat et. al. [3, 4] edges are more accurately and reliably found using wavelet transform and analyzing scale planes to detect singularities in the image by estimating Lipschitz exponent. An approach, proposed by them, is used, where image is first transformed using wavelet transform and modulus maxima points are calculated and connected in scale space. The wavelet transform is used without decimation. All scales have full resolution. The wavelet transform is defined as follows. For any real smoothing function θ ( x ) the θ s ( x ) = 1 θ x is smoothing function at the scale s. For a real s s
( ) ( )
function f ( x ) in L ( R ) and wavelet function ψ ( x ) defined as: 2
ψ ( x) =
dθ ( x ) dx ,
(1)
the wavelet transform is defined as:
Wf ( s, x ) = f ∗ψ s ( x ) .
(2)
d ⎛ dθ ⎞ Wf ( s, x ) = f ∗ ⎜ s s ⎟ ( x ) = s ( f ∗ θ s )( x ) dx ⎝ dx ⎠
(3)
Further:
and for specific orientation:
⎛ dθ ⎞ d Wf ( s, i, x ) = f ∗ ⎜ s s ⎟ ( x ) = s ( f ∗θ s )( x ) dxi ⎝ dxi ⎠ where i specifies orientation of wavelet transform (horizontal or vertical).
(4)
Fast Image Segmentation Algorithm Using Wavelet Transform
Modulus maximum points
(s , x ) 0
0
121
are points in scale planes, which have the
following properties:
Wf ( s0 , xm ) < Wf ( s0 , x0 )
(5)
Wf ( s0 , xn ) ≤ Wf ( s0 , x0 )
where ( xm , xn ) belongs to the opposite neighborhood (either left or right) of x0 . Maxima line is called a connected curve of the modulus maxima points in the scale space ( s , x ) . The decay of the maxima line over scales estimates the Lipschitz exponent. By following the maxima line to the finest scale, one can localize the point of singularity. The decay of the maxima line shows, if the point is regular edge or noise. Edge points have smaller decay trough scales as noise, where maxima line drops to zero after few scales. If noise is present in the image, then edges are difficult to detect at finer scales. By using information from coarser scales one can still detect edges. The drawback is poor localization of the edge. 2.3 Segmentation Using the Wavelet Coefficients
Once edges are found, their location is used for selection of significant points. Significant points are chosen on both sides of the edges, as Fig. 1 suggests. Since the edge direction is one of the outputs of edge detection step, we use it for direction definition. Front of the edge is in the same direction as edge direction. Opposite direction is then looking behind the edge. One significant point lies in the front of the edge and one behind the edge. Therefore, if the number of all edge points is P, then total number of significant points is 2P. We already have image, decomposed using wavelet transform, so we use this for feature extraction. For each significant point we extract feature from each subimage separately and build a feature vector, that reflects scale-dependent properties [10]. Coefficients of the wavelet transform are not directly applicable as texture features, because they exhibit great variability within the same texture. As stated in [8, 10], where several local energy measures were proposed, using local energy is more appropriate. We use square flat window to calculate energy values in small neighborhood of point of interest trough subimages at various scales and orientation:
eng ( s, i, x j , y j ) =
1 M ×N
∑∑Wf ( s, i, x M
N
2
m
n
j
+ m, y j + n )
(6)
where MxN is the size of square window, s and i are scale and orientation respectively and (xj, yj) are coordinates on the image of j-th significant point, where j is the index of significant point 0 ≤ j < 2P. Feature vector vj for j-th significant point is constructed from elements with values of local energy, measured at specific scale and orientation for respective significant point:
122
T. Romih and P. Planinšič
⎡ f (xj , y j ) ⎤ ⎢ ⎥ ⎢ eng ( s0 , i0 , x j , y j ) ⎥ ⎢ eng ( s , i , x , y ) ⎥ j j 0 1 ⎢ ⎥ ⎢ eng ( s1 , i0 , x j , y j ) ⎥ vj = ⎢ ⎥ ⎢ eng ( s1 , i1 , x j , y j ) ⎥ ⎢ ⎥ ⎢M ⎥ ⎢ eng ( sl , i0 , x j , y j ) ⎥ ⎢ ⎥ ⎢⎣ eng ( sl , i1 , x j , y j ) ⎥⎦
(7)
where f() is a image gray level value, (xj, yj) are coordinates of the j-th significant point, {s0, s1, …, sl} are scales up to the l-th scale and i0, i1 are horizontal and vertical orientation, respectively. Now each feature vector v j can be assigned to specific cluster by some clustering algorithm. We have chosen K-means clustering algorithm for its simplicity and wide use, although proposed method is not limited to it and other clustering algorithms can be used too. Using K-means algorithm, we need to specify expected number of clusters K. For n array elements, K-means algorithm then iteratively tries to minimize the measure [9]: n
K
J = ∑∑ ukm xm − zk
2
(8)
m =1 k =1
where ukm is the membership of pattern xm to cluster Ck and forms partition matrix U ( X ) = [ ukm ]K ×n for the data and z k is the center of the k-th cluster.
Once more is to emphasize that in clustering process reduced number of image points is used. Not all points of the image are used, but only selected, significant points, as described at the beginning of this section. The wavelet transform is performed on whole image. The numerical complexity of the fast discrete wavelet transform is O(N2log(N)) [3], where N is the number of all points of the image and numerical complexity of K-means algorithm is O(KtM) [11], where K is the number of clusters, t is the number of iteration of K-means algorithm and M is the number of points, considered in clustering process.
3 Experimental Results To evaluate our method, a computer program in C-language has been written that performs the wavelet decomposition of the image into the subimages, calculates the wavelet modulus maxima points and connects them in chains. For these procedures,
Fast Image Segmentation Algorithm Using Wavelet Transform
123
the program calls the routines, written by Emmanuel Bacry [5]. The program then calculates the local energy of the surrounding points and uses the K-means clustering algorithm to segment them. The flooding algorithm is then performed to define the surface of the segments. Experiments were made using various images, results from three test images, Lena, Airplane and Baboon, are shown on Fig. 3, 4 and 5. Edges, that are found using multiscale edge detection approach, are shown on Fig. 3b, 4b and 5b. Figure 3c, 4c and 5c shows final results of the segmentation using our method. Fig. 3d, 4d and 5d shows results of segmentation using K-means algorithm with all points considered, but without considering edges, referred here as common method. Comparison of the results of the segmentation using our method with the results of the common segmentation procedure shows, that by our method the edges of the details are better defined and that no new clusters are introduced around the edges. Obtained segments follow the edge lines, were possible. Comparison of the numerical
a)
b)
c)
d)
Fig. 3. a) Test image “Lena”, b) found edges, minimum edge length set to 10, amplitude threshold set to 5, c) segmentation using our method and d) common method
124
T. Romih and P. Planinšič
a)
b)
c)
d)
Fig. 4. a) Test image “Airplane”, b) found edges, minimum edge length set to 10, amplitude threshold set to 5, c) segmentation using our method and d) common method
a)
b)
c)
d)
Fig. 5. a) Test image “Baboon”, b) found edges, minimum edge length set to 10, amplitude threshold set to 5, c) segmentation using our method and d) common method
Fast Image Segmentation Algorithm Using Wavelet Transform
a)
b)
125
c)
Fig. 6. Evaluation of speed up of proposed method. a) Test image "Water in hands", b) segmentation using all 65536 pixels as input for K-means algorithm, c) segmentation with proposed method using 20278 pixels as input for K-means algorithm.
complexity of our method and that of the common method shows, that by our method about 1/3 of the points are involved in K-means clustering procedure compared to usual approach, as shown in example in Fig. 6. Execution time measurements confirm our expectations. We measured time of function execution using MS Visual C debugger utility and high-resolution timer on a PC computer with 3.0 GHz Pentium IV processor. Code has not been optimized. Using proposed method with reduced set of points for clustering brings time savings of over 20 ms for 256x256 pixels large image (18 ms compared to 41 ms of execution time of K-means function, using image from example on Fig. 6) and over 1 s for 512x512 pixels large image (24 ms compared to 1,05 s of execution time of K-means function, using test image Lena). In spite of all that, the multiscale edge detector code is not optimized either and is as such very time consuming. Other (faster) edge detection methods could be considered instead. The drawback of our method could be large number of introduced segments on the haired parts of images, as can be seen on Lena's hair and Baboon's fur, if this is not desired.
4 Conclusion The experiments shows, that our method speeds-up clustering algorithm and gives satisfactory results. It improves the common segmentation method, speeds up the whole process and better describes the objects by considering their edges. Our method uses even for structural complex images less points for calculation of the segments as the common method. It outperforms the common method in a case of the object shape definition, because it follows the true object edge and does not introduce new segments around edges. This was achieved by smart selection of significant points of the image, considered for the segmentation. Because of lower numerical complexity it is suitable for use in the low speed computing devices as smart cameras for consumer electronic. The method can be extended by using bandelets, contourelets and other second generation wavelet transforms. Edge detection can be extended by involving parameters for adjustment of edge intensity and edge fragments length. Flooding algorithm can be improved by introducing smarter expansion algorithm. False segment regions can be reduced that way.
126
T. Romih and P. Planinšič
References 1. Romih, T., Čučej, Ž., Planinšič, P.: Wavelet based edge preserving segmentation algorithm for object recognition and object tracking. In: Proceedings of the IEEE International Conference on Consumer Electronic 2008 (2008) 2. Canny, J.: A computational approach to edge detection. IEEE Trans. Pattern Anal. Machine Intell. 8, 679–698 (1986) 3. Mallat, S., Zhong, S.: Characterization of signals from multiscale edges. IEEE Trans. on Pattern Anal. Machine Intell. 14, 710–732 (1992) 4. Mallat, S., Hwang, W.L.: Singularity detection and processing with wavelets. IEEE Trans. on Information Theory 38, 617–643 (1992) 5. Bacry, E.: LastWave, http://www.cmap.polytechnique.fr/~bacry/LastWave 6. Ma, X., Grimson, W.E.L.: Edge-based rich representation for vehicle classification. In: Proceedings of the Tenth IEEE International Conference on Computer Vision, vol. 2, pp. 1185–1192 (2005) 7. Love, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60, 91–110 (2004) 8. Petrou, M., Sevilla, P.G.: Image processing - Dealing with texture. John Wiley & Sons Ltd., Chichester (2006) 9. Maulik, U., Bandyopadhyay, S.: Performance evaluation of some clustering algorithms and validity indices. IEEE Trans. on Pattern Anal. Machine Intell. 24, 1650–1654 (2002) 10. Livens, S., Scheunders, P., van de Wouwer, G., Van Dyck, D.: Wavelets for texture analysis, an overview. In: Sixth International Conference on Image Processing and Its Applications, vol. 2, pp. 581–585 (July 1997) 11. Hruschka, E.R., Hruschka Jr., E.R., Covoes, T.F., Ebecken, N.F.F.: Feature selection for clustering problems: a hybrid algorithm that iterates between k-means and a Bayesian filter. In: Fifth International Conference on Hybrid Intelligent Systems (November 2005)
Musical Instrument Category Discrimination Using Wavelet-Based Source Separation P.S. Lampropoulou, A.S. Lampropoulos, and G.A. Tsihrintzis Department of Informatics University of Piraeus Piraeus 185 34, Greece {vlamp,arislamp,geoatsi}@unipi.gr
Abstract. In this paper, we present and evaluate a new innovative method for quantitative estimation of the musical instrument categories which compose a music piece. The method uses a wavelet-based music source (i.e., musical instrument) separation algorithm and consists of two steps. In the first step, a source separation technique based on wavelet packets is applied to separate the musical instruments which compose a music piece. In the second step, a classification algorithm based on support vector machines is applied to estimate the musical category of each of the musical instruments identified in the first step. The method is evaluated on the publically available Iowa Musical Instrument Database and found to perform quite successfully.
1 Introduction Instrumentation is an important high-level descriptor of music, which may provide useful information for many music information retrieval (MIR) related tasks. Furthermore, instrument identification is useful for automatic musical genre recognition, as certain instruments may be more characteristic of specific genres [1]. For example, the electric guitar is quite a dominant instrument in rock music, but is hardly ever used in classical music. Additionally, human listeners may be able to determine the genre of a music signal and, at the same time, identify a number of different musical instruments from a complex sound structure. There are several difficulties in developing automated musical instrument identification procedures, which reside in the fact that some audio signal features depend on pitch and individual instruments. Specifically, the timbre of a musical instruments is obviously affected by the wide range of the pitch of the instrument. For example, the pitch range of the piano covers over seven octaves. To achieve high performance of musical instrument identification, it is indispensable to cope with the pitch dependencence of timbre. Most studies on musical instrument identification, however, have not dealt with timbre dependence on pitch [2]. For the identification of the musical instruments which compose an audio signal, many approaches have been proposed. One of the most popoular lies on recognition of the musical instruments from the direct signal. Another approach G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 127–136, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
128
P.S. Lampropoulou, A.S. Lampropoulos, and G.A. Tsihrintzis
attempts to separate the sources of an audio signal with source separation techniques borrowed from the signal processing literature and, then, identify the participating instruments in each separated source [3]. In the current work, we follow the second approach and attempt to identify the instrumentation of an audio signal, as a preprocessing step to a genre classification system. Our method uses a wavelet-based music source (i.e., musical instrument) separation algorithm and consists of two steps. In the first step, a source separation technique based on wavelet packets is applied to separate the musical instruments which compose a music piece. In the second step, a classification algorithm based on support vector machines is applied to estimate the musical category of each of the musical instruments identified in the first step. The method is evaluated on the publically available Iowa Musical Instrument Database and found to perform quite successfully. Specifically, this paper is organized as follows: Section 2 reviews previous related works, while Section 3 presents the instrument separation step in detail. Section 4 presents the instrument category classification in detail, illustrates the results and evaluates our method. Conclusions are drawn and future research directions are illustrated in Section 5.
2 Previous Related Work Early work on musical instrument recognition includes the development of a pitchindependent isolated tone musical instrument recognition system that was tested using the full pitch range of thirty orchestral instruments from the string, brass and woodwind categories, played with different articulations [4]. Later, this work was extended into a classification system and several features were compared with regards to musical instrument recognition performance [5]. Moreover, a new system was developed to recognize musical instruments from isolated notes [6]. In other works, a number of classification techniques were evaluated to determine the one that provided the lowest error rate when classifying monophonic sounds from 27 different musical instruments [7, 8] or to to reliably identify up to twelve instruments played under a diverse range of articulations [9]. Other approaches include the classification of musical instrument sounds was based on decision tables and knowledge discovery in databases (KDD) for training data analysis [10, 11] and the development of a musical instrument classification system based on a multivariate normal distribution the mean of which was represented as a function of fundamental frequency (F0) [2, 12]. All the above systems are instrument recognition systems, which use a signal of one instrument as input. Even in systems fed with input signals coming from a mixture of instruments, each instrument plays an isolated note. In a previous work of ours, we proposed a new approach for musical genre classification based on the features extracted from signals that correspond to musical instrument sources [13]. Contrary to previous works, this approach used a sound source separation method first to decompose the audio signal into a number of component signals, each of which corresponded to a different musical instrument source. In this way timbral, rhythmic and pitch features are extracted from separated
Musical Instrument Category Discrimination
129
sources and used to classify a music clip, detect its various musical instruments sources and classify them into a musical dictionary of instrument sources or instrument teams. The source separation algorithm we used was the Convolutive Sparse Coding(CSC) [14]. In this approach, we applied the CSC algorithm on the entire input signal that corresponds to a music piece assuming that all the instruments are active throughout the entire piece. This assumption, however, does not meet with realistic music scenarios, in which it is possible that different instruments participate only in a part of the music piece. For example, during the introduction of a music piece, only two instruments may participate, while a third instrument added may be added later on and so on. In order to address such cases, we propose a new approach of a source separation method based on the CSC algorithm [15].
3 Wavelet-Based Musical Instrument Identification Method A human listener is able not only to determine the genre of a music signal, but at the same time distinguish a number of different musical instruments
Fig. 1. Two Steps of the instrument category identification approach
130
P.S. Lampropoulou, A.S. Lampropoulos, and G.A. Tsihrintzis
from a complex sound structure. In order to mimic this process, we proposed a new approach for instrument category identification that is going to be a preprocessing module in a genre classification system. This approach implements a wavelet-based source separation method, followed by feature extraction of each separated signal. Classifiers are built next to identify the instrument category in each separated source. The procedure is illustrated in Fig 1. 3.1
Source-Separation Method Based on the Wavelet Packets
The problem of separating the component signals that correspond to the musical instruments that generated an audio signal is ill-defined as there is no prior knowledge about the instrumental sources. It is common for audio signals coming from different sources to exist simultaneously in the form of a mixture. Many existing separation methods have been exploiting these different multiple observations to extract the required signals without using any other form of information. This is called the Blind Audio Source Separation (BASS) problem. An extreme case of this problem is when there is only one available observation of the signal mixture (i.e., one channel). In the case of audio signals, the most obvious choice for the observation matrix is a time-frequency, so that basis functions are magnitude spectra of sources. This basic approach has already been used in some ICA, ISA, and sparse coding systems [16]. In our previous work, for the source separation method we used a dataadaptive algorithm that is similar to ICA and called Convolutive Sparse Coding (CSC) [14] based on the Independent Subspace Analysis (ISA) method which can separate individual sources from a single-channel mixture by using sound spectra. Signal independence is the main assumption of both the ICA and ISA methods. In musical signals, however, there exist dependencies in both the time and frequency domains. In this algorithm, the number of sources N was set by hand and should be equal to the number of clearly distinguishable instruments. In the case of audio signals, the most obvious choice for the observation matrix is a time-frequency presentation, so that basis functions are magnitude spectra of sources. This basic approach has already been used in some ICA, ISA, and sparse coding systems [17, 14]. For the separation of the different sources in this work, we use an approach of sub-band decomposition independent component analysis (SDICA) [18] which is based on decomposition using wavelet packets (WPs) [19]. In order to adaptively select the sub-band with the least dependent sub-components of the source signals, in this work it was introduced a criterion based on small cumulant approximation of Mutual Information(MI). The problem is known as blind source separation (BSS) and is formally described as in x = As (1) where x represents vector of measured signals, A represents an unknown mixing matrix and s represents an unknown vector of the source signals. In our source separation method, we use an approach similar to SDICA, which is based on decomposition using wavelet packets (WPs) based on iterative filter
Musical Instrument Category Discrimination
131
banks [19]. In order to enable the filter bank to adaptively select the sub-band with the least dependent sub-components of the source signal, we have introduced a criterion based on small cumulant approximation of MI. In order to obtain a sub-band representation of the original wideband BSS problem Eq. 1, we can use any linear operator Tk which will extract a set k of sub-components sk = Tk [s], (2) where Tk can, for example, represent a linear time-invariant bandpass filter. Using Eq. 2 and sub-band representation of the sources, application of the operator Tk on the wideband BSS model Eq. 1 yields xk = Tk [As] = ATk [s] = Ask .
(3)
In this algorithm, a WP transform was used for Tk in order to obtain subband representation of the wideband BSS problem Eq. 1. The main reason was existence of the WP transform in a form of iterative filter bank which allows isolation of the fine details within each decomposition level and enable adaptive
Fig. 2. Wavelet- based Source Separation
132
P.S. Lampropoulou, A.S. Lampropoulos, and G.A. Tsihrintzis
Fig. 3. Example of resulted separation
sub-band decomposition [20, 21]. In the context of SDICA, it means that an independent sub-band that is arbitrarily narrow can be isolated, by progressing to the higher decomposition levels. In this implementation of the described SDICA algorithm, it is used 1D WP for separation of audio signals. Let fl and cl be constructed from the lth coefficients of the mixtures and sources, respectively. For each component xn of the signal x, the WP transform creates a tree with the nodes that correspond to the sub-bands of the appropriate scale. In order to select the sub-band with least dependent components sk , MI[19] is measured between the same nodes in the WP tree. Once the sub-band with the least dependent components is selected, it is obtained either estimation of the inverse of the basis matrix W or estimation of the basis matrix A by applying standard ICA algorithms on the model [22]. Mixed signals can be reconstructed through the synthesis part of the WP transform, where sub-bands with a high level of MI are removed from the reconstruction. An diagram of the abstract steps of this separation is illustrated in Fig. 2 Summarizing this algorithm in the following four steps: 1. Perform multi-scale WP decomposition of each component of the input data x. Wavelet tree will be associated to each component of x. 2. Select sub-band with the least dependent components by estimating MI between the same nodes (sub-bands) in the wavelet trees
Musical Instrument Category Discrimination
133
3. Learn basis matrix A or its inverse W by executing standard ICA algorithm for linear instantaneous problem on the selected sub-band. 4. Obtain recovered sources y by applying W on data x. An example of the separation process of a audio signal by means of 1D wavelet transform with the three decomposition level and the separated sources is presented in the Fig. 3 .
4 Instrument Class Classification-Results The samples were obtained from the University of Iowa, Musical Instrument Samples Database[23], and were all, apart from the piano, recorded in an anechoic chamber and sampled at 44.1 kHz. The instruments we used from this database are categorized in three categories [24] the winds, the brass and strings as shown in Table 1. Table 1. Categories of Instruments Wind
Brass
String
Alto Saxophone
French Horn
Violin
Alto Flute
Bass Trombone
Viola
Bass Flute
Trumpet
Cello
Bass Clarinet
Tuba
Double Bass
Oboe
Tenor Trombon
Eb Clarinet Bb Clarinet Bassoon Flute Soprano Saxophone
In this work, we utilize Instrument-class classifier based on Gaussian Mixture Models (GMM). We have 40 audio signals from each instrument class (40 wind, 40 brass, 40 string). Then we produced 40 mixtures of audio signals WavePad Master’s Edition which is is a sound editor program. Each mixture signal separated by wavelet-based algorithm. The source separation process produced three component signals which correspond to the three instrument teams such as strings, winds, and brass. The component signals from every mixture are labeled by a user into three instrument-classes. Therefore, dataset Table 2 consists of 120 initial audio signals and 120 component signals, balanced distributed in three classes (80 wind, 80 brass, 80 string). From each signal, we extract a specific set of 30 objective features [25]. It is worth to mention that these features not only provide a low level representation of the statistical properties of the music signal but also include high level
134
P.S. Lampropoulou, A.S. Lampropoulos, and G.A. Tsihrintzis Table 2. Dataset Wind Brass String Total
Initial Signals Signals from separation process Total 40 40 80 40 40 80 40 40 80 120 120 240
Table 3. Confusion Matrix: Gaussian Mixture Models, K=5, 70.66% Wind Wind 72 Brass 34 String 13
Brass 23 63 10
String 5 3 77
information, extracted by psychoacoustic algorithms in order to represent rhythmic content (rhythm, beat and tempo information) and pitch content describing melody and harmony of a music signal. In the Gaussian Mixture Model (GMM) classifier, Probability Distribution Function (PDF) for each instrument class is assumed to consist of a mixture of a specific number K of multidimensional Gaussian distributions, herein K=5. The GMM classifier is initialized using the K-means algorithm with multiple random starting points then refine the models using Expectation-Maximization algorithm (EM) for 200 cycles. Assuming equal prior likelihoods for each class the decision rule is that data points (and corresponding signals) in feature space for which one the PDF is larger are classified as belonging to that class. The NetLab toolbox was utilized in order to construct the GMM classifier. Classification result was calculated using 10-fold cross-validation evaluation, where the dataset to be evaluated was iteratively partitioned so that 90% be used for training and 10% be used for testing for each class. This process was iterated with different disjoint partitions and the results were averaged. This ensured that the calculated accuracy was not biased because of the particular partitioning of training and testing. The achieved GMM classification accuracy is 70.66%. In a confusion matrix, the columns correspond to the actual instrument category, while the rows correspond to the predicted instrument category. In Table 3, the cell in row 1 of column 1 has value 72, which means that 72 signals (in a total of 100 signals) from the ”Wind” category class was accurately predicted as ”Wind”. Similarly, 63 and 77 signals from the ”Brass”, and ”String” category class, correspondingly were predicted accurately. Therefore, the classifier accuracy is computed to equal (72+63+77)*100/300=70.66 % for this classifier.
5 Conclusions We presented a new innovative method for quantitative estimation of the musical instruments sources with the use of a wavelet-based source separation algorithm.
Musical Instrument Category Discrimination
135
This method consists of two steps, in the first a source separation technique based on the wavelet transform applied for the separation of the musical sources and in the second a classification process is applied in order to estimate the participation of each musical instrument team in the separated sources. Performance of this method is evaluated.
References 1. Kostek, B.: Musical instrument classification and duet analysis employing music information retrieval techniques 2. Kitahara, T., Goto, M., Okuno, H.G.: Pitch-dependent musical instrument identification and its application to musical sound ontology. In: Chung, P.W.H., Hinde, C.J., Ali, M. (eds.) IEA/AIE 2003. LNCS, vol. 2718, pp. 112–122. Springer, Heidelberg (2003) 3. Martin, K.: Sound-source recognition: A theory and computational mode. PhD thesis, MIT (1999) 4. Eronen, A., Klapuri, A.: Musical instrument recognition using cepstral coefficients and temporal features 2, II753–II756 (2000) 5. Eronen, A.: Automatic musical instrument recognition (2001) 6. Tzanetakis, G.: Musescape: A tool for changing music collections into libraries. In: Proc. Seventh International Symposium on Signal Processing and Its Applications, 2003, vol. 2, pp. 133–136 (2003) 7. Agostini, G., Longari, M., Pollastri, E.: Musical instrument timbres classification with spectral features 8. Agostini, G., Longari, M., Pollastri, E.: Musical instrument timbres classification with spectral features. EURASIP J. Appl. Signal Process 2003(1), 5–14 (2003) 9. Czyzewski, A., Szczerba, M., Kostek, B.: Musical phrase representation and recognition by means of neural networks and rough sets 3100, 254–278 (2004) 10. Slezak, D., Synak, P., Wieczorkowska, A., Wroblewski, J.: Kdd-based approach to musical instrument sound recognition. In: ISMIR 2002: Proceedings of the 13th International Symposium on Foundations of Intelligent Systems, pp. 28–36. Springer, London (2002) 11. Wieczorkowska, A., Wroblewski, J., Synak, P., Slezak, D.: Application of temporal descriptors to musical instrument sound recognition. Journal of Intelligent Information Systems 21(1), 71–93 (2003) 12. Kitahara, T., Goto, M., Komatani, K., Ogata, T., Okuno, H.G.: Musical instrument recognizer “instrogram” and its application to music retrieval based on instrumentation similarity. In: ISM, pp. 265–274. IEEE Computer Society Press, Los Alamitos (2006) 13. Lampropoulos, A.S., Lampropoulou, P.S., Tsihrintzis, G.A.: Musical genre classification of audio data using source separation techniques. In: Proc. 5th EURASIP Conference on Speech and Image Processing, Multimedia Communications and Services, Smolenice, The Slovak Republic (2005) 14. Virtanen, T.: Separation of sound sources by convolutive sparse coding. In: Proc. of Workshop on Statistical and Perceptual Audio Processing (SAPA) (2004) 15. Lampropoulos, A.S., Lampropoulou, P.S., Tsihrintzis, G.A.: Musical genre classification enhanced by source separation techniques. In: Proc. 6th International Conference on Music Information Retrieval, London, UK, pp. 576–581 (2005)
136
P.S. Lampropoulou, A.S. Lampropoulos, and G.A. Tsihrintzis
16. Casey, M.A., Westner, A.: Separation of mixed audio sources by independent subspace analysis. In: International Computer Music Conference (ICMC) (2000) 17. Smaragdis, P., Brown, J.: Non-negative matrix factorization for polyphonic music transcription. In: IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2003 (2003) 18. Zhang, K., Chan, L.W.: An adaptive method for subband decomposition ica. Neural Comput. 18(1), 191–223 (2006) 19. Koprivaa, I., Sersic, D.: Wavelet packets approach to blind separation of statistically dependent sources. Neurocomputing (2007) doi:10.1016/j.neucom.2007.04.002 20. Wickerhauser, M.V.: Adapted wavelet analysis from theory to software. A. K. Peters, Ltd., Natick (1994) 21. Mallat, S.: A Wavelet Tour of Signal Processing (Wavelet Analysis & Its Applications), 2nd edn. Academic Press, London (1999) 22. Marchini, J.L., Heaton, C., et al.: The fastica package - fastica algorithms to perform ica and projection pursuit 23. University of Iowa Musical Instrument Samples Database, http://theremin.music.uiowa.edu/ 24. Martin, K.: Musical instrument identification: A pattern-recognition approach (1998) 25. Tzanetakis, G., Cook, P.: Musical genre classification of audio signals. IEEE Transactions on Speech and Audio Processing 10(5) (2002)
Music Perception as Reflected in Bispectral EEG Analysis under a Mirror Neurons-Based Approach Panagiotis Doulgeris, Stelios Hadjidimitriou, Konstantinos Panoulas, Leontios Hadjileontiadis, and Stavros Panas Aristotle University of Thessaloniki, Faculty of Technology, Department of Electrical & Computer Engineering, GR – 541 24, Thessaloniki, Greece
[email protected]
Abstract. One important goal of many intelligent interactive systems is dynamic personalization and adaptivity to users. ‘Motion’ and intention that are involved in the individual perception of musical structure combined with mirror neuron (MN) system activation are studied in this article. The mechanism of MN involved in the perception of musical structures is seen as a means for cueing the learner on ‘known’ factors that can be used for his/her knowledge scaffolding. To explore such relationships, EEG recordings, and especially the Mu-rhythm in the premotor cortex that relates to the activation of MN, were acquired and explored. Three experiments were designed to provide the auditory and visual stimuli to a group of subjects, including both musicians and non-musicians. The acquired signals, after appropriate averaging in the time domain, were analysed in frequency and bifrequency domains, using spectral and bispectral analysis, respectively. Experimental results have shown that an intention–related activity shown in musicians could be associated with Mu–rhythm suppression. Moreover, an underlying ongoing function appearing in the transition from heard sound to imagined sound could be revealed in the bispectrum domain and a Mu-rhythm modulation provoked by the MNs could cause bispectral fluctuations, especially when visual stimulation is combined with an auditory one for the case of musicians. These results pave the way for transferring the research in the area of blind or visually impaired people, where hearing is the main information sensing tool. Keywords: music, motion, intention, mirror neurons, EEG, Mu-rhythm, spectrum, bispectrum.
1 Introduction Multimedia services based on multimedia systems arise in various areas including art, business, education, entertainment, engineering, medicine, mathematics, scientific research and spatio-temporal applications. Dynamic personalization and adaptivity to users set an important goal for many intelligent interactive systems. To achieve this, some indicative characteristics can be used to identify underlying mechanisms in many sensory systems that could be taken into account in the modelling procedures by intelligence systems. In this work, the focus is placed upon the mechanisms involved in music perception and knowledge scaffolding. The underlying features of musical sounds play a major role in music perception. Music consists of sounds, but not all sounds are music. It has been suggested that music, like G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 137–146, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
138
P. Doulgeris et al.
language, involves an intimate coupling between the perception and production of hierarchically organized sequential information, the structure of which has the ability to communicate meaning [1]. Musical structure consists of basic elements that are combined in patterns whose performance and perception are governed by combinatorial rules or a sort of musical grammar [1]. Auditory features of the musical signal are primarily processed in the superior temporal gyrus (auditory cortex). However, the processing of structural features activates regions of the human brain that have been related to the perception of semantic and motor tasks. A mechanism involved in the structural analysis of communicative signals, like music, is the mirror neuron (MN) system. In 2001, a magnetoencephalogram-based study suggesting that musical syntax is processed in Broca’s area, a region of the human brain involved in language perception, was presented [2]. Moreover, in 2006, a positron emission tomography-based study suggested that there are common regions in the human brain for language and music processing, as Brodmann area 44, [3], whereas two studies that used functional magnetic resonance imaging analysis, revealed shared networks for auditory and motor processing [4, 5]. Furthermore, in 2006, a model of the integration of motor commands with their associated perceived sounds in vocal production was proposed. This model was based on neurobiologically plausible principles, i.e., receptive fields responding to a subset of all possible stimuli, population coded representations of auditory stimuli and motor commands and a simple Hebbian–based weight update [6]. The activation of MN system in response to music stimulation that is targeted to ‘motion’ and intention is evaluated in this work. These qualities of music are inherent in musical performance and creation and could be seen as basic elements for music knowledge scaffolding. ‘Motion’ is conveyed by the augmenting or diminishing pitch, while intention underlies in every communicative signal with hierarchical structure. The activation of the MN system from music tasks targeted to specific structural features of musical creation and performance has not yet been monitored using electroencephalogram (EEG) analysis. Here, fluctuations of the Mu–rhythm, related to the activation of MNs, were explored using EEG recordings and bi/spectral analysis.
2 Background 2.1 Mirror Neuron System The MN system consists of a whole network of neurons and was initially discovered in macaque monkeys, in the ventral premotor cortex (probably the equivalent of the inferior frontal gyrus in humans) and in the anterior inferior parietal lobule [7]. These neurons are active when the monkeys perform certain tasks, but they also fire when the monkeys watch or hear someone else perform the same specific task. In humans, brain activity consistent with MNs has been found in the premotor cortex (inferior frontal cortex) and the inferior parietal cortex [8]. Cognitive functions, like imitation and the understanding of intentions, have been linked to the activation of the MN system [9-11]. As far as music is concerned, a number of studies suggest the implication of Brodmann area 44 in music cognition [2, 3]. Brodmann area 44 is situated just anterior to premotor cortex (mirror neuron system) [8]. Together with Brodmann area 45 it comprises Broca’s area, a region involved in the processing of hierarchical structures that are inherent in communicative signals, like language and action [12]. Moreover, auditory features of the
Music Perception as Reflected in Bispectral EEG Analysis
139
musical signal, which are processed primarily in the primary auditory cortex (superior temporal gyrus), are combined with structural features of the ‘motion’ information conveyed by the musical signal in the posterior inferior frontal gyrus and adjacent premotor cortex (mirror neuron system) [1, 8]. Certain neuroimaging evidence suggests that frontoparietal motor-related regions including, prominently, the posterior inferior frontal gyrus (Broca’s region), as well as the posterior middle premotor cortex, were active during passive listening of music pieces by professional musicians [4, 5]. Figure 1 displays the regions of the brain related to music perception. The activation of MNs can be detected using EEG analysis. Mu-rhythm could reflect visuomotor integrative processes, and would ‘translate seeing and hearing into doing’. Mu-rhythm is alpha range activity (8–12 Hz) that is seen over the sensorimotor cortex. Fluctuation on Mu-rhythm during the observation of a motor action (suppression of Murhythm) is highly similar to the one seen during the direct performance of the action by the individual (greater suppression of Mu-rhythm) [13]. The suppression is due to the desychronization of the underlying cell assemblies, reflecting an increased load in the related group of neurons [13]. This motor resonance mechanism, witnessed by an Mu-rhythm modulation, is provoked by the MNs.
Fig. 1. Regions of the human brain related to the mirror neuron system (premotor cortex) and semantic tasks (Broca’s area). Position of the electrodes (C3 and its homologue C4) over the sensorimotor cortex and an example of the acquired signals (Mu–rhythm) across ten trials during the Experiment 1, along with the resulting averaged signal.
140
P. Doulgeris et al.
2.2 Music Structure Hearing provides us with sensory information arousing from our environment. As with any other form of information, the human brain tends to focus on the pieces of greater interest and extract messages. In this way, a communication port is created for the human brain and these ‘messages’ inherited in different qualities of sound. Pitch is underlying the sound perception and it is only the structure of the human auditory system (cochlea) that allows us to have a clear view of the frequencies present in a sound. However, this is not the only quality of music that we can perceive. Seven basic elements of music can be defined: pitch, rhythm, tempo, timbre, loudness, tone, and the spatial qualities of sounds. While pitch is obviously defined as the dominant frequency of a tone, another element of music, timbre, is frequency related. The timbre of a sound is obviously dependent on the different harmonics present in the tone. The same occurs when we go at the time domain, where rhythm and tempo have actually close relation. Furthermore, harmonic progression of music, i.e., the sequence of chords (tones heard simultaneously), can give meaning (communicate a message) to the listener. Finally, style, which is defined partly by the instruments (the timbre) participating and partly from the rhythm and patterns followed during the performance of a musical piece, could also affect music perception. Nevertheless, apart from these elements, several combinations of them seem to affect different groups of people or people in different ways, thus leading to the definition of a variety of styles.
3 Material and Methods 3.1 Material 3.1.1 Subjects Seven subjects have participated in the experiments. They were divided into musicians and non- musicians. Table 1 presents the age, sex and music skills of each subject. Subjects that had more than 5 years of musical training are described as intermediates whereas the rest are described as beginners. The last column refers to which experiments the subjects participated in. Table 1. Anthropometric characteristics of the subjects, their musical skills, and the related experiments Subject No. Sex Age Musical Skills Experiment No. 1 Male 24 Musician (Intermediate) 1,2,3 2 Male 23 Musician (Intermediate) 1,2 3 Male 24 Musician (Intermediate) 1,2,3 4 Female 23 Musician (Beginner) 1,2,3 5 Male 24 Musician (Beginner) 2 6 Female 22 Non-Musician 1,2,3 7 Male 23 Non-Musician 1,2,3
3.1.2 EEG Acquisition Device EEG recordings were conducted using the g.MOBIlab portable biosignal acquisition system (4 EEG channels bipolar, Filters: 0.5–30 Hz, Sensitivity: 100 μV, Data acquisition:
Music Perception as Reflected in Bispectral EEG Analysis
141
A/D converter with 16 bits resolution and sampling frequency of 256 Hz, Data transfer: wireless, Bluetooth ‘Class I’ technology, meets IEC 60601-1, for research application, no medical use) (g.tec medical & electrical engineering, Guger Technologies). An example of the acquired EEG signal is shown in Fig. 1. Seven subjects have participated in the experiments. They were divided into musicians and non- musicians. 3.1.3 Interface The experiments were designed and conducted with Max/Msp 4.5 software (Cycling ’74) on a PC (Pentium 3.2 GHz, RAM 1 GB). In order to precisely synchronize the g.MOBIlab device with the Max/MSP software an external object for Max/MSP was created in C++, using the g.MOBIlab API. Thus, we were able to open and close the device, start the acquisition and store the acquired data in text files (.txt) through Max/MSP. No visualization of the acquired signal was available during the experiments. Separate interfaces were designed for each experiment providing the essential auditory and visual stimulation. Figure 2(A) shows the designed interface, whereas Fig. 2(B) depicts the system configuration.
Fig. 2. (A) Experiments interface in Max/Msp. (B) System configuration.
3.2 Material 3.2.1 Experimental Design Three experiments were designed. In particular: Experiment 1 (Intention): This experiment consisted of seven consecutive instances. During the first two instances no tone was heard (relax state). At the beginning of the third, fourth, fifth and seventh instance, a tone (A4–440 Hz) was provided to the subjects; the tone was missing in the sixth instance. The importance of the last tone relies on the fact that the subjects should conceive the true ending of the sequence of tones. The time interval between the instances was set at 2 sec and the tone duration was 400 msec. The subjects had their eyes closed. We expected the subjects to conceive the tempo and imagine the missing tone at the beginning of the sixth instance. Experiment 2 (Intention-Harmonic Analysis): This experiment consisted of four acoustic blocks. Each block analyzed the same 4-voices major chord into its four notes starting
142
P. Doulgeris et al.
with the sound of the chord itself. During the chord analysis, each block-apart from the first-omitted one note, randomly, different for each block, but the same for each trial. The time interval between the instances was set at 1.5 sec and the tone duration was 500 msec. The major chord chosen was G+ starting with G4 (392 Hz) note. The subjects had their eyes closed. The subjects, musicians especially, were expected to conceive the sequence of notes and the underlying intention in order to imagine the omitted ones. Experiment 3 (‘Motion’): This experiment consisted of six consecutive instances. During the first two instances no tone was heard (relax state). At the beginning of the following four instances, a tone of augmenting pitch (‘motion’), each time (E4–329.6 Hz, F4– 349.2 Hz, G4–392 Hz & A4–440 Hz), was heard (auditory stimulus). The subjects could also see the corresponding notes on a music notation scheme on a computer screen (visual stimulus). The time interval between the instances was set at 2 sec and the tone duration was 400 msec. A series of trials was conducted by providing the subjects with auditory and visual stimulus and another series was conducted by providing the subjects with auditory stimulus only. The subjects had their eyes open during all trials. Subjects were expected to conceive the ‘motion’ in both series. 3.2.2 EEG Recordings The EEG recordings were conducted according to the 10/20 international system of electrode placement. The subjects were wearing the EEG cap provided with the g.MOBIlab device. One bipolar channel was used and the electrodes were placed at the C4 position (if the subject was left handed) or at the C3 position (if the subject was right handed) [4], where sensorimotor activity would most likely be presented [14] (see Fig. 1). The subjects sat still during all trials. The acoustic stimulus was provided to the subjects by headphones (Firstline Hi – Fi Stereo Headphones FH91) and the visual stimulation in Experiment 3 could be viewed on the computer screen. A series of trials (see below) per subject and per experiment was conducted (see Fig. 1). 3.2.3 Data Analysis Off–line data analyses were carried out with MatLab 7 (Mathworks). A bandpass filter (Butterworth IIR, 6th order, lower cut–off frequency 8 Hz and upper cut–off frequency 12 Hz) was designed in order to isolate the alpha range and at the same time the Mu-rhythm. The acquired EEG signals per subject and per experiment were synchronized and averaged across all trials in the time domain in order to discard random phenomena, or other artefacts, and produce the underlying evoked potential for each case. The latter, was then analyzed using spectral and high–order spectral analyses, as described below. 3.2.3.1 Bi/Spectral analysis. The filtered signals were scaled and then segmented using a time window in several parts according to the experiment’s design. The process of segmentation of the signals and the discrimination of the different states, for each experiment, is described below. Experiment 1: Ten trials were conducted for each subject. The filtered signal from each trial was segmented using a 2 sec time window in seven parts corresponding to each of the seven instances of the experiment. Three states were distinguished: relax state (first & second instance), the state of auditory stimulation (third, fourth, fifth & seventh instance) and the state during which the subject imagined the missing tone (sixth instance).
Music Perception as Reflected in Bispectral EEG Analysis
143
Experiment 2: Five trials were conducted for each subject. The filtered signal from each trial was segmented using a 1500 msec time window in 20 parts corresponding to each of the 20 instances. Three states were distinguished: the state in which the subjects listened to the chord, the state in which the subjects listened to the notes and the state during which the subject imagined the missing notes. Experiment 3: Five trials with acoustic and visual stimulation and five trials with acoustic and no visual stimulation were conducted for each subject. The filtered signal from each trial was segmented using a 2 sec time window in six parts corresponding to each of the six instances. Two states were distinguished in both cases (acoustic and visual stimulus, acoustic stimulus and no visual stimulation): relax state (first & second instance) and the state of auditory stimulation (third, fourth, fifth & sixth instance). The power spectral density (based on Fast Fourier Transformation) and the power of the signal were estimated for each part. The mean value and the standard deviation of the Mu-rhythm power corresponding to the different states of each experiment for all trials were estimated, for each subject. The averaged and filtered EEG data were segmented using a time window, as described in spectral analysis section. Third-order statistical analysis was conducted and the bispectrum corresponding to each part was estimated. Definitions about third-order statistics and the conventional bispectrum are described in details in [16].
4 Results and Discussion Figure 3 displays in the form of box plots the average spectral power of the Mu-rhythm corresponding to the relax state, the state of auditory stimulation and the state during which the subject imagined the missing tone, for all subjects participating in the Experiment 1. These results indicate an Mu-rhythm suppression both during the second and the third state. Between these two states no significant modulation on Mu-rhythm was observed among all subjects. Figure 4 shows an example from the bispectrum analysis of the data from the subjects participating in the Experiment 1, for instances (steps) 5 and 6. According to Fig. 4, (i) a transition of high bispectral coefficients from frequency pairs located on the two axes towards frequency pairs located on the diagonal during the three heard tones, and (ii) a clear ‘echoing’ effect that appears in the transition from the third tone towards the missing one. As shown in Fig. 4, bispectral coefficients of the missing tone appear at the same frequency pairs of the ones of the last tone. Moreover, experimental results have shown that the average spectral power of the Murhythm corresponding to the chord state, the notes state and the state during which the subject imagined the missing tone, for all subjects participating in the Experiment 2 appears to be stable. Furthermore, analysis of the average spectral power of the Mu-rhythm corresponding to the relax state and the notes state, for all subjects participating in the Experiment 3, has indicated an Mu-rhythm suppression during the second state. It is noteworthy that results between musicians and non–musicians showed no statistical difference whatsoever. In the case of the relax state versus the imagined tone during the Experiment 1, Mu–rhythm suppression has been observed for all musicians. Non musicians’ response was contradicting as subject six showed no suppression, whereas subject seven did. Such suppression can be linked directly to the mirror neuron system activation, proposing an
144
P. Doulgeris et al.
Fig. 3. Experiment 1: Box plots of the estimated average spectral power at each state: Relax state (RS), Auditory stimulation (AS), state in which the subjects imagined the missing tone (Imagined tone)
Fig. 4. Analysis of the EEG signals from Experiment 1; only the primary area of bispectrum is shown: Step 5 (Fifth instance) - Third heard tone; Step 6 (Sixth instance) – Missing tone; Transition of the bispectrum corresponding to the two aforementioned instances and ‘echoing’ effect, for all subjects
underlying procedure of musical intention. As far as bispectral analysis is concerned, thetransition of the high bispectral coefficients from frequency pairs on the two axes towards frequency pairs located on the diagonal during the three heard tones shows a shifting to non-linearity that appears at higher frequencies (i.e., from 10 to 20 Hz). This non-linearity reveals a self-quadratic phase coupling between these harmonic pairs, implying the generation of harmonics at the sum and difference of the frequencies of the pair. However, this pattern of bispectral coefficients was very similar between the third heard tone and the missing one for all subjects, thus providing us with an ‘echoing’ effect during this transition. This suggests a continuous function, i.e., the perception of intention. Data analysis from subjects one, two and four who belong to the musicians’ group from Experiment 2 showed no fluctuation of the Mu–rhythm average spectral power at any
Music Perception as Reflected in Bispectral EEG Analysis
145
state. On the contrary, in the case of subject three and five (also belonging to the same group), the Mu–rhythm average spectral power was higher for the third state (imagined notes). Non musicians also displayed contradicting results, as in the case of subject six (Mu–rhythm power attenuated for the third state), whereas, in the case of subject seven the equivalent spectral power was higher. Consequently, no safe conclusion can be drawn concerning the response of MNs from Experiment 2. This was also evident in the bispectrum domain. Furthermore, during the Experiment 3, musicians displayed Mu–rhythm attenuation in the case of relax state versus auditory and visual stimulation. During the novisual stimulation trials, subjects one and two displayed similar response as to the previous trials, whereas subject four did not. Non-musicians’ response was once again contradicting in both cases. The aforementioned results conclude that the visual stimulus (virtualized by an ascending note on a notation scheme), boosted the perception of motion conveyed by the ascending pitch of the note. In the bispectrum domain, an increase in the bispectrum values is noticed for the case of visual stimulation compared to that of no-visual stimulation. Moreover, no-visual stimulation has resulted in diffused bispectral values and in a gradual attenuation across the six states (steps) of the Experiment 3, implying a gradual transition from the non-Gaussianity and non-linearity towards the Gaussian and linear assumption. On the contrary, when the visual stimulation was employed, less diffused bispectral values were noticed and the attenuation was limited between the second and fourth state (step), exhibiting a gradual increase in the bispectral values when moving from state (step) five to six; hence, implying a shifting towards a non-linear and non-Gaussian behaviour. These results, also observed in the cases of all musicians, justify the Murhythm modulation provoked by the MNs. Nevertheless, for the cases of non-musicians the degree of fluctuation in the intensity of the bispectrum was smaller, both without and with visual stimulation, indicating a smaller sensitivity in the Mu-rhythm modulation provoked by the MNs. This might be explained by the lack of musical experience in distinguishing pitch differences and, hence, in perceiving an ascending motion across the six states of the Experiment 3. From the above experimental analysis it is clear that music perception can be viewed as a pattern recognition process by the brain, in analogy to the more familiar pattern recognition processes in the visual system. Our brain carries (or builds through experience) ‘templates’ against which the signals of incoming sounds are compared-if there is a match with the template that corresponds to a harmonic tone, a musical tone sensation with definite pitch is evoked. In addition, ‘motion’ and intension in music facilitates this pattern recognition, and in an analogy with the optical system, if part of the acoustical stimulus is missing, or if the stimulus is somewhat distorted, the brain is still capable of providing the correct sensation. MNs as seen through the Mu-rhythm modulation seems to support such pitch prediction (correction) and probably could even foster a kind of ‘acoustical illusions’. The role of MNs towards this direction seems to be central and understanding of their relation with the music qualities could really expand the way we see knowledge scaffolding in music perception.
5 Conclusions The response of MN cells to intention and ‘motion’ involved in musical structures was studied in this paper. EEG recordings from three experiments were conducted on seven subjects, musicians and non–musicians, and spectral and bispectral analyses were
146
P. Doulgeris et al.
implemented on the data acquired. Experimental results showed that Mu–rhythm suppression displays an intention–related activity shown in musicians, whereas the bispectral ‘echoing’ effect supports the idea of an underlying ongoing function, appearing in the transition from heard sound to imagined sound. Moreover, a Mu-rhythm modulation provoked by the MNs was linked to bispectral fluctuations, especially when visual stimulation was combined with an auditory one. Further experiments towards the exploration of the role of MNs in music perception for other sonic qualities, such as timbre, spatial motion, spectromorphology, harmonic style, are already on the way.
References 1. Molnar–Szakacs, I., Overy, K.: Music and mirror neurons: from motion to ‘e’motion. SCAN I, 234–241 (2006) 2. Maess, B., Koelsch, S., Gunter, T., Fiederici, A.: Musical syntax is processed in Broca’s area: a MEG study. Nature Neuroscience 4, 540–545 (2001) 3. Brown, S., Martinez, M., Parsons, L.: Music and language side by side in the brain: a PET study of the generation of melodies and sentences. European journal of neuroscience 23(10), 2791–2803 (2006) 4. Lahav, A., Saltzman, E., Schlaug, G.: Action representation of sound: audiomotor recognition network while listening to newly acquired sounds. The journal of neuroscience 27(2), 308–314 (2007) 5. Bangert, M., Peschel, T., Schlaug, G., Rotte, M., Drescher, D., Hinrichs, H., Heinze, H.J., Altenmuller, E.: Shared networks for auditory and motor processing in professional pianists: Evidence from fMRI conjunction. Neuroimage 30, 917–926 (2006) 6. Westerman, G., Miranda, E.R.: Modeling the development of mirror neurons for auditory motor integration. The journal of new music research 31(4), 367–375 (2002) 7. Rizzolatti, G., Graighero, L.: The mirror neuron system. An. Rev. of Neurosc. 27, 169–192 (2004) 8. Logothetis, I., Milonas, I.: Logotheti Neurology. University Studio Press, Thessaloniki (2004) 9. Calvo–Merino, B., Grezes, J., Glaser, D., Passingham, R., Hanggard, P.: Seeing or doing? Influence of visual and motor familiarity in action observation. Current Biology 16, 1–6 (2006) 10. Rizzolatti, G., Fogassi, L., Gallese, V.: Mirrors in the mind, pp. 30–37. Sci. American (2006) 11. Grezes, J., Costes, N., Decety, J.: The effects of learning and intention on the neural network involved in the perception of meaningless actions. Brain 122, 1875–1887 (1999) 12. Grossman, M.: A central processor for hierarchically–structured material: evidence from Broca’s aphasia. Neuropsychologia 18, 299–308 (1980) 13. Pineda, J., Oberman, L., Hubbard, E., McCleery, J., Altschuler, E., Ramachandran, V.: EEG evidence for mirror neuron dysfunction in autism spectrum disorders. Cognitive Brain Res. 24(2), 109–198 (2005) 14. Hadjileontiadis, J., Panas, S.: Higher–order statistics: a robust vehicle for diagnostic assessment and characterization of lung sounds. Technology and Healthcare 5, 359–374 (1997)
Automatic Recognition of Urban Soundscenes Stavros Ntalampiras1, Ilyas Potamitis2, and Nikos Fakotakis1 1
Wire Communications Laboratory, University of Patras {dallas,fakotaki}@wcl.ee.upatras.gr 2 Department of Music Technology and Acoustics, Technological Educational Institute of Crete
[email protected]
Abstract. In this paper we propose a novel architecture for environmental sound classification. In the first section we introduce the reader to the current work in this research field. Subsequently, we explore the usage of Mel frequency cepstral coefficients (MFCCs) and MPEG7 audio features in combination with a classification method based on Gaussian mixture models (GMMs). We provide details concerning the feature extraction process as well as the recognition stage of the proposed methodology. The performance of this implementation is evaluated by setting up experimental tests in six different categories of environmental sounds (aircraft, motorcycle, car, crowd, thunder, train). The proposed method is fast because it does not require high computational resources covering therefore the needs of a real time application. Keywords: Computer Audition, Automatic audio recognition, MPEG-7 audio, MFCC, Gaussian mixture model (GMM).
1 Introduction Due to the exponential increase of the amount of data being used in computer science in the last decades the need of a robust and user friendly way to access data came up. Nowadays a great amount of information is produced and, therefore, searching algorithms of large metadata databases becomes a necessity. Another important fact that we have to take under consideration is the spread of the internet. The global network is becoming faster and larger, vanishing the limitations that existed some years ago concerning the data transfers. The result of this situation is the increased need to search and retrieve desired data as fast as possible. Our best allies in these kinds of situations are classification and similarity. Machine learning techniques are widely used in these types of problems making life easier by proposing automatic methods for annotations of data collections, which is a time consuming process. In this way huge collections of databases are becoming searchable without human interference. This work is considered to be a step forward in this direction and improves the automatic recognition of environmental sounds. The scope of our system is to understand the surrounding environment exploiting only the acoustic information just like humans do unconsciously and constantly. Think as a paradigm the situation where one is sitting on a bench near a harbour. Using only the perceived acoustic information one is able to understand that a boat is leaving, a car is passing by and a dog is barking. This is the exact human property we are trying to capture. A system possessing this human property can be of great G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 147–153, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
148
S. Ntalampiras, I. Potamitis, and N. Fakotakis
importance for monitoring and understanding the environment or even to help humans making decisions whether or not they should perform an action in the specific environment. The main difficulty is that it is not possible to create a perfect database of environmental sounds, unlike speech processing. Nowadays a great deal of research has been conducting in the researching field of environmental sound classification resulting in many different implementations. A method based on three MPEG-7 audio low-level descriptors (spectrum centroid, spectrum spread and spectrum flatness) is presented in [1]. As for the classification scheme a fusion support vector machines and k nearest neighbour rule is adopted in order to define the class of a sound into predefined classes of common kinds of home environmental sounds. Wold et al. [2] build a system for sound retrieval based on the statistical measures of pitch, harmonicity, loudness, brightness and bandwidth. These features are input to a k nearest neighbor classifier, which makes the decision of the sound class. A different approach is adopted in [3] taking advantage of a multilayered perceptron neural network. The feature taken into account is the one-dimensional combination of the instantaneous spectrum at the power peak and the power pattern in time domain. Wang et al. [4] show the usage of an MPEG-7 based feature set which is processed by an HMM scheme in order to assign the sound to a defined class. In this work we are trying to find out the feature set which includes information that can effectively distinct a large variety of different environmental sound categories. As a first step a comparison between the MPEG-7 feature set and the MFCCs, which are typically used in speech recognition is made. The organization of the paper is as follows. In section 2 we describe the overall architecture of our methodology and the feature extraction processes. Section 3 contains the details concerning the recognition engine and the last section presents recognition results as well as possible extensions of this work.
2 System Architecture The overall architecture of the system is rather simple and is described in Fig. 1. First the sound recording passes through a preprocessing step to prepare it for the feature extraction. After that Gaussian mixture models are used to compute the probability for each sound class. In the operational phase, the class with the higher probability is assigned to the unknown sound to be recognized.
Fig. 1. System Architecture
Automatic Recognition of Urban Soundscenes
149
2.1 The Feature Extraction Process As mentioned before we are going to use two different feature sets in order to evaluate their performance in the task of environmental sound classification. The same parameters are used for both of the feature sets. The signal is cut into frames of 30ms with 10ms of time shifts after the MPEG7 standard recommendations. It should be noted that the hamming window is used. MFCCs The first feature set is consisted of the total energy of the frame and first 12 Mel frequency cepstral coefficients. The MFCC feature extraction is composed of several steps. First the time domain signal is segmented in overlapping frames. We derive the power of the STFT of these frames and then passed through a triangular Mel scale filterbank which emphasizes the portions of the signal that are proven to play an important role to human perception. Subsequently the log operator is applied and for the decorrelation of the features the discrete cosine transform (DCT) is used. The result of the MFCC feature extraction process is a feature vector consisted of 13 features in total. In Fig. 2 the whole process of the MFCC extraction can be seen.
Fig. 2. MFCC feature extraction process
MPEG-7 feature set The second feature set includes descriptors that are part of the MPEG-7 standard. In order be fair with the comparison and to keep the balance between the amounts of data contained in the feature sets we use the following MPEG-7 descriptors:
Audio Spectrum Centroid Audio Spectrum Spread Audio Spectrum Flatness
which have been demonstrated to produce good results [1, 4]. Feature extraction methodology All of these descriptors are part of the basic spectral descriptors that are provided by the MPEG7 standard and the method for their computation is the described in the following session. I. Audio Spectrum Centroid (ASC) For the calculation of this feature the log-frequency spectrum is computed first. This descriptor gives the centre of its gravity. To avoid the effect of a non-zero DC-component and/or very low frequency components, every one of the power
150
S. Ntalampiras, I. Potamitis, and N. Fakotakis
coefficients bellow 62.5Hz are summed and represented by a single coefficient. This process gives as an output a different power spectrum pi as well as a new representation of the corresponding frequencies fi. For a given frame the ASC is defined from the modified power coefficients and their frequencies as:
ASC= ∑ log 2 (f i /1000)p i ) / ∑ p i i
(1)
i
The derivation of this descriptor provides information about the dominant frequencies (high or low) as well as perceptual information of timbre (i.e. sharpness). II. Audio Spectrum Spread (ASS) This feature corresponds to another simple measure of the signal’s spectral shape. ASS (also called instantaneous bandwidth) is defined as the second central moment of the log-frequency spectrum. For a given frame it can be extracted by taking the rootmean-square (RMS) deviation of the spectrum from its Centroid ASC:
ASS=
∑ (log (f /1000)-ASC) p / ∑ p 2
2
i
i
i
i
(2)
i
where pi and fi represent the same quantities as before. This descriptor indicates the way the spectrum of the signal is distributed around its centroid frequency. If its value is low then the spectrum may be concatenated around the centroid, while a high value shows the opposite. Its purpose is to differentiate tone-like and noise-like sounds. III. Audio Spectrum Flatness (ASF) The ASF descriptor is designed to expose how flat a particular portion of the signal is. For a given frame it consists of a series of values, each one expressing the deviation of the signal’s power spectrum from a flat shape. In order to obtain the values of the descriptor the power coefficients are computed from non-overlapping frames (window length=time shift). Subsequently the spectrum is divided into 1/4-octave resolution, overlapping frequency bands which are logarithmically spaced. The ASF of a band is calculated as the ratio of the geometric mean and the arithmetic mean of the spectral power coefficients within the band. N
ASF= N ∏ Cn n=1
N
/ 1 ∑C N
n
(3)
n =1
where N is the number of coefficients within a subband and cn is the n-th spectral power coefficient of the subband. The usage of this feature achieves to classify effectively the sounds which correspond to noise (or impulse) and the harmonic sounds. Psychoacoustics tells us that a large deviation from a flat shape generally depicts the tonal sounds. The computation of the MPEG-7 audio features (alternatively called Low Level Descriptors) results in the formulation of a 21 feature vector since the ASF descriptor has 19 coefficients. In figure 3 we depict the Mel log filterbank and the MPEG-7 descriptors for a part of the same file belonging to the category Motorcycle so to visualize the differences between the two feature sets.
151
Signal 0.04 0 -0.04 1
MPEG-7 features log Mel Filterbank
Frequency (kHZ)
Sampled data
Automatic Recognition of Urban Soundscenes
2
3
4 Time
5
6
7
8 4
x 10
Spectrogram 8 6 4 2 1
2
3
4
300
400
500
300
400
500
Time
5 10 15 20 100
200 Frames
5 10 15 20 100
200 Frames
Fig. 3. Mel Filterbank and feature values against the sound’s frames
3 Recognition Engine The next figure describes the recognition process which is consisted of Gaussian mixture models (Fig. 4). A linear combination of Gaussians represents the probability density function of a mixture model. Concerning the experiments a standard version GMM for sound class 1
Probability Computation
P(s|M1)
GMM for sound class 2
signal
Probability Computation
P(s|M2) Maximum Probability
GMM for sound class n
Class=maxarg[P(s|Mi)] 1≤i≤n
P(s|Mn) Probability Computation
Fig. 4. GMM classification process
152
S. Ntalampiras, I. Potamitis, and N. Fakotakis
of Expectation-Maximization algorithm was used, with k-means initialization for the training of the models. The number of Gaussian mixtures for all the experiments was 4 while each density is described by a covariance matrix of the diagonal form. At the stage of the experimental set-up, the feature streams of each sound are passed to the trained Gaussian mixture models. The probabilities produced by all models are compared and the class with the highest probability represents the system decision. In the test phase we applied the 10-fold cross validation method and the results were averaged for each category. The procedure that was followed during the whole process was identical for both feature sets (MFCC and MPEG-7 features) so that the comparison of their performance provides reliability.
4 Results The data were obtained from recordings found on the internet due to the unavailability of such a sound corpus. There are six classes of sounds consisted of aircraft (110), motorcycle (46), car (81), crowd (60), thunder (60) and train (82) and there is a lot of variability in each of them. All the sounds were downsampled to 16 KHz, 16 bit while their average length is 25.2 seconds. In Tables 1 and 2 the confusion matrices of the proposed methodology are provided. At this point we must stress the importance of the fact that the evaluation was made with a frame-based decision of the class in order to obtain reliable results. The MFCC feature set achieves better recognition rates as it is shown in Table 1 showing that it is able to capture more discriminative information than the MPEG-7 feature set. Furthermore it can be observed that both feature sets tend to confuse the same classes. We conclude that the MPEG-7 descriptors achieve overall accuracy of 65% having the best performance in thunder category while the MFCCs results in 78.3% overall accuracy and their best performance is in the train category.
Car
Crowd
Thunder
Train
64.1
2.2
11.2
6.4
11
5.1
2.7
81.6
9.6
4.6
1.4
0.1
Aircraft
Aircraft Motorcycle
Responded
Motorcycle
Table 1. Confusion Matrix (MFCC)
Car
11.6
5.2
62
9.6
8.5
3.1
Crowd
3.2
1.7
11
82.4
0.8
0.9
Thunder
6.5
0.6
2.1
0.3
88.4
2.1
Train
4.3
1.6
1.2
1.4
0.1
91.4
Automatic Recognition of Urban Soundscenes
153
4.2
11.3
10.1
13.2
10
10.2
7.9
3.6
5.1
14
14.1
10.9
0.9
3.6
77.2 5.6
68.6
Car
8.3
67 2.3
Crowd
7.2
1.2
50.4 11.2
Thunder
10.4
3.3
5.1
75.9 0.5
Train
7
3.4
8.9
6.5
Train
Thunder
Motorcycle
Crowd
51.2 6.2
Car
Aircraft Motorcycle
Aircraft
Responded
Table 2. Confusion Matrix (MPEG-7)
3.5
5 Conclusions and Future Work In this work we evaluated the performance of two well known feature sets on the task of environmental sound classification and it has been proved that the MFCCs outperform the MPEG-7 descriptors. Our future work includes the incorporation of more sound classes, signal separation as well as the usage of several techniques like silence detection that may improve the recognition performance.
References 1. Wang, J.-C., Wang, J.-F., Kuok, W.-H., Hsu, C.-S.: Environmental Sound Classification Using Hybrid SVM/KNN Classifier and MPEG-7 Audio Low-Level Descriptor. In: International Joint Conference on Neural Networks (2006) 2. Wold, E., Blum, T., Keislar, D., Wheaton, J.: Content based classification search and retrieval of audio. IEEE Multimedia Magazine 3, 27–36 (1996) 3. Toyoda, Y., Huang, J., Ding, S., Liu, Y.: Environmental sound recognition by multilayered neural networks. In: Proceedings of the Fourth International Conference on Computer and Information Technology, pp. 123–127 (2004) 4. Wang, J.-F., Wang, J.-C., Huang, T.-H., Hsu, C.-S.: Home environmental sound recognition based on MPEG-7 features. In: Circuits and Systems, MWSCAS 2003, vol. 2, pp. 682–685 (2003) 5. Casey, M.A.: MPEG-7 sound recognition tools. IEEE Transactions on Circuits and Systems for Video Technology 11(6), 737–747 (2001) 6. Kim, H.-G., Moreau, N., Sikora, T.: MPEG-7 Audio and Beyond: Audio Content Indexing and Retrieval. Wiley, Chichester (2005) 7. Nabney, I.: Netlab: Algorithms for Pattern Recognition. Springer, London (2002)
Low Bitrate Coding of Spot Audio Signals for Interactive and Immersive Audio Applications Athanasios Mouchtaris, Christos Tzagkarakis, and Panagiotis Tsakalides Department of Computer Science, University of Crete and Institute of Computer Science (FORTH-ICS) Foundation for Research and Technology - Hellas Heraklion, Crete, Greece {mouchtar,tzagarak,tsakalid}@ics.forth.gr Abstract. In the last few years, a revolution has occurred in the area of consumer audio. Similarly to the transition from analog to digital sound that took place during the 80s, we have been experiencing the transition from 2-channel stereophonic sound to multichannel sound (e.g., 5.1 systems). Future audiovisual systems will not make distinctions regarding whether the user will be watching a movie or listening to a music recording; they are envisioned to offer a realistic experience to the user who will be immersed into the content, implying that the user will be able to interact with the content according to his will. In this paper, an encoding procedure is proposed, focusing on spot microphone signals, which is necessary for providing interactivity between the user and the environment. A model is proposed which achieves high-quality audio reproduction with side information for each spot microphone signal in the order of 19 kbps.
1 Introduction Similarly to the transition from analog to digital sound that took place during the 80s, these last years we have been experiencing the transition from 2-channel stereophonic sound to multichannel sound. This transition has shown the potential of multichannel audio to surround the listener with sound and offer a more realistic acoustic scene compared to 2-channel stereo. Current multichannel audio systems place 5 or 7 loudspeakers around the listener in pre-defined positions, and a loudspeaker for low-frequency sounds (5.1 [1] and 7.1 multichannel systems), and are utilised not only for film but also for audio-only content. Multichannel audio offers the advantage of improved realism compared to 2-channel stereo sound at the expense of increased information concerning the storage and transmission of this medium. This is important in many networkbased applications, such as Digital Radio and Internet audio. At a point where MPEG Surround (explained in the following paragraph) achieves coding rates for 5.1 multichannel audio that are similar to MP3 coding rates for 2-channel stereo, it seems that the research in audio coding might have no future. However,
This work has been funded by the Marie Curie TOK “ASPIRE” grant within the 6th European Community Framework Program.
G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 155–164, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
156
A. Mouchtaris, C. Tzagkarakis, and P. Tsakalides
this is far from the truth. Current multichannel audio formats will eventually be substituted by more advanced formats. Future audiovisual systems will not distinguish between whether the user will be watching a movie or listening to a music recording; audiovisual systems of the future are envisioned to offer a realistic experience to the user who will be immersed into the content. As opposed to listening and watching, the passive voice of immersed implies that the user’s environment will be seamlessly transformed into the environment of his/her desire, the user being able to interact with the content according to his/her will. Using a large number of loudspeakers is useless if there is no increase in the content information. Immersive audio is largely based on enhanced audio content, which translates into using a large number of microphones (known as spot recordings) for obtaining a recording, containing as many sound sources as possible. These sources offer increased sound directions around the listener, but are also useful for providing interactivity between the user and the audio environment. The increase in audio content, combined with the strict requirements regarding the processing, network delays, and losses in the coding and transmission of immersive audio content, are issues that can be addressed based on the proposed methodology. The proposed approach in this paper is an extension of multichannel audio coding. For 2-channel stereo sound, the importance of decreasing the bitrate in a music recording has been made apparent within the Internet audio domain with the proliferation of MP3 audio coding (MPEG-1 Layer III [2, 3]). MP3 audio coding allows for coding of stereo audio with rates as low as 128 Kbit/sec for high-quality audio (CD-like or transparent quality). Multichannel sound, as the successor of 2-channel stereo, has been in focus of all audio coding methods since the early 1990s. MPEG-2 Advanced Audio Coding (AAC) [4, 5] and Dolby AC-3 [6] were proposed among others and truly revolutionised the delivery of multichannel sound, allowing for bitrates as low as 320 Kbit/sec for 5.1 audio (transparent quality). These methods were soon adopted by all audio-related applications, such as newer versions of Internet music files (Apple’s iTunes) and Digital Television (DTV). In the audio coding methods mentioned in the previous paragraph, the concept of perceptual audio coding has been of central importance. Perceptual audio coding refers to colouring the coding noise in the frequency domain, so that it will be inaudible by the human auditory system. However, early on it was apparent that coding methods that exploit interchannel (for 2-channel or multichannel audio) were necessary for achieving best coding results. In MPEG-1 and MPEG-2 audio coding, Mid/Side [7] and Intensity Stereo Coding [8] were employed. The former operated on the audio channels in an approximate Karhunen-Loeve-type approach for decorrelation of the channel samples, while the latter was applied to higher frequency bands by exploiting the fact that the auditory image in these bands can be retained by only using the energy envelope of each channel at each short-time audio segment. In early 2007, a new standard for very low bitrate coding of multichannel audio became an International Standard under the name MPEG Surround [9]. MPEG Surround allows for coding of multichannel audio content with rates as low
Low Bitrate Coding of Spot Audio Signals
157
as 64 Kbit/sec for transparent quality. It is based on Binaural Cue Coding (BCC) [10] and Parametric Stereo (PS) [11]. Both methods operate on the same philosophy, which is to capture (at the encoder) and re-synthesise (at the decoder) the cues needed for sound localisation by the human auditory system. In this manner, it is possible to recreate the original spatial image of the multichannel recording by encoding only one monophonic audio downmix signal (the sum of the various audio channels of a particular recording), as well as the binaural cues which constitute only a small amount of additional (side) information. MPEG Surround and (related) AAC+ are expected to replace the current MP3 and AAC formats for Internet audio, and to dominate in broadcasting applications. Immersive audio, as opposed to multichannel audio, is based on providing the listener the option to interact with the sound environment. This translates, as explained later in this paper, into different objectives in the content to be encoded and transmitted, which cannot be fulfilled by current multichannel audio coding approaches. Our goal is to introduce mathematical models specifically directed towards immersive audio, for compressing the content and allowing model-based reconstruction of lost or delayed information. Our aspirations are towards finally implementing long-proposed ideas in the audio community, such as (network-based) telepresence of a user in a concert hall performance in realtime, implying interaction with the environment, e.g., being able to move around in the hall and appreciate the hall acoustics; virtual music performances, where the musicians are located all around the world; collaborative environments for the production of music; and so forth. In this paper, the sinusoids plus noise model (henceforth denoted as SNM for brevity), which has been used extensively for monophonic audio signals, is introduced in the context of low-bitrate coding for Immersive audio. As in the SAC method for low bitrate multichannel audio coding, our approach is to encode one audio channel only (which can be one of the spot signals or a downmix), while for the remaining spot signals we retain only the parameters that allow for resynthesis of the content at the decoder. These parameters are the sinusoidal parameters (harmonic part) of each spot signal, as well as the short-time spectral envelope (estimated using Linear Predictive – LP – analysis) of the sinusoidal noise component of each spot signal. These parameters are not as demanding in coding rates, as the true noise part of the SNM model. For this reason, the noise part of only the reference signal is retained; during the resynthesis of each spot signal, its harmonic part is added to the noise part, which is recreated by using the corresponding noise envelope with the noise residual obtained from the reference channel. This procedure, has been described in our recent work as noise transplantation [12], and is based on the observation that the noise component of the spot signals of the same multichannel recording are very similar when the harmonic part has been captured with an appropriate number of sinusoids. In this paper, we focus on describing the coding stage of the model parameters, and defining the lower limits in terms of bitrate that our proposed system can achieve. The coding of the sinusoidal parameters is based
158
A. Mouchtaris, C. Tzagkarakis, and P. Tsakalides
on the high-rate quantization scheme of [13], while the encoding process of the noise envelope is based on the vector quantization method described in [14].
2 Modeling Methodology Initially, we briefly explain how interactivity can be achieved using the multiple microphone recordings (spot microphone signals) of a particular multichannel recording. The number of these multiple microphone signals is usually higher than the available loudspeakers, thus a mixing process is needed when producing a multichannel audio recording. We place emphasis on the mixing of the multimicrophone audio recordings on the decoder side. Remote mixing is imperative for immersive audio applications, since it offers the amount of freedom for the creation of the content that is needed for interactivity. Thus, in immersive audio applications, current multichannel audio coding methods. This is due to the fact that, for audio mixing (remote or not), not only the spatial image but the content of each microphone recording must be encoded, so that the audio engineer will have full control of the content. We note that remote mixing, when the user is not an experienced audio engineer, can be accomplished in practice by storing at the decoder a number of predefined mixing “files” that have been created by experts for each specific recording. The limitations of transmitting the microphone recordings through a low-bandwidth medium (e.g., the Internet or wireless channels) are due to: (i) the increase in the audio channels, which translates into the need of high transmission rates which are not available, and (ii) network delays and losses which are unacceptable in high-quality real-time audio applications. In order to address these problems, we propose using the source/filter and sinusoidal models. The source/filter model [15] segments the signal in short (around 30 ms) segments, and the spectral envelope of each segment is modelled (e.g., by linear prediction) using a small number of coefficients (filter part). The remaining modelling error has the same number of samples as the initial segment (source part), and contains important spectral information. For speech signals, the source part theoretically contains the integer multiples of the pitch, so it can be modelled using a small number of coefficients. Many speech compression methods are based on this concept. However, for audio signals, methods for reducing the dimensionality of the source signal and retaining high quality have not yet been derived. We have recently found that multiresolution estimation of the filter parameters can greatly improve the modelling performance of the filter model. We, then, were able to show that the source/filter model can separate the spot microphone signals of multimicrophone recording into a part that is specific to each microphone (filter) and a part which can be considered common to all signals (source) [16]. Thus, for each spot recording we can only encode its filter part (using around 10 Kbit/sec), while one reference audio signal (can be a dowmix) must be encoded, e.g., using MP3. Our aforementioned method introduces an amount of correlation between the recordings (crosstalk), and is not suitable for some audio signals (e.g., transients).
Low Bitrate Coding of Spot Audio Signals
159
These problems can be overcome by additional use of the sinusoidal model [17, 18, 19]. It has been applied to speech and audio signals and is based on retaining (for each segment) only the prominent spectral peaks. The sinusoidal parameters alone cannot model audio signals with enough accuracy. Representing the modelling error is an important problem for enhancing the low audio quality of the sinusoids-only model. It has been proposed that the error signal can be modelled by only retaining its spectral envelope (e.g., [19, 20, 21]). The sinusoids plus noise model (SNM) represents a signal s(n), with harmonic nature, as the sum of a predefined number of sinusoids (harmonic part) and a noise term (stochastic part) e(n) (for each short-time analysis frame) s(n) =
L
αl cos(ωl n + φl ) + e(n) , n = 0, . . . , N − 1,
(1)
l=1
where L denotes the number of sinusoids, {αl , ωl , φl }L l=1 are the constant amplitudes, frequencies and phases respectively and N is the length (in samples) of the analysis short-time frame of the signal. The noise component is also needed for representing the noise-like part of audio signals which is audible and is necessary for high-quality resynthesis. The noise component can be computed by subtracting the harmonic component from the original signal. Modeling the noise component is a challenging task. We follow the popular approach of modeling e(n) as the result of filtering a residual noise component with an autoregressive (AR) filter that models the noise spectral envelope, i.e., e(n) =
p
b(i) e(n − i) + re (n),
(2)
i=1
where re (n) is the residual of the noise, and p is the AR filter order, while vector b = (1, −b(1), −b(2), ..., −b(p))T represents the spectral envelope of the noise component e(n) and can be obtained by LP analysis. In the remainder of the paper, we refer to e(n) as the (sinusoidal) noise signal, and to re (n) as the residual (noise) of e(n). Fully parametric models under the SNM degrade audio quality, since the residual of the original audio signals is discarded and replaced by (filtered) white noise or parametrically generated. Thus, so far the sinusoidal (as the source/filter) model is considered useful (for audio) only in low-bitrate low-quality applications (e.g., scalable audio coding in MPEG-4). The idea in our research is to apply our findings of the source/filter model not to the actual audio signal but to the sinusoidal error signal. Our preliminary efforts have shown that this “noise transplantation” procedure is indeed valid and can overcome the problems of crosstalk and transient sounds, since even only a few sinusoidal coefficients can capture the significant components of an audio signal. In fact, by using our approach, the number of sinusoidal coefficients can be greatly decreased compared to current sinusoidal models, due to the improved accuracy in the noise modelling of the proposed multiresolution source/filter model.
160
A. Mouchtaris, C. Tzagkarakis, and P. Tsakalides
In more detail, consider a collection of M microphone signals that correspond to the same multichannel recording and thus have similar acoustical content. We model and encode as a full audio channel only one of the signals (alternatively it can be a downmix, e.g. a sum signal), which is the reference signal. The remaining (side) signals are modeled by the SNM, retaining their sinusoidal components and the noise spectral envelope (filter b in (2)). In order to reconstruct the side signals, we obtain the LP residual of the reference channel’s noise signal. Each side microphone signal is reconstructed using its sinusoidal (harmonic) component and its noise LP filter. In specific, its harmonic component is added to the noise component that it is obtained by filtering, with the signal’s LP noise shaping filter, the LP residual of the sinusoidal noise from the reference signal. In this manner, we avoid encoding the residual of each of the side signals. This is important, since this signal is of highly stochastic nature, and cannot be adequately represented using a small number of parameters (thus, it is highly demanding in bitrates for accurate encoding). We note that modeling this signal with parametric models results in low-quality audio resynthesis; in our previous work [12] we have shown that our noise transplantation method can result in significantly better quality audio modeling compared to parametric models for the residual signal. We obtained subjective scores around 4.0 using as low as 10 sinusoids, which is very important for low bitrate coding. For decoding, the proposed model operates as follows. The reference signal (Signal 1) is fully encoded (e.g. using an MP3 encoder at 64 kbps), while the remaining microphone signals are reconstructed using the quantized sinusoidal and LP parameters, using the LP residual obtained from the reference channel.
3 Coding Methodology The second part of our method is the coding procedure. It can be divided into two tasks; the quantization of the sinusoidal parameters and the quantization of the noise spectral envelopes for each side signal (for each short-time frame). 3.1
Coding of the Sinusoidal Parameters
We adopt the coding scheme of [13], developed for jointly optimal quantization of sinusoidal frequencies, amplitudes and phases. Due to space limitations, and since the details of the coding can be found in [13], we only provide here the final equations for the coding. More specifically, the quantizations point densities gA (α), gΩ (ω) and gΦ (φ) (corresponding to amplitude, frequency, and phase, respectively) are given by the following equations: 1
gA (α) = gA =
˜
2
1
1
wg6 ( N12 ) 6 1
gΩ (ω, α) = gΩ (α) =
1
wα6 2 3 H− 3 b(A)
αwα6
2
N2 12
13
1
,
˜
(3)
2
2 3 H− 3 b(A) 1
wg6
,
(4)
Low Bitrate Coding of Spot Audio Signals
1
gΦ (φ, α, wl ) = gΦ (α, wl ) =
1
˜
2
αwl2 2 3 H− 3 b(A) 1 3
1 6
161
1
wα wg ( N12 ) 6 2
,
(5)
where wα and wg are the arithmetic and geometric mean of the perceptual ˜ = H − h(A) − h(Ω) − h(Φ) and weights of the L sinusoids, respectively, H b(A) = fA (α) log2 (α) dα. The quantities h(A), h(Ω) and h(Φ) are the differential entropies of the amplitude, frequency and phase variables, respectively, while fA (α) denotes the marginal pdf of the amplitude variable. 3.2
Coding of the Spectral Envelopes
The second group of parameters for each spot signal that need to be encoded are the spectral envelopes of the sinusoidal noise. We follow the quantization scheme of [14]. The LP coefficients of each spot signal that model the noise spectral envelope are transformed to LSF’s (Line Spectral Frequencies) which are modeled by means of a Gaussian Mixture Model (GMM). Then, the Karhunen Lo`eve Transform (KLT) decorrelates each LSF vector for each time segment. The decorrelated components can be independently quantized by a non-uniform quantizer (compressor, uniform quantizer and expander). Each LSF vector is classified to only one of the GMM clusters. This classification is performed in an analysis-by-synthesis manner. For each LSF vector, the Log Spectral Distortion (LSD) is computed for each GMM class (the distortion among the spectral envelopes obtained by the original and the quantized LSF vectors), and the vector is classified to the cluster associated with the minimal LSD.
4 Results In this section, we are interested to examine the coding performance of our proposed system, with respect to the resulting audio quality. For this purpose we performed subjective (listening) tests. We employed the Degradation Category Rating (DCR) test, in which listeners grade the coded vs the original waveform using a 5-scale grading system (from 1-“very annoying” audio quality compared to the original, to 5-“not perceived” difference in quality). For our listening tests, we used three signals, referred to as Signals 1-3. These signals are parts of a multichannel recording of a concert hall performance. We used the recordings from two different microphones, one of which captured mainly the female voices of the orchestra chorus, while the second one captured mainly the male voices. The former was used in our experiments as the side channel, and the latter as the reference signal. Thus, the objective is to test whether the side signal can be accurately reproduced when using the residual from the reference signal. We note that in our previous work [12], we showed that the proposed noise transplantation approach results in very good quality (around 4.0 grade in DCR tests in most cases) for various music signals, with the number of sinusoids per frame as low as 10. Thus, in this section our objective is to examine the lower
162
A. Mouchtaris, C. Tzagkarakis, and P. Tsakalides
Not perceived
Perceived, not annoying
Slightly annoying
19 kbps 24.4kbps
Annoying
Very annoying
SIgnal 1
Signal 2
Signal 3
Fig. 1. Results from the quality rating DCR listening tests, corresponding to coding with (a) 24.4 kbps (dotted), (b) 19 kbps (solid). Each frame is modeled with 10 sinusoids and 10 LP parameters.
limit in bitrates which can be achieved by our system without loss of audio quality below the grade achieved by modeling alone (i.e. 4.0 grade for the three signals tested here). Regarding the parameters used for deriving the waveforms used in the tests, the sampling rate for the audio data was 44.1 kHz and the LP order for the AR noise shaping filters was 10. The analysis/synthesis frame for the implementation of the sinusoidal model is 30 msec with 50% overlapping between successive frames. The coding efficiency for the sinusoidal parameters was tested for a given (target) entropy of 28 and 20 bits per sinusoid (amplitudes, frequencies and phases in total), which gives a bitrate of 19.6 kbps and 14.2 kbps respectively. Regarding the coding of the LP parameters (noise spectral envelope), 28 bits were used per LSF vector. With 23 msec frame and 75 % overlapping, this corresponds to 4.8 kbps for the noise envelopes. Thus, the resulting bitrates that were tested are 24.4 kbps and 19 kbps (adding the bitrate of the sinusoidal parameters and the noise envelopes). A training audio dataset of about 100,000 LSF vectors (approximately 9.5 min of audio) was used to estimate the parameters of a 16class GMM. The training database consisted of recordings of the classical music performance (corresponding to the recording from which Signals 1-3 originated, but a different part of the recording than the one used for testing). Details about the implementation of the coding procedure for the LP parameters can be found in our earlier work [16]. Eleven volunteers participated in the DCR tests, using high-quality headphones. The results of the DCR tests are depicted in Fig. 1, where the 95% confidence interval are shown (the vertical lines indicate the confidence limits). The solid line shows the results for the case of coding with a bitrate of 19 kbps, while the dotted line shows the results for the 24.4 kbps case. The results of the
Low Bitrate Coding of Spot Audio Signals
163
figure verify that the quality of the coded audio signals is good and the proposed algorithm offers an encouraging performance, and that this quality can be maintained at as low as 19 kbps per side signal. We note that the reference signal was PCM coded with 16 bits per sample, however similar results were obtained for the side signals when the reference signal was MP3 coded at 64 kbps (monophonic case).
5 Conclusions In this paper a novel modeling approach, namely noise transplantation, was proposed for achieving interactive and immersive audio applications of high-quality audio at low bitrates. The approach was based on applying the sinusoidal model at spot microphone signals, i.e. the multiple audio recordings before performing the mixing process which produces the final multichannel mix. It was shown that these signals can be encoded collectively using a bitrate as low as 19 kbps per spot signal. Further research efforts are necessary in order to achieve even lower bitrates while preserving high audio quality, while a more detailed testing procedure using subjective methods is currently underway.
Acknowledgments The authors wish to thank Prof. Y. Stylianou for his insightful suggestions and for his help with the implementation of the sinusoidal model algorithm, Prof. C. Kyriakakis for providing the audio recordings used in the experiments, as well as the listening tests volunteers.
References 1. ITU-R BS.1116: Methods for the subjective assessment of small impairments in audio systems including multichannel sound systems, 1994. International Telecommunications Union, Geneva, Switzerland(1994) 2. ISO/IEC JTC1/SC29/WG11 (MPEG) International Standard ISO/IEC 11172-3: Coding of moving pictures and associated audio for digital storage media at up to about 1.5 Mbit/s (1992) 3. Brandenburg, K.: MP3 and AAC explained. In: Proc. 17th International Conference on High Quality Audio Coding of the Audio Engineering Society (AES) (September 1999) 4. ISO/IEC JTC1/SC29/WG11 (MPEG) International Standard ISO/IEC 13818-7: Generic coding of moving pictures and associated audio: Advanced audio coding, 1997 (1997) 5. Bosi, M., Brandenburg, K., Quackenbush, S., Fielder, L., Akagiri, K., Fuchs, H., Dietz, M., Herre, J., Davidson, G., Oikawa, Y.: ISO/IEC MPEG-2 advanced audio coding. In: Proc. 101st Convention of the Audio Engineering Society (AES), preprint No. 4382, Los Angeles, CA (November 1996) 6. Davis, M.: The AC-3 multichannel coder. In: Proc. 95th Convention of the Audio Engineering Society (AES), preprint No. 3774, New York, NY (October 1993)
164
A. Mouchtaris, C. Tzagkarakis, and P. Tsakalides
7. Johnston, J.D., Ferreira, A.J.: Sum-difference stereo transform coding. In: Proc. IEEE Int. Conf. Acoust., Speech, Signal Process (ICASSP), pp. 569–572 (1992) 8. Herre, J., Brandenburg, K., Lederer, D.: Intensity stereo coding. In: Proc. 96th Convention of the Audio Engineering Society (AES), preprint No. 3799 (February 1994) 9. Breebaart, J., Herre, J., Faller, C., Roden, J., Myburg, F., Disch, S., Purnhagen, H., Hotho, G., Neusinger, M., Kjorling, K., Oomen, W.: MPEG Spatial Audio Coding / MPEG Surround: Overview and current status. In: Proc. AES 119th Convention, Paper 6599, New York, NY (October 2005) 10. Baumgarte, F., Faller, C.: Binaural Cue Coding - Part I: Psychoacoustic Fundamentals and Design Principles. IEEE Trans. on Speech and Audio Proc. 11(6), 509–519 (2003) 11. Breebaart, J., van de Par, S., Kohlrausch, A., Schuijers, E.: Parametric coding of stereo audio. EURASIP Journal on Applied Signal Processing 9, 1305–1322 (2005) 12. Tzagkarakis, C., Mouchtaris, A., Tsakalides, P.: Modeling spot microphone signals using the sinusoidal plus noise approach. In: Proc. Workshop on Appl. of Signal Proc. to Audio and Acoust. (October 2007) 13. Vafin, R., Prakash, D., Kleijn, W.B.: On Frequency Quantization in Sinusoidal Audio Coding. IEEE Signal Proc. Letters 12(3), 210–213 (2005) 14. Subramaniam, A.D., Rao, B.D.: PDF optimized parametric vector quantization of speech line spectral frequencies. IEEE Trans. on Speech and Audio Proc. 11, 365–380 (2003) 15. Rabiner, L., Juang, B.-H.: Fundamentals of Speech Recognition. Prentice-Hall, Englewood Cliffs (1993) 16. Karadimou, K., Mouchtaris, A., Tsakalides, P.: Multichannel Audio Modeling and Coding Using a Multiband Source/Filter Model. In: Conf. Record of the ThirtyNinth Asilomar Conf. Signals, Systems and Computers, pp. 907–911 (2005) 17. McAulay, R.J., Quatieri, T.F.: Speech analysis/synthesis based on a sinusoidal representation. IEEE Trans. Acoust., Speech, and Signal Process. 34(4), 744–754 (1986) 18. Stylianou, Y.: Applying the harmonic plus noise model in concatenative speech synthesis. IEEE Trans. Speech and Audio Process. 9(1), 21–29 (2001) 19. Serra, X., Smith, J.O.: Spectral modeling synthesis: A sound analysis/synthesis system based on a deterministic plus stochastic decomposition. Computer Music Journal 14(4), 12–24 (1990) 20. Goodwin, M.: Residual modeling in music analysis-synthesis. In: Proc. IEEE Int. Conf. Acoust., Speech, Signal Process (ICASSP), pp. 1005–1008 (May 1996) 21. Hendriks, R.C., Heusdens, R., Jensen, J.: Perceptual linear predictive noise modeling for sinusoid-plus-noise audio coding. In: Proc. IEEE Int. Conf. Acoust., Speech, Signal Process (ICASSP), pp. 189–192 (May 2004)
Extracting Input Features and Fuzzy Rules for Forecasting Exchange Rate Using NEWFM Sang-Hong Lee, Hyoung J. Jang, and Joon S. Lim* College of IT, Kyungwon University, Korea {shleedosa,hjjang,jslim}@kyungwon.ac.kr
Abstract. Fuzzy neural networks have been successfully applied to generate predictive rules for exchange rate forecasting. This paper presents a methodology to forecast the daily and weekly GBP/USD exchange rate by extracting fuzzy rules based on the neural network with weighted fuzzy membership functions (NEWFM) and the minimized number of input features using the distributed non-overlap area measurement method. NEWFM classifies upward and downward cases of next day’s and next week’s GBP/USD exchange rate using the recent 32 days and 32 weeks of CPPn,m (Current Price Position of day n and week n : a percentage of the difference between the price of day n and week n and the moving average of the past m days and m weeks from day n-1 and week n-1) of the daily and weekly GBP/USD exchange rate, respectively. In this paper, the Haar wavelet function is used as a mother wavelet. The most important five and four input features among CPPn,m and 38 numbers of wavelet transformed coefficients produced by the recent 32 days and 32 weeks of CPPn,m are selected by the nonoverlap area distribution measurement method, respectively. The data sets cover a period of approximately ten years starting from 2 January 1990. The proposed method shows that the accuracy rates are 55.19% for the daily data and 72.58% for the weekly data. Keywords: fuzzy neural networks, wavelet transform, exchange rate, forecasting.
1 Introduction Fuzzy neural network (FNN) is the combination of neural network and fuzzy set theory, and provides the interpretation capability of hidden layers using knowledge based the fuzzy set theory [14-17]. Various FNN models with different algorithms such as learning, adaptation, and rule extraction were proposed as an adaptive decision support tool in the field of pattern recognition, classification, and forecasting [4-6][18]. Chai proposed the economic turning point forecasting using fuzzy neural network [13] and Gestel proposed the financial time series prediction using least squares support vector machines within the evidence framework [11]. Kim proposed support vector machines (SVMs) to predict a financial time series and compared SVMs with back-propagation neural networks [2] and a genetic algorithm approach to instance selection in artificial neural networks [12]. Exchange rate forecasting has been studied using AI (artificial intelligence) approach such as artificial neural networks and rule-based systems. Artificial neural networks are to support for training exchange rate data and rule-based systems are to *
Corresponding author.
G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 165–174, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
166
S.-H. Lee, H.J. Jang, and J.S. Lim
support for making a decision in the higher and lower of the daily and weekly change. Panda [10] compared the weekly Indian rupee/USD exchange rate forecasting performance of neural network with the performances of linear autoregressive (LAR) and random walk (RW) models. A new forecasting model based on neural network with weighted fuzzy membership functions (NEWFM) [3] concerning forecasting of GBP/USD exchange rate using the Haar wavlet transforms (WT) is implemented in this paper. In this paper, the five and four extracted input features are presented to forecast the daily and weekly GBP/USD exchange rate, respectively, using the Haar WT and the neural network with weighted membership functions (NEWFM), and the non-overlap area distribution measurement method [3]. The method extracts minimum number of input features each of which constructs an interpretable fuzzy membership function. All features are interpretably formed in weighted fuzzy membership functions preserving the disjunctive fuzzy information and characteristics. All features are extracted by the non-overlap area measurement method validated by the wine benchmarking data in University of California, Irvine (UCI) Machine Learning repository [7]. This study is to forecast the higher and lower of the daily and weekly changes of GBP/USD exchange rate. They are classified as “1” or “2” in the data of GBP/USD exchange rate. “1” means that the next day’s data and next week’s data are lower than today’s data and this week’s data, respectively. “2” means that the next day’s data and next week’s data are higher than today’s data and this week’s data, respectively. In this paper, the total numbers of samples are 2800 days for the daily GBP/USD exchange rate and 560 weeks for the weekly GBP/USD exchange rate used in Sfetsos [1] for approximately ten years starting from 2 January 1990. Sfetsos divided the samples into three subsets namely the training, the evaluation, and the unknown prediction sets. These were formed using approximately the 70%, 19%, and 11%, respectively. The performance and forecasting ability is measured on the totally unknown prediction sets. Sfetsos compared linear regression (LR) with the feedforward artificial neural network (ANN) for forecasting the daily and weekly GBP/USD exchange rate, and then the accuracy rates of LR and ANN are 48.86% and 50.62% for the daily data and 63.93% and 65.57% for the weekly data, respectively. In this paper, the most important five and four input features are selected by nonoverlap area measurement method [7]. The five and four generalized features are used to generate the fuzzy rules to forecast the next day’s directions of the daily changes of GBP/USD exchange rate and the next week’s directions of the weekly changes of GBP/USD exchange rate, respectively. NEWFM shows that the accuracy rates are 55.19% for the daily data and 72.58% for the weekly data.
2 Wavelet Transforms The wavelet transform (WT) is a transformation to basis functions that are localized in scale and in time as well. WT decomposes the original signal into a set of coefficients that describe frequency content at given times. The continuous wavelet transform (CWT) of a continuous time signal x(t) is defined as:
Extracting Input Features and Fuzzy Rules for Forecasting Exchange Rate
T ( a, b) =
1 a
∫
∞
⎛t −b⎞ ⎟x(t )dt ⎝ a ⎠
ψ⎜
−∞
167
(1)
where ψ((t-b)/a) is the analyzing wavelet function. The transform coefficients T(a,b) are found for both specific locations on the signal, t=b, and for specific wavelet periods (which are scale function of a). The CWT is defined as the dyadic wavelet transform (DWT), if a is discretized along the dyadic sequence 2i where i = 1, 2, … . The DWT can be defined as [8]:
S2 i x(n) = ∑ hk S2 i−1 x( n − 2i −1 k ) k ∈Z
W2i x( n) = ∑ g k S 2i −1 x(n − 2i −1 k )
(2)
k∈Z
where
S 2i is a smoothing operator, W2i is the digital signal x(n), i∈Z (Z is the integral
set), and hk and gk are coefficients of the corresponding low pass and high pass filters. A filtered signal at level i is down-sampled reducing the length of a signal at level i-1 by a factor of two, and generating approximation (ai) and detail coefficients (di) at level i. This paper proposes CPPn,m (Current Price Position) as a new technical indicator to forecast the next day’s directions of the daily changes of GBP/USD exchange rate and the next week’s directions of the weekly changes of GBP/USD exchange rate, respectively. CPPn,m is a current price position of day n and week n on a percentage of the difference between the price of day n and week n, and the moving average of the past m days and m weeks from day n-1 and week n-1, respectively. CPPn,m is calculated by
CPPn ,m = ((Cn − MAn−1,n−m ) / MAn−1,n−m ) × 100
(3)
where Cn is the closing price of day n and week n, and MAn-1,n-m is the moving average of the past m days and m weeks from day n-1 and week n-1, respectively. In this paper, the Haar wavelet function is used as a mother wavelet. The Haar wavelet function makes 38 numbers of approximations and detail coefficients from CPPn,5 to CPPn-31,5 to extract input features. 38 numbers of approximations and detail coefficients consist of 16 detail coefficients at level 1, 8 detail coefficients at level 2, 4 Table 1. Comparisons of Input Features Used for Forecasting the Daily Changes with the Weekly Changes in NEWFM Input Features for Forecasting the Daily Changes 5 features such as 1) d12 among 16 detail coefficients at level 1 2) d13 among 16 detail coefficients at level 1 3) d4 among 8 approximations at level 2 4) d1 among 4 approximation at level 3 5) d2 among 2 approximation at level 4
Input Features for Forecasting the Weekly Changes 4 features such as 1) d1 among 8 detail coefficients at level 2 2) a1 among 4 approximations at level 3 3) a1 among 1 approximation at level 5 4) CPPn
168
S.-H. Lee, H.J. Jang, and J.S. Lim
detail coefficients and 4 approximations at level 3, 2 detail coefficients and 2 approximations at level 4, and 1 detail coefficient and 1 approximation at level 5. The neural network with weighted membership functions (NEWFM) and the nonoverlap area distribution measurement method [3] are used to extract minimum number of input features among 39 numbers of features. Table 1 shows the extracted minimum input features.
3 Neural Network with Weighted Fuzzy Membership Function (NEWFM) 3.1 The Structure of NEWFM Neural network with weighted fuzzy membership function (NEWFM) is a supervised classification neuro-fuzzy system using bounded sum of weighted fuzzy membership functions (BSWFM in Fig. 2) [3][9]. The structure of NEWFM, illustrated in Fig. 1, comprises three layers namely input, hyperbox, and class layer. The input layer contains n input nodes for an n featured input pattern. The hyperbox layer consists of m hyperbox nodes. Each hyperbox node Bl to be connected to a class node contains n BSWFMs for n input nodes. The output layer is composed of p class nodes. Each class node is connected to one or more hyperbox nodes. An hth. Input pattern can be recorded as Ih={Ah=(a1, a2, … , an), class}, where class is the result of classification and Ah is n features of an input pattern. The connection weight between a hyperbox node Bl and a class node Ci is represented by wli, which is initially set to 0. From the first input pattern Ih, the wli is set to 1 by the winner hyperbox node Bl and class i in Ih. Ci should have one or more than one connections to hyperbox nodes, whereas Bl is restricted to have one connection to a corresponding class node. The Bl can be learned only when Bl is a winner for an input Ih with class i and wli = 1.
C1
C2
Cp
L
wmp = 0
wm 2 = 1 L
Bmi : μ j (x)
B1
B2
B3
B4
1
L
1
1
x
x
L
L
I1
I2
L
In
Ah = (a1 ,
a2 ,
L
, an )
Fig. 1. Structure of NEWFM
Bm μ j (x)
μ j (x)
Bm
x
Extracting Input Features and Fuzzy Rules for Forecasting Exchange Rate
169
3.2 Learning Scheme A hyperbox node Bl consists of n fuzzy sets. The ith fuzzy set of Bl, represented by Bli , has three weighted fuzzy membership functions (WFM, grey triangles ωli1 , ωli 2 , and ωli 3 in Fig. 2) which randomly constructed before learning. Each ω li j is originated from the original membership function μ li j with its weight Wl i j in the Fig. 2. The bounded sum of three weighted fuzzy membership functions (BSWFM, bold line in Fig. 2) of Bli combines the fuzzy characteristics of the three WFMs. The BSWFM value of Bli , denoted as BSli (.) , and is calculated by formulas (4) where ai is an ith feature value of an input pattern Ah for Bli . 3
BS li ( ai ) = ∑ ω li j ( ai ),
(4)
j =1
The winner hyperbox node Bl is selected by the Output (Bl) operator. Only the Bl, that has the maximum value of Output (Bl) for an input Ih wtih class i and wli = 1, among the hyperbox nodes can be learned. For the hth input Ah= (a1, a2… an) with n features to the hyperbox Bl, output of the Bl is obtained by formulas (5)
Output ( B l ) =
1 n
n
∑ BS
i l
( a i ).
(5)
i =1
BS li (x) μli1 Wl 1 = 0.7 i
μli 3
μli 2
BS li (ai )
Wl i 2 = 0.8
ωli 2
Wl i 3 = 0.3
ωli1 vli 0 vli min
ωli 3
vli1
vli 2
ai
vli 3
x vli max
vli 4
Fig. 2. An Example of Bounded Sum of Weighted Fuzzy Membership Functions (BSWFM, i i Bold Line) of B l and BSl (ai )
Then, the selected winner hyperbox node Bl is learned by the Adjust (Bl) operation. This operation adjusts all Bli s according to the input ai, where i=1, 2… n. The membership function weight Wl i j (where 0≤ Wl i j ≤1 and j=1, 2, 3) represents the strength of ω li j . Then a WFM ω li j can be formed by ( v li j −1 , W l i j , v li j + 1 ). As a result
170
S.-H. Lee, H.J. Jang, and J.S. Lim
of Adjust (Bl) operation, the vertices v li j and weights W l i j in Fig. 3 are adjusted by the following expressions (6): vli j = vli j + s × α × Eli j × ωli j (ai ) = vli j + s × α × Eli j × μli j (ai ) × Wl i j , where
⎧s = −1, Eli j = min( vli j − ai , vli j −1 − ai ), if vli j −1 ≤ ai < vli j ⎪⎪ i i i i i ⎨ s = 1, El j = min( vl j − ai , vl j +1 − ai ), if vl j ≤ ai < vl j +1 ⎪ i ⎪⎩ El j = 0, otherwise W l i j = W l i j + β × ( μ li j (a i ) − W l i j )
(6)
Where the α and β are the learning rates for v li j and W l i j respectively in the range from 0 to 1 and j=1,2,3. Fig. 3 shows BSWFMs before and after Adjust (Bl) operation for Bli with an input ai. The weights and the centers of membership functions are adjusted by the Adjust (Bl) operation, e.g., W l i 1 , W l i 2 , and W l i 3 are moved down, v li 1 and v li 2 are moved toward ai, and v li 3 remains in the same location. The Adjust (Bl) operations are executed by a set of training data. If the classification rate for a set of test data is not reached to a goal rate, the learning scheme with Adjust(Bl) operation is repeated from the beginning by randomly reconstructing all WFMs in Bl s and making all connection weights to 0 (wli = 0) until the goal rate is reached. BSli (x) μli1
1
μ li 2 Wl i 2
Wl i1
Bli
μli 3
Wl i 3
vli 0
vli min vli1 BS li (x) μ li1
1
Bli
i
ai v l 2 Adjust(Bl) for Bli μ li 2
vli 3 vli max
x
vli 4
μli 3
Wl i 2
Wl i1
Wl i 3
vli 0
vli min
vli1
ai v li 2
vli 3
x
i vli max vl 4
Fig. 3. An Example of Before and After Adjust (Bl) Operation for Bli
Extracting Input Features and Fuzzy Rules for Forecasting Exchange Rate
171
4 Experimental Results In this section, the total numbers of samples are 2800 days for the daily GBP/USD exchange rate and 560 weeks for the weekly GBP/USD exchange rate used in Sfetsos [1] for approximately ten years starting from 2 January 1990. Sfetsos divided the samples into three subsets namely the training, the evaluation, and the unknown prediction sets. Table 2 displays that these were formed using approximately the 70%, 19%, and 11%, respectively. The performance and forecasting ability is measured on the totally unknown prediction sets. Table 2. Number of instances used in Sfetsos Training sets
Evaluation sets
Unknown prediction sets
Total sets
Forecasting the Daily Changes
1960
532
308
2800
Forecasting the Weekly Changes
392
106
62
560
Sfetsos compared linear regression (LR) with the feedforward artificial neural network (ANN) for forecasting the daily and weekly GBP/USD exchange rate. The accuracy of NEWFM is evaluated by the totally unknown prediction sets which were used in Sfetsos. Table 3 displays the comparison of performance results for Sfetsos with NEWFM and the accuracy rates for the totally unknown prediction sets. Table 3. Comparisons of Performance Results for Sfetsos with NEWFM Sfetsos’s LR
Sfetsos’s ANN
NEWFM
Accuracy (%)
Accuracy (%)
Accuracy (%)
Daily data
48.86
50.62
55.19
Weekly data
63.93
65.57
72.58
In this paper, the most important five and four input features are selected by nonoverlap area measurement method [7]. The five and four generalized features are used to generate the fuzzy rules to forecast the next day's directions of the daily changes of GBP/USD exchange rate and the next week's directions of the weekly changes of GBP/USD exchange rate, respectively. The five and four generalized features extracted from 39 numbers of input features are selected by non-overlap area distribution measurement method [3]. The method measures the degree of salience of the ith feature by non-overlapped areas with the area distribution by the following equation: f ( i ) = ( Area
i U
+ Area Li ) 2 Max ( Area
i U
, Area Li ) ,
(7)
where AreaU and AreaL are the upper phase superior area and the lower phase superior area, respectively. As an example, for d1 feature among 8 detail coefficients at level 2,
172
S.-H. Lee, H.J. Jang, and J.S. Lim
Fig. 4. Trained BSWFM of the Generalized Five Features for Lower Phase and Upper Phase Classification of the daily GBP/USD exchange rate
Fig. 5. Trained BSWFM of the Generalized Four Features for Lower Phase and Upper Phase Classification of the weekly GBP/USD exchange rate
Fig. 6. AreaU(White) and AreaL(Black) for d1 feature among 8 detail coefficients at level 2
the AreaU and AreaL are shown in Fig. 6. The larger the value of f(i), the more the feature characteristic is implied. In this experiment, two hypoboxes are created for classification. While a hyperbox which contains a set of line (BSWFM) in Fig. 4 and Fig. 5 are a rule for class 1 (lower
Extracting Input Features and Fuzzy Rules for Forecasting Exchange Rate
173
phase), the other hyperbox which contains a set of lines (BSWFM) is another rule for class 2 (upper phase). The graph in Fig. 4 and Fig. 5 are obtained from the training process of the NEWFM program and shows the difference between lower phase and upper phase for each input feature graphically. Lower phase means that the next day's data and next week's data are lower than today's data and this week's data, respectively. Upper phase means that the next day's data and next week's data are higher than today's data and this week's data, respectively.
5 Concluding Remarks This paper proposes a new forecasting model based on neural network with weighted fuzzy membership function (NEWFM). NEWFM is a new model of neural networks to improve forecasting accuracy rates by using self adaptive weighted fuzzy membership functions. The degree of classification intensity is obtained by bounded sum of weighted fuzzy membership functions extracted by NEWFM. In this paper, the Haar wavelet function is used as a mother wavelet to extract input features. The five and four input features extracted by the non-overlap area distribution measurement method [3] are presented to forecast the daily and weekly GBP/USD exchange rate using the Haar WT. The accuracy rates are 55.19% for the daily data and 72.58% for the weekly data. To improve the accuracy of the exchange rate forecasting capability, some kinds of statistics such as probability density function, normal distribution, and etc will be needed to study.
References 1. Sfetsos, A., Siriopoulos, C.: Time Series Forecasting of Averaged Data With Efficient Use of Information. IEEE Trans. on Systems, Man, and Cybernetics—Part A: Systems and Humans 35(5) (September 2005) 2. Kim, K.-j.: Financial time series forecasting using support vector machines. Neurocomputing 55, 307–309 (2003) 3. Lim, J.S., Ryu, T.-W., Kim, H.-J., Gupta, S.: Feature Selection for Specific Antibody Deficiency Syndrome by Neural Network with Weighted Fuzzy Membership Functions. In: Wang, L., Jin, Y. (eds.) FSKD 2005. LNCS (LNAI), vol. 3614, pp. 811–820. Springer, Heidelberg (2005) 4. Ishibuchi, H., Nakashima, T.: Voting in Fuzzy Rule-Based Systems for Pattern Classification Problems. Fuzzy Sets and Systems 103, 223–238 (1999) 5. Nauk, D., Kruse, R.: A Neuro-Fuzzy Method to Learn Fuzzy Classification Rules from Data. Fuzzy Sets and Systems 89, 277–288 (1997) 6. Setnes, M., Roubos, H.: GA-Fuzzy Modeling and Classification: Complexity and Performance. IEEE Trans. Fuzzy Systems 8(5), 509–522 (2000) 7. Lim, J.S., Gupta, S.: Feature Selection Using Weighted Neuro-Fuzzy Membership Functions. In: The 2004 International Conference on Artificial Intelligence (ICAI 2004), Las Vegas, Nevada, USA, June 21-24, vol. 1, pp. 261–266 (2004) 8. Mallat, S.: Zero Crossings of a Wavelet Transform. IEEE Trans. on Information Theory 37, 1019–1033 (1991)
174
S.-H. Lee, H.J. Jang, and J.S. Lim
9. Lim, J.S., Wang, D., Kim, Y.-S., Gupta, S.: A neuro-fuzzy approach for diagnosis of antibody deficiency syndrome. Neurocomputing 69(7-9), 969–974 (2006) 10. Panda, C., Narasimhan, V.: Forecasting exchange rate better with artificial neural network. Journal of Policy Modeling 29, 227–236 (2007) 11. Gestel, T.V., et al.: Financial Time Series Prediction Using Least Squares Support Vector Machines Within the Evidence Framework. IEEE Trans. Neural Networks 12(4), 809–821 (2001) 12. Kim, K.-j.: Artificial neural networks with evolutionary instance selection for financial forecasting. Expert System with Applications 30, 519–526 (2006) 13. Chai, S.H., Lim, J.S.: Economic Turning Point Forecasting Using Fuzzy Neural Network and Non-Overlap Area Distribution Measurement Method. The Korean Economic Association 23(1), 111–130 (2007) 14. Carpenter, G.A., Grossberg, S., Reynolds, J.: ARTMAP: Supervised real-time learning and classification of nonstationary data by a self-organizing neural network. Neural Networks 4, 565–588 (1991) 15. Jang, R.: ANFIS: Adaptive network-based fuzzy inference system. IEEE Trans. Syst., Man, Cybern. 23, 665–685 (1993) 16. Wang, J.S., Lee, C.S.G.: Self-Adaptive Neuro-Fuzzy Inference System for Classification Applications. IEEE Trans., Fuzzy Systems 10(6), 790–802 (2002) 17. Simpson, P.: Fuzzy min-max neural networks-Part 1: Classification. IEEE Trans., Neural Networks 3, 776–786 (1992) 18. Lim, J.S.: Finding Fuzzy Rules by Neural Network with Weighted Fuzzy Membership Function. International Journal of Fuzzy Logic and Intelligent Systems 4(2), 211–216 (2004)
Forecasting Short-Term KOSPI Time Series Based on NEWFM Sang-Hong Lee, Hyoung J. Jang, and Joon S. Lim* College of IT, Kyungwon University, Korea {shleedosa,hjjang,jslim}@kyungwon.ac.kr
Abstract. Fuzzy neural networks have been successfully applied to generate predictive rules for stock forecasting. This paper presents a methodology to forecast the daily Korea composite stock price index (KOSPI) by extracting fuzzy rules based on the neural network with weighted fuzzy membership functions (NEWFM) and the minimized number of input features using the distributed non-overlap area measurement method. NEWFM supports the KOSPI time series analysis based on the defuzzyfication of weighted average method which is the fuzzy model suggested by Takagi and Sugeno. NEWFM classifies upper and lower cases of next day’s KOSPI using the recent 32 days of CPPn,m (Current Price Position of day n : a percentage of the difference between the price of day n and the moving average of the past m days from day n-1) of KOSPI. In this paper, the Haar wavelet function is used as a mother wavelet. The most important four input features among CPPn,m and 38 numbers of wavelet transformed coefficients produced by the recent 32 days of CPPn,m are selected by the non-overlap area distribution measurement method. The total number of samples is 2928 trading days, from January 1989 to December 1998. About 80% of the data is used for training and 20% for testing. The result of classification rate is 59.0361%. Keywords: fuzzy neural networks, weighted average defuzzification, wavelet transform, KOSPI, nonlinear time series.
1 Introduction Fuzzy neural network (FNN) is the combination of neural network and fuzzy set theory, and provides the interpretation capability of hidden layers using knowledge based the fuzzy set theory [14-17]. Various FNN models with different algorithms such as learning, adaptation, and rule extraction were proposed as an adaptive decision support tool in the field of pattern recognition, classification, and forecasting [4-6][12]. Chai proposed the economic turning point forecasting using fuzzy neural network [7] and Gestel proposed the financial time series prediction using least squares support vector machines within the evidence framework [11]. Stock forecasting has been studied using AI (artificial intelligence) approach such as artificial neural networks and rule-based systems. Artificial neural networks are to support for training stock data and rule-based systems are to support for making a decision in the higher and lower of daily change. Bergerson and Wunsch [10] combined of neural network and rule-based system in the S&P 500 index futures *
Corresponding author.
G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 175–184, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
176
S.-H. Lee, H.J. Jang, and J.S. Lim
market. Xiaohua Wang [1] proposed the time delay neural network (TDNN). TDNN explored the usefulness of volume information in the explanation of the predictability of stock index returns. Kim proposed support vector machines (SVMs) to predict a financial time series and compared SVMs with back-propagation neural networks [2]. In this paper, the four extracted input features are presented to forecast the daily Korea composite stock price index (KOSPI) using the Haar WT and the neural network with weighted membership functions (NEWFM), and the non-overlap area distribution measurement method [3]. The method extracts minimum number of input features each of which constructs an interpretable fuzzy membership function. The four features are interpretably formed in weighted fuzzy membership functions preserving the disjunctive fuzzy information and characteristics, locally related to the time signal, the patterns of KOSPI. This study is to forecast the higher and lower of daily changes of KOSPI. They are classified as “1” or “2” in the data of KOSPI. “1” means that the next day’s index is lower than today’s index. “2” means that the next day’s index is higher than today’s index. In this paper, the total numbers of samples are 2928 trading days used in Kim [2], from January 1989 to December 1998. About 80% of trading data are used for training and 20% for testing, from January 1989 to December 1998. Kim used support vector machines (SVMs) to predict a financial time series and then the accuracy rate was 57.8313% [2]. In this paper, the most important four input features are selected by non-overlap area distribution measurement method [3]. The four generalized features are used to generate the fuzzy rules to forecast the next day’s directions of the daily changes of KOSPI. NEWFM shows that the accuracy rate is 59.0361%. The fuzzy model suggested by Tagaki and Sugeno in 1995 can represent nonlinear system such as stock time series [13] and business cycle [7].
2 Wavelet Transforms The wavelet transform (WT) is a transformation to basis functions that are localized in scale and in time as well. WT decomposes the original signal into a set of coefficients that describe frequency content at given times. The continuous wavelet transform (CWT) of a continuous time signal x(t) is defined as:
T ( a, b) =
1 a
∫
∞
⎛t −b⎞ ⎟x(t )dt ⎝ a ⎠
ψ⎜
−∞
(1)
where ψ((t-b)/a) is the analyzing wavelet function. The transform coefficients T(a,b) are found for both specific locations on the signal, t=b, and for specific wavelet periods (which are scale function of a). The CWT is defined as the dyadic wavelet transform (DWT), if a is discretized along the dyadic sequence 2i where i = 1, 2, … . The DWT can be defined as [8]: S2 i x(n) = ∑ hk S2 i−1 x( n − 2i −1 k ) k ∈Z
W2i x( n) = ∑ g k S 2i −1 x(n − 2i −1 k ) k∈Z
(2)
Forecasting Short-Term KOSPI Time Series Based on NEWFM
where
177
S 2i is a smoothing operator, W2i is the digital signal x(n), i∈Z (Z is the integral
set), and hk and gk are coefficients of the corresponding low pass and high pass filters. A filtered signal at level i is down-sampled reducing the length of a signal at level i-1 by a factor of two, and generating approximation (ai) and detail coefficients (di) at level i. This paper proposes CPPn,m (Current Price Position) as a new technical indicator to forecast the next day’s directions of the daily changes of KOSPI. CPPn,m is a current price position of day n on a percentage of the difference between the price of day n and the moving average of the past m days from day n-1. CPPn,m is calculated by
CPPn ,m = ((Cn − MAn−1,n−m ) / MAn−1,n−m ) × 100
(3)
where Cn is the closing price of day n and MAn-1,n-m is the moving average of the past m days from day n-1. In this paper, the Haar wavelet function is used as a mother wavelet. The Haar wavelet function makes 38 numbers of approximations and detail coefficients from CPPn,5 to CPPn-31,5 to extract input features. 38 numbers of approximations and detail coefficients consist of 16 detail coefficients at level 1, 8 detail coefficients at level 2, 4 detail coefficients and 4 approximations at level 3, 2 detail coefficients and 2 approximations at level 4, and 1 detail coefficient and 1 approximation at level 5. The neural network with weighted membership functions (NEWFM) and the nonoverlap area distribution measurement method [3] are used to extract minimum number of input features among 39 numbers of features. The following four minimum input features are extracted: 1) d1 among 16 detail coefficients at level 1 2) a1 among 4 approximations at level 3 3) a1 among 1 approximation at level 5 4) CPPn,5
3 Neural Network with Weighted Fuzzy Membership Function (NEWFM) 3.1 The Structure of NEWFM Neural network with weighted fuzzy membership function (NEWFM) is a supervised classification neuro-fuzzy system using bounded sum of weighted fuzzy membership functions (BSWFM in Fig. 2) [3][9]. The structure of NEWFM, illustrated in Fig. 1, comprises three layers namely input, hyperbox, and class layer. The input layer contains n input nodes for an n featured input pattern. The hyperbox layer consists of m hyperbox nodes. Each hyperbox node Bl to be connected to a class node contains n BSWFMs for n input nodes. The output layer is composed of p class nodes. Each class node is connected to one or more hyperbox nodes. An hth. Input pattern can be recorded as Ih={Ah=(a1, a2, … , an), class}, where class is the result of classification and Ah is n features of an input pattern. The connection weight between a hyperbox node Bl and a class node Ci is represented by wli, which is initially set to 0. From the first input pattern Ih, the wli is
178
S.-H. Lee, H.J. Jang, and J.S. Lim
C1
C2
Cp
L
Class Nodes
wm 2 = 1
wmp = 0
L
Bmi : μ j (x)
B2
B3
B4
...
L
I2
L
In
Ah = (a1 ,
a2 ,
L
, an )
Hyperbox
... x
x
L
Bm
1
1
L
I1
of μ j(x )
μ j (x )
1
B1
th fuzzy set
Bm :
x
Nodes
th Hyperbox Node
Input Nodes
th Input Pattern with
Features
Fig. 1. Structure of NEWFM
set to 1 by the winner hyperbox node Bl and class i in Ih. Ci should have one or more than one connections to hyperbox nodes, whereas Bl is restricted to have one connection to a corresponding class node. The Bl can be learned only when Bl is a winner for an input Ih with class i and wli = 1. 3.2 Learning Scheme A hyperbox node Bl consists of n fuzzy sets. The ith fuzzy set of Bl, represented by Bli , has three weighted fuzzy membership functions (WFM, grey triangles ωli1 , ωli 2 , and ωli 3 in Fig. 2) which randomly constructed before learning. Each ω li j is originated from the original membership function μ li j with its weight Wl i j in the Fig. 2. The bounded sum of three weighted fuzzy membership functions (BSWFM, bold line in Fig. 2) of Bli combines the fuzzy characteristics of the three WFMs. The BSWFM value of Bli , denoted as BSli (.) , and is calculated by formulas (4) where ai is an ith feature value of an input pattern Ah for Bli . 3
BS li ( ai ) = ∑ ω li j ( ai ),
(4)
j =1
The winner hyperbox node Bl is selected by the Output (Bl) operator. Only the Bl, that has the maximum value of Output (Bl) for an input Ih wtih class i and wli = 1, among the hyperbox nodes can be learned. For the hth input Ah= (a1, a2… an) with n features to the hyperbox Bl, output of the Bl is obtained by formulas (5) 1 n (5) Output ( B l ) = ∑ BS li ( a i ) . n i =1
Forecasting Short-Term KOSPI Time Series Based on NEWFM
179
BS li (x) μli1 Wl i1 = 0.7
BS li (ai )
Wl i 2 = 0.8
ωli 2
ωli1 vli 0 vli min
μli 3
μli 2
Wl i 3 = 0.3
ωli 3
vli1
vli 2
ai
vli 3
x vli max
vli 4
Fig. 2. An Example of Bounded Sum of Weighted Fuzzy Membership Functions (BSWFM, i i Bold Line) of B l and BSl (ai )
Then, the selected winner hyperbox node Bl is learned by the Adjust (Bl) operation. This operation adjusts all Bli s according to the input ai, where i=1, 2… n. The membership function weight Wl i j (where 0≤ Wl i j ≤1 and j=1, 2, 3) represents the strength of ω li j . Then a WFM ω li j can be formed by ( v li j −1 , W l i j , v li j + 1 ). As a result of Adjust (Bl) operation, the vertices v li j and weights W l i j in Fig. 3 are adjusted by the following expressions (6): vli j = vli j + s × α × Eli j × ωli j (ai ) = vli j + s × α × Eli j × μli j (ai ) × Wl i j , where
⎧s = −1, Eli j = min( vli j − ai , vli j −1 − ai ), if vli j −1 ≤ ai < vli j ⎪⎪ i i i i i ⎨ s = 1, El j = min( vl j − ai , vl j +1 − ai ), if vl j ≤ ai < vl j +1 ⎪ i ⎪⎩ El j = 0, otherwise W l i j = W l i j + β × ( μ li j (a i ) − W l i j )
(6)
Where the α and β are the learning rates for v li j and W l i j respectively in the range from 0 to 1 and j=1,2,3. Fig. 3 shows BSWFMs before and after Adjust (Bl) operation for Bli with an input ai. The weights and the centers of membership functions are adjusted by the Adjust (Bl) operation, e.g., W l i 1 , W l i 2 , and W l i 3 are moved down, v li 1 and v li 2 are moved toward ai, and v li 3 remains in the same location. The Adjust (Bl) operations are executed by a set of training data. If the classification rate for a set of test data is not reached to a goal rate, the learning scheme with Adjust(Bl) operation is repeated from the beginning by randomly reconstructing all WFMs in Bl s and making all connection weights to 0 (wli = 0) until the goal rate is reached.
180
S.-H. Lee, H.J. Jang, and J.S. Lim BS li (x ) μli1
Wl i 2
Wl i1
i l
B
μ li 3
μli 2
Wl i 3
vli 0
vli min vli1
i ai v l 2
x
vli 4
Bli
BS li (x) μli1
Bli
vli 3 vli max
μli 2
μ li 3
Wl i 2
Wl i1
Wl i 3
vli 0
vli min
vli1
ai v li 2
x
i vli max vl 4
vli 3
Fig. 3. An Example of Before and After Adjust (Bl) Operation for Bli
4 Experimental Results In this section, the data of KOSPI for 10 years, from January 1989 to December 1998, were trained for about 80% of the data and tested for about 20% of the data. Table 1 displays the comparison of the numbers of features used in Kim and NEWFM. Kim used 12 features such as CCI, RSI, Stochastic, and etc. NEWFM uses 4 features, which consist of 4 numbers of approximations and detail coefficients made by the Haar wavelet function. The four generalized features extracted from 39 numbers of input features are selected by non-overlap area distribution measurement method [3]. The method measures the degree of salience of the ith feature by non-overlapped areas with the area distribution by the following equation: f ( i ) = ( Area
i U
+ Area Li ) 2 Max ( Area
i U
, Area Li ) ,
(7)
where AreaU and AreaL are the upper phase superior area and the lower phase superior area, respectively. As an example, for CPPn,5 feature, the AreaU and AreaL are shown in Fig. 4. The larger the value of f(i), the more the feature characteristic is implied. Table 2 displays the comparison of the numbers of features used in Kim and NEWFM. Kim used 12 features such as CCI, RSI, Stochastic, and etc. NEWFM uses 4 features, which consist of 3 numbers of approximations and detail coefficients made by the Haar wavelet function and CPPn,5. The four generalized features are used to generate the fuzzy rules (BSWFM) to forecast the time series of KOSPI.
Forecasting Short-Term KOSPI Time Series Based on NEWFM
181
Fig. 4. AreaU(White) and AreaL(Black) for CPPn,5 Table 1. Comparisons of Features of KIM With NEWFM Kim
NEWFM
12 features such as CCI, RSI, Stochastic, and etc
4 features such as CPPn,5 and 3 numbers of approximations and detail coefficients from CPPn,5 to CPPn-31,5
The accuracy of NEWFM is evaluated by the same data sets which were used in Kim. Table 2 displays the accuracy rates for about 20% of the data, from January 1989 to December 1998 in Kim. Kim proposed support vector machines (SVMs) to predict a financial time series and compared SVMs with back-propagation neural networks [2]. Table 2. Comparisons of Performance Results for KIM with NEWFM NEWFM Accuracy rate
59.0361%
SVM 57.8313%
BP 54.7332%
In this experiment, two hypoboxes are created for classification. While a hyperbox which contains a set of line (BSWFM) in Fig. 5 is a rule for class 1 (lower phase), the other hyperbox which contains a set of lines (BSWFM) is another rule for class 2 (upper phase). The graph in Fig. 5 is obtained from the training process of the NEWFM program and shows the difference between lower phase and upper phase for each input feature graphically. Lower phase means that the next day’s index is lower than today’s index. Upper phase means that the next day’s index is higher than today’s index. The forecasting result of NEWFM can be represented by the trend line using the defuzzyfication of weighted average method (The fuzzy model suggested by Takagi and Sugeno in 1985 [13]). The Fig. 6 shows the trend line of the forecasting result from January 1989 to December 1998 by the KOSPI. This result generally demonstrates the similar fluctuations with KOSPI.
182
S.-H. Lee, H.J. Jang, and J.S. Lim
Fig. 5. Trained BSWFM of the Generalized Five Features for Lower Phase and Upper Phase Classification of KOSPI 900 800 700 600 ec 500 ri P 400
Sugeno Original
300 200 100 0 7 1 0- -1 10 20 -7 -7 99 99 1 1
71 -3 079 91
91 -4 079 91
62 -5 079 91
82 -6 079 91
10 -8 079 91
40 -9 079 91
11 -0 179 91
31 -1 179 91
61 -2 179 91
62 -1 089 91
30 -3 089 91
40 -4 089 91
90 -5 089 91
31 -6 089 91
61 -7 089 91
02 -8 089 91
22 -9 089 91
82 -0 189 91
03 -1 189 91
Fig. 6. Comparison of the Original KOSPI and the Fuzzy Model Suggested by Tagaki and Sugeno
5 Concluding Remarks This paper proposes a new forecasting model based on neural network with weighted fuzzy membership function (NEWFM) and the time series of KOSPI based on the defuzzyfication of weighted average method which is the fuzzy model suggested by Takagi and Sugeno [13]. NEWFM is a new model of neural networks to improve
Forecasting Short-Term KOSPI Time Series Based on NEWFM
183
forecasting accuracy rates by using self adaptive weighted fuzzy membership functions. The degree of classification intensity is obtained by bounded sum of weighted fuzzy membership functions extracted by NEWFM, and then weighted average defuzzification is used for forecasting the time series of KOSPI. In this paper, the Haar wavelet function is used as a mother wavelet to extract input features. The four input features extracted by the non-overlap area distribution measurement method [3] are presented to forecast KOSPI using the Haar WT. The total number of samples is 2928 trading days, from January 1989 to December 1998. About 80% of the data is used for training and 20% for testing. The result of classification rate is 59.0361%. In Table 2, NEWFM outperforms SVMs by 1.2048% for the holdout data. Although further study will be necessary to improve the accuracy of the stock forecasting capability, the buy-and-hold investment strategy can be planned by the trend line of KOSPI. To improve the accuracy of the stock forecasting capability, some kinds of statistics such as CCI, normal distribution, and etc will be needed to study.
References 1. Wang, X., Phua, P.K.H., Lin, W.: Stock market prediction using neural networks: Does trading volume help in short-term prediction? In: Proceedings of the International Joint Conference on Neural Networks, 2003, July 20-24, vol. 4, pp. 2438–2442 (2003) 2. Kim, K.-j.: Financial time series forecasting using support vector machines. Neurocomputing 55, 307–309 (2003) 3. Lim, J.S., Ryu, T.-W., Kim, H.-J., Gupta, S.: Feature Selection for Specific Antibody Deficiency Syndrome by Neural Network with Weighted Fuzzy Membership Functions. In: Wang, L., Jin, Y. (eds.) FSKD 2005. LNCS (LNAI), vol. 3614, pp. 811–820. Springer, Heidelberg (2005) 4. Ishibuchi, H., Nakashima, T.: Voting in Fuzzy Rule-Based Systems for Pattern Classification Problems. Fuzzy Sets and Systems 103, 223–238 (1999) 5. Nauk, D., Kruse, R.: A Neuro-Fuzzy Method to Learn Fuzzy Classification Rules from Data. Fuzzy Sets and Systems 89, 277–288 (1997) 6. Setnes, M., Roubos, H.: GA-Fuzzy Modeling and Classification: Complexity and Performance. IEEE Trans. Fuzzy Systems 8(5), 509–522 (2000) 7. Chai, S.H., Lim, J.S.: Economic Turning Point Forecasting Using Fuzzy Neural Network and Non-Overlap Area Distribution Measurement Method. The Korean Economic Association 23(1), 111–130 (2007) 8. Mallat, S.: Zero Crossings of a Wavelet Transform. IEEE Trans. on Information Theory 37, 1019–1033 (1991) 9. Lim, J.S., Wang, D., Kim, Y.-S., Gupta, S.: A neuro-fuzzy approach for diagnosis of antibody deficiency syndrome. Neurocomputing 69(7-9), 969–974 (2006) 10. Bergerson, K., Wunsch, D.C.: A commodity trading model based on a neural networkExpert system hybrid. In: Proceedings of the IEEE International Conference on Neural Networks, pp. I289–I293 (1991) 11. Gestel, T.V., et al.: Financial Time Series Prediction Using Least Squares Support Vector Machines Within the Evidence Framework. IEEE Trans. Neural Networks 12(4), 809–821 (2001)
184
S.-H. Lee, H.J. Jang, and J.S. Lim
12. Lim, J.S.: Finding Fuzzy Rules by Neural Network with Weighted Fuzzy Membership Function. International Journal of Fuzzy Logic and Intelligent Systems 4(2), 211–216 (2004) 13. Tagaki, T., Sugeno, M.: Fuzzy Identification of System and Its Applications to Modeling and Control. IEEE Trans. SMC 15, 116–132 (1985) 14. Carpenter, G.A., Grossberg, S., Reynolds, J.: ARTMAP: Supervised real-time learning and classification of nonstationary data by a self-organizing neural network. Neural Networks 4, 565–588 (1991) 15. Jang, R.: ANFIS: Adaptive network-based fuzzy inference system. IEEE Trans. Syst., Man, Cybern. 23, 665–685 (1993) 16. Wang, J.S., Lee, C.S.G.: Self-Adaptive Neuro-Fuzzy Inference System for Classification Applications. IEEE Trans., Fuzzy Systems 10(6), 790–802 (2002) 17. Simpson, P.: Fuzzy min-max neural networks-Part 1: Classification. IEEE Trans., Neural Networks 3, 776–786 (1992)
The Convergence Analysis of an Improved Artificial Immune Algorithm for Clustering* Jianhua Tong, Hong-Zhou Tan, and Leiyong Guo School of Information Science & Technology Sun Yat-Sen University Guangzhou, Guangdong, China, 510275
[email protected],
[email protected]
Abstract. Immune Algorithms have been used widely and successfully in many computational intelligence areas including clustering. Given the large number of variants of each operator of this class of algorithms, this paper presents a study of the convergence properties of an improved artificial immune algorithm for clustering(DCAAIN algorithm), which has better clustering quality and higher data compression rate rather than some current clustering algorithms. It is proved that the DCAAIN is completely convergent based on the use of Markov chain. The simulation results verified the steady convergence of DCAAIN by comparing with the similar algorithms. Keywords: immune algorithm, complete convergence, Markov chain.
1 Introduction From the information process perspective, the immune system is a massively-parallel and self-adaptive system which can defend the invading antigen effectively and keep various antigen coexist[1]. It become a valuable research area because it has the character of diversity, distributivity, dynamics, self-adaptability, robustness and so on[2]. Recently, researchers have put forward numerous modals and algorithms to solve the problems in engineering and science such as clustering by emulation the information process ability of the immune system. De Castro proposes a clonal selection algorithm(aiNet)[2] based on the clone selection principle and the affinity maturation process. The algorithm is shown to be capable of solving the clustering task. Na Tang improves aiNet algorithm[3] by combining it with k-means and HAC algorithms. Compared with above algorithms, DCAAIN algorithm[4] have made great improvement on the properties of incremental clustering ability, selfadaptability and diversity. But the research on the aspect of theory analysis, such as convergence analysis, is rather scarce. In fact, it is very helpful in pointing out the direction of improving the performance of the immune algorithm and providing insights of immunity-based systems applications. In this paper, we adopt the DCAAIN as a basis for the mathematical modal, and analyze the complete convergence of it. The proof is based on the use of Markov chain and other related knowledge. *
This work was supported in part by the National Natural Science Foundation of China under grant No. 60575006.
G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 185–189, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
186
J. Tong, H.-Z. Tan, and L.Guo
The remainder of the paper is organized as follows. Section 2 describes the proposed algorithm DCAAIN. In section 3, we analyze the complete convergence of DCAAIN based on the Markov chain. Typical test are used to validate the proposed algorithm and to verity the correctness of the theory analysis in section 4. Finally, we present some conclusions.
2 The Proposed Algorithm The algorithm is population based, like any typical evolutionary algorithm. Each individual of the population is a candidate solution belonging to the fitness landscape of a given computational problem. We expand the single population to multipopulation by performing antibody cluster. In each subpopulation, realize the parallel subspace search by performing competition clone and selection. We introduce hypermutation, antibody elimination and supplement operators in each subpopulation in order to improve mature progenies and suppress the similar antibodies except the max affinity. Thus the remainder individuals have better fitness than the initial population. We also introduce the barycenter of cluster in order to get high data compression rate and the incremental clustering ability. Finally we introduce the newcomers which expand the search space to find global precise solution. The main process of DCAAIN algorithm is described as follow. Step1: Initialization: Create initial population randomly and get N antibodies
Ab ∈ S N × L . Step2: Clustering: Perform cluster to antibody population and get M antibodies cluster. Step3: Competition selection: Perform competition selection in each cluser and put a currently best antibody(the barycenter of cluster) that has the max fitness or represent the cluster center into elite set, compose the T(T=2M) subpopulation
Abc ( Abc ∈ S T × L ) . The restricted population ultimately helps finding a local optimum solution for each elite cluster member. Step4: Clonal Proliferation: Perform reproduction to population formed the
Abc by n times and
N c population Abc ( Abc ∈ S Nc × L ) .
Step5: hypermutation : Perform the hypermutation to the some of the expanded individuals thus obtain the
N c population Abm ( Abm ∈ S Nc × L ) .
Step6: Suppression and supplement: Eliminate all but the max fitness individuals whose distance are less than the suppression threshold σ s , thus obtain
Abd ( Abd ∈ S N d × L , N d ≤ N c ) Create N r newcomers randomly and choose N N s ( N s N r and 5% < s < 10% ) individual which have better fitness to N constitute the next population together with Abd .
The Convergence Analysis of an Improved Artificial Immune Algorithm for Clustering
187
Step7: Convergence check: Repeat step 3-6, until most solutions are not improved any more.
3 Convergence Analysis The transformation of the states in the algorithm can be described as following stochastic process: cluster clone mutation selection T : Ab(t ) ⎯⎯⎯ → Ab '(t ) ⎯⎯⎯ → Abc (t ) ⎯⎯⎯ ⎯ → Abm (t ) ⎯⎯⎯ ⎯ → Abd (t ) ∪ N s sup press ⎯⎯⎯⎯ → Ab(t + 1) sup plement
(1)
where N s is the new individuals which are added randomly. Markov chain offers an appropriate model to analyze the probability convergence properties. It is obviously that the transformation from the state Ab(t) to Ab(t+1) constitutes a Markov chain. The state Ab(t+1) has nothing to do with the state earlier and depends only on Abd (t ) ∪ N s , So the stochastic process {A(n), n≥1} is still a Markov chain. The population serial {A(n), n≥0}of this algorithm is a finite state Markov chain. In the algorithm, the initial population size is n and the antibodies are clustered to m subpopulations. si ∈ S , where si expresses the number of states in S. f
is
the
fitness
function
of
the
variable
X,
namely
s ' = { x ∈ X | f ( x) = max f ( x)} . so we can defined this algorithm completely
convergence with probability one:
lim ∑ p { Ati } = 1 t →∞
(2)
si ∩ s '
Proof: Define pij (t ), (i, j ∈ I ) is transfer probability of the stochastic process {A(t)}
{
}
and I = {i | si ∩ s ' ≠ ∅} ,where pij (t ) = p At j+1 | Ati ≥ 0 , namely
p { Ati } is pit , pt = ∑ pi (t ) , by the property of Markov chain we have i∈I
pt +1 = ∑ ∑ pi (t ) pij (t ) = ∑∑ pi (t ) pij (t ) si ∈S j∈I
i∈I j∈I
∵ ∑∑ pi (t ) pij (t ) + ∑∑ pi (t ) pij (t ) = ∑ pi (t ) = pt i∈I j∈I
i∈I j∉I
i∈I
∴ ∑∑ pi (t ) pij (t ) = pt − ∑∑ pi (t ) pij (t ) i∈I j∈I
i∈I j∉I
∴ 0 ≤ pt +1 = pt − ∑∑ pi (t ) pij (t ) ≤ pt ≤ 1 i∈I j∉I
Where we get
pt +1 ≤ pt , we conclude that lim pt = 0 t →∞
188
J. Tong, H.-Z. Tan, and L.Guo
Therefore 1 ≥ lim t →∞
∑ p (t ) ≥ lim ∑ p (t ) = 1 − lim p i
t →∞
si ∩ s '
i
i∈I
t →∞
t
=1
So we can say that this algorithm is completely convergent with probability one. Because of the antibody cluster operators, the N antibodies of population are divided to M subpopulations. In each local cluster, we can parallel perform clone selection, mutation and suppression operation. The time complexity of the algorithm during each cycle is O ( N + X )i M i K 2 , which the clonal selection and mutation
(
)
part is O ( K 2 ) , the cluster part is O(M), and the similarity suppression part is O
(N+ X), where N,X is the number of antibodies during one cycle and may be different for each cycle..
4 Simulation and Discussion In order to validate the algorithm for multimodal optimization problem and to verify the correctness of the theory analysis, DCAAIN is executed to solve the typical multimodal function compared with aiNet. The function is described as following: Maxf(x,y)=x.sin(4 π x)-y.sin(4 π y+ π )+1, x,y ∈ [-2,2]
(3)
f is the Multi-function with several local optima solution and a single global optimum all distributed non-uniformly. To be convenient for comparing, we choose the same number of initial population N=100 for DCAAIN and aiNet, see[4] for a detail description of the other parameters. Table 1. The results of two algorithms
Algorithm
ItCav
ItCmin
ItC max
aiNet
63.5
48
77
0.547
0.121
DCAAIN
32.5
23
38
0.852
0.184
F-measure
Where ItC av is average iterations to local global optimum; iterations to local global optimum;
Entropy
ItCmin is the least
ItC max is the most iterations to local global
optimum; F-measure and Entropy describes the precision of clustering. From table 1, we can learn that DCAAIN locates all the peaks in each experiment and the iteration of convergence is relatively steady. Note that DCAAIN on average,the least and the most, requires less number of iterations to local the global optimum than aiNet. It means that the time of convergence of DCAAIN is less than that of aiNet.. The
The Convergence Analysis of an Improved Artificial Immune Algorithm for Clustering
189
more of F-measure and the less of Entropy means higher quality of convergence. Obviously, DCAAIN is better than aiNet
5 Conclusion In this paper, we analyze the complete convergence of DCAAIN with Markov chain and other related knowledge on probability and prove that the DCAAIN is complete convergence with probability one. By theoretical analysis and simulation results, we note that DCAAIN can reach a diverse set of local optima solutions with the special mutation and selection method. All the subpopulations access to the peaks gradually following the lead of their dynamic center. The best solution in every subpopulation is maintained by keeping the original individual unmutated, which ensure the convergence of the algorithm. Since Markov chain has progressively been used in the analyses of Evolutionary Algorithms on combinatorial optimization problems with practical applications, a similar strategy in analysing other Immune Algorithms is worth considering.
References 1. Frank, S.A.: The design of natural and artificial adaptive systems. Rose, M.R., Lauder, G.V. edition. Academic Press, New York (1996) 2. De Castro, L.N., Von Zuben, F.J.: Artificial immune systems: part II–a survey of applications. Technical Report, p. 65 (2000) 3. Tang, N., Rao Vemuri, V.: An Artificial Immune System Approach to Document Clustering. In: ACM Symposium on Applied Computing, pp. 918–922 (2005) 4. Tong, J., Tan, H.-Z.: A Document Clustering Algorithm Based on Artificial Immune Network. Computer Engineering and Science 29(10), 17–19 (2007)
Artificial Immune System-Based Music Genre Classification D.N. Sotiropoulos, A.S. Lampropoulos, and G.A. Tsihrintzis University of Piraeus, Department of Informatics, 80 Karaoli and Dimitriou St, Piraeus 18534, Greece {dsotirop,arislamp,geoatsi}@unipi.gr
Abstract. We present a novel approach for the problem of automated music genre classification, which utilizes an Artificial Immune System (AIS)-based classifier. Our inspiration lies in the observation that the natural immune system has the intrinsic property of self/non-self cell discrimination, especially when the non-self (complementary) space of cells is significantly larger than the class of self cells. The AIS-based classifier that we have built is compared with KNN-, RBF- and SVM-based classifiers in various experiments involving music data. We find that the performance of our classifier is similar to that of the other classifiers when tested in multi-class (eg. four class) problems. On the other hand, it exceeds by a significant margin the performance of the other classifiers when tested in two class problems.
1 Introduction Recent advances in digital storage technology and the rapid increase in the amount of digital music files have led to the creation of large music collections for use by broad classes of computer users. In turn, this fact gives rise to a need for systems that have the ability to manage and organize efficiently large collections of stored music files. Many currently available music search engines and peer-topeer systems (e.g. Kazaa, emule, Torrent) rely on textual meta-information such as file names and ID3 tags as the retrieval mechanism. This textual description of audio information is subjective and does not make use of the musical content and the relevant meta-data have to be entered and updated manually, which implies significant effort in both creating and maintaining the music database. Therefore, an automated process that extracts information from the actual music data and, thus, organizes the data automatically, could overcome some of the problems that arise in current Music Information Retrieval (MIR) systems. An important and difficult task in MIR systems is musical genre classification. The boundaries between genres are fuzzy, which makes the problem of automatic classification a highly non-trivial task. For the purpose of automatic music genre classification, we have developed a procedure which relies on Artificial Immune Systems (AIS). In general, AIS provide metaphors for the development of high-level abstractions of functions or mechanisms that can be utilized to solve real world (pattern recognition) problems. AIS-based clustering [1] and G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 191–200, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
192
D.N. Sotiropoulos, A.S. Lampropoulos, and G.A. Tsihrintzis
classification algorithms are characterized by an ability to adapt their behavior so as to cope efficiently with extremely complex and continuously changing environments. Their main advantage over other classifier systems is their intrinsic property of self/non-self discrimination, especially when the class of non-self (complementary) patterns is significantly larger than the class of self patterns. In this paper, we develop an Artificial Immune Network (AIN) for classifying a set of labeled multidimensional music feature vectors extracted from a music database and assess its classification performance. Specifically, the paper is organized as follows: Section 2 is devoted to a review of related work on music genre classification, while Section 3 presents a review of the basic concepts of (natural and artificial) immune systems and relevant learning algorithms. Section 4 describes our experimental results on music data on testing the performance of our AIN vs. the performance of KNN-, RBF- and SVM-based classifiers. Finally, we draw conclusions and point to future related work in Section 5 of the paper.
2 Related Work on Music Genre Classification There have been several works on automatic musical genre classification. These systems usually consist of two modules, namely a feature extraction and a classifier module. Existing works have dealt with both the extraction of features extracted from the audio signal and the performance of classification schemes trained on the extracted features. In [2] three different feature sets are evaluated for music genre classification using 10 genres. The 30-dimensional feature vector represents timbral texture, rhythmic content and pitch content. Experiments are made with a gaussian classifier, a gaussian mixture model and a K-nearest neighbor classifier. The best combination of features and classifier achieved a correct classification rate of 61%. On the other hand, Li et al. [3] propose a new feature extraction method, which relies on Daubechies Wavelet Coefficient Histograms (DWCH). Effectiveness of this feature is evaluated using machine learning algorithms such as Support Vector Machines, K5 Nearest Neighbour (KNN), Gaussian Mixture Models (GMMs) and Linear Discriminant Analysis (LDA). It is shown that DWCHs improve the accuracy of music genre classification significantly. On the dataset provided by [3], the classification accuracy has increased from 61% to almost 80%. In [4], short-time features are compared to two novel psychoacoustic feature sets for classification of five general audio classes as well as seven music genres. It is found that the psychoacoustic features outperform the power spectrum features. The temporal evolution of the short-time features is found to improve performance. Support Vector Machines have been used in the context of genre classification in [5] and [6]. In [5], SVMs are used for genre classification with a Kullback -Lleiber divergence-based kernel to measure the distance between songs. In [6], genre classification is done with a mixture of SVM experts. A mixture of experts solves a classification problem by using a number of classifiers to decompose it into a series of sub-problems. Not only does it reduce the complexity of each
Artificial Immune System-Based Music Genre Classification
193
single task, but it also improves the global accuracy by combining the results of the different classifiers (SVM experts). Individual songs are modelled as GMMs, trained using the k-means instead of the Expectation Maximization (EM) algorithm [7]. They approximate the KL divergence between GMMs as the earth mover’s distance based on the KL divergences of the individual Gaussians in each mixture. Since their system is described as a distance measure, there is no mention of an explicit classifier. Finally, they generate playlists with the nearest neighbors of a seed song. The performance of genre classification can be improved by combining spectral similarity with complementary information, as in [8]. In particular, they combine spectral similarity with fluctuation patterns and derive two new descriptors named as “Focus” and “Gravity”. The authors state that fluctuation patterns describe loudness fluctuations in frequency bands, as well as characteristics which are not described by spectral similarity measures. For classification, the nearest neighbour classifier is used. They obtained an average classification performance increase of 14%. Artificial Neural Networks have been used for musical genre classification in [9, 10]. In [9], a musical genre classification system was presented which processed audio features extracted from signals corresponding to distinct musical sources. An important difference from previous related works is a sound source separation method was applied first to decompose the signal into a number of component signals. Then, timbral, rhythmic and pitch features were extracted from distinct instrument sources and used to classify a music excerpt. The genre classifiers were built as multilayer perceptrons. Results showed that this approach presented an improvement of 2% - 2.5% in correct music genre classification. Finally, Turnbull and Elkan [10] explore radial basis function (RBF) networks for musical genre classification by using a combination of unsupervised and supervised initialization methods. These initialization methods yield classifiers that are as accurate as RBF networks trained with gradient descent (which is hundreds of times slower). The experiments in this paper show that RBF networks initialized with a combination of methods can yield good classification performance without relying on gradient descent. In the present paper, we propose a new approach for musical genre classification based on the construction of an AIS-based classifier. Our AIS-based classifier utilizes the supervised learning algorithm proposed by Watkins and Timmis [11]. More specifically, we aim at exploiting the inherent information processing capabilities of the natural immune system through the implementation of an Artificial Immune Recognition System. The essence of our work is to demonstrate the classification efficiency of the constructed classifier when required to assign multi-dimensional music feature vectors to corresponding music categories. The classification accuracy measurements presented in this paper justify the use of the AIS-based classifier over other classifiers, such as KNN, RBF, or SVM.
194
D.N. Sotiropoulos, A.S. Lampropoulos, and G.A. Tsihrintzis
3 AIS - Based Classification AIS-based classification relies on a computational imitation of the biological process of self/non-self discrimination, that is the capability of the adaptive biological immune system to classify a cell as “self” or “non-self” cell. Any cell or even individual molecule recognized and classified by the self/non-self discrimination process is called an antigen. A non-self antigen is called a pathogen and, when identified, an immune response (specific to that kind of antigen) is elicited by the adaptive immune system in the form of antibody secretion. The essence of the antigen recognition process is the affinity (molecular complementarity level) between the antigen and antibody molecules. The strength of the antigen-antibody interaction (stimulation level ) is measured by the complementarity of their match and, thus, pathogens are not fully recognized, which makes the adaptive immune system tolerant to molecular noise. Learning in the immune system is established by the clonal selection principle [12] which suggests that only those antibodies exhibiting the highest level of affinity with a given antigen will be selected to proliferate and grow in concentration. Moreover, the selected antibodies also suffer a somatic hypermutation process [12], that is a genetic modification of their molecular receptors which allows them to learn to recognize a given antigen more efficiently. This hypermutation process is termed affinity maturation [12] and results in the development of long lasting memory cells which guarantee a faster and more accurate immune response when presented with antigenic patterns similar to those they were originally exposed to. This evolutionary procedure of developing memory antibodies lies within the core of the training process of our AIS-based classifier applied on each class of antigenic patterns. The evolved memory cells provide an alternative problem domain representation since they constitute points in the original feature space that do not coincide with the original training instances. However, the validity of this alternative representation follows from the fact the memory antibodies produced recognize the corresponding set of training patterns in each class in the sense that their average affinity to them is above a predefined threshold. To quantify immune recognition, we consider all immune events as taking place in a shape-space S, constituting a multi-dimensional metric space in which each axis stands for a physico-chemical measure characterizing molecular shape [12]. Specifically, we utilized a real-valued shape-space in which each element of the AIS-based classifier is represented by a real-valued vector of 30 elements, thus S = 30 . The affinity/complementarity level of the interaction between two elements of the constructed immune-inspired classifier was computed on the basis of the Euclidean distance between the corresponding vectors in 30 . The antigenic pattern set to be recognized by the memory antibodies produced during the training phase of the AIS-based classifier is composed of the set of representative antibodies, which maintain the spatial structure of the set of all data in the music database, yet form a minimum representation of them. The AIS-based classifier [11] was as follows:
Artificial Immune System-Based Music Genre Classification
195
1. Initialization Phase a) Normalization b) Affinity Threshold Estimation c) Seeding 2. Training Phase: For each class of training patterns do: – For each antigenic pattern do a) Matching Memory Cell Identification b) Antibodies Generation c) while (StoppingCriterion == False) • Resource Allocation • Removal of Less Stimulated Antibodies • Generate Antibodies Mutated Offsprings d) Candidate Memory Cell Identification e) Memory Cell Introduction 3. Classification The initialization phase of the algorithm constitutes a preprocessing stage which is combined with a parameter discovery stage. All available data items pertaining to the training set are normalized so that the Euclidean distance between any two feature vectors lies within the [0,1] interval. The affinity threshold computation step consists in estimating the average affinity value over all the training data. The Affinity Threshold multiplied by an auxiliary control parameter, namely the Affinity Threshold Scalar (of value between 0 and 1), provides a cut - off value for the replacement of memory cells during the training phase. The final step of the initialization procedure involves the seeding of the memory cells and the pool of the available antibodies by randomly choosing 0 or more antigenic patterns from the training set. Once the initialization phase is completed training proceeds as a one-shot incremental learning algorithm where each antigenic pattern from each class is presented to the algorithm only once. This portion of the algorithm focuses on developing a candidate memory antibody, for the antigen currently being processed, from the pool of the available antibodies. This is realized by the utilization of three mechanisms: 1) Competition for resources, 2) mutation and 3) adopting an average stimulation threshold as a stopping criterion in order to determine when the training on a given antigen is completed. Resources are allocated to a given antibody based on its stimulation level to the current antigen which may be thought of as an indicator of its efficiency as a recognizer. Mutation enforces diversification and shape-space exploration. Matching Memory Cell Identification involves determining the memory cell having the strongest bond to the current training data item which is quantified by its stimulation level. The matching memory cell is subsequently used in order to generate new mutated versions of the original cell that will be placed into the pool of the available antibodies. This process constitutes the Antibodies Generation step. The number of mutated clones a given memory cell allowed to inject into the cell population is controlled by the hypermutation rate. This
196
D.N. Sotiropoulos, A.S. Lampropoulos, and G.A. Tsihrintzis
number is proportional to the product of its stimulation level with the value of the hyper-mutation rate. Each antibody in the pool of the available antibodies is allocated a finite number of resources during the resource allocation step based on its stimulation value and clonal rate. Clonal rate comprises an additional control parameter of the algorithm which may serve as a resource allocation factor when multiplied by the stimulation level of a given antibody. The total number of system wide resources is set to a certain limit specified by the number of resources allowed. If more resources are consumed by the pool of the available antibodies then the least stimulated cells are removed from the system during the removal of less stimulated antibodies step. Regardless of whether the stopping criterion has been met each antibody is given a chance to produce a set of mutated offsprings which is determined by the multiplication of the clonal rate with the stimulation value. This procedure is conducted during the generate antibodies mutated offsprings step. Once the training on a specific antigenic pattern is completed, the learning algorithm proceeds by identifying the candidate memory cell among the maturated antibodies. The candidate memory cell is that feature vector with the maximum stimulation level to the current antigenic pattern. The memory cell identification stage also involves incorporating the candidate memory cell into the pool of the available memory cells. Finally, the memory cell introduction step determines whether the matching memory cell is replaced by the candidate memory cell. After the training process has completed, the evolved memory cells are available for the classification task of unseen music feature vectors in a k-nearest neighbour approach. The system classified these new data items by using a majority vote of the outputs of the k most stimulated memory antibodies.
4 Experimental Results on Music Data An audio signal may be represented in multiple ways according to the specific features utilized in order to capture certain aspects of an audio signal. More specifically, there has been a significant amount of work in extracting features that are more appropriate for describing and modeling music signals. In this paper, we have utilized a specific set of 30 objective features that were originally proposed by Tzanetakis and Cook [2, 13] and have dominated the literature in subsequent approaches in this research area. It is worth to mention that these features not only provide a low-level representation of the statistical properties of the music signal, but also include high-level information extracted by psychoacoustic algorithms. In summary, these features represent rhythmic content (rhythm, beat and tempo information), as well as pitch content describing melody and harmony of a music signal. The collection we have utilized in our experiments contains one thousand (1000) pieces from 10 classes of western music. This collection has been used as a test bed for assessing the relative performance of various musical genre
Artificial Immune System-Based Music Genre Classification
197
classification algorithms [2, 3]. Specifically, the collection contains one hundred (100) pieces, each of thirty second duration, from each of the following ten (10) classes of western music: Table 1. Classes of western music Class ID Label 1 Blues 2 Classical 3 Country 4 Disco 5 Hip-Hop 6 Jazz 7 Metal 8 Pop 9 Reggae 10 Rock
In order to evaluate our AIS-based classifier in music genre classification, we compared its classification performance against 1)Radial Basis Functions neural networks, 2)K-th Nearest Neighbour classifiers and 3)Support Vector Machines. The NetLab toolbox was utilized in order to construct the RBF network and KNN classifiers, while the SVM classifier was implemented with the OSU-SVM toolbox. The AIS-based classifier was implemented in the programming environment of MatLab. The RBF network consisted of fifty (50) neurons in the hidden layer. The number of neurons in the output layer is determined by the number of audio classes we want to classify in each experiment. The network was trained with the Expectation Maximization algorithm for two hundred (200) cycles and its output estimates the degree of membership of the input feature vector in each class. Thus, the value at each output necessarily remains between 0 and 1. The KNN classifier was based on the class label prediction of the 10 nearest neighbours. The SVM classifier was based on a gaussian kernel with the default parameters provided by the osu-svm toolbox. Classification results were calculated using 10-fold cross-validation evaluation, where the dataset to be evaluated was iteratively partitioned so that 90% be used for training and 10% be used for testing for each class. This process was iterated with different disjoint partitions and the results were averaged. This ensured that the calculated accuracy was not biased because of the particular partitioning of training and testing. We conducted three experiments in order to measure the classification accuracy for each of the different classifiers. In the first experiment, a four class classification problem was considered. The results presented in Table 2 illustrate the competitiveness of the AIS-based classifier, as it ranks high and second only to the SVM classifier.
198
D.N. Sotiropoulos, A.S. Lampropoulos, and G.A. Tsihrintzis Table 2. Experiment 1: Four Class Classification Trial Four Classes AIRS 1 1, 2, 3, 4 71.5 2 1, 2, 7, 10 70.75 3 4, 7, 8, 10 60
KNN 10 RBF SVM 66.5 70.25 75.5 66.5 65 71.25 54.25 54 61.25
In the second experiment, we addressed a problem of two class classification, in which we considered a “self” class (classic music) and a “non-self” class (everything else in the music database). As it is presented in Table 3, the number of complementary classes was gradually increased in each trial. In this way, we increased the diversity of training data points belonging to the non-self class. As stated earlier, this experiment was driven by the fact that the self/non-self discrimination constitutes an intrinsic property of the natural immune system, especially when the non-self (complementary) space of patterns is significantly larger than the patterns belonging to the self class. In this experiment the AISbased classifier outperforms the other classifiers in every trial. Table 3. Experiment 2: Two Class Classification (Self Class = Classic) Trial Self 1 2 3 4 5 6 7 8 9
Class 2 2 2 2 2 2 2 2 2
Non-Self Class 1 1,3 1,3,4 1,3,4,5 1,3,4,5,6 1,3,4,5,6,7 1,3,4,5,6,7,8 1,3,4,5,6,7,8,9 1,3,4,5,6,7,8,9,10
AIRS 92 93.5 92.5 94 92 93.5 93 90.5 93.5
KNN 10 RBF SVM 91 89.5 91.5 91.5 91 92.5 91 88 92 89 87.5 90.5 88.5 88 90 89 88.5 90.5 88.5 89.5 89.5 89.5 88.5 90 90.5 85 91
In third experiment, we explore further the two class classification problem considering only two classes in each trial, as follows: In trials 1 and 2, the music data come from the classes classic-rock and classic-metal, respectively, which have a low degree of overlap in feature space and, thus, a higher classification accuracy is expected. In contrast, in trials 3 and 4, the music data come from the classes disco-pop and rock-metal, respectively, which have a high degree of overlap in feature space and, thus, a lower classification accuracy is expected. Results of the third experiment are presented in Table 4, in which the AIS-based classifier is seen to outperform all other classifiers in every trial. The experiments described in this section, show clearly that the classification accuracy of the AIS-based classifier increases and outperforms the other classifiers when the number of classes pertaining to the classification problem are reduced to two. Specifically, the results concerning the second experiment demonstrate that the mean classification accuracy in 9 trials is higher for the AIS-based
Artificial Immune System-Based Music Genre Classification
199
Table 4. Experiment 3: Two class classification Trial Self Class Non-Self Class AIRS 1 (Low Overlap) 2 10 93 2 (Low Overlap) 2 7 97 3 (High Overlap) 4 8 79.5 4 (High Overlap) 10 7 79
KNN 10 RBF 89.5 91 95.5 95.5 74.5 75 74.5 74
SVM 90 95 75.6 77
classifier (92.72%) than the classification accuracy of the SVM (90.83%) while they have the same standard deviation (1.0).
5 Conclusions In this paper, we suggest a new approach to the problem of music genre classification based on the construction of an Artificial Immune Recognition System which presents incorporates the inherent self/non-self discrimination ability of the natural immune system. The AIS-based classifier that we have built is compared with Knn-, RBF- and SVM-based classifiers in various experiments involving music data. We find that the performance of our classifier is similar to that of the other classifiers when tested in multi-class (eg. four class) problems. On the other hand, it exceeds by a significant margin the performance of the other classifiers when tested in two class problems. In the future, we will improve the AIS-based classifier further and evaluate their performance over other classification scenarios and various types of data sets. We will also investigate the appropriateness of their use as recommender systems. This and other related work is currently under way and will be reported shortly.
References 1. Sotiropoulos, D.N., Lampropoulos, A.S., Tsihrintzis, G.A.: Artificial immune system-based music piece similarity measures and database organization. In: Proc. 5th EURASIP Conference on Speech and Image Processing, Multimedia Communications and Services, Smolenice, Slovak Republic (June 2005) 2. Tzanetakis, G., Cook, P.: Musical genre classification of audio signals. IEEE Transactions on Speech and Audio Processing 10(5) (July 2002) 3. Li, T., Ogihara, M., Li, Q.: A comparative study on content based music genre classification. In: Proc. 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, Toronto, Canada (August 2003) 4. McKinney, M.F., Breebaart, J.: Features for audio and music classification. In: Proc. 4th International Conference on Music Information Retrieval, Washington, D.C., USA (October 2003) 5. Mandel, M., Ellis, D.: Song-level features and support vector machines for music classification. In: Proc. 6th International Conference on Music Information Retrieval, London, UK (September 2005)
200
D.N. Sotiropoulos, A.S. Lampropoulos, and G.A. Tsihrintzis
6. Lidy, T., Rauber, A.: Evaluation of feature extractors and psycho-acoustic transformations for music genre classification. In: Proc. 6th International Conference on Music Information Retrieval, London, UK (September 2005) 7. Logan, B., Salomon, A.: A music similarity function based on signal analysis. In: Proc. International Conference on Multimedia and Expo, Tokyo, Japan (2003) 8. Pampalk, E., Flexer, A., Widmer, G.: Improvements of audio-based music similarity and genre classification. In: Proc. 6th International Conference on Music Information Retrieval, London, UK (September 2005) 9. Lampropoulos, A.S., Lampropoulou, P.S., Tsihrintzis, G.A.: Musical genre classification enhanced by source separation techniques. In: Proc. 6th International Conference on Music Information Retrieval, London, UK, September 2005, pp. 576–581 (2005) 10. Turnbull, D., Elkan, C.: Fast recognition of musical genres using rbf networks. IEEE Transactions on Knowledge and Data Engineering 17(4) (2005) 11. Watkins, A., Timmis, J.: Artificial immune recognition system (airs): An immuneinspired supervised learning algorithm. Genetic Programming and Evolvable Machines 5, 291–317 (2004) 12. Castro, L.N., Timmis, J.: Artificial Immune Systems: A new Computational Intelligence Approach. Springer, Heidelberg (2002) 13. Tzanetakis, G., Cook, P.: Marsyas: A framework for audio analysis. Organised Sound 4(3) (2000)
Semantic Information Retrieval Dedicated to Multimedia Systems: A Platform Based on Conceptual Graphs Xavier Aimé and Francky Trichet LINA, Laboratoire d’Informatique de Nantes Atlantique (UMR-CNRS 6241) University of Nantes - Team Knowledge and Decision (KOD) 2, rue de la Houssinière - BP 92208 - 44322 Nantes cedex 03, France {xavier.aime,francky.trichet}@univ-nantes.fr
Abstract. OSIRIS is a web platform dedicated to the development of Ontology-based System for Semantic Information Retrieval and Indexation of multimedia resources which are shared within communautary and open web Spaces. Based on the use of both heavyweight ontologies and thesaurii, OSIRIS allows the end-user (1) to describe the semantic content of its resources by using an intuitive natural-language based model of annotation which is founded on the triple (Subject, Verb, Object), and (2) to formally represent these annotations by using Conceptual Graphs. Moreover, each resource can be described by adopting multiple points of view, which usually correspond to different end-users. These different points of view can be defined by using multiple ontologies which can be related to connected (or not-connected) domains. Developed from the integration of Semantic Web technologies and Web 2.0 technologies, OSIRIS aims at facilitating the deployment of semantic, collaborative, communautary and open web spaces. Keywords: ontology, heavyweight ontology, thesaurus, semantic annotation, semantic information retrieval, conceptual graphs, semantic web, intelligent multimedia system, collaborative annotation, social tagging, semantic web 2.0.
1 Introduction Currently, the collective and interactive dimension of Web 2.0 coupled with the lightness of its tools facilitates the rise of many platforms dedicated to the sharing of multimedia resources such as Flickr (http://www.flickr.com/) for the images or YouTube (http://www.youtube.com) for the videos. However, the success of these platforms (in terms of number of listed resources and number of federated users) must be moderate in comparison with the poverty of the approach used for Information Retrieval (IR). Indeed, the search engines integrated in such systems are only based on the use of tags which are usually defined manually by the end-users of the communities (i.e. the social tagging which leads to the creation of folksonomies). In addition to the traditional limits of IR systems based on keywords, in particular the poverty of semantic description provided by a set of tags and consequently the impossibility of implementing a semantic search engine, these systems suffer from a lack of openness because the tags provided by the end-users remain useful and efficient only inside the platforms. they cannot be exported when the resources are duplicated from a platform to another. G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 201–210, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
202
X. Aimé and F. Trichet
OSIRIS (Ontology-based Systems for Semantic Information Retrieval and Indexation dedicated to communautary and open web Spaces) is a platform dedicated to the development of communautary web spaces which aim at facilitating both semantic annotating process and searching process of multimedia resources. Such a communautary space corresponds to an Internet-mediated social and semantic environment in the sense that the resources which are shared are not only tagged by the users (which thus construct a folksonomy in a collaborative way) but they are also formally described by using one (or several) ontology(ies) shared by all the members of the community. The result is an immediate and rewarding gain in the user's capacity to semantically describe and find related content. Based on the use of heavyweight ontologies [6] coupled with thesaurii1, OSIRIS allows the end-users to semantically describe the content of a resource (for instance, this photography of Doisneau represents “A woman which kisses a man in a famous French place located in Paris”) and then to formally represent this content by using Conceptual Graphs [13]. Each resource can be described according to multiple points of view (i.e. representation of several contents) which can also be defined according to multiple ontologies, which can cover connected domain or not. Thus, during the annotating process, OSIRIS allows managing several ontologies which are used jointly and in a transparent way during the searching process, thanks to the possibility of defining equivalence links between concepts and/or relations of two ontologies. Moreover, OSIRIS is based on heavyweight ontologies, i.e. ontologies which in addition to including the concepts and relations (structured within hierarchies based on the relation of Specialisation/Generalisation) characterizing the considered domain, also include the axioms (rules and constraints) that govern this domain. This confers to OSIRIS the possibility to automatically enrich the annotations (manually associated to a resource) by applying the axioms which generally correspond to inferential knowledge of the domain. From a technical point of view, OSIRIS is based on the integration of technologies currently developed in Web 2.0 and Semantic Web areas: it aims at dealing with the Semantic Web 2.0 [11]. In its current version, OSIRIS allows implementing Semantic Web Spaces dedicated to the sharing of images, videos, music files and office documents, respectively in JPEG, MP3, OpenOffice and Office 200 format. The choice of these formats is justified by the fact that it is possible to store the semantic annotations (represented in terms of conceptual graphs) within the files via the use of standards such as IPTC (http://www.iptc.org) for JPEG, ID3 (http://www.id3.org) for MP3 and ODF (http://www.oasis-open.org/) for OpenOffice and Office 2007. These standards make it possible first to associate meta-data to images and sounds and second to store these meta-data within the files. They are currently used by the majority of the well-known tools dedicated to the management of personal images and sounds such as Picasa (http://picasa.google.com) or Winamp (http://www.winamp.com). But this use is only limited to the association of keywords, which does not make it possible to represent the semantic content of the resources. 1
A thesaurus is a special kind of controlled vocabulary where the terms (which correspond to the entries of the thesaurus) are structured by using linguistic relationships such as synonymy, antonymy, hyponymy or hypernymy. A thesaurus (like Wordnet) is not an ontology because it only deals with terms (i.e. the linguistic level), without considered the concepts and the relations of the domain (i.e. the conceptual or knowledge level).
Semantic Information Retrieval Dedicated to Multimedia Systems
203
OSIRIS aims at answering this lack by integrating the use of domain ontologies (coupled with thesaurii) in the process of indexing and searching process. In addition, preserving the annotations directly within the files makes our system much more open than the current Web 2.0 systems where the tags can not be exported from a platform to another. The rest of this paper is structured as follows. Section 2 introduces the basic foundations of our work: heavyweight ontologies coupled with thesaurii and represented within the Conceptual Graphs model. Section 3 presents (i) the model of annotation we have adopted, (ii) the annotating process (manual and automatic) and the searching process we advocate. These different functionalities are illustrated within examples extracted from an application (developed in French) dedicated to the History of Art.
2 Context of the Work 2.1 Heavyweight Ontologies Currently, ontologies are at the heart of many applications, in particular the Semantic Web, because they facilitate interoperability between human and/or artificial agents [9]. However, most of the current works concerned with the ontological engineering are limited to the construction of lightweight ontologies, i.e. ontologies simply composed of a hierarchy of concepts (possibly enriched by properties such as exclusion or abstraction) which is sometimes associated to a hierarchy of relations (possibly enriched by algebraic properties). Reduced in semantics, these ontologies do not make it possible to take all the knowledge of a given domain into account, in particular the rules and the constraints governing this domain and thus fixing the interpretation of the concepts and relations characterising it. This deficit of semantics, which is prejudicial at various levels, is due to the low level of expressivity of the language OWL (Ontology Web Language http://www.w3.org/2004/OWL/). Indeed, since 2004, this standard used to represent and to share domain ontologies, has indirectly influenced most of the works related to ontological engineering in the sense that the majority was focused on lightweight ontologies, by completely forsaking knowledge related to inference (mainly rules and constraints), and this both from a representation point of view (what kind of primitives can be used to represent this kind of reasoning knowledge ?) and from an implementation point of view (how to use effectively this type of knowledge within Knowledge-Based System ?). In our work, we are more particularly interested in heavyweight ontologies (semantically speaking), i.e ontologies which in addition to include the concepts and relations (structured within hierarchy) of a domain D, also include the axioms (rules and constraints) which govern D. The use of a heavyweight ontology (which represents all the semantic richness of a domain via the axioms) coupled with a thesaurus (which represents all the linguistic richness of a domain) characterises the originality of the OSIRIS platform. This features proves to be promising within an Information Retrieval System based on keywords because the interpretation of the sense of a request expressed by a set of terms becomes more precise.
204
X. Aimé and F. Trichet
2.2 The Conceptual Graphs Model and the Language OCGL The Conceptual Graphs Model (CGs), first introduced by J. Sowa [13], is an operational knowledge representation model which belongs to the field of semantic networks. This model is mathematically founded both on logics and graph theory [3]. However, to reason with CGs, two approaches can be distinguished: (1) considered CGs as a graphic interface for logics and thus reasoning with logics, (2) considered CGs as a graph-based knowledge representation and reasoning formalism with its own reasoning capabilities. In the context of our work, we adopt the second approach by using projection (a graph-theoretic operation corresponding to homomorphism) as the main reasoning operator; projection is sound and complete w.r.t. deduction in First Order Logic [3]. OCGL (Ontology Conceptual Graphs Language) [8] is a modeling language based on the CGs and dedicated to the representation of heavyweight ontologies. To represent an ontology in OCGL mainly consists in (1) specifying the conceptual vocabulary of the domain which is considered and (2) specifying the semantics of this vocabulary using Axioms. The conceptual vocabulary is composed of a set of Concepts and a set of Relations. These two sets can be structured by using both wellknown conceptual properties called Schemata Axioms (covering the current expressivity of OWL-DL such as for example the algebraic properties of the relations, the disjunction of two concepts, etc), or Domain Axioms used to represent rules and constraints. Domain Axioms correspond to all the inferential knowledge of the domain which can not be represented by using the Diagrams of Axioms, and thus which do not correspond to traditional properties attested on the concepts or the relations. A Domain Axiom is composed of an Antecedent graph and a Consequent graph; the formal semantics of such a construction can be intuitively expressed as follows: “if the Antecedent graph is true, then the Consequent graph is true”. Figure 1 presents two Domain Axioms expressed in OCGL and respectively dedicated to the representation of the following knowledge related to an ontology called OntoArt and dedicated to the History of Art2: (i) “A cubist is an artist which has created at least a work of art illustrating the artistic movement called cubism” and (ii) “All the works of art created by Claude Monet illustrate the impressionist movement of the 20th century”. Note that these axioms are not at same level (and that OCGL makes it possible to take these different levels of representation into account): the first one expresses a generic knowledge, the second one expresses a more specific knowledge which implies an instance of the domain: Claude Monet. OCGL is implemented into TooCoM3 (A Tool to Operationalize an Ontology with the Conceptual Graph Model), a tool dedicated to the representation and the operationalisation of heavyweight ontologies. TooCom is based on CoGITaNT [10], a
2
The OntoArt ontology has been created in the context of a French project. So, all the concepts and relations are expressed in French. This justifies why all the figures of this paper are in French. But as the domain of Art is generalist and (for most of people) well-known, we think that this situation will not interfere with the understanding of the ideas. 3 TooCoM is available under GNU GPL license: http://sourceforge.net/projects/toocom/
Semantic Information Retrieval Dedicated to Multimedia Systems
205
Fig. 1. Examples of Domain Axioms represented in OCGL (edited with TooCoM)
C++ library for developing conceptual graphs applications. An ontology expressed in OCGL is stored in a CGXML4.
3 OSIRIS Framework 3.1 The Annotation Model The annotation model which is advocated in OSIRIS is based on the triple {Subject/Verb/Object}. This model allows the end-user to represent the content of simple sentences (expressed in natural language) such as “A man which kisses a woman”. In the context of this triple, Subject and Object correspond to concepts and Verb corresponds to a relation of the ontology which is considered (cf. figure 2). Thus, each resource can be described semantically by a set of triples where each triple is defined according to a particular ontology. Note that it is possible to used multiple ontologies (covering overlapping domain or not) in the same OSIRIS application. Each triple can be associated to a member of the community. Figure 2 illustrates the application of this model in the context of the well-known work of art of the French photographer R. Doisneau: « Baiser de l’hôtel de ville ». In this example, the first user u1 annotates the photography by using the ontology Onto1 and describes the following contents: (1) “A man who kisses a woman” (where man is
4
CGXML is the format used to represent CGs in XML. This format is integrated into Cogitant: http://sourceforge.net/projects/cogitant. Note that TooCom enables to import and export lightweight ontologies in OWL, by using the OWL API [1]. Thanks to a specific transformational model from OCGL to OWL [7], most of the properties of classes and relations expressed in OWL are translated into Schemata Axioms in OCGL. However, because of the difference of expressivity of the two languages, the following properties are not yet translated: allValuesFrom, someValuesFrom and hasValue. Inversely, the Domain Axioms of OCGL cannot be translated in OWL as long as OWL does not offer capability of rule-like axioms representation.
206
X. Aimé and F. Trichet
- (man To-Kiss woman)Onto1 / u1 - (man To-Wear beret)Onto1 / u1 - (woman To-Walk)Onto1 / u2 - (building To-Locate town:paris)Onto2 / u2 - (photographer:Doisneau To-Create work_of-art) Onto3 / u3 …
Fig. 2. Annotating process applied to the work of art of Doisneau: “Baiser de l’hôtel de ville”
the concept corresponding to the Subject, To-kiss the relation corresponding to the Verb and woman the concept corresponding to the Object) and (2) “A man who wears a beret”. The second user u2 annotates the photography by using two different ontologies Onto1 and Onto2 ; he describes the following situations: (1) “A woman who walks”, without defining explicitly where (i.e. there is no concept corresponding to the Object) and (2) “A building which is located in a town called Paris” (i.e. the concept town corresponding to the Object is instantiated by Paris). Finally, the last end-user u3 does not annotate the content of the photography but the photography as a work of art in the sense that he states “A work of art creates by the photographer Doisneau”. This last point clearly illustrates that our model can be used both to describe the content of a resource and to describe the resource as such, which makes it flexible and open. As shown by this example, our model is intuitive, easily comprehensible and owns a strong correspondence with the Conceptual Graphs model. It allows the end-users to describe the content of their resources under several angles, and possibly by using several ontologies related to the same domain (which can be developed by different communities) or to different and not necessarily overlapping domains. In this way, OSIRIS is a tool which enables performing a multi users, multi points of view and multi ontologies based annotation process. 3.2 The Annotating Process The manual approach Annotating a resource which has been imported into ORISIS starts with the selection of an ontology which must be imported beforehand by the administrator of the platform. When an ontology O is selected, then the annotating process mainly consists in identifying a set of triples (Subject,Verb,Object) where Subject and Object correspond to concepts of O and Verb corresponds to a relation of O5. To state a triple, two approaches can be distinguished: (1) the end-user directly navigates within the hierarchies of concepts and relations of O or (2) the end-user freely expresses a list of terms which are then compared with the inputs of the thesaurii associated to O. In the first case, the end-user is guided by the interface because when he identifies the concept C1 corresponding to the Subject or the Object of its current annotation, 5
Of course, for the same resource, the end-user can repeat this process by using another ontology, in order to give another point of view on the same resource.
Semantic Information Retrieval Dedicated to Multimedia Systems
207
then only the relations having in signature the selected concept C1 (or more specific concepts of C1) are proposed by the interface. It is the same case when the end-user starts with the identification of the relation associated to the Verb: only the compatible concepts (defined by the signature of the relation) are then accessible. In the second case, which aims at offering more freedom and openness from a linguistic point of view, OSIRIS uses the thesaurii to find the concepts and relations underlying the set of terms expressed by the end-user. When OSIRIS finds correspondences between the terms of the end-user and the inputs of the ontology extended with the thesaurii, then it proposes the possible triples (Subject,Verb,Object) which are validated (or rejected) by the end-user.
Fig. 3. Illustration of the annotating process
Figure 3 illustrates this process in the context of the resource “Baiser de l’hôtel de ville” of Doisneau. Before applying the axioms, the annotations of this resource is as follows: “A woman which kisses a man” (in French, [femme:* embrasser homme:*]), “A photography of the artist Robert Doisneau” ([photographie:* photographier artiste:doisneau_robert]); “A man which wears a beret” ([homme:* porter beret:*]). After the automatic application of the axioms, two new annotations are produced: “A photography which is dated from the 20th century” ([photographie:* dater 20_siecle:*]) and “A man which kisses a woman” ([homme:* embrasser femme:*]). It is important to underline that it is also possible to link the annotations in order to precise, for instance, that two instances of the same concept are different (respectively similar). In the figure 3, this allows the end-user to specify that the man who wears the beret is different from the man who kisses the woman. Each annotation is recorded (in CGXML) within the files via the use of the standards IPTC, ID3 and ODF. OSIRIS also permits the automatic extraction of keywords (stored in terms of metadata of IPTC, ID3 and ODF) from the annotations: for each triple (Subject,Verb,Object), OSIRIS computes a set of terms which correspond to the union of all the synonyms of the concepts Subject, Object (and their sub-concepts) and all the synonyms of the relation Verb (and its possible sub-relations).
208
X. Aimé and F. Trichet
The automatic approach The heavyweight ontologies manipulated by OSIRIS include axioms intrinsically. The application of these axioms on the annotations defined beforehand by the end-users allows performing a process of automatic enrichment of the annotations. Figure 3 illustrates the results of this process after having applied the axioms of two ontologies: OntoArt dedicated to the History of Art and including the following axiom “Any work of Doisneau is dated from the 20th century” and OntoCourant, an ontology covering phenomena of the everyday life and including (for example) the To_Kiss relation which is defined between two Human (where the Human concept can be specialized in Man and Woman). The application of the Symetry algebraic property of the To_Kiss relation (which is represented by a Schemata Axiom in OCGL) produces the new annotation “Man:* To_Kiss Woman:*” (from the original annotation “Woman:* To_Kiss Man:*”) and the application of the axiom “Any work of Doisneau is dated from the 20th century” (represented by a Domain Axiom in OCGL) produces the new annotation “Photography:* To_Date 20th_century:*”. In addition, when the end-user imports a new resource, OSIRIS checks if this latter already includes keywords (which have been associated to it from other platforms such as YouTube or Flickr) via the use of the standards IPTC, ID3 or ODF. If it is the case, OSIRIS starts an analysis of these keywords in order to automatically find relevant annotations. This is done by comparing the keywords and the entries of the thesaurii which are coupled with the ontologies. This analysis leads to a set of potential annotations that the end-user must validate. 3.3 The Searching Process The searching process starts with the expression of a query in terms of (Subject, Verb, Object), or by a set of queries connected by the logical operators AND/OR.. To formulate queries, the end-user can either navigate in the hierarchies of the ontologies (cf. section 3.2.1) or freely (and directly) express terms which are then compared with the entries of the thesaurii in order to find the subjacent concepts and relations. Note that it is possible to formulate partial queries, i.e. queries which do not include all the elements of the triple, but only part of it such as (Subject) or (Object) or (Subject,Verb) or (Verb,Object). Each query C corresponds to one (or several) conceptual graph(s). The search for the resources which comply with the criterion defined by the query is performed by using the projection operator of the CG: a resource Ri satisfies a query C if there exists (at least) one projection from the conceptual graph representing C to the graphs representing the annotations associated with Ri. Figure 4 illustrates an example of query where the criteria are: (1) “a contemporary work of art” represented by the conceptual graph (Work_of_art:* To_Date Contemporary:*) (oeuvre:* dater contemporain:* in French), where Work_of_art and Contemporary are concepts and To_Date a relation AND (2) “whose the content incarnates a woman” represented by the conceptual graph including only one concept (Woman:*) (femme:* in French). A resource R is considered as relevant for the query when there exists a projection from each one of these two graphs into (at least) one of the graphs representing the annotations of R.
Semantic Information Retrieval Dedicated to Multimedia Systems
209
Fig. 4. Illustration of the searching process: “Contemporary works of art representing women”. This query corresponds to the two triples: (Work_of-art:* To-date Contemporary:*) and (Woman:*). The works of art which are proposed are photographies, sculptures or paintings.
OSIRIS also makes it possible to perform searching on instances of concepts. For example, “What are the paintings created by the artist Picasso ?” (represented by the graph “Painting:* To_Create Artist:Picasso”) specifies that the searching process must be focused on works of Picasso as paintings, and not on its other works like sculptures. OSIRIS also makes it possible to express partial queries which only involve one relation, without more precision on concepts (for example, “To_Kiss”).
4 Conclusion OSIRIS is a platform that enables the development of collaborative web spaces dedicated to the sharing of multimedia resources. The semantic annotation based on the use of conceptual graphs is not an new approach. Indeed, several works have adopted this approach [2, 4]. Thus, the originality of our work is not related to the adopted approach, but in the context in which this approach is considered. Indeed, contrary to similar works, OSIRIS makes it possible to make several ontologies cohabit within the same communautary space and these ontologies can be refined by the members of the community. This work is currently in progress towards a thorough study of the tags associated to the resources by the way of Web 2.0 platforms like Youtube or Flickr, in order to automatically enrich the thesaurii and to discover possible lacks in the ontologies that are considered. Our assumption is that the tags defined and shared by the communities (semantic and social tagging) are good vectors of the evolutions of the subjacent fields of knowledge (we are currently testing this idea in the context of a French project related to Cultural Heritage Preservation through a collaborative collection development of old and popular postcards). In this context, it appears relevant to use this type of material for tackle with the problem of the ontology evolution, which is one of the key factor in the popularisation of semantic and participative platforms such as OSIRIS.
210
X. Aimé and F. Trichet
References 1. Bechhofer, S., Volz, R., Lord, P.: Cooking the Semantic Web with the OWL-API. In: Fensel, D., Sycara, K.P., Mylopoulos, J. (eds.) ISWC 2003. LNCS, vol. 2870, pp. 659– 675. Springer, Heidelberg (2003) 2. Bocconi, S., Nack, F., Hardman, L.: Supporting the Generation of Argument Structure within Video Sequence. In: Proceedings of the Sixteenth ACM Conference on Hypertext and Hypermedia, pp. 75–84 (2005) 3. Chein, M., Mugnier, M.L.: Conceptual Graphs: fundamental notions. Revue d’Intelligence Artificielle (RIA) 6(4), 365–406 (1992) Hermès 4. Crampes, M., Ranwez, S.: Ontology-Supported and Ontology-Driven Conceptual Navigation on the World Wide Web. In: Proceedings of the eleventh ACM on Hypertext and hypermedia, pp. 191–199 (2000) 5. Euzenat, J., Shvaiko, P.: Ontology Matching, p. 341. Springer, Heidelberg (2007) 6. Fürst, F., Trichet, F.: Heavyweight Ontology Engineering. In: Meersman, R., Tari, Z., Herrero, P. (eds.) OTM 2006 Workshops. LNCS, vol. 4277, pp. 38–39. Springer, Heidelberg (2006a) 7. Fürst, F., Trichet, F.: Reasoning on the Semantic Web needs to reason both on ontologybased assertions and on ontologies themselves. In: Proceedings of the International Workshop on Reasoning on the Web (Row 2006), Co-located with the 15th International World Wide Web Conference (WWW 2006, Edinburgh) (2006b) 8. Fürst, F., Leclere, M., Trichet, F.: Operationalizing domain ontologies: a method and a tool. In: Proceedings of the 16th European Conference on Artificial Intelligence (ECAI 2004), pp. 318–322. IOS Press, Amsterdam (2004) 9. Gomez-Perez, A., Fernandez-Lopez, M.: Ontological Engineering. In: Advanced Information and Knowledge Processing (2003) 10. Genest, D., Salvat, E.: A Platform allowing typed nested graphs: how CoGITo became CoGITaNT. In: Mugnier, M.-L., Chein, M. (eds.) ICCS 1998. LNCS (LNAI), vol. 1453, pp. 154–161. Springer, Heidelberg (1998) 11. Greaves, M.: Semantic Web 2.0. IEEE Intelligent Systems 22(2), 94–96 (2007) 12. Shvaiko, P., Euzenat, J.: A Survey of Schema-based Matching Approaches. In: Spaccapietra, S. (ed.) Journal on Data Semantics IV. LNCS, vol. 3730, pp. 146–171. Springer, Heidelberg (2005) 13. Sowa, J.: Conceptual Structures: information processing in mind and machine, Handbook. Addison-Wesley, Reading (1984)
Interactive Cluster-Based Personalized Retrieval on Large Document Collections Petros Belsis1, Charalampos Konstantopoulos2, Basilis Mamalis1, Grammati Pantziou1, and Christos Skourlas1 1 2
Department of Informatics, TEI of Athens Department of Informatics, University of Piraeus
[email protected],
[email protected], {pantziou,vmamalis,cskourlas}@teiath.gr
Abstract. Lately, many systems and websites add personalization functionalities among their provided services. However, for large document collections it is difficult for the user to direct effective queries from the beginning of his/her search, since accurate query terms may not be known in advance. In this paper we describe a system that applies a hybrid approach to assist a user identify the most relevant documents: at the beginning it applies dynamic personalization techniques based on user modeling to initiate the search on a large document and multimedia content collection; next the query is further refined using a clustering based approach which after processing a sub-collection of documents presents the user with more categories to select from a list of new keywords. We analyze the most prominent implementation choices for the modular components of the proposed architecture: a machine learning approach for personalized services, a clustering based approach towards a user directed query refinement and a parallel processing module that supports document clustering in order to decrease the system’s response times.
1 Introduction The continuous growth of data stored in different types of systems such as information portals, digital libraries etc, has created an overwhelming amount of information that a user has to deal with. Many approaches have emerged towards query-refinement to facilitate the user towards a more efficient retrieval process in respect to his/her personal interests. A wide variety of systems also integrate personalization features that aim to assist the user to identify knowledge items that match the user’s preferences. Among else, digital libraries, document management systems and multimedia datawarehouses with focus on scientific data-storage grow significantly in size as new scientific content is gathered on a daily basis on different areas of research. Considering each user has specific areas of expertise or interest, a digital library consists of a good test-bed domain where personalization techniques may prove to be beneficial. Still, in large document collections it is hard to identify an efficient user model that contains adequate sub-categories to support the user preferences for two reasons: first, it would be difficult to identify appropriate sub-categories in respect to the number of existing users; second, classification of incoming documents would require a significant overhead for the system. G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 211–220, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
212
P. Belsis et al.
In this paper, we describe a hybrid approach that utilizes personalization techniques at the initiation of the user interaction with the system, while it proceeds towards a more user-oriented interaction, where the user is participating in the dynamic clustering process by selecting the sub-categories that arise dynamically after processing subsets of the documents. In order to keep the system’s response times low, we also apply parallel processing techniques when processing the selected by the user document sub-clusters. The rest of the paper is organized as follows. Section 2 presents related work in context; Section 3 presents the main principles that drive the design of the proposed architecture and discusses the structure of its modular components, while Section 4 concludes the paper.
2 Related Work A wide variety of research prototype systems as well as commercial solutions have emerged lately offering personalized services to their users. Many of the successful deployments use machine learning methods, which aim in integrating among the system’s features the ability to adapt to the user’s needs and to perform many of the necessary tasks in an automated way [7]. 2.1 User Models, Stereotypes, Communities and Personalization Systems Personalization technology aims to adapt information systems, information retrieval and filtering systems, etc. to the needs of the individual user. A user model may contain personal details about the user, such as occupation, interests, etc. and information gathered through the interaction of the user with the system. User community models are generic models that apply to the needs of groups of users and usually do not use explicitly provided personal information. If personal information is given the community models are called stereotypes. Machine learning techniques have been applied to construct all these types of models and are used in digital library services, and in personalized news services, etc. For example: The MyView system [1] collects bibliographic data and facilitates the user in his/her browsing digital libraries. MyView supports direct on-line reorganization, browsing and selection as specified by the user. Among its strong features are that it can support browsing in heterogeneous distributed repositories. It does not store the actual data-sources but metadata pointing to actual sources. It also supports user directed browsing. The PNS [4] is a generic system that offers to its users personalized news services. Its architecture consists of sub-modules that collect user related data, either explicitly inserted by the user or implicitly by monitoring a user’s behavior. A personalization module builds the user’s model and makes recommendations on topics that fall within the user’s interests. The PNS also contains a content management module that collects information about the actual content sources and indexes them without though storing the actual sources but instead the indexing information as collected by specific purpose wrappers.
Interactive Cluster-Based Personalized Retrieval on Large Document Collections
213
2.2 Document Clustering and Parallel Techniques There exists a large number of document clustering algorithms. They are usually classified into two main categories – hierarchical algorithms and partitional algorithms. Partitioning assigns every document to a single cluster iteratively [17] in an attempt to determine k partitions that optimize a certain criterion function [18]. Partitional clustering algorithms usually have better time complexity than hierarchical algorithms. The K-means algorithm [21] is a popular clustering method of this category. A hierarchical clustering is a sequence of partitions in which each partition is nested into the next partition in the sequence. Hierarchical clusterings generally fall into two categories: splitting and agglomerative methods. Splitting methods work in a top down approach to split clusters until a certain threshold is obtained. The more popular agglomerative clustering algorithms use a bottom-up approach to merge documents into a hierarchy of clusters [19]. Agglomerative algorithms typically use a stored matrix or stored data approach [20]. There also exist several algorithms that combine the accuracy of the hierarchical approach with the lower time complexity of the partitioning approach to form a hybrid approach. Such a popular algorithm is the Buckshot algorithm [8] (see also section 3.2). A detailed overview of sequential document clustering algorithms can be found in [9] and [16]. Many authors have also examined parallel algorithms for both hierarchical clustering and partitional clustering [22]. In [23], Olson provides a comprehensive review on parallel hierarchical clustering algorithms. Two versions of parallel Kmeans algorithms are discussed in recent literatures. In [21], Dhillon and Modha proposed a parallel K-means algorithm on distributed memory multiprocessors. Xu and Zhang [24] designed a parallel K-means algorithm to cluster high dimensional document datasets, which has low communication overhead. Besides K-means, some other classical clustering algorithms also have their corresponding parallel versions, such as the parallel PDDP algorithm [24] and the parallel Buckshot algorithm (given earlier in [15] and most recently in [9]). 2.3 The Scatter/Gather Approach Scatter/Gather was first proposed by Cutting et al [8], as a cluster-based method for browsing large document collections. The method works as follows: In the beginning, the system scatters the initial document collection into a small set of clusters (i.e., document groups) and presents to the user short descriptive summaries of these clusters. The summaries may include text that characterizes the cluster in general, as well terms that sample the contents of the cluster. Based on these summaries, the user can select one or more of the clusters for further examination. The clusters selected by the user are gathered together into a subcollection. In the sequel, on line clustering is applied again to scatter the subcollection into a new small set of clusters, whose summaries are presented to the user. The above process may be repeated while after each iteration the clusters become smaller and more detailed. With the Scatter/Gather method the user is not forced to provide query terms but from the beginning he is presented with a set of clusters. The successive iterations of the method help the user to find the desired information from a large document collection. Therefore, the Scatter/Gather approach is very useful when the user cannot
214
P. Belsis et al.
or does not want to express a query formally. In addition, as Hearst and Pedersen showed in [13],[14] the Scatter/Gather method can also significantly improve the retrieval results over a very large document collection. Since each iteration of the Scatter/Gather method requires online clustering on a large document collection, fast clustering algorithms should be employed. Cutting et al [8] have proposed and applied to Scatter/Gather two clustering procedures: Buckshot (which is used also in our hybrid approach) and Fractionation. In [12], a scheme is proposed that after near linear time pre-processing (O(kNlogN)), it requires constant time for the online phase for arbitrarily large document collections. The method involves the construction of a cluster hierarchy. Liu et al in [16] also proposed a new algorithm for Scatter/Gather browsing which achieves constant response time for each Scatter/Gather iteration. Their algorithm requires (as the algorithm in [12]) the construction of a cluster hierarchy.
3 System Architecture The proposed architecture consists of three sub-modules: i) the personalization submodule which collects user related data and recommends initially categories containing documents related to the user’s interests, ii) the content repository which is actually responsible to store the documents and facilitates a user directed search by performing a scatter/gather approach and iii) the parallel processing module which is responsible for speeding up online clustering procedures as well as for preprocessing of documents in real time. In the following sections we explain the main concepts behind the functionality of each sub-module. Fig. 1 shows a generic overview of the proposed architecture.
Fig. 1. Generic overview of the system’s architecture
3.1 Personalization Module In order to build an accurate and effective user model there are two main tasks that the system should support: either i) the user at the time of registration should be able to provide details about his/her personal preferences so as to create easily his/her
Interactive Cluster-Based Personalized Retrieval on Large Document Collections
215
model / stereotype, or ii) the sequence of the topics that she/he usually selects are monitored and could be used to create a model which will direct the system to classify him/her to one community. In general the personalization module should provide support for the following operations: • • •
Provide support in respect to new user’s registration Keep track of user’s preferences in respect to the topics of interest that affect their interaction with the system Present personalized information to users that have similar interests or in general could be classified in a common behavior stereotype.
In respect to the classification model adopted, the system must support the creation of user models / communities using a feature-based representation. Towards this, from a list of generic sub-categories that the user usually explores a user model / community is created using a machine learning approach. Typical algorithms which have been successfully applied towards this direction are the COBWEB algorithm and the Cluster Mining Algorithm [3] and its variations [4][6]. Paliouras et al [5] studied a free-text query based information retrieval service and they constructed models of user communities using these two algorithms. They compared the two approaches using two evaluation criteria: 1) coverage, the proportion of features covered by the models and 2) distinctiveness, the number of distinct features that appear in at least one model divided by the sum of the sizes of all models. Eventually, they concluded that “the cluster mining method is doing consistently better than COBWEB, in terms of coverage, and distinctiveness”. The main principle of the Cluster Mining algorithm is to create - from a graph that contains all the possible features - a sub-graph with weights containing all the features associated with a given user model. In other words, the algorithm constructs a weighted graph G(A,E,wA,wE), where the set of vertices A contains all the features and the set of edges E corresponds to the coexistence of two features in the corresponding model. Then, weights are assigned to both edges E and to vertices A as aggregate usage statistics. In order to lower the complexity of the graph a threshold can be imposed which results in rejecting the edges with an assigned value below that threshold. In figure 2 considering we have a threshold of 0.09 the edge between the categories hardware and databases that has a lower value is rejected (it means that there is no strong evidence that the user or set of user in this specific stereotype are
Fig. 2. The feature-based graph that allows creation of the personalization model
216
P. Belsis et al.
interested in both categories). The remaining subset of the graph results in the construction of the feature group. 3.2 Clustering - Based Browsing for Large Document Collections In addition to personalization, our system also provides effective automatic browsing using the known Scatter/Gather approach (which is mainly based on iterative application of document clustering procedures – see section 2) in order to further facilitate the user search procedure. Moreover, we apply parallelism over a distributed memory environment in order to gain better (and acceptable) total performance for very large document collections. Specifically, we first follow the typical scatter/gather approach proposed in [8], slightly changed due to the fact that in our system personalized documents categorization for each user has already been done via the personalization module of the system. The predefined categories for each specific user (e.g. user model / stereotype based) can serve here as the basic initial clusters for the scatter/gather procedure. Thus, initially, the documents belonging into the specific user-profile categories (in other words, the set of initial clusters assigned to the specific user) are gathered together to form a dynamic (for the specific user) subcollection. An appropriate reclustering procedure is then applied to scatter the user subcollection into a number of document groups, and short summaries of them are presented to the user. Based on these summaries, the user selects one or more of the groups for further study. The selected groups are gathered together again to form a new (smaller) subcollection. The system then applies clustering (re-clustering via the same procedure as above) again to scatter the new subcollection into a small number of document groups, which are again presented (summaries of them) to the user. The user selects again, etc. With each successive iteration the groups become smaller, and therefore more detailed. Ultimately, when the groups become small enough, this process bottoms out by enumerating individual documents. Note that, since an initial (via the personalization module) document recommendation and selection has already been done (assignment of specific categories to each user), initial heavy (and more accurate) clustering (i.e. such as fractionation proposed in [8]) is not necessary. This initial personalized filtering can serve as the basic initial (via a different and more accurate procedure) clustering step proposed in the above reference. Thus, in our hybrid approach, only fast online reclustering procedures have to be considered. Towards this direction, we’ve used a customized version of the Buckshot algorithm (see [8] for a general description) which is a typical quite fast clustering algorithm suitable for the online re-clustering essential for scatter/gather. The Buckshot algorithm is a combination of hierarchical and partitioning algorithms designed to take advantage of the accuracy of hierarchical clustering as well as the low computational complexity of partitioning algorithms. Specifically, it assumes the existence of some (i.e. hierarchical) algorithm which clusters well, but which may run slowly. This procedure is usually called ‘the cluster subroutine’. In our system, we use single-link hierarchical agglomerative clustering method for this subroutine (instead of group-average or complete-link), in order to obtain not very tight initial clusters. Hierarchical conceptual clustering plays an important role in our
Interactive Cluster-Based Personalized Retrieval on Large Document Collections
217
work because we plan, in the future, to combine knowledge acquisition with machine learning to extract semantics from resources found on the Web [2]. Then, the algorithm takes a random sample of s = √kn documents from the collection and uses the specific ‘cluster subroutine’ (hierarchical single-link) as the high-precision clustering routine to find initial centers from this random sample. The initial centers generated from the hierarchical agglomerative clustering subroutine can be used as the basis for clustering the entire collection in a highperformance manner, by assigning the remaining documents in the collection to the most appropriate initial center. The original Buckshot algorithm gives no specifics on how best to assign the remaining documents to appropriate centers, although various techniques are given. In our work we use an iterated assign-to-nearest algorithm with two iterations similar to the one proposed in [9]. The Buckshot algorithm typically requires linear time (since s = √kn, the total time is O(kn) where k is much smaller than n) which is very satisfactory. This establishes the feasibility of the scatter/gather method for browsing moderately large document collections. But for very large document collections the linear time requirement for the online phase makes the use of the scatter/gather browsing method not very efficient. On the other hand, in our system the Buckshot procedure is usually expected to run over controlled-size subcollections (since the user subcollections are the results of personalized filtering procedures). However, in order to face this inefficiency in either case, we apply parallelism over a distributed memory parallel environment, aiming at gaining acceptable performance even for very large document collections. 3.3 Parallel Processing Module As mentioned above, even the Buckshot algorithm in sequential execution tends to be quite slow for today’s very large (huge) collections. Even the most simplistic modern clustering techniques tend to be quite slow too. Naturally, a promising approach could be parallel processing. In our proposed system, we use such efficient parallel techniques in order to achieve acceptable performance even for very large document collections. Moreover, using distributed memory architecture we can reduce the time and memory complexity of the sequential algorithms by a factor of p where p is the number of nodes used. Specifically, towards an efficient design and implementation of the proposed (in the previous section) scatter/gather clustering techniques, we follow the parallel approach presented in [9]. First, an efficient implementation of the underlying hierarchical agglomerative clustering subroutine is constructed (initially based on the parallel calculation of the pair-wise documents similarity matrix – in a distributed manner over the multiple processors, and then iterating to build the cluster hierarchy using the single-link criterion). Based on the above parallel execution of the underlying clustering subroutine, we build an efficient parallel implementation of the Buckshot algorithm (similar again to the one proposed in [9]). The first phase of the parallel Buckshot algorithm uses the parallel hierarchical clustering subroutine to cluster s random documents. The second phase for the parallel version of the Buckshot algorithm groups the remaining documents in parallel. After the clustering subroutine has finished, k initial clusters have been created from the random sample of s = √kn documents. From the total collection n−s
218
P. Belsis et al.
documents remain that have not yet been assigned to any cluster. The second phase of the Buckshot algorithm assigns these documents according to their similarity to the centroids of the initial clusters. This phase of the algorithm is trivially parallelized via data partitioning. First, the initial cluster centroids are calculated on every node (with use of appropriate collective parallel functions – aiming at properly reducing the total communication cost). After centroids calculation is complete, each node is assigned approximately (n−s)/p documents to process. Each node iterates through these documents in place, (comparing the document’s term vector to each centroid and making the assignment) until all documents are assigned. The second phase is iterated two times. The second iteration recalculates the centroids and reassigns all the documents to one of the k clusters. Moreover, we also apply parallelism during the documents’ preprocessing phase, based on previous works of ours (see [10],[11]). As a part of these techniques, a more accurate off-line clustering algorithm (partitional clustering based on the iterative calculation of connected components of the documents similarity matrix – as a specialization of the single-link hierarchical agglomerative clustering algorithm) is also given. This global initial clustering method is quite useful if the user wish to perform global searches from the beginning (entering natural language keywords etc.) without using any personalization-based categorization feature. The specific used document indexing process (essential as part of the off-line setup/preprocessing phase of the system, in order to be able to apply effectively the similarity-based clustering techniques) follows the basics of the Vector Space Model (construction of weighted document vectors, based on the statistical extraction of word-stems, phrases and thesaurus classes). For speeding up the similarity calculations we also extract and extensively use a global inverted index (as well as partial local inverted lists when needed for parallel clustering procedures). Some of our parallel processing methods (see [10],[11]) have been extensively tested over real distributed memory environment, yielding to very good performance. As the underlying distributed memory platform we use a beowulf-class linux-cluster environment with use of the MPI-2 (MPICH implementation) passing message library. Specifically, our cluster consists of 8 Pentium-4 based processors with 1GB RAM and a dedicated Myrinet network interface which provides 2Gbps communication speed. The corresponding experiments have been done over a part the known TIPSTER/TREC standard document collections.
4 Conclusion Adding personalization features in websites or other systems becomes a very popular approach. Towards this direction machine learning algorithms have proved to be an effective solution. Personalization models base their operation on a limited set of features. In large document collections though it is not sufficient to direct the userqueries based on the generic categories that help build up the personalization model. We have presented a hybrid approach that initiates the user-system interaction by making propositions to the user based on the user model created either by userfeedback or by a machine learning approach based on tracking her/his previous past interaction with the system; accordingly the selected sub-clusters of documents are
Interactive Cluster-Based Personalized Retrieval on Large Document Collections
219
processed and new keywords arise which help build up a new set of sub-clusters of the remaining documents. This process proceeds repetitively with the user’s participation until an adequately limited number of documents have been refined through the user directed queries. The benefit of our approach is that it proceeds in a highly dynamic manner, not limiting the number of features that arise in each step of the query process. Thus, the associated with the resources feature set is updated frequently, resulting in an effective and dynamic re-clustering of documents. In order to keep the response times low, parallel processing techniques are employed. We have described the modular components of a proof of concept architecture that encompasses the basic principles of our approach and we have described good selection choices towards the system’s implementation which is still under continuous development; still, based on previous experimentation of some of its sub-modules [10][11] we provide adequate evidence about the validity of our approach. Acknowledgments. We are grateful to George Paliouras for his helpful comments on the early version of this article.
References 1. Wolff, E., Cremers, A.: The MyVIEW Project: A Data Warehousing Approach to Personalized Digital Libraries. Next Generation Information Technologies and Systems, 277–294 (1999) 2. Godoy, D., Amandi, A.: Modeling user interests by conceptual clustering. Information Systems 31, 247–265 (2006) 3. Perkowitz, M., Etzioni, O.: Learning and revising user profiles: The identification of interesting Web sites. Machine learning 27, 313–331 (1998) 4. Paliouras, G., Papatheodorou, C., Karkaletsis, V., Spyropoulos, C.D.: Clustering the Users of Large Web Sites into Communities. In: Proceedings of the InternationalConference on Machine Learning (ICML), pp. 719–726 (2000) 5. Paliouras, G., Papatheodorou, C., Karkaletsis, V., Spyropoulos, C.D.: Discovering User Communities on the Internet Using Unsupervised Machine Learning Techniques. Interacting with Computers 14(6), 761–791 (2002) 6. Paliouras, G., Mouzakidis, A., Ntoutsis, C., Alexopoulos, A., Skourlas, C.: PNS: Personalized Multi-source News Delivery. In: Proceedings of the 10th International Conference on Knowledge-Based & Intelligent Information & Engineering Systems (KES), Bournemouth, UK, October 2006, pp. 1152–1161 (2006) 7. Langley, P.: User modeling in adaptive interfaces. In: Proceedings of the 7th International conference on user modeling, pp. 357–370. Springer, Heidelberg (1999) 8. Cutting, D.R., Pedersen, J.O., Karger, D., Tukey, J.W.: Scatter/gather: A cluster-based approach to browsing large document collections. In: Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 318–329. ACM, New York (1992) 9. Cathey, R., Jensen, E., Beitzel, S., Frieder, O., Grossman, D.: Exploiting parallelism to support scalable hierarchical clustering. JASIST 58(8), 1207–1221 (2007) 10. Kehagias, D., Mamalis, B., Pantziou, G.: Efficient VSM-based Parallel Text Retrieval on a PC-Cluster Environment using MPI. In: Proceedings, ISCA 18th Intl. Conf. on Parallel Distributed Computing Systems (PDCS’05), Las Vegas, Nevada, USA, September 12-14, pp. 334–341 (2005)
220
P. Belsis et al.
11. Gavalas, D., Konstantopoulos, C., Mamalis, B., Pantziou, G.: Efficient BSP/CGM Algorithms for Text Retrieval. In: Proceedings of the 17th IASTED Intl. Conf. on Parallel and Distributed Computing and Systems (PDCS’05), Phoenix, Arizona, USA, November 14-16, pp. 301–306 (2005) 12. Cutting, D.R., Karger, D.R., Pedersen, J.O.: Constant interaction-time scatter/gather browsing of very large document collections. In: Proceedings of the Sixteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 126–134. ACM, New York (1993) 13. Hearst, M.A., Karger, D., Pedersen, J.O.: Scatter/gather as a tool for the navigation of retrieval results. In: Burke, R. (ed.) Working Notes of the AAAI Fall Symposium on AI Applications in Knowledge Navigation and Retrieval, Cambridge, MA. AAAI, Menlo Park (1995) 14. Hearst, M.A., Pedersen, J.O.: Reexamining the cluster hypothesis: scatter/gather on retrieval results. In: SIGIR 1996: Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 76–84. ACM Press, New York (1996) 15. Jensen, E.C., Beitzel, S.M., Pilotto, A.J., Goharian, N., Frieder, O.: Parallelizing the buckshot algorithm for efficient document clustering. In: CIKM 2002: 11th int. conf. on Information and knowledge management, pp. 684–686. ACM Press, New York (2002) 16. Liu, Y., Mostafa, J., Ke, W.: A Fast Online Clustering Algorithm for Scatter/Gather Browsing (2006) 17. Hartigan, J.A.: Clustering Algorithms. Wiley, Chichester (1975) 18. Guha, S., Rastogi, R., Shim, K.: CURE: An Efficient Clustering Algorithm for Large Databases. In: Proceedings of the 1998 ACM-SIGMOD, pp. 73–84 (1998) 19. Jardine, N., van Rijsbergen, C.J.: The Use of Hierarchical Clustering in Information Retrieval. Information Storage and Retrieval (1971) 20. Dash, M., Petrutiu, S., Sheuermann, P.: Efficient Parallel Hierarchical Clustering. In: Danelutto, M., Vanneschi, M., Laforenza, D. (eds.) Euro-Par 2004. LNCS, vol. 3149. Springer, Heidelberg (2004) 21. Dhillon, I.S., Modha, D.S.: A data-clustering algorithm on distributed memory multiprocessors. In: Zaki, M.J., Ho, C.-T. (eds.) KDD 1999. LNCS (LNAI), vol. 1759, pp. 245–260. Springer, Heidelberg (2000) 22. Heckel, B., Hamann, B.: Divisive parallel clustering for multiresolution analysis. In: Geometric Modeling for Scientifc Visualization, Germany, pp. 345–358 (2004) 23. Olson, C.: Parallel algorithm for hierarchical clustering. Parallel Computing 21 (1995) 24. Xu, S., Zhang, J.: A hybrid parallel web document clustering algorithm and its performance study (366-03) (2003)
Decision Support Services Facilitating Uncertainty Management Sylvia Encheva Stood/Haugesund University College, Bjørnsonsg. 45, 5528 Haugesund, Norway
[email protected]
Abstract. This work focuses on taking actions with respect to managing uncertainty situations in system architectures by employing non Boolean logic. Particular attention is paid to solving problems arising in situations where recognition of all correct answers is required and some responses contain both correct and incorrect options. Keywords: decision support services, uncertainty management.
1 Introduction The importance of uncertainty is increasing when a long-term view is considered. In order to manage and explore unexpected developments, an introduction of designs, susceptible of modifications and adjustable to changing environments, is highly desirable. The need for proper management of uncertainty requires flexibility that can be obtained through system architectures based on non-Boolean logic. While Boolean logic appears to be sufficient for most everyday reasoning, it is certainly unable to provide meaningful conclusions in presence of inconsistent and/or incomplete input [9]. However, many-valued logic is offering a solution to this problem. Knowledge assessment is strongly related to what kind of answer alternatives should be included while establishing students’ level of mastering of a particular concept. Introducing various types of answer alternatives helps to attain a higher level of certainty in the process of decision making. However, providing meaningful responses to all answer combinations is a difficult task. A possible solution to this problem is arranging all answer combinations into meaningful sets first. Application of many-valued logic, for drawing conclusions and providing recommendations, is then suggested. This allows introduction of decision strategies where multivalued inferences support comparison of degrees of specificity among context. In addition involvement of intermediate truth values adds a valuable contribution to the process of comparing degrees of certainty among contexts. The rest of the paper is organized as follows. Related work, basic terms and concepts are presented in Section 2. The management model is described in Section 3. The paper ends with a description of the system in Section 4 and a conclusion in Section 5. G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 221–230, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
222
S. Encheva
2 Background Let P be a non-empty ordered set. If sup{x, y} and inf {x, y} exist for all x, y ∈ P , then P is called a lattice, [2]. In a lattice illustrating partial ordering of knowledge values, the logical conjunction is identified with the meet operation and the logical disjunction with the join operation. A lattice L is said to be modular [2], if it satisfies the modular low (∀a, b, c ∈ L) a ≥ c =⇒ a ∧ (b ∨ c) = (a ∧ b) ∨ c Nested line diagrams are used for visualizing large concept lattices, emphasizing sub-structures and regularities, and combining conceptual scales, [14]. A nested line diagram consists of an outer line diagram, which contains in each node inner diagrams. Both five-valued and seven-valued logics can be obtained from the generalized Lukasiewicz logic [12]. The set of truth values with cardinality n corresponds to 1 2 , n−1 , ..., n−1 the equidistant rational numbers {0, n−1 n−1 , 1}. Seven-valued logic has been employed in reliability measure theory [10], for verification of switch-level designs in [5], and for verifying circuit connectivity of MOS/LSI mask artwork in [13]. The seven-valued logic presented in [10], known also as seven-valued relevance logic, has the following truth values - truth (i.e. valid), false (i.e. invalid), true by default, false by default, unknown, contradiction, and contradiction by default. The authors of [5] define seven-valued propositional logic called switch-level logic. The truth values are E, D0, D1, DU, S0, S1, SU and are called switch-level values, where S0 and S1 are ’strong’ values associated with the support voltage ’vdd’ and ground ’gnd’, D0 and D1 are obtained as a result of degradation effect, SU and DU are undefined values corresponding to certain strength, and E is the value of all nodes not connected to a source node via a path through the network. These truth values are ordered in a switch-level values lattice Fig. 1.
SU
S0
S!
DU
D0
D1
E
Fig. 1. A lattice with switch-level values
Decision Support Services Facilitating Uncertainty Management
223
We choose to apply the five-valued and the seven-valued logics developed in [3] and [4] because the latter one contains all the truth values of the former and thus simplifies the use of both in combination. The five-valued logic Fig. 2 introduced in [3] is based on the following truth values: – – – – –
uu - unknown or undefined, kk - possibly known but consistent, ff - false, tt - true, ww - inconsistent.
tt
ww
kk
ff
uu
Fig. 2. Five-valued logic lattice
ii
ff
kk
it
tt
fi
uu
Fig. 3. Lattice of the seven-valued logic
A seven-valued logic presented in [4] is based on the following truth values: – uu - unknown or undefined, – kk - possibly known but consistent,
224
S. Encheva
Fig. 4. M3 × 2 lattice
– – – – –
ff - false, tt - true, ii - inconsistent, it - non-false, and if - non-true.
The lattice in Fig. 3 is modular since it is isomorphic to the shaded sublattice of the modular lattice M3 × 2 in Fig. 4, [2].
3 The Test The main idea is to develop a framework for automated evaluation of knowledge and/or skills learned by a student. In real life situations a person is often presented with several alternatives and has to choose one of them as a solution to a given problem. To prepare future experts for dealing with such situations we propose application of multiple choice tests where a student should mark all correct answers. Such a test immediately opens possibilities for inconsistent and incomplete input. The inconsistency occurs if a student’s response contains both
qqq
qqi
qqe
qqu
qqp
Fig. 5. The truth value tt
Decision Support Services Facilitating Uncertainty Management
225
ppp
ppv
ppe
ppu
ppq
Fig. 6. The truth value ff
eee
eev
eeq
eeu
eep
Fig. 7. The truth value ww
vvv
vvq
vve
vvu
vvp
Fig. 8. The truth value kk
a correct answer and a wrong answer, and incompleteness occurs when a student is not providing any answer. Similar situations cannot be resolved with Boolean logic because systems based on Boolean logic operate with two outputs ‘correct’ or ‘incorrect’. Therefore we suggest application of many-valued logic.
226
S. Encheva uuu
uuv
uue
uuq
uup
Fig. 9. Truth value uu qpi
qpu
qpv
qve
pvu
Fig. 10. Truth value fi qvu
que
pve
vue
pue
Fig. 11. Truth value it
A test consists of three questions addressing understanding of a concept or mastering a skill. Stem responses can be – – – – –
true (q), false (p), answer is missing (u), incomplete answer but true (v), and both true and false (e).
Decision Support Services Facilitating Uncertainty Management
227
eee
eev
eeq
eeu
eep
ppp
qqq
vvv
vvq
ppv
qqi
ppe
qqu
vvu
ppu
qqe
vve
ppq
qqp
vvp
qvu
qpi
que
qpu
pve
vue
qpv
qve
pue
pvu
uuu
uuv
uue
uuq
uup
Fig. 12. Possible outcomes from a single test
The meaning of the last two notations is as follows – responses where a part of the answer is missing but whatever is stated is true (v), and – responses where one part of the answer is true and another one is false, (e). A student can take the test several times depending on the size of the pull of questions and answer alternatives.
228
S. Encheva
ii
ff
kk
it
tt
fi
uu
ii ii
ff
kk
tt ff
it
kk
tt
fi it
fi
uu uu
Fig. 13. Relations among truth values
Aiming at a simplification of the visualizing process we first group the answer alternatives in sets with five elements. The lattices in Fig. 5, Fig. 6, Fig. 7, Fig. 8, Fig. 9, Fig. 10, and Fig. 11 relate the answer alternatives to the seven truth values. These sets are then placed in the nodes of the lattice in Fig. 3. The outcome can be seen in Fig. 12. On Fig. 13 we show in details how the results of two tests are related. The seven-valued logic is both associative and commutative. This allows combining results of tests based on truth Table 1 as well as drawing only halve of the truth values dependencies.
Decision Support Services Facilitating Uncertainty Management
229
Table 1. The ontological operation ∨ in [4] ∨ uu
uu uu
kk uu
fi fi
ff ff
ii fi
tt it uu uu
kk
uu
kk
fi
ff
fi
kk uu
fi
fi
fi
fi
ff
fi
fi
fi
ff
ff
ff
ff
ff
ff
ff
ff
ii
fi
fi
fi
ff
ii
ii
ii
tt
uu
kk
fi
ff
ii
tt
it
it
uu
uu
fi
ff
ii
it
it
4 Brief System Description Web application server architecture is proposed for system implementation, where Apache Web server deals with the presentation layer, the logic layer is written in Python, and SQLite database engine is used for implementing the data layer. The back end SQLite databases are used to store both static and dynamic data.
5 Conclusion This work discusses assessment of students understanding of knowledge. The presented framework facilitates automation of an evaluation process where a student is asked to find all correct answers. Since some of the presented to the student options are incorrect the system is challenged to provide decisions for cases with incomplete and or contradictory input. This is resolved by applying many-valued logic.
References 1. Belnap, N.J.: A useful four.valued logic. In: Dunn, J.M., Epstain, G. (eds.) Modern uses of multiple-valued logic, pp. 8–37. D. Reidel Publishing Co, Dordrecht (1977) 2. Davey, B.A., Priestley, H.A.: Introduction to lattices and order. Cambridge University Press, Cambridge (2005) 3. Ferreira, U.: A Five-valued Logic and a System. Journal of Computer Science and Technology 4(3), 134–140 (2004) 4. Ferreira, U.: Uncertainty and a 7-Valued Logic. In: Proceedings of The 2nd International Conference on Computer Science and its Applications (ICCSA 2004), San Diego CA, USA (June 2004)
230
S. Encheva
5. Hahnle, R., Werner Kernig, W.: Verification of Switch-level designs with manyvalued logic. In: Voronkov, A. (ed.) LPAR 1993. LNCS, vol. 698, pp. 158–169. Springer, Heidelberg (1993) 6. http://httpd.apache.org/ 7. http://www.python.org/ 8. http://www.sqlite.org/ 9. Immerman, N., Rabinovich, A., Reps, T., Sagiv, M., Yorsh, G.: The boundery between decidability and undecidability of transitive closure logics. In: Marcinkowski, J., Tarlecki, A. (eds.) CSL 2004. LNCS, vol. 3210. Springer, Heidelberg (2004) 10. Kim, M., Maida, A.S.: Reliability measure theory: a nonmonotonic semantics. IEEE Transactions on Knowledge and Data Engineering 5(1), 41–51 (1993) 11. Kleene, S.: Introduction to Metamathematics. D. Van Nostrand Co., Inc, New York (1952) 12. Lukasiewicz, J.: On Three-Valued Logic. Ruch Filozoficzny 5, 170–171 (1920); English translation in Borkowski, L. (ed.) 1970. Jan Lukasiewicz: Selected Works. North Holland, Amsterdam (1920) 13. Takashima, M., Mitsuhashi, T., Chiba, T., Yoshida, K.: Programs for Verifying Circuit Connectivity of MOS/LSI Mask Artwork. In: 19th Conference on Design Automation, pp. 544–550 (1982) 14. Wille, R.: Concept lattices and conceptual knowledge systems. Computers and Mathematics with Applications 23(6-9), 493–515 (1992)
Efficient Knowledge Transfer by Hearing a Conversation While Doing Something Eiko Yamamoto and Hitoshi Isahara Graduate School of Engineering, Kobe University, 1-1 Rokodai-cho, Nada-ku, Kobe, Hyogo 657-8501, Japan National Institute of Information and Communications Technology, 3-5 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-0289, Japan
[email protected], {eiko,isahara}@nict.go.jp Abstract. We introduce an approach for developing a conversation synthesizer in order to support knowledge acquisition. The system can transfer knowledge to humans even while the person is doing something or is not concentrating on listening to the voice. Our approach does not create a summary of the key points of what is being read out, but focuses on the knowledge transfer method for supporting knowledge acquisition. Specifically, to provide knowledge efficiently by computer, that is, to transfer knowledge from computers to human, we consider what kinds of conversation are naturally retained in the brain; as such conversations may enable people to obtain knowledge more easily. We aim to construct an intelligent system which can create such conversations by applying natural language processing technologies. Keywords: Intelligent narrative environments, Knowledge acquisition support, Natural language processing, Learning by listening.
1 Introduction One of the most common means of acquiring useful knowledge is reading suitable documents and websites. However, this is time-consuming and cannot be done in parallel with other tasks. Is there a way to acquire knowledge when we cannot read written texts, such as while driving a car, walking around or doing housework? It is not easy to remember the contents of a document simply by listening to its reading aloud from the top, even if we concentrate while listening. In contrast, it is sometimes easier to remember words heard on the radio or television even if we are not concentrating on them. While we are doing something, listening to conversation is better than listening to a precise reading out of a draft or summary for memorizing the contents and turning them into knowledge. We are therefore trying to improve the efficiency of knowledge transfer1 by “hearing a conversation while doing something.” 1
In this paper, "(knowledge) transfer" is a movement of knowledge/information from a knowledge source, including a human, to a human recipient. That is to say, the term “knowledge transfer” means not only transferring knowledge between people but also transferring knowledge from computers to human. "Acquisition" is a process of understanding/memorizing knowledge by the human recipient. We focus on the process of synthesizing conversation being uttered for knowledge transfer, which relates to the "externalization" in SECI model [1], in order to realize efficient knowledge acquisition by the recipient, which relates to the "combination" in the model.
G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 231–238, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
232
E. Yamamoto and H. Isahara
In order to support knowledge acquisition by humans, we aim to develop a system which provides people with useful knowledge while they are doing something or not concentrating on listening. We did not try to edit notes to be read out, or to summarize documents; rather, we aimed to develop a way of transferring knowledge. Specifically, in order to provide knowledge efficiently with computers, we consider how to turn the content into a dialogue that is easily remembered, and develop a system to produce dialogue by which one can easily acquire knowledge.
2 Sophisticated Eliza Recently, thanks to the improvement of natural language processing (NLP) technology, development of high-performance computers and the availability of huge amounts of stored linguistic data, there are now useful NLP-based systems. There are also practical speech synthesis tools for reading out documents and tools for summarizing documents. These tools do not necessarily use state-of-the-art technologies to achieve deep and accurate language understanding, but are based on huge amounts of linguistic resources that used not to be available. Although current computer systems can collect huge amounts of knowledge from real examples, it is not obvious how to transfer knowledge more naturally between such powerful computer systems and humans. We need to develop a novel way to transfer knowledge from computers to humans. We believe that, based on large amounts of text data, it is possible to devise a system which can generate dialogue by a simple mechanism to give people the impression that two intelligent persons are talking. We verified this approach by implementing a system named Sophisticated Eliza [2] which can simulate conversation between two persons on a computer. Sophisticated Eliza is not a Human-Computer Interaction system; instead, it simulates conversation by two people and users acquire information by listening to the conversation generated by the system. Concretely, using an encyclopedia in Japanese [3] as a knowledge base, we develop rules to extract information from the knowledge base and create fragments of conversation. We extract rules with syntactic patterns to make a conversation, for example, “What is A?” “It’s B.” from “A is B.” The system extracts candidate fragments of conversation using these simple scripts and two voices then read the conversation aloud. This system cannot generate long conversations as humans do on one topic, but it can simulate short conversations from stored linguistic resources and continue conversations while changing topics. Figure 1 shows a screenshot of Sophisticated Eliza and Figure 2 shows its system flow. Figure 3 is examples of conversation generated by the system. Example 1: Original text in knowledge base Osaka Castle was a castle in Osaka prefecture from 15th century to 17th century. Extracted fragment of conversation A: What is Osaka Castle? B: It is a castle in Osaka prefecture from 15th century to 17th century.
Efficient Knowledge Transfer by Hearing a Conversation While Doing Something
Fig. 1. Screenshot of Sophisticated Eliza
Synthesizer Huge text database Manual compilation of rules
Conversation generator (fragment selection)
Japanese parser
Template
Conversation extraction
Conversation fragment database
Fig. 2. System Flow of Sophisticated Eliza
Speaker 1
Speaker 2
233
234
E. Yamamoto and H. Isahara
Example 2: Japanese government reinforces bi-relation with African countries and appeals Japanese policy of foreign affairs, aiming to establish environment to solve problems at United Nations.
What activities are done under the supporting program for Africa?
Fig. 3. Examples of Generated Conversation
The encyclopedia utilized here contains all about Japan, e.g., history, culture, economy and politics. All sentences in the encyclopedia are analyzed syntactically using a Japanese parser and we use rules to extract the fragments of conversation using information in the encyclopedia. As for the manual compilation of rules, we carefully analyzed the output of the Japanese parser and found useful patterns to extract knowledge from the encyclopedia. The terms extracted during the syntactic analysis are stored in the keyword table and are used for selection of topics and words during the conversation synthesis. Note that in our current system, we use Japanese documents as the input. Because we are using only syntactic information output by the Japanese parser, our mechanism is also applicable to other languages such as English. We use a rather simple mechanism to generate actual conversations in the system, which includes rules to select fragments containing similar words and rules to change topics. The contents in the encyclopedia are divided into seven categories, i.e. geography, history, politics, economy, society, culture and life. When the topic in a conversation moves from one topic to another, the system generates utterance showing such move. As for the speech synthesis part, we use the synthesizer developed by Oki Electric Industry Co. Ltd., Japan. The two authors of this paper, one male and one female, recorded 400 sentences each and the two characters in the system talk to each other by impersonating our voices. The images of the two characters are also based on the authors. Because this system uses simple template-like knowledge, it cannot generate semantically deep conversation on a topic by considering context or by compiling highly precise rules to extract script-like information from text. Thus, the mechanism used in this system has room for improvement to create conversations for knowledge transfer.
Efficient Knowledge Transfer by Hearing a Conversation While Doing Something
235
3 Efficient Knowledge Transfer by Hearing a Conversation While Doing Something In the daily transfer of knowledge, such as in a cooking program on TV, there are not only the reading aloud of recipes by the presenter but also conversation between the cook and assistant. Through such conversations, information which viewers want to know and which they should memorize is transferred to them naturally. We started to develop a mechanism to achieve natural knowledge acquisition for humans by turning information that is written in documents into conversational text. Efficient methods of acquiring knowledge include not only “reading documents” and “listening to passages read aloud,” but also “hearing a conversation while doing something,” provided that information is appropriately embedded into the conversation. We believe that we can verify that this “conversation hearing” can assist knowledge acquisition by developing a system for synthesizing conversations by collecting fragments of conversation and conducting experiments by using the system. As a means to transfer information, contents conveyed by an interpretive reading with pronounced intonation are better retained in memory than if read monotonously from a document or summary. Furthermore, by turning contents into conversation style, even someone who is not concentrating on listening may become interested in the topic and acquire the contents naturally. This suggests that several factors in conversations, such as throwing in words of agreement, pauses and questions, which may appear to decrease the density of information, are actually effective means of transferring information matching humans’ ability to acquire knowledge with limited concentration. Based on this idea, we propose a novel mechanism of an information transfer system by considering the way of transferring knowledge from computers to humans. Various dialogue systems have already been developed as communication tools between humans and computers [4, 5]. However, in our novel approach, the dialogue system regards the user as an outsider, presents conversation by two speakers in the computer which is of interest to the outside user, and thus provides the user with useful knowledge. There are dialogue systems [6, 7, 8] which can join in a conversation between a human and a computer, but they simply create fragments of conversation and so do not sound like an intelligent human speaker. One reason is that they do not aim to provide knowledge or transfer information to humans, and few theoretical evaluations have been done in this field. In this research, we consider a way to transfer knowledge and develop a conversation system which generates dialogue by which humans can acquire knowledge from dialogue conducted by two speakers in the computer. We analyze the way to transfer knowledge to humans with this system. This kind of research is beneficial not only from an engineering viewpoint but also cognitive science and cognitive linguistics. Furthermore, a speech synthesis system in which two participants conduct spoken conversation automatically is rare. In this research, we develop an original information-providing system by assigning conversation to two speakers in the computer in order to transfer knowledge to humans.
236
E. Yamamoto and H. Isahara
4 System Implementation The principle of Sophisticated Eliza is that because a large amount of text data is available, even if the recall of information extraction is low, we can obtain sufficient information to generate short conversations. However, the rules still need to be improved by careful analysis of input texts. As for the information transfer system, although our final target is to handle topics which are practically useful such as knowledge from newspapers, encyclopedia and Wikipedia, as a first step we are trying to compile rules for small procedural domains such as cooking recipes. Concretely, we are developing the new system via the following five steps repeatedly. 1) Enlargement of conversational script and template in order to generate sentences in natural conversation. We have already compiled simple templates for extracting fragments of conversation as a part of Sophisticated Eliza. We are now enlarging the set of templates to handle wider contexts, domain-specific knowledge and insertion of words. This enlargement is basically being done manually. Here, domain-specific knowledge includes domain documents in a specific format, such as recipes. Insertion of words includes words of agreement and encouragement for the other speaker, part of which is already introduced in Sophisticated Eliza. An example of synthesized conversation is shown in Figure 4. A: Let’s make boiled scallop with lettuce and cream. B: It is 244 Kcal for one person. A: What kinds of materials are needed? B: Lettuce and scallop. For four persons, four peaces of tomatoes and …… …………… A: How will we cook lettuce? B: Pour off the hot water after boiling it. Then cool it. A: How about tomatoes? B: Remove seeds and dice them. Fig. 4. Example conversation
2) Implementation of system in which two speakers (agents/characters) make conversation in a computer considering dialogue and document contexts. Using the conversational templates extracted based on the contexts, the system continues conversation with two speakers. Fundamental functions of this kind have already been developed for Sophisticated Eliza. Here, there are two types of “context.” One is the context in the documents, i.e. knowledge-base. For the recipe example, cooking heavily depends on the order of each process and on the result of each process. The
Efficient Knowledge Transfer by Hearing a Conversation While Doing Something
237
other type is the context in the conversation. If all subevents included in an event are explicitly uttered in conversation, it would be very dull and makes understanding obstruct. For example, “Make hot water in a pan. Peel potatoes and boil them” is enough and it is not necessary to say “boil peeled potatoes in the hot water in a pan.” Appropriate use of ellipsis and anaphoric representation based on the context in the conversation are useful tools for easy understanding. Though speech synthesis itself is out of the scope of our research, pauses in utterances are also important in natural communication. 3) Mechanism to extract (fragment of) knowledge from text Sophisticated Eliza outputs informative short conversations, but the content of the conversation is not consistent as a whole. In this research, we are developing a system to provide people with some useful knowledge. We have to recognize the useful part of the knowledge base and to place great importance on the extracted useful part of the text. We previously reported how to extract an informative set of words using a measure of inclusive relations [9], and will apply a similar method to this conversation system. 4) Improvement of conversation script and template considering “fragment of knowledge” By considering the useful part of information written in the knowledge base, we modify the templates to extract conversational text. Contextual information such as ellipsis and anaphora is also treated in this part. As a first step, we will handle anaphora resolution in a specific domain, such as cooking, considering factors described at 2). We will use domain knowledge about cooking such as cookware, cookery and ingredient. 5) Evaluation We will conduct tests with participants to evaluate our methodology and verify the effectiveness of our method for transferring knowledge. So far, we are reported by some small number of participants that it is rather easy to listen to the voice of the system, however, objective evaluation is still our future work.
5 Conclusion We introduced an approach for developing an information-providing system in order to support knowledge acquisition. The system can transfer knowledge to humans even while the person is doing something or is not concentrating on listening to the voice. Our approach does not create a summary of the key points of what is being read out, but focuses on the knowledge transferring method. Specifically, to provide knowledge efficiently, we consider what kinds of conversation are naturally retained in the brain, as such conversations may enable people to obtain knowledge more easily. We aim to construct an intelligent system which can create such conversations by applying natural language processing techniques.
238
E. Yamamoto and H. Isahara
References 1. Nonaka, I., Takeuchi, H.: The Knowledge-Creating Company. Oxford University Press, Oxford (1995) 2. Isahara, H., Yamamoto, E., Ikeno, A., Hamaguchi, Y.: Eliza’s daughter. In: Annual Meeting of Association for Natural Language Processing of Japan (2005) 3. Bilingual Encyclopedia about Japan (in Japanese and English), Kodansha International (1998) 4. Waizenbaum, J.: ELIZA—A Computer Program For the Study of Natural Language Communication Between Man And Machine. Communications of the ACM 9(1), 36–45 (1966) 5. Matsusaka, Y., Tojo, T., Kuota, S., Furukawa, K., Tamiya, D., Hayata, K., Nakano, Y., Kobayashi, T.: Multi-person Conversation via Multimodal Interface –A Robot who Communicate with Multi-user. In: Proceedings of 6th European Conference on Speech Communication Technology (EUROSPEECH 1999), vol. 4, pp. 1723–1726 (1999) 6. Nadamoto, A., Tanaka, K.: Passive viewing of Web contents based on automatic generation of conversational sentences. Japanese Society of Information Processing 2004-DBS-134(1), 183–190 (2004) 7. ALICE: http://alice.pandorabots.com 8. Artificial non-Intelligence UZURA: http://www.din.or.jp/~ohzaki/uzura.htm 9. Yamamoto, E., Kanzaki, K., Isahara, H.: Extraction of hierarchies based on inclusion of co-occurring words with frequency information. In: Proceedings of the 19th International Joint Conference on Artificial Intelligence, pp. 1166–1172 (2005)
On Managing Users’ Attention in Knowledge-Intensive Organizations Dimitris Apostolou1, Stelios Karapiperis1, and Nenad Stojanovic2 1
University of Piraeus, Karaoli & Dimitriou St. 80 Piraeus, Greece 185 34
[email protected],
[email protected] 2 FZI at the University of Karlsruhe, Haid-und-Neu Strasse 10-14, 76131 Karlsruhe, Germany
[email protected]
Abstract. In this paper we present a novel approach for managing users’ attention in knowledge intensive organizations, which goes beyond informing a user about changes in relevant information towards proactively supporting the user to react on changes. The approach is based on an expressive attention model, which is realized by combining ECA rules with ontologies. We present the architecture of the system, describe its main components and present early evaluation results. Keywords: attention management, semantic technologies.
1 Introduction Success factors in knowledge-intensive and highly dynamic business environments are mostly the ability to rapidly adapt to complex and changing situations and the capacity to deal with a quantity of information of all sorts. For knowledge workers, these new conditions have translated in the acceleration of time, the multiplication of projects in which they are involved, and in increased collaboration with colleagues, clients and partners. Knowledge workers are overloaded with potentially useful and continuously changing information originating from a multitude of sources and tools. In order to cope with a complex and dynamic business environment, the attention of knowledge workers must be always paid on the most relevant changes in information. Attention management in an organisational context refers to supporting knowledge workers focus their cognitive ability only on up-to-date information that is most relevant for their current business context (e.g. the business process they are involved in and the task they are currently resolving). In particular, support is required for searching, finding, selecting and presenting the most relevant and up-to-date information without distracting them from their activities. Information retrieval systems have provided means for delivering the right information at the right place and time. The main issue with existing systems is that they do not cope explicitly with the information overload, i.e. it might happen that a knowledge worker “overlooks” important information. The goal of attention management systems is to avoid information overload and to provide proactive effective recommendations for dealing with changed or new information [5]. Moreover, enterprise attention management is not just about receiving notifications proactively, but also enabling relevant reaction on this information and on relevant changes in general. Our approach puts forward a comprehensive G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 239–248, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
240
D. Apostolou, S. Karapiperis, and N. Stojanovic
reasoning framework that can trigger a knowledge base in order to find out the best way to react on a change. We base such a framework on a combination of ECA (event-condition-action) rules and ontologies. The paper is organized as follows: in the second section we analyze a motivating example and derive requirements for an Enterprise Attention Management System (EAMS). In the third section we outline the approach and implementation of the SAKE EAMS encompassing various functionalities to address relevant attentionrelated issues. In the fourth section we present related works and we compare them to our approach while in the fifth section we summarise conclusions and future work.
2 Enterprise Attention Management Framework In this section we give an indicative but generic motivating scenario from the public administration sector, we elaborate requirements for supporting it and generalise them into an enterprise attention management framework. 2.1 Motivating Scenario In order to keep local laws aligned with federal ones, the mayor in a municipality has to react on any changes and updates in the federal laws. In this process, a General Binding Regulation (GBR) is being developed in the local administration by a group of experts. Together they discuss about ways to adapt the federal law in the local administration. The outcome of this discussion is a draft GBR, which thereafter is open for public deliberation. Citizens can provide comments on the draft GBR and the head of the legal department together with the experts assess the comments received. Next, the revised GBR is submitted to the local councillors for approval. If the GBR is approved, it is then signed by the mayor and published. Changes in the federal law are announced in the federal governmental web portal; a system for alerting the mayor about updates is needed. Nevertheless, the mayor may not be interested in all possible changes, but rather only in some that are relevant for her/his municipality. Moreover, the mayor should be assisted in starting the public deliberation process (i.e. what additional information should be gathered, what is the timeline for the deliberation, etc.). In order to be supported in this decision making process, the mayor could use process-specific information such as previous deliberation processes. Moreover, relevant information might be hidden in the relationships between some information artefacts, like that there are not many experts who are familiar with the selected domain in this municipality. This information is needed to enable the mayor to react on the received alert properly. 2.2 Requirements We summarise three basic requirements for an EAMS stemming from the aforementioned scenario: 1. Expressive modelling of information in order to enable focusing of attention on the right level of information abstraction as well as to conceptually similar information. For example, a public servant working in the urban planning department
On Managing Users’ Attention in Knowledge-Intensive Organizations
241
should be alerted for new information about building refurbishing but the mayor should not. 2. Context-awareness in order to support a particular decision making process. For example, a new law about sanitation should trigger different alerts in a GBR about pets and in other, less-related to sanitary conditions GBRs. 3. Expressive formalism for the description of user preferences, including when to alert a user, but also how to react on an alert. 2.3 Enterprise Attention Management Framework Figure 1 presents an EAMS framework that generalises the aforementioned requirements. working environment
Web sites
AMS
remote databases proactively deliver relevant information
learn preferences
INFORMATION
extranets
INFORMATION
RSS feeds
CONTEXT
external information sources
learn preferences
business processes
information repositories
usage
usage
user-to-user interactions
internal information sources PREFERENCES
Fig. 1. Enterprise Attention Management Framework
− Information represents all relevant artefacts that can be found in the available information repositories (knowledge sources). In the business environment of an organization, sources of information can be both internal and external to the organization. Moreover, information can be represented either formally (e.g. using information structuring languages such as XML) or informally. Finally, information may be stored in structured repositories such as databases that can be queried using formal languages or in unstructured repositories such as discussion forums. − Context defines the relevance of information for a user. Detection of context is related to the detection of the user’s attentional state that involves collecting information about users’ current focus of attention, their current goals, and some relevant aspects of users’ current environment. In order to form a complete picture of the user’s attentional state, both sensor-based (e.g., observing cues of users’ current
242
D. Apostolou, S. Karapiperis, and N. Stojanovic
activity and of the environment) and non-sensor based (e.g., users explicitly state what they are working on) mechanisms for detecting user attention can be employed [9]. − Preferences enable filtering of relevant information according to its importance/relevance to the given user’s context. In other words, the changeability of resources is proactively broadcasted to the users who may be interested in them in order to keep them up to date with new information. Users may have different preferences about both the means they want to be notified and also about the relevance of certain types of information in different contexts. User preferences can be defined with formal rules or more informally e.g., by adding keywords to user profiles. Moreover, even when employing mechanisms capable of formalizing the users’ preferences, a certain level of uncertainty about users’ preferences will always remain. For this reason, dealing with uncertainty is an important aspect of attention management systems. Equally important is the way preferences can be derived: by explicitly specifying them or be machine learning techniques.
3 The SAKE Enterprise Attention Management System The objective of this section is to describe the SAKE EAMS that tries to address the problem of keeping corporate users’ attention always focused on their current job. 3.1 Attention Model Figure 2 presents the conceptual model and technical architecture underlying the SAKE EAMS. The model assumes that interactions between users and external/internal information sources are logged; the same applies to the business process context in which user interactions take place. Some log entries can be defined as Events that cause Alerts, which are related to a user and a problem domain, and associated to a priority level. Every Alert invokes Actions, that can be purely informative (i.e. an information push) or executable (e.g., execute a business process, start a new discussion forum). In the core of the SAKE approach are ECA (Event – Condition – Action) rules; their general form is: ON event AND additional knowledge, IF condition THEN do something Relevant events and actions are usually triggered by interactions taking place in organisational systems, such as the SAKE Content Management System (CMS) and the GroupWare System (GWS) or by external change detectors. The later are implemented with the Change Notification System (CNS), a component that can be configured to monitor web pages, RSS feeds and files stored in file servers for any change or specific changes specified by regular expressions (e.g. new web content containing the keyword “sanitary” but not “pets”).
On Managing Users’ Attention in Knowledge-Intensive Organizations external knowledge sources
Context observer
working environment
CNS
internal knowledge sources
Log ontology
Reasoner
SAKE EAMS
243
Information ontology
Preference editor
proactively suggest relevant information and actions
preferences
User
Fig. 2. The SAKE Conceptual Model (left) and High-level Technical Architecture (right)
The log ontology models change events, following a generic and modular design approach so that it is re-usable in other domains as well. Figure 3 shows the four basic types of events: AccessEvent, AddEvent, ChangeEvent and RemoveEvent. The subclasses of AddEvent and RemoveEvent are further differentiated: AddToCMSEvent means the addition of a new resource to the CMS (either by uploading an existing document or by creating a new one). AddToParentEvent and RemoveFromParent refers to the composition of generic parent/child relationship. For example, the addition of a ForumMessage to a ForumThread, or a ForumThread to a Forum is logged using an AddToParentEvent. The Information ontology contains the domain concepts and relations about which we want to express preferences, such as documents and forum messages. On the top level, the Information ontology separates physical types of information resources (e.g. HTML documents, PDF files, etc.) from conceptual types: some information resources are of an abstract nature, such as persons, while others physically exist in the SAKE system, such as CMS documents, GWS forums or e-mails. Preferences are generated using the Preference Editor. Each preference is expressed as a logical rule, represented in SWRL1 (Semantic Web Rule Language). Figure 4 illustrates a preference rule: if userA is in the processZ, then userA has preference of value 1.0 for documents created in 2006.Among the preferred values, preferences include the business context of the user, in order to support context-awareness of the whole system. The Preference Editor supports creation of preference rules by providing a GUI for step-wise, interactive rule development, as presented in Fig. 5. The user starts with defining a variable which properties are further specified (including defining new variables) in several subsequent steps.
1
www.w3.org/Submission/SWRL/
244
D. Apostolou, S. Karapiperis, and N. Stojanovic
Fig. 3. Log ontology: a) Class hierarchy, b) Class diagram
Fig. 4. A Sample Preference Rule Expressed in SWRL
Fig. 5. Preference Editor: Step-wise, interactive rule development
On Managing Users’ Attention in Knowledge-Intensive Organizations
245
Rules can react on events upon detection of them. The Reasoner then evaluates additional queries against the log for obtaining further information such as the business context of the user and then evaluates the condition. The business context is derived using the Context Observer, a component that links to enterprise systems (such as workflows and ERPs) and extracts the current business process, activity or task the user is working on. Finally, the Reasoner executes the action part of rules. 3.2 Technical Implementation and Evaluation The SAKE prototype is based on J2EE and Java Portlets following a three-tiered architecture. The presentation tier contains Portlets, JavaServer Pages (JSPs) and an auxiliary Servlet. Portlets call business methods on the Enterprise Java Beans (EJBs), pre-process the results and pass them to the JSP pages. The JSPs contain Hypertext Markup Language (HTML) fragments as well as placeholders for dynamic content (such as the results passed from the Portlets). The auxiliary Servlet is used for initializing the connection to the KAON2 ontology management system (http://kaon2. semanticweb.org/, part of the integration tier). The business tier consists mostly of EJBs, which provide the business logic and communicate with the components of the integration tier that comprise a third-party CMS component (Daisy) and GWS component (Coefficient) as well the Preference Framework. The interface to these components is represented using EJBs which all use the Kaon2DAO in order to access the ontologies: the CMSBean and GWSBean enhance the CMS and GWS with semantic meta-data, the AMSBean manages the preference rules. KAON2 stores the semantic meta-data for these entities with ontologies and provides the facilities for querying them using SPARQL2. The KAON2 reasoner is used for evaluating the user’s preference rules. The integration tier contains also a MySQL relational database, which stores CMS- and GWS-related content, such as forums, discussions, documents etc. Since the development of the SAKE system has not been completed yet (mainly integration of components is still pending), a comprehensive user-driven evaluation of the system as a whole is planned but not performed yet. On the contrary, we have performed an early evaluation of the main SAKE components, independently. Evaluation has been performed in three case studies: two local administrations and one ministry. We validated the usability of these components and their relevance to the knowledgeintensive work of public servants. We collected useful comments for further improvement regarding the functionality and interface of the SAKE components. Early evaluation of the Preference Framework in particular has revealed noticeable improvement in relevance of system-generated notifications when user preferences are taken into account. In the future we plan to perform formal experiments to measure the degree of improvement of notifications. Moreover, as soon as the SAKE system is integrated, we plan to test the system’s ability to not only send relevant notifications to users but also execute relevant actions such as the initiation of a workflow process. From a conceptual point of view, we have ensured that all components are based on a common ontological model for representing information resources and changes as well as other concepts not presented in this paper, such as context, roles and 2
www.w3.org/TR/rdf-sparql-query/
246
D. Apostolou, S. Karapiperis, and N. Stojanovic
preferences. From the technical point of view we ensured standards-based interoperability by using state-of-the-art Semantic Web technologies, such as SWRL and SPARQL.
4 Related Work There has been considerable research done on attention aware systems that address the information overload problem (e.g., [9]). According to [7] attention-aware or attentive systems are software systems that support users’ information needs by keeping track of what users are writing, reading, looking at or talking to and suggesting information that might have beneficial influence to them. SUITOR [7] is an attentive system comprising four modules: a) watching user’s actions to infer user’s current mental state and needs, b) storing user’s actions to create and maintain user’s model, c) searching information from the digital world and scanning user’s local hard disk and d) ranking and suggesting relevant information sources through peripheral display. Having enough input information from these modules, SUITOR can infer user’s current interests and propose relevant information sources from local and remote databases that have previously gathered and stored. The Attentional User Interface project [6] developed methods for inferring attention from multiple streams of information, and for leveraging these inferences in decision making under uncertainty. The project focused on the design of interfaces that take into account visual attention, gestures and ambient sounds as clues about a user’s attention. These clues can be detected through cameras, accelerometers and microphones or other perceptual sensors and, along with the user’s calendar, current software interaction and data about the history of user’s interests, they provide valuable information about the status of a user’s attention. The same project incorporated Bayesian models dealing with uncertainty and reasoning about current or future user’s attention taking as input all of the above clues. In comparison to attention aware systems, our system does not include sensorbased mechanisms for detecting the user’s environment. We argue that for enterprise attention management, non-sensory based mechanisms provide a wealth of attentional cues such as users’ scheduled activities (e.g. using online calendars), users’ working context (e.g. by querying workflow or enterprise systems) and user’s communication and collaboration patterns (e.g. using groupware and other communication tools). Our approach is more in-line with related commercial systems, such as KnowNow Enterprise Syndication Server (KESS), NewsGator Enterprise Server (NES) and Attensa Feed Server (AFS). These systems leverage RSS technology to find and distribute relevant content to employees. KESS (http://www.knownow.com) persistently monitors both RSS and non-RSS enabled data sources, from either outside or inside the enterprise, for predefined criteria and routes automatically notifications of new and updated information to employees, partners and customers into various output formats, like enterprise portals, RSS readers, mobile devices or email. This server is capable to syndicate and aggregate, rank, prioritize and route enterprise content. NES (http://www.newsgator.com) is a centrally managed and administered RSS aggregation platform providing access to RSS and Atom sources, search tools to find relevant feeds, multiple user options for
On Managing Users’ Attention in Knowledge-Intensive Organizations
247
feed reading and enables users to collaborate and discuss important topics. Besides RSS feeds, NES can aggregate and deliver content from internal portals, enterprise applications and databases (CRM, ERP, HR), premium content providers such as Thomson, blogs, wikis and e-mail applications. AFS (http://www.attensa.com) is a content delivery and notification system which has analytics and reporting capabilities that profile user behavior to predict and identify the most effective communications channels between users. Context-based, proactive delivery of information refers to delivering information to users based on context, e.g. activities, organizational role, and work outputs. Maus [8] and Abecker et al. [1] developed workflow management systems that recommend relevant information to users based on the process context. The Watson system [4] provides users with related documents based on users’ job contents such as wordprocessing and Web browsing. Ahn et al. [3] provide a knowledge context model, which facilitates the use of contextual information in virtual collaborative work. In [2], organizational context is modelled to provide awareness of other users’ activities in a shared workspace. In our work, context-based delivery of information is coupled to attention aware delivery of information and is also used for triggering actions.
5 Conclusions In this paper we presented a novel approach for managing attention in an enterprise context by realising the idea of having a reactive system that manages not only alerting a user that something has been changed, but also supporting the user to react properly on that change. In a nutshell, the corresponding system is an ontology-based platform that logs changes in internal and external information sources, observes user context and evaluates user attentional preferences represented in the form of ECA rules. The system has been developed targeting the eGovernment domain. The evaluation process is still ongoing, but the first results are very promising. Future work will be toward further refinement of ECA rules for preference description and automatic learning of preferences from usage data using machine learning techniques.
References 1. Abecker, A., Bernardi, A., Hinkelmann, K., Kühn, O., Sintek, M.: Context-aware, proactive delivery of task-specific information: the Know-More Project. Information Systems Frontiers 2, 253–276 (2000) 2. Agostini, A., De Michelis, G., Grasso, M.A., Prinz, W., Syri, A.: Contexts, work processes and workspaces. In: Proceedings of the International Workshop on the Design of Cooperative Systems (COOP 1995), INRIA, Antibes, France, pp. 219–238 (1995) 3. Ahn, H.J., Lee, H.J., Cho, K., Park, S.J.: Utilizing knowledge context in virtual collaborative work. Decision Support Systems 39(4), 563–582 (2005) 4. Budzik, J., Hammond, K.: Watson: Anticipating and contextualizing information needs. In: Proceedings of the Sixty-Second Annual Meeting of the American Society for Information Science (1999)
248
D. Apostolou, S. Karapiperis, and N. Stojanovic
5. Davenport, T., Beck, J.: The Attention Economy: Understanding the New Currency of Business. Harvard Business School Press (2001) 6. Horvitz, E., Kadie, C.M., Paek, T., Hovel, D.: Models of attention in computing and communication: From principles to applications. Communications of the ACM 46(3), 52–59 (2003) 7. Maglio, P.P., Campbell, C.S., Barrett, R., Selker, T.: An architecture for developing attentive information systems. In: Knowledge-Based Systems, vol. 14, pp. 103–110. Elsevier, Amsterdam (2001) 8. Maus, H.: Workflow context as a means for intelligent information support. In: Akman, V., Bouquet, P., Thomason, R.H., Young, R.A. (eds.) CONTEXT 2001. LNCS (LNAI), vol. 2116, pp. 261–274. Springer, Heidelberg (2001) 9. Roda, C., Thomas, J.: Attention Aware Systems: Theory, Application, and Research Agenda. Computers in Human Behaviour 22, 557–587 (2006)
Two Applications of Paraconsistent Logical Controller Jair Minoro Abe1,2, Kazumi Nakamatsu3, and Seiki Akama4 1
Graduate Program in Production Engineering, ICET - Paulista University R. Dr. Bacelar, 1212, CEP 04026-002 São Paulo – SP – Brazil 2 Institute For Advanced Studies – University of São Paulo, Brazil
[email protected] 3 School of Human Science and Environment/H.S.E. – University of Hyogo – Japan
[email protected] 4 C-Republic Inc., 1-20-1, Higashi-Yurigaoka, Asao-ku, Kawasaki-shi, 215-0012, Japan
[email protected] Abstract. In this paper we discuss two applications of the logical controller Paracontrol. Such controller is based on Paraconsistent Annotated Logic and it is .capable of manipulating imprecise, inconsistent and paracompete data. Keywords: Logical controller, paraconsistent logics, annotated logics, conflicts and automation, temperature sensors.
1 Introduction A Paraconsistent Logical Controller based on Paraconsistent Annotated Evidential Logic Eτ was introduced in [5]. Such controller was dubbed Paracontrol. In this paper we sketch two more applications made: an electronic device for blind and/or dumb people locomotion and for an autonomous mobile robot based on two temperature sensors. The Paracontrol is the eletric-eletronic materialization of the Para-analyzer algorithm [5], which is basically an electronic circuit, which treats logical signals in a context of logic Eτ [1]. The atomic formulae of the logic Eτ is of the type p(μ, λ), where (μ, λ) ∈ [0, 1]2 and [0, 1] is the real unitary interval with the usual order relation and p denotes a propositional variable. There is an order relation defined on [0, 1]2: (μ1, λ1) ≤ (μ2, λ2) ⇔ μ1 ≤ μ2 and λ1 ≤ λ2 . Such ordered system constitutes a lattice that will be symbolized by τ. p(μ, λ) can be intuitively read: “It is believed that p’s favorable evidence is μ and contrary evidence is λ.” So, we have some interesting examples: • • • • •
p(1.0, 0.0) can be read as a true proposition. p(0.0, 1.0) can be read as a false proposition. p(1.0, 1.0) can be read as an inconsistent proposition. p(0.0, 0.0) can be read as a paracomplete proposition. p(0.5, 0.5) can be read as an indefinite proposition.
Note, the concept of paracompleteness is the “dual” of the concept of paraconsistency. The Paracontrol compares logical values and determines domains of a state lattice corresponding to output value. Favorable evidence and contrary evidence degrees are represented by voltage. Certainty and contradiction degrees are determined by analogues of operational amplifiers. The Paracontrol comprises both G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 249–254, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
250
J.M. Abe, K. Nakamatsu, and S. Akama
analog and digital systems and it can be externally adjusted by applying positive and negative voltages. The Paracontrol was tested in real-life experiments with an autonomous mobile robot Emmy [6], whose favorable/contrary evidences coincide with the values of ultrasonic sensors and distances are represented by continuous values of voltage.
2 Paraconsistent Autonomous Mobile Robot Hephaestus1 Recently we’ve tested the Paracontrol in another study, namely to verify the performance in robots with temperature sensors. The prototype studied was dubbed Hephaestus [8]. The robot uses a microcontroller PIC 16f877A, two pairs of two temperature sensors LM35 (which can measure 0 oC to 100 oC, with output tension variation of 10mV/oC) and two continuous electric current engines with speed reduction. For each 1 oC of temperature variation, it corresponds to 10mV tension sign variation (i.e., temperatures ranging 2 oC to 60 oC). Feed circuit provides the necessary electric tension to guarantee the performance of the robot’s circuits. We’ve employed 5 and 12 Volts tensions.
Fig. 1. Electric diagram of the feed circuit
Information on the robot’s environment is sent to the microcontroller by a sensor circuit, which processes according to the Paracontrol, resulting a decision to be performed by the robot. Next, there is a description of the operation of the program located in the memory of the microcontroller. The control circuit is responsible for calculating the temperature of where the robot is. It is also responsible for transforming this temperature in values of favorable and contrary evidences degree, execution of the Para-analyzer algorithm and generating the signs for starting the engines. Two engines supplied by a continuous electric tension ranging from 0 to 12 Volts are responsible for the robot movements; go back, go forward, and speed control. Paracontrol is also responsible for determining which engine should operate, which way it should turn and which speed should be applied. A circuit D/A was built using the adder chain to process the signs coming in digital signs and convert them first into analog signs, a PWM circuit to help controlling the speed and a bridged circuit to control where the engines should turn to. Each engine has their independent circuits whose are operated by the microcontroller outputs. The logical controller Paracontrol will quantify the values of de favorable evidence (µ) and of contrary evidence (λ) degrees, corresponding to the signs of physical signs 1
The ancient Greek god of fire.
Two Applications of Paraconsistent Logical Controller
251
Microcontroller PIC16f877
Fig. 2. Sensors circuit
value originated by the temperature sensors, resulting an action. This analysis is made according to the output states lattice τ (see Fig. 2). Information about temperature is captured by two pairs of two sensors; each pair gives the value of favorable evidence degree and contrary evidence degree: (μ1, λ1) and (µ 2, λ2). The circuit of measuring and controlling the temperature was projected having the basis 50 ºC, that is to say, the robot is determined to take an action if the temperature is equal or greater than 50 ºC. The possible actions are go forward, backward or turn aside. The temperature sensors were placed on the extremities at the basis of the robot, in such a manner that can be identify different temperatures in front of, behind, on the left side or on the right side of the robot, which will determine its decision. In this case, the different degrees of favorable or contrary evidences of the process can be submitted to a logic operation NOT, OR and AND in order to obtain resulting values (µ R e λR). The logical operation applied will depend on the process that is under control. In each pair of sensors, it was considered logical operation OR (maximization of the terms), so if one of the sensors fails, the measuring of the temperature will be performed by the second sensor without loosing of efficiency. Table 1 presents the converted values of the continuous tension of the analog input signs. Table 1. Values of the tension of the converted input analog signs
0.3
0.6
0.9
1.2
1.5
1.8
2.1
2.4
2.7
3
0011
0100
0110
1000
1001
1011
1100
1101
1111
1111
0
1000
1
1110
0.9
0101
0.8
1011
0.7
0001
0.6
0111
0.5
1101
0.4
0011
0.3
1001
0.2
0000
0.1
0001
Hexadecimal signal
0
0000
LSB MSB
Binary signal
Evidence Signal Voltage Signal(V)
00
19
33
4D
67
81
9B
B5
CE
D8
FF
252
J.M. Abe, K. Nakamatsu, and S. Akama
λ
μ Fig. 3. Output states lattice – Extreme and Non-extreme states Table 2. Symbology of the Extreme and Non-extreme states
Extreme States True False Inconsistent Paracomplete
Symbol V F T ⊥
Non-extreme states
Symbol
Quasi-true tending to Inconsistent Quasi-true tending to Paracomplete Quasi-false tending to Inconsistent Quasi-false tending to Paracomplete Quasi-inconsistent tending to True Quasi-inconsistent tending to False Quasi-paracomplete tending to True Quasi-paracomplete tending to False
D E A H C B F G
The decision states are defined by the control values. So, the analysis of the input parameters will be started after the definition of the control limit values (C1, C2, C3 e C4). The parameters of the control values are defined as the following: C1 = upper certainty control value; C2 = lower certainty control value; C3 = upper contradiction control value; C4 = lower contradiction control value. In the prototype studied we’ve worked with C1 = C2 = C3 = C4 = 0.75. Table 3. Control limit values C1 11000001
PARAMETERS OF THE CONTROL LIMIT VALUES C2 C3 C4 00111111 11000001 00111111
Two Applications of Paraconsistent Logical Controller
253
In 6 tests made, ranging 13 to 60 seconds time, the robot was able to avoid a heat source placed in front of inside a box, with a escape opening, at a rate of 80%. In the remaining cases, the printed circuit board was heated up or was not escape from the tunnel.
3 Keller – Electronic Device for Blind and/or Deaf People Locomotion The Paracontrol was also used for the construction of an electronic device which we dubbed Keller2, for helping blind and/or dumb people in their locomotion. The main components of Keller are a microcontroller of 8051 family, two ultrasonic sensors, and two vibracalls. Figure 3 shows the Keller basic structure. The ultrasonic sensors are responsible in verifying whether there is any obstacle in front of the person in the area of sonars acting. The signals generated by the sensors are sent to the microcontroller. These signals are used to determine the favorable evidence degree μ and the contrary evidence degree λ regarding the proposition “There is no obstacle in front of the person”. Then the Paracontrol, recorded in the internal memory of the microcontroller, uses in order to determine the people movements, through a signal generated by the vibracalls, provided also by the microcontroller. The vibracalls can be placed in a confortable place for the user. In the prototype Keller, the two ultrasonic sensors were placed at person’s chest (Fig. 3). Thus, the main motivation in studying this prototype leans on the very simple implementation of the Paracontrol. Keller is a promising device in aiding blind and/or deaf people in their locomotion. Some tests were made in real situations with promising results. However, the problem as whole, naturally, we are aware we need improve with more accurate sensors, as well as, solve how to detect obstacles on the ground and innumerous other questions.
Fig. 4. Keller: position of the sensors 2
Keller is in homage to Helen Adams Keller.
254
J.M. Abe, K. Nakamatsu, and S. Akama
4 Conclusions The applications considered show the power and even the beauty of the Paraconsistent systems. They have provided other ways than usual electronic theory, opening applications of non-classical logics in order to overcome situations not covered by other systems or techniques. All the topics treated in this paper are being improving as well as many other projects in a variety of themes. We hope to say more in forthcoming papers.
References 1. Abe, J.M.: Fundamentos da Lógica Anotada (Foundations of Annotated Logics), in Portuguese, Ph. D. Thesis, Universidade de São Paulo, São Paulo (1992) 2. Torres, C.R.: Sistema Inteligente Paraconsistente para Controle de Robôs Móveis Autônomos, MSc. Dissertation, Universidade Federal de Itajubá - UNIFEI, Itajubá (2004) 3. Abe, J.M.: Some Aspects of Paraconsistent Systems and Applications. Logique et Analyse 157, 83–96 (1997) 4. Abe, J.M., da Silva Filho, J.I.: Manipulating Conflicts and Uncertainties in Robotics. Multiple-Valued Logic and Soft Computing 9, 147–169 (2003) 5. Da Silva Filho, J.I.: Métodos de Aplicações da Lógica Paraconsistente Anotada de Anotação com Dois Valores LPA2v com Construção de Algoritmo e Implementação de Circuitos Eletrônicos, in Portuguese, Ph. D. Thesis, Universidade de São Paulo, São Paulo (1999) 6. Da Silva Filho, J.I., Abe, J.M.: Emmy: a paraconsistent autonomous mobile robot. In: Abe, J.M., Da Silva Filho, J.I. (eds.) Logic, Artificial Intelligence, and Robotics, Proc. 2nd Congress of Logic Applied to Technology – LAPTEC 2001. Frontiers in Artificial Intelligence and Its Applications, vol. 71, pp. 53–61, 287. IOS Press, Amsterdam (2001) [14] Elfes, A.: Sonar based real-world mapping and navegation. IEEE Journal of Robotics and automation (1987) 7. Mckerrow, P.: Introduction to Robotcs. Addison-Wesley Publishing Company, New York (1992) 8. Berto, M.F.: Aplicação da Lógica Paraconsitente Anotada Evidencial E(no Controle de Sensores de Temperatura na Atuação de Robôs Móveis, MSc. Dissertation, in Portuguese, Paulista University, São Paulo (2007)
Encoding Modalities into Extended Petri Net for Analyzing Discrete Event Business Process Takashi Hattori1,2 , Hiroshi Kawakami1, Osamu Katai1 , and Takayuki Shiose1 1
2
Graduate School of Informatics, Kyoto University Sakyo-ku, Kyoto, 606-8501, Japan {kawakami,katai,shiose}@i.kyoto-u.ac.jp http://www.symlab.sys.i.kyoto-u.ac.jp/ NTT Communication Science Laboratories 2-4, Hikaridai, Seika-cho, “Keihanna Science City”, Kyoto, 619-0237, Japan takashi
[email protected]
Abstract. This paper proposes a method for encoding discrete systems together with their tasks into an extended Petri net. The main feature of this method is the interpretation of tasks, which are first described by modal logic formulae, into what we call “task unit graphs” that can be naturally combined with a Petri net. The combination of a normal Petri net and task unit graphs provides a method for detecting conflicts among tasks. Furthermore, we examine a way for resolving such conflicts and attaining the correct behavior of systems.
1 Introduction To cope with the recent need for constructing sophisticated reliable systems that are governed in a discrete manner, such as IT process models, this paper proposes a theoretical framework for analyzing discrete event-driven systems. Our framework employs a representational method based on a Petri net [1] and a combination of two kinds of modal logics [2]: “temporal logic [3]” and “deontic logic [4, 5].” A Petri net is known as a conventional method for modeling a physical structure and event-driven behavior of systems with rich mathematical analysis [6]. Recent trends in IT business have revitalized Petri nets as one of the frameworks of process modeling together with conventional Peterson’s ERM (EntityRelationship Model), UML (Unified Modeling Language) and so on. For example, the research on translating UML into a Petri net [7] can be seen as a work to enhance a Petri net’s visibility. The long history of research on Petri nets has clarified not only their ability but also certain problems such as their visibility and ability for modular techniques [8]. To overcome these problems, this paper employs the strategy, as shown in Fig. 1, of dividing a target system into two portions. The portion that is pre-fixed by physical and structural constraints is represented by a simple Petri net with high visibility, and the portion that is conducted through control and systemic constraints is represented by modal logic formulae. One of the G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 255–264, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
256
T. Hattori et al.
target system
pre-fixed part
control and systemic constraints
sec. 2.1 encode X
sec. 2.2 encode modal logic formulae sec. 3 translate
extended Petri net Petri net (C/E system)
task unit graph
Fig. 1. Overview of Proposed Framework
promising methods for dealing with the latter portion is to follow the rigid syntax, semantics and proof theory for the combination of Tense logic with Deontic Logic [9]. On the other hand, our strategy translates modal logic formulae into what we call “task unit graphs” that mediate the integration of the two portions into an extended Petri net. In these unit graphs, the deontic part of modal logic formulae is substantiated as “imperative” arcs between the unit graphs, while the temporal part is represented as the structures of the unit graphs. Among several works for combining Temporal Logic and Petri nets such as [10, 11], the main feature of our framework is that each logic formula is directly interpreted to a network representation. Some part of the latter portion may be directly translated into a Petri net, but we do not adopt such a strategy because it remains to be determined whether it can be embedded into a Petri net. In addition, keeping the two portions of the systems separate is better than encoding different aspects of systems into a complicated single network for the system’s higher modularity. The rest of this paper consists of the following sections. Section 2 gives an overview of a way to model systems by using Petri nets and modal logic formulae. Then the formulae are translated into “task unit graphs” that are defined through a systematic analysis of logical definitions in section 3. Section 4 describes how this representational framework enables us to elucidate potential “conflicts” among tasks.
2 Modeling Systems by Petri Net and Modal Logic This section proposes a method for modeling discrete event systems based on a kind of Petri net and modal logic. For illustrating the procedures of system modeling, this paper employs the following example of a simple ID authentication system. Example System: Generally, ID authentication systems require users’ submission of a set of an ID and a password. The ID authentication system we employ here is unique in that it accepts not only an ID-password set but also a series of IDs and passwords, such series as “an ID and a password submitted alternately twice” or “ID submissions twice followed by a password submission.” The series may be called a “pass-action.”
Encoding Modalities into Extended Petri Net
cache
P2 P3
P4
memory P5
257
P7 authentication process
Fig. 2. Example of Target System
Unlike ordinary systems, this new system cannot store ID and password to an individual memory. Instead, it has to accept both ID and password in any order, and also has to hand over a sequence of submissions to the authentication engine. Figure 2 shows an example of this system that allows the length of pass-actions to be two at most. 2.1
Petri Net Representation of System’s Pre-fixed Aspect
The first step of our framework is to encode the pre-fixed portion of a system into a Petri net. It is known that a k-bounded standard Petri net can be translated into an equivalent 1-bounded Petri net. We employ a 1-bounded one called the condition/event system (C/E system) [12] where a transition can fire only if all “its output places” are empty. For instance, the example shown in Fig. 2 is the case where the upper-bound of “the number of stored passwords or IDs” is two (2-bounded). Thus, it can be modeled as a C/E system as shown in Fig. 3, where a token in place Pi means that the process Pi is in progress. Places P1 /P6 correspond to the idling state of users/“an authentication engine” respectively. A token in P4 means that data are stored in cache P4, and transition τ5 means the “data transfer from P4 to the memory P5” and “flushing P4.”
τ1
τ2
P2
P3
P6
τ5
P1
τ3
P4
τ4
τ6
τ7 P7
P5
Fig. 3. Example of Petri Net
By corresponding the place of a C/E system to a proposition, we can represent the true/false value of the proposition by putting/removing a token in/from the place. In this case, each transition leads a value alteration of the proposition. For instance, in Fig. 3, the firing of τ6 leads P4 , P5 , P6 to turning from true to false, and P7 to turning from false to true. 2.2
Modal Logic Representation of Tasks
The second step of our framework is to represent each task that states “when the focused state is required to be true” as a proposition by introducing modal logics such as temporal and deontic logics.
258
T. Hattori et al.
Temporal Modalities. A temporal logic is given by the propositional logic, modal operators, and an axiom system. This paper employs the following modal operators: T A: GA: F A: AUB:
A will be true at the next state S1 , A will be true from now on S0 ,S1 ,S2 ,· · ·, A will be true at some time in the future Sj (j ≥ 0), B is true at the current state S0 or A will be true from now on until the first moment when B will be the case,
where A, B denote logic formulae, and S0 and Si (i > 0) mean current and future states (worlds) respectively. Axiom systems of temporal logic vary depending on the viewpoint of time. This paper employs one of the discrete and linear axiom systems, KSU [13], which is an extension of the minimal axiom system Kt [3]. Introducing Y (yesterday) as the mirror image of T (tomorrow), the axiom system claims that T ¬A ≡ ¬T A, Y¬A ≡ ¬YA, and T YA ≡ YT A ≡ A. Introducing S (since) as the mirror image of U (until), GA ≡ AU⊥, where ⊥ denotes the contradiction, and F A ≡ ¬G¬A, Kt is rewritten as AUB ≡ B ∨ (A ∧ T (AUB)),
(1)
ASB ≡ B ∨ (A ∧ Y(ASB)), {(A ⊃ T (A ∨ B))UC} ⊃ {A ⊃ AU(B ∨ C)},
(2) (3)
{(A ⊃ Y(A ∨ B))SC} ⊃ {A ⊃ AS(B ∨ C)},
(4)
where A, B and C denote logic formulae. Deontic Modalities. Understanding the system’s behavior by temporal logic is of an “objective” view of the focused proposition. To represent our “subjective” intention or purpose, such as how the propositions should behave, i.e., the control rule (or task), we introduce deontic modalities: OA: A is obligatory,PA: A is permitted. The axiom system we adopt here for O and P is that of SDL (standard deontic logic), which defines OA ≡ ¬P¬A, and claims OA ⊃ PA, and O(A ⊃ B) ⊃ (OA ⊃ OB). Control rules and specifications of systems can be translated into the combinations of temporal and deontic modes by using “translation templates” such as OF A: A has to be true in the future, PGA: A can be always true. Translating Tasks into Tentative Obligations. Assume that the provider of the authentication system shown in Fig. 2 must adopt some privacy policies as control rules in order to guarantee the system’s security such as: PP1: submission and authentication must not be done simultaneously; PP2: password submission is always followed by an ID submission;
Encoding Modalities into Extended Petri Net
259
PP3: submission is accepted only when the cache is empty; PP4: the system should request a password submission randomly. Each task is translated into modal logic formulae, which are activated by a firing of the corresponding transition, as follows. PP1: This task is easily encoded into the C/E system, but in order to maintain high modularity, we translate it into logic formulae at first. This task can be broken down into relatively logical sentences, i.e., after an authentication begins (τ6 ), input methods have to wait (P1 ) until the authentication finishes (P6 ), i.e., τ6 activates O(P1 UP6 ). Also, after an input process begins (τ1 or τ3 ), the authentication has to wait (P6 ) until the input process finishes (P1 ), i.e., each of τ1 and τ3 activates O(P6 UP1 ). PP2: This task means that after a password submission (τ4 ), passwords are not accepted (¬P3 ) until an ID input method works (P2 ), i.e., τ4 activates O(¬P3 UP2 ). PP3: This task means that after some data is stored in the cache (τ2 or τ4 ), input processes should wait (P1 ) until the cache is flushed (¬P4 ), i.e., each of τ2 and τ4 activates O(P1 U(¬P4 )). PP4: This task means that after an ID process starts (τ1 ), the password input process (P3 ) has to work in the future, i.e., τ1 activates OF P3 . Translating Tasks into Resident Obligations: Not every task corresponds to a specific transition. Some tasks are translated into logical forms that are not activated by a transition but are always activated. For example, a rule PP0: Once the authentication process (P7) is finished, a password (P3) should not be accepted until the cache memory (P4) flushes its contents to the memory P5, is resident and translated into OG({(¬P3 )U(¬P4 ∧ P5 )}S(¬P7 )).
3 Network Representation of Tasks The third step of our framework is to translate the task represented by modal logic into an extended Petri net, which we call a “task unit graph,” by introducing four types of special arcs shown in Fig. 4. These arcs are placed from a place to a transition. They function whenever the place holds a token and the transition satisfies its firing condition, but they differ from ordinary arcs of the standard Petri net on the following points. First, they do not transfer tokens from places to transitions. Next, if there are multiple special arcs from the same place, all of them are activated simultaneously. As a result, simultaneous firing of multiple transitions is permitted at the same state.
260
T. Hattori et al. (a)
Prohibit firing Force firing at the next step
(b)
(c)
Force firing in the future
(d)
Force synchronized firing
Fig. 4. Special arcs for control of transition firing
A
free
A
A
A
A (a) OTA
free
A (b) OGA
A
A (c) OFA
free
B
B (d) O(AUB)
Fig. 5. Examples of Task Unit Graph
Figure 5 shows task unit graphs, which are derived by a systematic analysis of logical representations of tasks. OT A: A has to be true at the next state S1 . If the current state S0 holds A, OT A forbids the transition from A to ¬A. Otherwise, OT A forces the transition from ¬A to A. In each case, OT A is accomplished at S1 . OGA: A has to be true from now on. If S0 holds A, the transition from A to ¬A is forbidden. Otherwise, OGA cannot be true. OF A: A has to be true at some time in the future. This task forces the transition from A to ¬A in the future. If A becomes true, OF A is accomplished. So an arc of “compulsion of firing at the net step” is placed from A to the transition of transferring the token to the place f ree at the next state. O(AUB): A has to be true from now on until B will be the case. If B is the case at S0 , this task is accomplished, else if ¬A ∧ ¬B at S0 , this task cannot be accepted due to Eq. (1). If A ∧ ¬B at S0 , A should be maintained and O(AUB) also has to be the case at S1 . In each case, O(AUB) at S0 prohibits the alteration from A to ¬A, so an arc of “prohibition of firing” is placed from the place of O(AUB) to the “transition of the alteration from A to ¬A.” These general definitions of “task unit graph” are instantiated to specific tasks. Figure 6 shows an extended Petri net where “general task unit graphs” are instantiated to tasks PP0, · · ·, PP4. They include transitions that are unified to transitions of a “Petri net representation of the pre-fixed portion of the target system” at the middle part of Fig. 6. The unifications are indicated by synchronized firing linkages (Fig. 4 (d)). It should be noted that defining Q ≡ ¬P4 ∧ P5 , H ≡ ¬P3 UQ, PP0 can be represented as OG(HS(¬P7 )), which derives hierarchical structure of the extended Petri net representation as shown in the upper part of Fig. 6.
Encoding Modalities into Extended Petri Net
Hierarchical Decomposition of Task OG(HS( P7)) (HS( P7))
HS( P7) OG(HS( P7)) HS( P7)
H
P7
P7 P7
261
H
H
H
P7
(HS( P7)) H= P3UQ
Q H
Q
P3
H
P3
τ4 τ3
P4
P5
Q
Q= P4 P5 P4
or
A B
Q
P5
τ5 τ2 Pre-fixed portion of Target System
τ3 free
τ6 τ5 τ1 P3
P3
P2
τ2
P1
τ4
P6
τ7
P5
P1 free P1 O(P1U P4)
P3 OFP3
P4
τ5 τ6 P7
P4 P4
Fig. 6. Hierarchical Extended Petri net Representation of the System with Tasks
A B ASB B
A
A A
A
B
B
B or A
B (ASB)
A
B (A B)
Fig. 7. Task Unit Graph for S and that for ∧ (conjunction)
In the figure, task unit graphs representing S and ∧ (conjunction) are employed. Their general types are defined as shown in Fig. 7.
4 Detecting Conflict among Systems’ Behavior The typical conflicts are observed among tasks. Figure 8 shows the state transitions of the target system with the initial state S0, which holds P1 and P6 , and is in charge of tasks PP0, · · ·, PP4. Table 1 shows the markings of each
262
T. Hattori et al.
τ1 τ1
S7
τ7
S6
τ6
’’ S5
S1
τ2
S2
τ5
S3
S0
τ7
’ S6
τ6
’’’ S5
τ4
’ S4
τ4
’ S5
τ3 τ1 τ2
S4
τ2
S5
Fig. 8. State Transition Diagram
P1 P2 P3 P4 P5 P6 P7 P6 P1 free P6 ∪ P1 P3 free F P3 P1 ¬P4 free P1 ∪ (¬P4 ) ¬P3 P2 free (¬P3 ) ∪ P2 P1 P6 free P6 ∪ P1 ¬P4 P5 Q = ¬P4 ∧ P5 ¬P3 H = (¬P3 ) ∪ Q ¬P7 HS(¬P7 ) G(HS(¬P7 ))
Table 1. State Transition Table
S0 S1 S2 S3 S4 S4 S5 S5 S5 S5 S6 S6 S7
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
◦ ◦
◦ ◦
◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
• • ◦ ◦ ◦
◦ ◦
• • ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦
• • ◦ ◦
◦ ◦
• • • • • ◦ ◦ ◦ • • • • • • • ◦ ◦ ◦ • ◦ ◦ ◦
• • • • • • • •
◦ ◦ ◦ • • • • • ◦ • • ◦ • ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
state where “◦” means a normal token and “•” means an active token, which constrains other tokens. We have already clarified the criteria for governing this kind of state transition [14], but the criteria indicate only which state involves task conflicts. The framework proposed in this paper detects how the state falls into conflicts by tracing synchronized firing linkages (broken lines in Fig. 6) as mutual interferences among tasks. For example, Fig. 9 shows conflicts in state S5, where place O(P1 U(¬P4 )) has a token, which prohibits firing of τ1 and τ3 . On the other hand, the token in place OF P3 requests firing of τ3 . Therefore, there is a conflict of firing τ3 . The only way to resolve this conflict is turning P4 to ¬P4 , which leads the token in P1 U(¬P4 ) to f ree. But the establishment of ¬H in this state prohibits turning P4 to ¬P4 by tracing synchronized firing linkages from OG(HS(¬P7 )). As a result, this conflict cannot be resolved unless ¬H turns to H.
Encoding Modalities into Extended Petri Net
263
HS( P7) (HS( P7))
OG(HS( P7)) HS( P7)
H
P7
P7 P7
H
H
H
P7
(HS( P7)) Target System
τ1
τ3 free P3 OFP3
P3
P3
P2
τ2
P6
τ5 τ6 P7
P1
τ4
or P4
τ7
P5
P1 free P1 O(P1U P4)
P4 P4
Fig. 9. Conflict Detection via Extended Petri Net
5 Conclusions In this paper, we have proposed a novel modeling framework of systems that are governed in a discrete manner. In the framework, tasks that conduct systems’ behavior are represented by task unit graphs that can be unified with the extended Petri net. If there are conflicts among multiple tasks, they can be visually detected using the network representation. Our model also opens the way to discussing the “level of correctness” of the target system. Due to the design of the system, a system may be consistent under any series of operations and need no governance. Or it may need adequate control in order to keep itself consistent. There also may be systems that cannot be kept consistent despite all possible governance. We call these systems “strongly correct,” “weakly correct,” and “incorrect” respectively. We expect that the adequate governance needed for the consistency of weakly correct systems can be derived through analysis of the proposed modeling method. Such a discussion of correct behavior can be extended to multi-agent systems by assuming that agents are in charge of governing their own tasks. Although conflicts mainly occur among tasks, multiple agents are another major cause of conflicts. Agents sometimes interfere with each other’s field of work. Our model has the potential for discussing such conflicts [14]. Also, solutions of such conflicts can be derived through an analysis of the extended Petri net.
References 1. Peterson, J.L.: Petri Net Theory and the Modeling of Systems. Prentice-Hall, Englewood Cliffs (1981) 2. Hughes, G.H., Cresswell, M.J.: An Introduction of Modal Logic. Methuen (1968)
264
T. Hattori et al.
3. Rescher, N., Urquhart, A.: Temporal Logic. Springer, Heidelberg (1971) 4. von Wright., G.: Deontic logic. Mind 60, 1–15 (1951) 5. Goble, L.F.: Gentzen systems for modal logic. Notre Dame J. of Formal Logic 15(3), 455–461 (1974) 6. Karatkevich, A.: Dynamic Analysis of Petri Net-Based Discrete systems. LINCIS, vol. 356. Springer, Heidelberg (2007) 7. Hu, Z., Shatz, S.M.: Mapping UML diagrams to a Petri net notation for system simulation. In: Proc. of the 16th Int. Conf. on Software Engineering and Knowledge Engineering (SEKE), pp. 213–219 (2004) 8. Saldhana, J., Shatz, S.M.: UML diagrams to object Petri net models: An approach for modeling and analysis. In: Proc. of the Int. Conf. on Software Engineering and Knowledge Engineering (SEKE), pp. 103–110 (2000) 9. ˚ Aqvist, L.: Combinations of tense and deontic modality. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 3–28. Springer, Heidelberg (2004) 10. Penczek, W., et al.: Advances in Verification of Time Petri Nets and Timed Automata. Springer, Heidelberg (2006) 11. Okugawa, S.: Introduction to Petri Nets (in Japanese). Kyoritsu Shuppan Co., Ltd (1995) 12. Reisig, W.: Petri Nets. Springer, Heidelberg (1982) 13. Katai, O., Iwai, S.: A design method for concurrent systems based on step diagram and tense logic under incompletely specified design criteria (in Japanese). Systems, Control and Information 27(6), 31–40 (1983) 14. Katai, O., et al.: Decentralized control of discrete event systems based on extended higher order Petri nets. In: Proc. of the Asian Control Conference, pp. 897–900 (1994)
Paraconsistent Before-After Relation Reasoning Based on EVALPSN Kazumi Nakamatsu1 , Jair Minoro Abe2 , and Seiki Akama3 1
2
3
University of Hyogo, Himeji, Japan
[email protected] Paulista University, Sao Paulo, Brazil
[email protected] C-republic, Kawasaki, Japan
[email protected]
Abstract. A paraconsistent annotated logic program called EVALPSN by Nakamatsu has been applied to deal with real-time safety verification and control such as pipeline process safety verification and control. In this paper, we introduce a new interpretation for EVALPSN to dynamically deal with before-after relations between two processes (time intervals) in a paraconsistent way, which is named bf-EVALPSN. We show a simple example of an EVALPSN based reasoning system that can reason before-after relations in real-time. Keywords: annotated logic program, EVALPSN, bf-EVALPSN, before-after relation, paraconsistent reasoning system.
1 Introduction We have already developed a paraconsistent annotated logic program called Extended Vector Annotated Logic Program with Strong Negation(abbr. EVALPSN), which has been applied to various kinds of process safety verification and control such as pipeline process control [5, 6, 7]. However, the EVALPSN based process control is for each process itself, not for process order. However, we have many systems in which process order control based on its safety verification is strongly required such as chemical plants. In this paper, we introduce a newly interpreted EVALPSN named bf(beforeafter)-EVALPSN and a paraconsistent reasoning system based on bf-EVALPSN that can deal with before-after relations between processes dynamically. Meaningful process before-after relations are classified into 15 kinds according to the before-after relation of the start/finish times of two processes, and they are paraconsistently represented in vector annotations whose components designate before/after degrees. The vector annotation (m, n) to represent beforeafter relations can be dynamically determined according to the order of process start/finish times of two processes. We also show how the paraconsistent reasoning system based on bf-EVALPSN can deal with process order correctly in real-time with a simple example. G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 265–274, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
266
K. Nakamatsu, J.M. Abe, and S. Akama (2, 2) q P P PP ∗1 @ (2, 1) ∗3 P (1, 2) q P q @ PP α P @(1, 1) @ P @q ∗2 @q (2, 0) (0, 2) q @ @ @q γP @q β PP (0, 1) @ (1, 0) P P @q ⊥ (0, 0)
Fig. 1. Lattice Tv (2) and Lattice Td
This paper is organized in the following manner: firstly, EVALPSN is reviewed briefly; subsequently bf-EVALPSN is introduced with a paraconsistent beforeafter relation reasoning system; lastly, we conclude the features of bf-EVALPSN with a comparison of bf-EVALPSN and Allen’s interval temporal logic [1, 2].
2 EVALPSN We review EVALPSN briefly[6]. Generally, a truth value called an annotation is explicitly attached to each literal in annotated logic programs [3]. For example, let p be a literal, μ an annotation, then p : μ is called an annotated literal. The set of annotations constitutes a complete lattice. An annotation in EVALPSN has a form of [(i, j), μ] called an extended vector annotation. The first component (i, j) is called a vector annotation and the set of vector annotations constitutes the complete lattice, Tv (n) = { (x, y)|0 ≤ x ≤ n, 0 ≤ y ≤ n, x, y and n are integers } in Fig.1. The ordering(v ) of Tv (n) is defined as : let (x1 , y1 ), (x2 , y2 ) ∈ Tv (n), (x1 , y1 ) v (x2 , y2 ) iff x1 ≤ x2 and y1 ≤ y2 . For each extended vector annotated literal p : [(i, j), μ], the integer i denotes the amount of positive information to support the literal p and the integer j denotes that of negative one. The second component μ is an index of fact and deontic notions such as obligation, and the set of the second components constitutes the complete lattice, Td = {⊥, α, β, γ, ∗1 , ∗2 , ∗3 , }. The ordering(d) of Td is described by the Hasse’s diagram in Fig.1. The intuitive meaning of each member of Td is ⊥ (unknown), α (fact), β (obligation), γ (non-obligation), ∗1 (fact and obligation), ∗2 (obligation and non-obligation), ∗3 (fact and non-obligation), and (inconsistency). Then the complete lattice Te (n) of extended vector annotations is defined as the product Tv (n) × Td . The ordering(e ) of Te (n) is defined as : let [(i1 , j1 ), μ1 ] and [(i2 , j2 ), μ2 ] ∈ Te , [(i1 , j1 ), μ1 ] e [(i2 , j2 ), μ2 ] iff
(i1 , j1 ) v (i2 , j2 )
and
μ1 d μ2 .
Paraconsistent Before-After Relation Reasoning Based on EVALPSN
267
There are two kinds of epistemic negation (¬1 and ¬2 ) in EVALPSN, which are defined as mappings over Tv (n) and Td , respectively. Definition 1. (epistemic negations ¬1 and ¬2 in EVALPSN) ¬1 ([(i, j), μ]) = [(j, i), μ], ∀μ ∈ Td , ¬2 ([(i, j), ⊥]) = [(i, j), ⊥], ¬2 ([(i, j), α]) = [(i, j), α], ¬2 ([(i, j), β]) = [(i, j), γ], ¬2 ([(i, j), γ]) = [(i, j), β], ¬2 ([(i, j), ∗1 ]) = [(i, j), ∗3 ], ¬2 ([(i, j), ∗2 ]) = [(i, j), ∗2 ], ¬2 ([(i, j), ∗3 ]) = [(i, j), ∗1 ],
¬2 ([(i, j), ]) = [(i, j), ].
If we regard the epistemic negations as syntactical operations, the epistemic negations followed by literals can be eliminated by the syntactical operations. For example, ¬1 p : [(2, 0), α] = p : [(0, 2), α] and ¬2 q : [(1, 0), β] = p : [(1, 0), γ]. There is another negation called strong negation (∼) in EVALPSN, and it is treated as classical negation. Definition 2. (strong negation ∼) [4] Let F be any formula and ¬ be ¬1 or ¬2 . ∼ F =def F → ((F → F ) ∧ ¬(F → F )). Definition 3. (well extended vector annotated literal) Let p be a literal. p : [(i, 0), μ] and p : [(0, j), μ] are called weva(well extended vector annotated)-literals, where i, j ∈ {1, 2, · · · , n}, and μ ∈ { α, β, γ }. Defintion 4. (EVALPSN) If L0 , · · · , Ln are weva-literals, L1 ∧ · · · ∧ Li ∧ ∼ Li+1 ∧ · · · ∧ ∼ Ln → L0 is called an EVALPSN clause. An EVALPSN is a finite set of EVALPSN clauses. Fact and deontic notions, “obligation”, “forbiddance” and “permission” are represented by extended vector annotations, [(m, 0), α], [(m, 0), β], [(0, m), β], and [(0, m), γ], respectively, where m is a positive integer.
3 Before-After Relation in EVALPSN First of all, we introduce a special literal R(pi, pj, t) whose vector annotation represents the before-after relation between processes P ri (pi) and P rj (pj) at time t, and the literal R(pi, pj, t) is called a bf-literal.1 Definition 5. (bf-EVALPSN) An extended vector annotated literal R(pi , pj , t) : [μ1 , μ2 ] is called a bf-EVALP literal, where μ1 is a vector annotation and μ2 ∈ 1
Hereafter, the term “before-after” is abbreviated as just “bf” in this paper.
268
K. Nakamatsu, J.M. Abe, and S. Akama
{α, β, γ}. If an EVALPSN clause contains bf-EVALP literals, it is called a bfEVALPSN clause or just a bf-EVALP clause if it contains no strong negation. A bf-EVALPSN is a finite set of bf-EVALPSN clauses. We define vector annotations to represent bf-relations, which are called bfannotations. Strictly speaking, bf-relations are classified into meaningful 15 kinds according to the order of process start/finish times. Suppose that there are two processes, P ri with its start time xs and finish time xf , and P rj with its start time ys and finish time yf . Then we have the following 15 kinds of bf-annotations. Before (be)/After (af) firstly, we define basic bf-relations before/after according to the bf-relation between each start time of two processes, which are represented by the bfannotations be/af, respectively. If one process has started before/after another, then the bf-relations are defined as just ‘before(be)/after(af)’, respectively. The bf-relations also are described in Fig.2 with the condition that process P ri has started before process P rj starts. The order of their start/finish times is denoted by the inequality {xs < ys }.2 xs
P ri ys
xs
P rj
Fig. 2. Bf-relations, Before/After
P ri
xf -
ys
P rj
yf -
and Disjoint Before/After
Disjoint Before (db)/After (da) bf-relations disjoint before(db)/after(da) are described in Fig.2. Immediate Before (mb)/After (ma) bf-relations immediate before(mb)/after(ma) are described in Fig.3.
P ri xs
ys xf
P rj -yf
xs
Fig. 3. Bf-relations, Immediate Before/After
P ri ys
xf -
P rj
yf -
and Joint Before/After
Joint Before (jb)/After (ja) bf-relations joint before(jb)/after(ja) are are described in Fig.3. S-included Before (sb)/After (sa) bf-relations s-included before(sb)/after(sa) are described in Fig.4. Included Before (ib)/After (ia) bf-relations included before(ib)/after(ia) are described in Fig.4. F-included Before (fb)/After (fa) bf-relations f-include before(fb)/after(fa) are described in Fig.5. 2
If time t1 is earlier than time t2 , we conveniently denote their relation by the inequality t1 < t2 in this paper.
Paraconsistent Before-After Relation Reasoning Based on EVALPSN xs ys
xs
xf -
P ri P rj
-
ys
yf
Fig. 4. Bf-relations, S-included Before/After xs
P ri
ys
P rj
-
yf
Fig. 5. Bf-relations, F-included Before/After
xf -
P rj - yf
and Included Before/After xs
xf -
P ri
269
ys
P ri
xf -
P rj
-
yf
and Paraconsistent Before-after
Paraconsistent Before-after (pba) the bf-relation paraconsistent before-after(pba) is described in Fig.5. If we consider the before-after measure over the 15 bf-annotations, obviously there exists a partial order(
μ1 ∈ { db, mb, jb, sb, ib, fb, pba }, μ2 ∈ { da, ma, ja, sa, ia, fa, pba },
db ≡v mb ≡v jb ≡v sb ≡v ib ≡v fb ≡v pba ≡v fa ≡v ia ≡v sa ≡v ja ≡v ma ≡v da, and be ≡v af. If we regard the before-after measure as the horizontal one and the before-after knowledge amount as the vertical one, we obtain the complete bi-lattice, Tv (12)bf = { ⊥12 (0, 0), · · · , be(0, 8), · · · , db(0, 12), · · · , mb(1, 11), · · · , jb(2, 10), · · · , sb(3, 9), · · · , ib(4, 8), · · · , fb(5, 7), · · · , be(8, 0), · · · , pba(6, 6), · · · , af(8, 0), · · · , fa(7, 5), · · · , ia(8, 4), · · · , sa(9, 3), · · · , ja(10, 2), · · · , ma(11, 1), · · · , da(12, 0), · · · , 12 (12, 12) }.
270
K. Nakamatsu, J.M. Abe, and S. Akama 12
knowledge 6
• @
• @ @• @
• @ @• @
• @ @• @ @• @
• @ @• @ @• @
• @ @• @ @• @ @• @
• @ @• @ @• @ @• @
• @
@• @ @• @ @• @ @• @
@• @ @• @ @• @ @• @
@• @ @• @ @• @ @• @
@• @ @• @ @• @
@• @ @• @ @• @
@• @ @• @
@• @
@• @
@• @ @• @ @• @• @• @• @• @• @• @• @• @• • @ @ @ @ @ @ @ @ @ @ @ @• @• @• @• @• @• @• @• @• @• @• • @ @ @ @ @ @ @ @ @ @ @ @ db •@ mb@•@ jb@•@ sb@•@ ib@•@ fb@•@ @•@pba @•@fa @•@ia @•@sa @•@ja @•@ma @• da @• @• @• @• @• @• @• @• @• @• @• @• @ @ @ @ @ @ @ @ @ @ @ @• @• @• @• @• @• @• @• @• @• @• @ @ @ @ @ @ @ @ @ @ @• @• @• @• @• @• @• @• @• @• @ @ @ @ @ @ @ @ @ @• @• @• @• @• @• @• @• @• @ @ @ @ @ @ be @ @ af @• @• @• @• @• @• @• @• @ @ @ @ @ @ @ @• @• @• @• @• @• @• @ @ @ @ @ @ @• @• @• @• @• @• @ @ @ @ @ @• @• @• @• @• @ @ @ @ @• @• @• @• @ @ @ @• @• @• @ @ @• @• @ @• ⊥12 • @
• @
@• @
@• @
@• @
@• @
@• @
@• @
@• @
@• @
@• @
before
after
Fig. 6. The Complete Bi-lattice Tv (12)bf of Bf-annotations
4 Bf-Relation Reasoning System under Incomplete Information We consider a bf-relation reasoning system consisting of three agents and their supervisor who can reason the correct bf-relations under incomplete or contradictory process bf-relations detected by each agent. Suppose that three agents 1,2 and 3 (ids : a1 ,a2 and a3 , resp.) can detect start/finish times of three processes P r1 , P r2 and P r3 in Fig.7, and their supervisor (id : as ) obtains process bf-relations from each agent at time ti (i = 0, 1, 2, . . . , 7). We also assume that: agent 1. (a1 ) fails to detect process P r3 start/finish times, agent 2. (a2 ) fails to detect process P r2 start/finish times, and agent 3. (a3 ) can detect only process P r2 start/finish times. Let P ri (aj ) be the i-th process identified by agent aj and pi(aj ) its process id, where i ∈ {1, 2, 3, . . .} and j ∈ {1, 2, 3, s(supervisor)}. Then, bf-relations(vector annotations) reasoned by each agent and the supervisor are shown in Table 1.
Paraconsistent Before-After Relation Reasoning Based on EVALPSN Q time Q P roc.Q
t0
t1
t2
t3
t4
P r0
t5
t6
271
t7
-
P r1
-
P r2
Fig. 7. Process Time Chart Table 1. Table of Vector Annotations bf-relations
t0
t1
t2
t3
t4
t5
t6
t7
R(p1(a1 ), p2(a1 ), t) (0, 0) (0, 8) (2, 8) (2, 8) (2, 8) (2, 10) (2, 10) (2, 10) R(p2(a1 ), p3(a1 ), t) (0, 0) (0, 0) (0, 8) (0, 8) (0, 8) (0, 8) (0, 12) (0, 12) R(p1(a2 ), p2(a2 ), t) (0, 0) (0, 8) (0, 8) (2, 8) (4, 8) (4, 8) (4, 8) (4, 8) R(p2(a2 ), p3(a2 ), t) (0, 0) (0, 0) (0, 0) (0, 8) (0, 12) (0, 12) (0, 12) (0, 12) R(p1(a3 ), p2(a3 ), t) (0, 0) (0, 0) (0, 8) (0, 8) (0, 8)
(0, 8)
R(p1(as), p2(as ), t) R(p2(as), p3(as ), t) R(p3(as), p4(as ), t) R(p4(as), p5(as ), t) R(p5(as), p6(as ), t)
(6, 6) (6, 6) (6, 6) (2, 10) (2, 10) (2, 10) (5, 5) (6, 6) (6, 6) (4, 8) (4, 8) (4, 8) (0, 12) (0, 12) (0, 12)
(0, 0) (0, 0) (0, 0) (0, 0) (0, 0)
(5, 5) (0, 8) (0, 0) (0, 0) (0, 0)
(5, 5) (2, 8) (5, 5) (0, 8) (0, 0)
(5, 5) (2, 8) (5, 5) (2, 8) (0, 8)
(5, 5) (2, 8) (5, 5) (4, 8) (0, 12)
(0, 12) (0, 12)
The supervisor identifies five processes according to the incorrect bf-relations reasoned by each agent and correctly reasons that there are three different processes. We show the bf-EVALPSN reasoning process of the correct bf-relations under incomplete or contradictory information. At time t1 , agents 1 and 2 have detected processes P r1 (a1 ) and P r1 (a2 ), respectively, however agent 3 could not detect any process, then they have the bf-EVALP clauses, R(p1(a1 ), p2(a1 ), t1 ) : [(0, 8), α],
R(p1(a2 ), p2(a2 ), t1 ) : [(0, 8), α],
R(p1(a3 ), p2(a3 ), t1 ) : [(0, 0), α]. The supervisor recognizes that two processes P r1 (a1 ) and P r1 (a2 ) have started at the same time t1 as processes P r1 (as ) and P r2 (as ), respectively; therefore, it has the bf-EVALP clauses, R(p1(as ), p2(as ), t1 ) : [(5, 5), α],
R(p2(as ), p3(as ), t1 ) : [(0, 8), α].
At time t2 , agents 1 and 3 have detected processes P r2 (a1 ) and P r1 (a3 ), respectively, however agent 2 could not detect any process, then they have the bf-EVALP clauses,
272
K. Nakamatsu, J.M. Abe, and S. Akama
R(p1(a1 ), p2(a1 ), t2 ) : [2, 8], R(p1(a3 ), p2(a3 ), t2 ) : [0, 8].
R(p2(a1 ), p3(a1 ), t2 ) : [0, 8],
The supervisor recognizes that processes P r2 (a1 ) and P r1 (a3 ) have started at the same time t2 as processes P r3 (as ) and P r4 (as ), respectively. Therefore it has the bf-EVALP clauses, R(p2(as ), p3(as ), t2 ) : [(2, 8), α], R(p4(as ), p5(as ), t2 ) : [(0, 8), α].
R(p3(as ), p4(as ), t2 ) : [(5, 5), α],
At time t3 , only agent 2 has detected process P r2 (a2 ), however, agents 1 and 3 could not detect any process, then agent 2 has the bf-EVALP clauses, R(p1(a2 ), p2(a2 ), t3 ) : [2, 8],
R(p2(a2 ), p3(a2 ), t3 ) : [0, 8].
Then the supervisor recognizes that process P r2 (a2 ) has started at time t3 as process P r5 (as ). Therefore it has the bf-EVALP clause, R(p4(as ), p5(as ), t3 ) : [(2, 8), α],
R(p5(as ), p6(as ), t3 ) : [(0, 8), α].
At time t4 , only agent 2 has detected process P r2 (a2 ) finish, however, agents 1 and 3 could not detect anything, then agent 2 has the bf-EVALP clauses, R(p1(a2 ), p2(a2 ), t4 ) : [4, 8],
R(p2(a2 ), p3(a2 ), t4 ) : [0, 12].
The supervisor recognizes that process P r2 (a2 ) has finished at time t4 as process P r5 (as ). Therefore it has the bf-EVALP clauses, R(p4(as ), p5(as ), t4 ) : [(4, 8), α],
R(p5(as ), p6(as ), t4 ) : [(0, 12), α].
At time t5 , agents 1 and 2 have detected that both processes P r1 (a1 ) and P r1 (a2 ) have finished, however, agent 3 could not detect anything, then they have the bf-EVALP clauses, R(p1(a1 ), p2(a1 ), t5 ) : [2, 10],
R(p1(a2 ), p2(a2 ), t5 ) : [4, 8].
The supervisor recognizes that both processes P r1 (a1 ) and P r1 (a2 ) have finished at the same time t5 as processes P r1 (as ) and P r2 (as ), respectively. Therefore it has the bf-EVALP clauses, R(p1(as ), p2(as ), t5 ) : [(6, 6), α],
R(p2(as ), p3(as ), t5 ) : [(2, 10), α].
At time t6 , agents 1 and 3 have detected that processes P r2 (a1 ) and P r1 (a3 ) have finished, respectively, however, agent 2 could not detect anything, then they have the bf-EVALP clauses, R(p2(a1 ), p3(a1 ), t6 ) : [0, 12],
R(p1(a3 ), p2(a3 ), t6 ) : [0, 12].
The supervisor recognizes that both processes P r2 (a1 ) and P r1 (a3 ) have finished at the same time t6 as processes P r3 (as ) and P r4 (as ), respectively. Therefore it has the bf-EVALP clauses, R(p3(as ), p4(as ), t6 ) : [pba(6, 6), α].
Paraconsistent Before-After Relation Reasoning Based on EVALPSN
273
Since the supervisor has the bf-EVALP clauses, R(p1(as ), p2(as ), t6 ) : [(6, 6), α],
R(p3(as ), p4(as ), t6 ) : [(6, 6), α],
it can identify that processes P r1 (as ) and P r3 (as ) are identical to processes P r2 (as ) and P r4 (as ), respectively. Therefore, the supervisor obtains the bfEVALP clauses, R(p1(as ), p3(as ), t6 ) : [jb(2, 10), α],
R(p3(as ), p5(as ), t6 ) : [ib(4, 8), α],
R(p5(as ), p6(as ), t6 ) : [db(0, 12), α], as the correct bf-relations, which say that three processes have been detected and finished, and the bf-relation between those processes are ‘joint before (jb)’, ‘include before (ib)’ and ‘disjoint before (db)’, although the fourth process has not started yet.
5 Conclusions and Future Works In this paper, we have introduced bf-EVALPSN with a dynamic bf-relation reasoning system. The reasoning method based on bf-EVALPSN can be applied to various process control systems requiring real-time performance. An interval temporal logic has been proposed by Allen et al. for knowledge representation of properties, actions and events[1, 2]. In his interval temporal logic, some predicates such as Meets are used for representing bf-relation between two time intervals.3 Although it is sure that the interval temporal logic is a logically sophisticated tool to develop practical planning or natural language understanding systems. It does not seem to be so suitable for real-time processing because before-after relations between two time periods cannot be determined until both of them finish. On the other hand, in bf-EVALPSN reasoning before-after relations are represented by paraconsistent vector annotations more minutely, which are dynamically determined in real time with simple integer computation according to start/finish time information of processes. Therefore, bf-EVALPSN is a more suitable theoretical foundation for dynamic process order control. As another topic, we introduce that bf-EVALPSN can deal with future possibility of bf-relation in vector annotations. Suppose that there are three processes P r1,2,3 being processed sequentially, and only process P r2 has finished and neither processes P r1 nor P r3 have finished yet at time t. Then two bfrelations between processes P r1 and P r2 , and processes P r2 and P r3 have been determined, although the bf-relation between processes P r1 and P r3 has not been determined yet. Therefore, we have two bf-EVALP clauses with complete bf-relations, R(p1, p2, t) : [ib(4, 8), α] and R(p2, p3, t) : [mb(1, 11), α],
(1)
Now, we will infer the incomplete bf-relation between processes P r1 and P r3 from other complete bf-relations. Then it is logically inferred that the bf-relation 3
M eets(m, n) denotes the time periods m and n exist sequentially.
274
K. Nakamatsu, J.M. Abe, and S. Akama
has three possibilities jb, sb and ib, and the bf-literal R(p1, p3, t) has one of three vector annotations jb(2, 10), sb(3, 9) and ib(4, 8), where t denotes the time when processes P r1 or P r3 finish and t < t . Therefore, the bf-EVALP clause, R(p1, p3, t) : [(2, 8), α] (2) whose vector annotation is the greatest lower bound of {(2, 10), (3, 9), (4, 8)} can be infered from the two bf-EVALP clauses (1). Conversely, if we have the bfEVALP clause (2), it can be inferred that the bf-relation between processes P r1 and P r3 is one of three bf-relations jb, sb and ib at time t . As mentioned above, we can deal with incomplete bf-relations implying complete ones and rules for inferring a bf-relation from other ones in bf-EVALPSN, although they have not been fully developed yet. We will develop such reasoning in bf-EVALPSN in our future work.
References 1. Allen, J.F.: Towards a General Theory of Action and Time. Artificial Intelligence 23, 123–154 (1984) 2. Allen, J.F., Ferguson, G.: Actions and Events in Interval Temporal Logic. J.Logic and Computation 4, 531–579 (1994) 3. Blair, H.A., Subrahmanian, V.S.: Paraconsistent Logic Programming. Theoretical Computer Science 68, 135–154 (1989) 4. da Costa, N.C.A., Subrahmanian, V.S., Vago, C.: The Paraconsistent Logics PT . Zeitschrift f¨ ur Mathematische Logic und Grundlangen der Mathematik 37, 139–148 (1989) 5. Nakamatsu, K.: Pipeline Valve Control Based on EVALPSN Safety Verification. J.Advanced Computational Intelligence and Intelligent Informatics 10, 647–656 (2006) 6. Nakamatsu, K., Abe, J.M., Suzuki, A.: Annotated Semantics for Defeasible Deontic Reasoning. In: Ziarko, W., Yao, Y. (eds.) RSCTC 2000. LNCS (LNAI), vol. 2005, pp. 432–440. Springer, Heidelberg (2001) 7. Nakamatsu, K., Mita, Y., Shibata, T.: An Intelligent Action Control System Based on Extended Vector Annotated Logic Program and its Hardware Implementation. J.Intelligent Automation and Soft Computing 13, 289–304 (2007)
Image Representation with Reduced Spectrum Pyramid Roumen Kountchev1 and Roumiana Kountcheva2 1
Technical University – Sofia, Department of Radio Communications and Video Technologies, Boul. Kl. Ohridsky 8, Sofia 1000, Bulgaria,
[email protected] 2 T&K Engineering Co. Mladost 3, POB12, Sofia 1712, Bulgaria,
[email protected]
Abstract. The paper presents one new approach for image decomposition based on spectrum pyramid with reduced number of coefficients (lower or equal to the number of the image pixels). The decomposition permits multi-layer image transfer with high compression ratio and good visual quality. The computational complexity of the new decomposition is relatively low. Some results, obtained with the simulation of the presented algorithm and the most important application areas are presented in the paper as well. Keywords: Image decomposition, Multi-layer image representation, Reduced spectrum pyramid.
1 Introduction One of the classic approaches for low bit-rate transfer of digital images via narrowband communication channels is based of hierarchical coding with pyramid decomposition [1]. Such decomposition is usually implemented by image resolution reduction with 2D decimation and filtration. In result is obtained a sequence of approximating images of twice smaller size each, which build the consecutive layers of the so-called Gaussian Pyramid (GP). The difference images between the current GP layer approximation and the corresponding 2D-interpolated and filtered one from the preceding layer comprise the pyramid known as Laplacian (LP) [2]. This pyramid is used as a basis for the creation of different systems for progressive image transfer [3]. The basic disadvantage of LP is that the needed memory volume is about 33% larger than that of the original image. In order to make this volume smaller, several LP modifications had already been developed, such as the reduced LP (RLP) [4]; the pyramid based on reduced sums or reduced differences (Reduced-Sum/ReducedDifference Pyramid, RSP/RDP); the S-Transform Pyramid (STP) [5], etc. The general disadvantages of the Laplacian pyramids are: • The principle used for their building: the pyramid base is calculated first and then the next layers up to the pyramid top follow. This approach generates some difficulties because the progressive image data transfer requires the top of the pyramid (the highest layer) to be transferred first; • On account of the multiple decimations and interpolations, followed by low frequency or band filtration used, the restored images usually have some G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 275–284, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
276
R. Kountchev and R. Kountcheva
specific distortions - false concentric circles around the high-contrast transitions. Additional distortions appear as a result of the filter window border effect. In order to avoid this, a lower number of pyramid layers should be used, which decreases the abilities for multi-layer image transfer with high compression ratio; • The needed memory volume, which for the RLP is higher than that of the original image (for example, the memory needed by the RSP is with 8.3% larger [5]). This is an additional obstacle to obtain high compression ratio for applications, which need good quality of the restored image. This paper presents new decomposition method in the image spectrum domain with the Reduced Spectrum Pyramid (RSPP), based on the Complex Spectrum Pyramid (CSP) [6]. The CSP permits to avoid the first two disadvantages of the Laplacian pyramids, but unfortunately the needed memory volume is with about 33% larger. The offered new pyramid requires less memory and permits higher compression for retained image quality.
2 Building the Reduced Spectrum Pyramid The processing starts by dividing the original halftone image (represented by the matrix [B]) into square sub-blocks, as shown in Fig. 1. The total sub-blocks number is m×n. In case that the image matrix is not square or the number of pixels could not be divided by 2n, the matrix is extended adding the necessary number of zeros. After that each sub-block is represented by the matrix [B(2n)] of size 2n × 2n, on which in the image spectrum domain is built the multi-layer RSPP. The pyramid elements are calculated recursively, starting from the pyramid top. The pyramid top of each image sub-block corresponds to the layer p = 0. The calculation of the elements, comprising this layer, is presented below.
[ S 0 ( 2 n )] = [ T0 ( 2 n )][ B( 2 n )][ T0 ( 2 n )]
(1)
Here [T0(2n)] is a matrix of size 2n × 2n, defined in accordance with the selected ~ orthogonal transform. After that is defined the matrix [ S 0 ( 2 n )] , containing the spectrum coefficients ~s ( u ,v ) for u, v = 0,1,2,..,2n-1, which comprise the pyramid 0
top and are defined by the equation:
~ s0 ( u , v ) = m0 ( u , v ) s0 ( u , v ) ,
(2)
where m0(u,v) is an element of the binary matrix-mask [M0(2n)] , defined by:
(u, v) ∈V0 ; ⎧ 1, if m0 ( u , v ) = ⎨ ⎩0 - in all other cases.
(3)
Image Representation with Reduced Spectrum Pyramid
sub-im age 1
sub-im age 2
sub-im age 3
.......
sub-im age m
sub-im age 2m
sub-im age m +1
sub-im age 2m +1
sub-im age 3m
.......
277
sub-im age 3m +1
....... sub-im age m xn
Fig. 1. Image division in m×n sub-blocks
Here V0 defines the area of the transform [ S 0 ( 2 n )] , which contains the high-energy spectrum coefficients s0 ( u ,v ) , selected in accordance with Eq. (3). In particular, if m0(u,v) = 1 for the frequencies u, v = 0, 1 only, the total number of the retained coefficients in V0 is equal to 4. In this case the pyramid layer p = 0 contains the coefficients s0(0,0), s0(1,0), s0(0,1) and s0(1,1) of the matrix [ S 0 ( 2 n )] . The coefficients for the next pyramid layer p = 1 are calculated, applying inverse 2D ~ transform on the matrix [ S 0 ( 2 n )] :
~ ~ [ B0 ( 2 n )] = [ T0 ( 2 n )] −1 [ S 0 ( 2 n )][ T0 ( 2 n )] −1
(4)
~ The matrix obtained [ B0 ( 2 n )] , is the coarse approximation of the block [B(2n)]. The approximation error for the pyramid layer p = 0 is defined by the next difference matrix of size 2n × 2n: ~ [ E0 ( 2 n )] = [ B( 2 n )] − [ B0 ( 2 n )]
(5)
This matrix is then divided in 4 sub-matrices [ E0k1 ( 2 n −1 )] , each of size 2n-1 × 2n-1 and with sequence number k1 = 1,2,3,4, which correspond to their sequential processing in accordance with the Peano “Z-scan” [7]:
278
R. Kountchev and R. Kountcheva
⎡ [ E 1 ( 2 n −1 )] [ E0 ( 2 n )] = ⎢ 03 n −1 ⎣[ E0 ( 2 )]
[ E02 ( 2 n −1 )] ⎤ ⎥ [ E04 ( 2 n −1 )] ⎦
(6)
The elements of the next pyramid layer (p = 1) for each difference sub-matrix
[ E0k1 ( 2 n −1 )] are calculated performing recursively the operations, defined by equations (1)-(6). The RSPP layer p = 1,2,..,r (r ≤ n-1) is calculated, applying on each difference submatrix [ E p −p 1 ( 2 n − p )] of size 2n-p × 2n-p and with number kp = 1,2,..,4p, the selected k
orthogonal 2D transform:
[ S p p ( 2 n − p )] = [ T p ( 2 n − p )][ E p −p 1 ( 2 n − p )][ T p ( 2 n − p )]. k
k
(7)
~k Here [ T p ( 2 n − p )] is the orthogonal transform matrix, and [ S p p ( 2 n − p )] is the transform of the matrix [ E p −p 1 ( 2 n − p )] which is the sub-matrix of the difference k
matrix, defined in similarity with Eq. (5) by the relation: ~ [ E p −1 ( 2 n )] = [ E p −2 ( 2 n )] − [ E p − 2 ( 2 n )] for p = 2,3,..,r.
(8)
~ The matrix [ E p −2 ( 2 n )] is of size 2n × 2n and approximates correspondingly the matrix [ E p −2 ( 2 n )] . The difference matrix [ E p −1 ( 2 n )] is quad-tree divided [7] in sub-blocks. In result are obtained 4p sub-matrices [ E p −p 1 ( 2 n − p )] of size 2n-p×2nk
and numbered kp = 1,2,..,4p, in correspondence with their sequential processing, following the “Z-scan”:
p
⎡ [ E 1 ( 2 n−p )] [ E 2p−1( 2 n−p )] p−1 ⎢ p 2 p +1 n−p ⎢ [ E 2p−1+2( 2 n−p )] [ E p−1(2n )] = ⎢ [ E p−1 (2 )] −−−−− −−−−− ⎢ ⎢ 4 p−2 p+1 n−p 4 p−2 p +2 n−p ( 2 )] ⎢⎣[ E p−1 ( 2 )] [ E p−1
[ E 2p−1( 2 n−p )] ⎤ ⎥ p+1 − [ E 2p−1 ( 2 n−p )] ⎥ ⎥ − −−−−− ⎥ p ⎥ − [ E 4p−1( 2 n−p )] ⎥⎦ −
p
(9)
s p p ( u , v ) for u, v = 0, 1 of The RSPP layer p consists of the spectrum coefficients ~ ~k the approximating matrices [ S p p ( 2 n − p )] numbered kp = 1,2,..,4p. These coefficients k
are defined by: k k ~ s p p ( u , v ) = m p ( u , v ). s p p ( u , v )
(10)
Image Representation with Reduced Spectrum Pyramid
279
Here mp(u,v) is the element of the binary matrix-mask [Mp(2n-p)] , for which:
(u, v) ∈V p ; ⎧ 1, if m p ( u ,v ) = ⎨ 0 in all other cases, ⎩
(11)
In this equation Vp is the area of the transform [ S p p ( 2 n − p )] , which contains the k
k
high-energy coefficients s pp ( u , v ) , defined in accordance with (11). The reduction of spectrum coefficients number for pyramid layer р = 1,2,..,r of the sub-block [B(2n)] is done using the relations, existing between coefficients in neighboring layers. For this are analyzed the coefficients of one sub-block [B(2n)], obtained with two-dimensional Walsh-Hadamard transform (2D-WHT) with arranged matrix (the arrangement is done following the ascending number of the elements’ sign change). The basic images with spatial frequencies (u, v), which correspond to 2DWHT of size 4×4 (respectively, n = 2), are shown in Fig.2а. For comparison, in Fig.2b are shown the basic images of the 2D discrete cosine transform (2D-DCT) of size 4×4.
a. 2D-WHT.
b. 2D-DCT
Fig. 2. Basic images of size 4×4, corresponding to 2D-WHT and 2D-DCT
From Eqs. (7) and (8) follows that the transform of the difference sub-matrix k n-p n-p [ E p −p 1 ( 2 n − p )] of size 2 ×2 and with sequential number kp in the pyramid layers p = 1,2, .. ,r , for one block [B(2n)] (or correspondingly, for each of its’ sub-blocks) is defined by the relation:
[ S p p ( 2 n − p )] = [ T p ( 2 n − p )][ E p −p 1 ( 2 n − p )][ T p ( 2 n − p )] = k ~k [ T p ( 2 n − p )][ E p −p 2 ( 2 n − p )][ T p ( 2 n − p )]−[ T p ( 2 n − p )][ E p −p 2 ( 2 n − p )][ T p ( 2 n − p )]= ~k k [ S p −p 1 ( 2 n − p )] − [ S p −p1 ( 2 n − p )]. k
k
(12)
280
R. Kountchev and R. Kountcheva
In particular, if mp(u,v) = 1 for the frequencies u, v = 0, 1 only, then the retained coefficients in the region of Vp of the sub-block kp in the layer p is:
Np =
2 n − p −1 2 n − p −1
∑ ∑ m p ( u , v ) = 4.
u =0
(13)
v =0
k
The relation between coefficients s pp ( u , v ) for u, v = 0, 1 of the matrix
[ S p p ( 2 n − p )] of size 2n-p×2n-p in the layer р and their corresponding coefficients k
k +1
s p p++11 ( u , v ) for same spatial frequencies from the matrix [ S p +p 1 ( 2 n − p −1 )] of size k
2n-p-1 × 2n-p-1 in the next layer (р+1), taking into account the relation (12) and the 2D-WHT basic images (0,0), (0,1), (1,0), (1,1) from Fig. 2.а, is as follows: • For u = v = 0 k k k k +1 k +2 k +3 k s pp (0 ,0 ) = s pp−1 (0 ,0 )−~ s p −p1 (0 ,0 ) = s pp+1+1 (0 ,0 )+s pp+1+1 (0 ,0 )+s pp+1+1 (0 ,0 )+s p p+1+1 (0 ,0 )= 0
(14)
• For u = 0 and v = 1 k k k k +1 k +2 k +3 k s pp (0 ,1 ) = s pp−1 (0 ,1 )−~ s p −p1 (0 ,1 ) = s pp++11 (0 ,0 )+s pp++11 ( 0 ,0 )−s pp++11 (0 ,0 )−s p+p +11 (0 ,0 ) = 0
(15)
• For u = 1 and v = 0 k k k k +1 k +2 k +3 k s pp (1 ,0 ) = s pp−1 ( 1 ,0 )−~ s p −p1 (1 ,0 ) = s pp++11 ( 0 ,0 )−s pp+1+1 ( 0 ,0 )+s pp++11 ( 0 ,0 )−s pp++11 ( 0 ,0 ) = 0
(16)
• For u = v = 1 k k k k k +1 k +2 k +3 s pp ( 1,1 ) = s pp−1( 1,1 )−~ s p −p1( 1,1 ) = s p p++11 ( 0 ,0 ) − s p p++11 ( 0 ,0 )−s p p++11 ( 0 ,0 )+s pp++11 ( 0 ,0 ) = 0.
k
k
(17)
+1
The solution of equations (14)-(17) for coefficients s p p++11 ( 0 ,0 ), s p p++11 ( 0 ,0 ), k
+2
k
+3
s p p++11 ( 0 ,0 ), s p p++11 ( 0 ,0 ) , whose numbers correspond to the “Z-scan” for the subblocks of [ E p −p 1 ( 2 n − p )] , is: k
k
k
+1
k
+2
k
+3
s p p++11 ( 0 ,0 ) = s p p++11 ( 0 ,0 ) = s p p++11 ( 0 ,0 ) = s p p++11 ( 0 ,0 ) = 0.
(18)
From this follows that the defined 4 spectrum coefficients with frequency (0,0) calculated for 4 neighbor sub-blocks in the pyramid layer p ≥ 1 are always equal to “0” and this result does not depend on the content of the corresponding sub-block from the preceding layer. As a result, the number of the necessary coefficients for every 4 adjacent sub-blocks from same pyramid layer (p = 1,2,..,r) could be reduced
Image Representation with Reduced Spectrum Pyramid
281
by 4. Then the total number of spectrum coefficients, necessary to represent the RSPP of n layers for the image sub-block [B(2n)] in the case, when mp(u, v) = 1 for u, v = 0, 1 and p = 0,1,.., n-1, is: n −1
n −1
n −1
p =1
p =1
p =1
∑ 4 p+1−∑ 4 p = 4 + 3∑ 4 p = 4 + 3 3 ( 4 n−1− 1 ) = 4 n
M =4+
4
(19)
ˆ ( 2 n )] is represented by the sum of the components: The restored image block [ B ~ n ˆ ( 2 n )] = [ B [B 0 ( 2 )] +
r
∑ [ E p−1 ( 2 n )] + [ E r ( 2 n )] ~
(20)
p =1
Each component is a matrix of size 2n×2n, which corresponds to the pyramid layer р = 0,1,2,..,r. The last component is the residual difference matrix [Er(2n)] . In case that the RSPP layers number r is (n-1), the corresponding decomposition (20) is full ˆ ( 2 n )] ≡ [ B( 2 n )] . and [ B The RSPP of the block [B(2n)], which corresponds to (20), consists of the following coefficients: ~ • in the layer p = 0 – the coefficients of [ S 0 ( 2 n )] , retained after the reduction; ~k • in layers p = 1,2,..,r-1 – the coefficients of [ S p p ( 2 n − p )] for kp=1,2,..,4p, retained after the reduction. The pyramid of the image, represented by the matrix [B] consists of all coefficients k s pp ( u , v ) from
the reduced sub-blocks pyramids. The coefficients are then arranged
in global massifs (2D frequency bands), in accordance with the following parameters:
• • •
the spatial frequency (u,v) of the coefficient; the pyramid layer number; the sub-block number kp in the pyramid layer p.
The obtained data massifs build the RSPP layers for the halftone image [B]. After this the data obtained are compressed losslessly using some of the famous methods for entropy coding. Therefore, the global number of spectrum coefficients, which build the RSPP, is equal to the number of pixels in the block [B(2n)] and the decomposition, presented by Eq. (12) is full. For same conditions, for the CSP this number is (4/3)4n, i.e. the reduction of RSPP compared to CSP is (4/3) = 1.33. This result was confirmed modeling the pyramid performance for some test images. Our investigation proved that relations similar with these, represented by Eqs. (14)(17) are in force for larger number of coefficients (not limited up to 4 only) from adjacent pyramid layers. This gives additional opportunities for flexible spectrum pyramid reduction and image representation.
282
R. Kountchev and R. Kountcheva
3 Modeling Results In Fig. 3 is shown the 3-layer RSPP representation of the test image “Lena”. The image matrix is first divided in sub-blocks of size 16 × 16 and on each is built a corresponding RSPP. The pyramid is called “inverse”, because the calculation of its coefficients starts from the top and continues to the base. In Fig. 3 are given the peak signal-to-noise ratio (PSNR) in dB and the bitrate (BR), obtained using 2D DCT for the calculation of the coefficients with frequencies (0,0), (0,1), (1,0) and (1,1) in every RSPP layer. The decomposition modeling was done with specially developed software. The BR values were calculated without quantization and entropy coding of the RSPP coefficients. After Huffman coding the BR is reduced to 1 bpp, retaining the quality of the restored image “Lena” to PSNR > 32 dB. The restored images shown in Fig. 4 correspond to different pyramid layers, built on sub-blocks of size 16 × 16 pixels. Further compression enhancement with retained image quality is obtained with post-processing based on the block-artifacts reduction [9].
[S2 ]
~ [S1 ]
~ [S0 ]
[E1 ]
~ [E0 ]
~ [B0 ]
~ ~ [B0]+[E0]+[E1]
~ ~ [B0]+[E0]
~ [ B0 ]
Fig. 3. Three-layer RSPP representation of the test image “Lena” (512×512 pixels, 8 bpp), starting with sub-blocks of size 16×16 pixels and using block-based DCT without entropy coding and block-artifacts reduction (IDCT – Inverse DCT)
Image Representation with Reduced Spectrum Pyramid
a. Original test image (512x512, 8bpp)
c. RSPP layer p = 1
283
b. RSPP layer p = 0
d. RSPP layer p = 2
Fig. 4. Restored images, corresponding to RSPP layers p = 0, 1, 2 (from Fig. 3)
4 Conclusion The basic advantages of the offered RSPP for image representation in the spectrum domain are: 1. The compression ratio for same image quality is about 33% higher than that of CSP; 2. The calculation of the RSPP using the values of the corresponding CSP coefficients is based on several simple relations, which results in relatively low computational complexity of the coder/decoder and offers the ability for real-time applications;
284
R. Kountchev and R. Kountcheva
3. The RSPP could be built using any known orthogonal transform, which makes the offered algorithm for pyramid representation universal and permits multiple applications; 4. The RSPP provides the ability to build systems for multi-layer image transfer with high compression ratio, which is very important for Internet applications; 5. The RSPP permits the creation of image databases with enhanced image search. The research work will continue with an investigation on the influence of the kind of the selected orthogonal transform (KLT, DFT, Haar, Hartley, etc.) on the compression ratio, the comparison with the JPEG and JPEG2000 standards for still image compression, on the enhancement of the image search in large databases [10], pattern recognition in the spectrum domain, etc. Acknowledgment. This paper was supported by the National Fund for Scientific Research of the Bulgarian Ministry of Education and Science (Contr. VU-I 305/2007)
References 1. Rosenfeld, A. (ed.): Multiresolution Image Processing and Analysis. Springer, Heidelberg (1984) 2. Burt, P., Adelson, E.: The Laplacian Pyramid as a Compact Image Code. IEEE Trans. on Commun. COM-31(4), 532–540 (1983) 3. Wang, L., Goldberg, M.: Reduced-difference pyramid: A data structure for progressive image transmission. Optical Eng. 28, 708–716 (1989) 4. Aiazzi, B., Alparone, L., Baronti, S.: A Reduced Laplacian Pyramid for Lossless and Progressive Image Communication. IEEE Trans. on Commun. 44(1), 18–22 (1996) 5. Wang, L., Goldberg, M.: Comparative performance of pyramid data structures for progressive image transmission. IEEE Trans. on Commun. 39(4), 540–548 (1991) 6. Kountchev, R., Milanova, M., Ford, C., Kountcheva, R.: Multi-layer Image Transmission with Inverse Pyramidal Decomposition. In: Halgamuge, S., Wang, L. (eds.) Computational Intelligence for Modeling and Predictions, vol. 2(13), pp. 179–196. Springer, Heidelberg (2005) 7. Cocquerez, J., Philipp, S.: Analyse d’images: filtrage et segmentation, Masson (1995) 8. Gonzales, R., Woods, R.: Digital Image Processing. Prentice Hall, Englewood Cliffs (2002) 9. Luo, Y., Ward, R.: Removing the Blocking Artifacts of Block-Based DCT Compressed Images. IEEE Trans. on Image Processing 12(7), 838–842 (2003) 10. Kountchev, R., Rubin, S., Milanova, M., Todorov, V., Kountcheva, R.: Image Multi-Layer Search Based on Spectrum Pyramid. In: IEEE Conf. IRI 2007, Las Vegas, USA, August 13-15, pp. 436–440 (2007)
Constructive Logic and the Sorites Paradox Seiki Akama1 , Kazumi Nakamatsu2 , and Jair Minoro Abe3 1
2
3
C-Republic Inc., Tokyo, Japan
[email protected] University of Hyogo, Himeji, Japan
[email protected] Paulista University, Sao Paulo, Brazil
[email protected]
Abstract. We show that contractionless constructive logic CLN , which is a subsystem of Nelson’s constructive logic with strong negation, can be viewed as an interesting logic for vagueness. It can formalize vague predicate in a constructive setting and overcome the so-called sorites paradox. We describe a sequent calculus for CLN and related systems. Keywords: the sorites paradox, vagueness, sequent calculus, contractionless constructive logic.
1 Introduction To develop a logic for vagueness is an important topic for several fields. For instance, philosophers are interested in the nature of vague predicates. Computer scientists try to formalize vague information in knowledge representation. However, they must face the sorites paradox which is one of the challenging paradoxes closely related to vagueness. It is paradoxical in the sense that all the steps in the sorites argument are justified in classical logic. The notable instance of the sorites is the paradox of heap. The essential point here is that the predicate involving the paradox is vague. There exist several approaches to vagueness in the literature. One of the most trivial approaches is to claim that logic should not cope with vagueness on the ground that logic works with exactness found in mathematics. This is nothing but a solution. A serious solution naturally consists in setting up some logical system avoiding the paradox. Many-valued logic is a promising framework. In particular, fuzzy logic, which is a version of infinite-valued logic, can provide a smooth solution (cf. Goguen [5]). Fine’s [4] theory based on supervaluation intriguing stuff within classical logic. However, these approaches seem ad hoc in that the motivation is far from a proper solution. Akama [1] proposed a contractionless constructive logic to solve Curry paradox which is an obstacle in the development of the foundation for naive set theory. This paper is to show that contractionless constructive logic is also a viable G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 285–292, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
286
S. Akama, K. Nakamatsu, and J.M. Abe
framework for a logic for vagueness. The starting point of this work lies in the fact that in contractionless logics what we call chain does not hold and that consequently the paradox does not arise. The solution to the sorites paradox by means of contractionless logic seems novel and is worth investigating. The paper proceeds as follows. In section 2, we give an overview the sorites paradox. In section 3, we present a contractionless constructive logic based on a sequent calculus. In section 4, we discuss the proposed solution to the paradox in detail. In section 4, we review related approaches. Section 5 gives some conclusions.
2 The Sorties Paradox Philosophers are troubled with the sorites paradox for a long time. The paradox is endorsed by the argument by means of repeated applications of modus ponens to produce the chain of predicates in question. And it is closely related to vagueness. As a consequence, we have to consider both logic and our cognition. The paradox of heap is of the form: heap(10, 000) heap(10, 000) → heap(9, 999) So heap(9, 999) heap(9, 999) heap(9, 999) → heap(9, 998) So heap(9, 998) ··· So heap(0) Here, 10, 000 and 0 are arbitrary, but we can conclude that heap(0) does not hold. A similar argument can be applied to the predicate bold(n) which reads “a man with n hairs on his head is bald”. The paradox can be written in the following way: bald(0) bald(0) → bald(1) So bald(1) bald(1) bald(1) → bald(2) So bald(1) ··· So bald(100, 000) Here, 0 and 100, 000 are arbitrary, but we are again faced with the paradoxical conclusion bald(100, 000) does not hold. The sorites paradox can also be found in other predicates like “tall”, “large”, “red”. In any case, the pattern can be classified into the above two. Both patterns employ the chain of the applications of modus ponens. We call the pattern the chain. They could be generally interpreted as the formula of the form (A ∧ (A → B)) → B.
Constructive Logic and the Sorites Paradox
287
There are probably two interesting points here. One is that the predicates used in the paradox are vague. The other is how valid inference steps are explained. Most people can recognize that the formulation of vague predicate is beyond classical logic. Unfortunately, no one can refute chain reasonably.
3 Contractionless Constructive Logic Contractionless constructive logic CLN was proposed by Akama [1] to overcome Curry’s paradox in relation to naive set theory. It is now widely held that contractionless logics are useful for naive set theory, naive truth theory and related theories (cf. Priest [8]). The language of CLN is the same as the one of standard predicate logic. It is obtained from constructive logic N of Nelson [7] by deleting Contraction of the form: (Contraction) (A → (A → B)) → (A → B). We denote by CLN − the logic which can be obtained from CLN without (EFQ): (EFQ) (A∧ ∼ A) → B CLN − is a subsystem of N − of Almukdad and Nelson [3]. The lack of (EFQ) makes CLN − paraconsistent, in which contradiction never means triviality. Therefore, for some A, A∧ ∼ A could be proved in CLN − . Note that negation ∼ in CLN is a strong negation which satisfies most of classical laws. It should, however, be distinguished from intuitionistic negation. We also point out that (LEM) A∨ ∼ A does not hold since CLN is a constructive logic. Although Akama [1] formulated Prawitz-style natural deduction system for CLN , we here use a sequent calculus. A sequent calculus present below is a variant of the one described in Alumkdad and Nelson [3]. A sequent is an expression of the form Γ ⇒ A, where Γ is a set of multisets (not sets!) and A is a formula. A sequent calculus GCLN for CLN consists of axioms and rules. We assume the reader’s familiarity with sequent calculus. Let Γ, Δ be a multiset of formulas and A, B, C be a formula. (Axiom) (AX1) Γ, A ⇒ A
(AX2) Γ, A, ∼ A ⇒ B
(Rules) Γ ⇒ A Δ, A ⇒ B Γ, Δ ⇒ B Γ ⇒A Γ ⇒B Γ, A, B ⇒ C (∧L) (∧R) Γ, A ∧ B ⇒ C Γ ⇒ A∧B Γ ⇒B Γ ⇒A (∨R) Γ ⇒ A∨B Γ ⇒A∨B
(cut)
288
S. Akama, K. Nakamatsu, and J.M. Abe
(∨L)
Γ, A ⇒ C Γ, B ⇒ C Γ, A ∨ B ⇒ C
(→ L)
Γ ⇒ A Δ, B ⇒ C Γ, Δ, A → B ⇒ C
(→ R)
Γ, A ⇒ B Γ ⇒A→B Γ ⇒A Γ ⇒ ∼∼ A
(∼∼ L)
Γ, A ⇒ B Γ, ∼∼ A ⇒ B
(∼ ∧L)
Γ, ∼ A ⇒ C Γ, ∼ B ⇒ C Γ, ∼ (A ∧ B) ⇒ C
(∼ ∧R)
Γ ⇒∼A Γ ⇒ ∼ (A ∧ B)
Γ ⇒∼ B Γ ⇒ ∼ (A ∧ B)
(∼ ∨L)
Γ, ∼ A, ∼ B ⇒ C Γ, ∼ (A ∨ B) ⇒ C
(∼ ∨R)
(∼→ L)
Γ, A, ∼ B ⇒ C Γ, ∼ (A → B) ⇒ C
(∼∼ R)
Γ ⇒∼A Γ ⇒∼B Γ ⇒ ∼ (A ∨ B)
(∼→ R)
Γ ⇒A Γ ⇒∼B Γ ⇒ ∼ (A → B)
(∀L)
Γ, A(t) ⇒ B Γ, ∀xA(x) ⇒ B
(∀R)
Γ ⇒ A(c) Γ ⇒ ∀xA(x)
(∃L)
Γ, A(c) ⇒ B Γ, ∃xA(x) ⇒ B
(∃R)
Γ ⇒ A(t) Γ ⇒ ∃xA(x)
(∼ ∃L)
Γ, ∼ A(t) ⇒ B Γ, ∼ ∃xA(x) ⇒ B
(∼ ∃R)
Γ ⇒ ∼ A(c) Γ ⇒ ∼ ∃xA(x)
(∼ ∀L)
Γ, ∼ A(c) ⇒ B Γ, ∼ ∀xA(x) ⇒ B
(∼ ∀R)
Γ ⇒ ∼ A(t) Γ ⇒ ∼ ∀xA(x)
Here, t is an arbitrary term and c is a term not occurring in the lower sequent, respectively. We need some comments on structure rules. In this presentation, (weakening) is implicit in (AX1) and (contraction) is not included. We can dispense with (exchange) by assuming that Γ, Δ are a multiset. (cut) is eliminable, since the cut-elimination theorem holds. If we delete (AX2) from GCLN , we can have GCLN − for CLN − . To have GN for N of Nelson [7], contraction: (C) should be added to GCLN . (C)
Γ, A, A ⇒ B Γ, A ⇒ B
One can show that (Contraction) cannot be proved with (C) in GCLN .
Constructive Logic and the Sorites Paradox
289
A ⇒ A A, B ⇒ B (→⇒) A, A, A → B ⇒ B (C) A⇒A A, A → B ⇒ B (→⇒) A, A, A → (A → B) ⇒ B (C) A, A → (A → B) ⇒ B (⇒→) A → (A → B) ⇒ A → B (⇒→) ⇒ (A → (A → B)) → (A → B) Due to the lack of (C), CLN belongs to the family of the so-called BCK logics.
4 Contraction and Sorites In this section, we show that CLN can naturally solve the sorites paradox by the built-in feature. The sorites paradox has the following form: (1) bold(0) (2) ∀n ≥ 0(bold(n) → bold(n + 1) (3) ¬bold(100, 000) Here, there are no objections to (1) and (2). Thus, we must find the problem with (3), i.e. inductive step. In fact, the key to the solution lies in (2) and many solutions in the literature for various motivations. Here, we should be careful to analyze (2). The chain is available in combination with (1) and (2). For each inductive step in the chain, the following reasoning is implicit: (4) (bold(m) ∧ (bold(m) → bold(m + 1)) → bold(m + 1) for any m. One can easily see that (A ∧ (A → B)) → B cannot be proved without using (Contraction), although it is a formula-form of modus ponens. The observation can be checked in the following sequent proof. A ⇒ A A, B ⇒ B (→⇒) A, A, A → B ⇒ B (C) A, A → B ⇒ B (∧ ⇒) A ∧ (A → B) ⇒ B (⇒→) ⇒ (A ∧ (A → B)) → B The fact means that the chain is not a valid inference in CLN . And it provides a formal explanation of the place of the origin of the paradox, i.e. chain. A similar argument can be established by natural deduction of Akama [1]. In addition, CLN has desired features as a logic for vagueness. In a constructive interpretation, neither A(n) nor ∼ A(n) may be true for some n, thus inducing truth-value gaps. This is because A∨ ∼ A is not an axiom in CLN . Then, it would not be possible for us to determine whether some vague sentence is true or false. This implies that our cognition is incomplete. There is also another possibility. For vague sentence B(n), both B(n) and ∼ B(n) are true, i.e. truth-value glut arises. If we adopt CLN − instead of CLN ,
290
S. Akama, K. Nakamatsu, and J.M. Abe
A∧ ∼ A may be true enabling some vague sentences to be both true and false. Therefore, our cognition is regarded ambiguous. From these observations, contractionless constructive logic offers smooth solution to the sorites paradox with some intuitive appeal. However, we must reply to a criticism. Since the deletion of contraction is essential to our solution, many would want to employ standard contractionless logics. Although the term “standard” is different to those who have different motivations, we can take up classical and intuitionistic BCK logics, which are obtainable from classical and intuitionistic logic by deleting Contraction. We can now reject to use classical BCK logic (CBCK) as a candidate since it cannot deal with truth-value gaps. For intuitionistic BCK logic (IBCK), some vague predicate is neither true nor false. In addition, by deleting the axiom (A ∧ ¬A) → B from (IBCK), where ¬ denotes intuitionistic negation, truthvalue glut can be also handled. But, we believe, BCK logics are not always suited to the approach to the sorites paradox. The reason is that they cannot adequately express cognition of vague predicates. Of course, positive vague predicate of the form bold(n) can be justified in BCK logics. However, negative vague predicate receives the representation of the form ¬bold(n), which is defined as bold(n) → f alse, i.e. it is defined via absurdity. The cognition of negative vague predicates seems to be done directly rather than indirectly in general. Probably, we do not interpret the sentence “the man with 1,000,000 hairs is not bold” as the justification of contradiction by assuming that “the man with 1,000,000 hairs is bold”. We indeed admit that there is a case in which the formula of the form ¬bold(n) is useful. But, it is weaker than ∼ A. Fortunately, ¬A can be defined as A → ∼ A in CLN or CLN − , and we can dispense with ¬ as a primitive. These considerations reveal that strong negation plays an important role in our setting. Of course, we cannot deny the solution to the sorites paradox using other types of substructural logics.
5 Related Work In this section, we compare our approach with existing ones. There are many interesting approaches to vagueness. We here review three approaches. One of the simplest approaches to vagueness is to use fuzzy logic (cf. Goguen [5] and Zadeh [10]). It is possible in fuzzy logic to express vagueness by means of the degree of truth by introducing many linguistic truth-values. There are two objections. One is that to introduce various truth-values corresponds to make the sharp distinctions in the degree of vagueness as opposed to the original intention of fuzzy logic. The other is the lack of the proof theory for fuzzy logic. Without proof theory, we have no intuitive justification of reasoning about vagueness. Supervaluation due to van Fraassen [9] is now recognized as an attractive method adopted in various approaches to vagueness. Many writers utilize supervaluation
Constructive Logic and the Sorites Paradox
291
to capture the notion of vagueness. And the landmark work has been done by Fine [4]. The basic of his idea is roughly to interpret a predicate in a world as extended worlds. Based on the technique, vague predicates can allow for truth-value gaps. In addition, supervaluational approach admits truth-value gap while keeping the law of excluded middle. We agree with technical elegance in supervaluation. However, we disagree with it for two reasons. First, the law of excluded middle holds for vague predicates as opposed to our intuition. Second, technically speaking, supervaluation can be incorporated into arbitrary logic in a semantical setting. Thus, it could not act as a logic. It is, however, of some interest to strengthen BCK logic with supervaluational semantics. We can find approaches to vagueness using the concept of context-dependency like in Graff [6]. As the name indicates, the approach claims that the truth of a vague predicate depends on context. And the approach can be linguistically motivated and useful for the semantics for adjectives in natural language. The defect of the context-sensitive approach lies in the fact that it is not obvious how the notion of a context can be formalized in a model of appropriate logic. In addition, we face the difficulty with the formalization of the approach as a logic.
6 Conclusions We have presented a solution to the sorites paradox based on contractionless constructive logic. The use of the logic in the proposed solution is defensible in that chain can be naturally blocked for vague predicates. It is interesting that Contraction is key to the solutions of sorties paradox as well as Curry’s paradox. We also address the point that strong negation enables us to confirm the denial of vague predicate. However, there are at least three points to be addressed here. First, the lack of Contraction does not enable us to do mathematics properly. This is true because chain is related to induction. In fact, many mathematical inferences involve induction, but we believe that vague predicates need no use of induction. To accommodate the defect, we may be able to add standard implication, i.e. intuitionistic implication for mathematical reasoning. Second, even if the proposed logic allows for truth-value gaps and gluts, they depend on context or situation. A vague sentence is true in some context, but it may be false in other contexts. The validity of sorites argument is surely connected with a context. What does it mean? We guess that the whole argument of the solution to the sorites paradox appears more involved than we suppose. An expanded explanation should be worked out for the issue. Some extensions of constructive infon logic of Akama and Nagata [2] seem of some use. Finally, it should be able to advance an intuitive semantics for contractionless constructive logic for the demonstration of a semantic justification for our solution and the comparison of our solution to others. But, the subject requires
292
S. Akama, K. Nakamatsu, and J.M. Abe
complicated modification of standard Kripke semantics. Some work is needed in line with the semantics for substructural logics.
References 1. Akama, S.: Curry’s paradox in contractionless constructive logic. Journal of Philosophical Logic 25, 135–150 (1996) 2. Akama, S., Nagata, Y.: Infon logic based on constructive logic. Logique et Analyse 194, 119–136 (2006) 3. Almukdad, A., Nelson, D.: Constructible falsity and inexact predicates. Journal of Symbolic Logic 49, 231–233 (1984) 4. Fine, K.: Vagueness, truth and logic. Synthese 30, 265–300 (1975) 5. Goguen, J.: The logic of inexact concepts. Synthese 19, 325–373 (1969) 6. Graff, D.: Shifting sands: an interest-relative theory of vagueness. Philosophical Topics 28, 45–81 (2000) 7. Nelson, D.: Constructible falsity. Journal of Symbolic Logic 14, 16–26 (1949) 8. Priest, G.: Logic of paradox. Journal of Philosophical Logic 8, 219–241 (1979) 9. van Fraassen, B.C.: Presuppositions, implications, and self-reference. Journal of Philosophy 65, 136–152 (1968) 10. Zadeh, L.A.: Fuzzy logics and approximate reasoning. Synthese 30, 407–428 (1975)
Resource Authorization in IMS with Known Multimedia Service Adaptation Capabilities Tomislav Grgic, Vedran Huskic, and Maja Matijasevic University of Zagreb, Faculty of Electrical Engineering and Computing, Unska 3, HR-1000 Zagreb, Croatia
[email protected],
[email protected],
[email protected]
Abstract. The Quality of Service (QoS) mapping in Internet Protocol (IP) based networks is not well-suited for complex multimedia services in a dynamically changing service environment and service adaptation driven by such changes. The work presented in this paper is motivated by the challenge to utilize additional knowledge about the service and its adaptation capabilities, in form of the Media Degradation Path, in the signaling across Diameter interfaces related to resources authorization within the Third Generation Partnership Project (3GPP) IP Multimedia Subsystem (IMS). The proposed approach extends the functionality of current Diameter Rx and Gx applications in IMS, and illustrates it by using an adaptive video call service as an example.
1 Introduction Quality of service (QoS) assurance for networked multimedia services requires coordination of resources and quality control mechanisms at all points in the system. The mechanisms operating at the multimedia service level (responsible for requesting, authorizing, monitoring, and controlling service-specific QoS parameters), must work together with the network level QoS mechanisms, which are generic and application-independent. In the 3GPP IP Multimedia Subsystem (IMS) [1], a framework for QoS mapping is applied within the Policy Control and Charging (PCC) architecture [2] and related to signaling flows between the entities in the PCC architecture [3]. The signaling protocol used in PCC for this purpose is Diameter [4]. As the richness and variety of next-generation multimedia services increases, so does the amount and complexity of QoS related signaling between the service and network layers, giving rise to questions regarding signaling scalability [5]. The way of QoS mapping and network resources negotiation in IP-based networks, including IMS-based ones, is well-established (at least theoretically) and fairly straightforward: from session parameters to IP QoS to access network specific QoS class. This model, however, is not well-suited for complex multimedia services in service environments with many possible dynamic changes (e.g., triggered by user preferences, terminal capabilities, current network conditions), and service adaptation driven by such changes. The work presented in this paper G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 293–302, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
294
T. Grgic, V. Huskic, and M. Matijasevic
decreasing overall service utility
Media Degradation Path (for Service Version i) Operating configuration 1
Optimal service media 1 operating parameters configuration media 2 operating parameters … (configuration 1)
.. .
Service configuration: alternative j
media n operating parameters
Service configuration x
Resource configuration 1 media 1 resource requirements media 2 resource requirements … media n resource requirements
Utility Enforced Media 1 Media k Media n
Operating configuration j ...
Resource configuration j ...
.. .
Service configuration: alternative m
Config identifier
Direction Bandwidth range
Operating configuration m ...
Resource configuration m ...
Delay Jitter Loss rate
Fig. 1. Media Degradation Path
is motivated by the challenge to utilize additional knowledge about the service and its adaptation capabilities within the PCC architecture in IMS to make the process more flexible, and hopefully reduce the number of steps (and the time) in which the successfull authorization of network resources could be achieved. The paper is organized as follows. Section 2 describes the idea of introducing knowledge about the application in the QoS negotiation process. Section 3 provides an overview of the relevant reference points in IMS PCC based on Diameter protocol and proposes how they could be extended to support the proposed idea. Section 4 presents an example using a video call service.
2 Knowledge about Adaptation Capabilities of a Multimedia Service From the network provider’s point of view, a multimedia service is a (set of) combination(s) of two or more media components (e.g., audio, video, 3D graphics) within a a particular network environment providing that service. Considering that a multimedia service is composed of one or more media components, we specify service versions as initially differing in the included media components (e.g., Version 1 for desktop computer with streaming media, Version 2 without streaming media for a handheld computer or a mobile phone). Each media component may be configured by choosing from a set of offered alternative operating parameters (e.g., different codecs, frame rates, resolutions, etc.). For a particular service version, we refer to the overall service configuration as the set of chosen operating parameters across all included media components (flows). The one active configuration at any one time is marked as “enforced”. In the process of QoS provisioning, operating parameters are mapped to QoS parameters, followed by network resources authorization and reservation. In our previous work, we proposed a model for dynamic service adaptation (DSAM)
Resource Authorization in IMS
295
[6] and its application in IMS [7]. The QoS matching and optimization process in DSAM is based on including a list of possible service configurations, ranked by utility, into the exchange of data in the QoS negotiation proces. This list, named “Media Degradation Path” (Fig. 1) represents the application-specific knowledge about the capability of the service to adapt to varying user preferences, terminal capabilities, service requirements, and available resources in the access network. Each configuration specifies the list of network requirements to be fulfilled for the configuration to be enforced. The requirements are specified per (group of) media flows, stating the direction (uplink or downlink), flow description (source/destination IP addresses and ports), bandwidth limitations and QoS parameters.
3 Extending the Relevant Diameter Applications in IMS Diameter base protocol was initially specified by the Internet Engineering Task Force (IETF) [4] and adopted by 3GPP for use in IMS. Diameter provides a framework for authentication, authorization and accounting (AAA). Diameter architecture consists of Diameter base protocol and Diameter applications (Fig. 2). The base protocol provides only fundamental AAA capabilities, such as negotiating capabilities and error handling. Diameter applications extend the base protocol by defining application-specific messages and parameters. A Diameter message consists of a Diameter header, followed by information elements called attribute-value pairs (AVPs). An AVP may be of a primitive value type (e.g., Integer, String), and a “Grouped” value type containing other AVPs. This approach allows building new Diameter applications by adding new AVPs and composing new messages by using existing or new AVPs. Diameter Base Protocol messages referred to in this work are: Re-Authentication-Request/-Answer (RAR/RAA), and Session-Termination-Request/-Answer (STR/STA). How the current Diameter applications over of the Rx [8] and the Gx [9] reference points could be extended in order to include the knowledge about the service contained within the MDP is described next. (Some familiarity with the IMS architecture is assumed; an interested reader is referred to [1] for more details.) Figure 3 shows the QoS mapping across the Gx and Rx reference points in IMS PCC architecture [3]. We assume that the selected service configuration and the MDP are signaled by using the Session Initiation Protocol (SIP). The Rx reference point is used for transporting session-related information from Specified by IETF Credit control application
Mobile IP Application
SIP Application ...
Specified by 3GPP Rx
Diameter Base Protocol
Fig. 2. Diameter Architecture
Gx
Cx
Dx
...
296
T. Grgic, V. Huskic, and M. Matijasevic
User Equipment
session signaling (SIP)
Application
session signaling (SIP)
PCEF QoS translation /mapping
QoS translation / mapping Bearer service
GGSN
authorized Access Specific QoS parameters
Bearer service
P-CSCF
Rx Gx authorized IP QoS parameters
service information PCRF Policy Engine
Fig. 3. QoS Mapping
the Proxy-Call Session Control Function (P-CSCF) to the Policy Control and Charging Rules Function (PCRF) in order to reserve resources in the connectivity layer needed for session establishment. Session information may be received from the P-CSCF due to initial session establishment, session modification, or session termination. The PCRF provides network control regarding service data flow detection, gating, QoS and flow-based charging towards the PCEF. It is also responsible for informing the P-CSCF of events in the connectivity layer, e.g. change in network resources. The PCRF may use subscription-specific information as a basis for the policy and charging control decisions, e.g. the highest allowed QoS class, or, maximum bit rate. The Gx reference point enables the PCRF to dynamically control the Policy and Charging Enforcement Function (PCEF). Depending on the collected session and subscriber-specific information, the PCRF issues Policy Control and Charging (PCC) rules. The Gx is also used for provisioning and removing the PCC rules from the PCRF to the PCEF and the transmission of connectivity layer events from the PCEF to the PCRF. In an UMTS RAN, PCEF is situated at the GPRS Gateway Support Node (GGSN), providing control over the connectivity layer traffic, reserving network resources needed for service establishing and delivery, and performing online and offline charging. Diameter messages of interest in this paper include base protocol messages mentioned earlier, as well as additional Rx and Gx applicationspecific messages: Authentication-Authorization-Request/Answer (AAR/AAA), and Control-Charging-Request/Answer (CCR/CCA). 3.1
Rx Diameter Application
In order to use the information stored in MDP for the PCC decision/making procedure, new MDP-specific AVPs are introduced and integrated in existing Rx application messages. Table 1 lists all (proposed) MDP-specific AVPs and selected Rx application AVPs relevant for this model. The MDP data is mapped to a group of Diameter AVPs using MDP-Configuration AVP, shown in Fig. 4. MDP-Configuration is of type Grouped. SIP-Item-Number presents a configuration identifier in a MDP, MDP-Utility stores information about the configuration utility, and MDP-Enforced AVP marks whether this configuration is enforced (or
Resource Authorization in IMS
297
Table 1. MDP-specific and selected Rx application AVPs Attribute name
Value type
MDP-Configuration MDP-Utility MDP-Enforced MDP-Media MDP-Media-Delay MDP-Media-Jitter MDP-Media-Loss MDP-Media-Max-Bandwidth MDP-Media-Min-Bandwidth Session-Id Flow-Description Flow-Usage SIP-Item-Number
Grouped Float Enumerated Grouped Integer Integer Integer Integer Integer String Grouped Enumerated Integer
MDP Media (j) Flow-Usage
MDP Configuration (i)
Flow Description Direction
Flow-Description SIP-Item-Number MDP-Utility MDP-Enforced MDP-Media (j)
MDP-Media-Delay
Source-IP
MDP-Media-Jitter
Source-Port
MDP-Media-Loss
Destination-IP
MDP-Media-Max-Bandwidth
Destination-Port
MDP-Media-Min-Bandwidth
Protocol
Fig. 4. Assigning data contained in MDP to AVPs
not) at the connectivity layer. Each configuration contains a number of MDPMedia AVPs, defining bandwidth, network QoS parameters and flow description (source and destination IP addresses, ports, and transport protocols) for each media flow. Depending on the number of configurations in MDP, one or more MDP-Configuration AVPs may be included in a Diameter message. 3.2
Gx Diameter Application
The MDP-specific AVPs for the Rx interface, described in the previous section, are reused for QoS parameters description for the Gx interface. Table 2 lists Gx application-specific AVPs which had to be extended with MDP-specific AVPs addressing network QoS parameters such as delay, jitter, loss rate, and bandwidth. Figure 5 shows how MDP may be included in a PCC rule. ChargingRule-Install AVP is used to activate, install or modify PCC rules as instructed from the PCRF to the PCEF. The Charging-Rule-Remove AVP is used to deactivate or remove existing PCC rules. Each PCC rule contains a unique identifier
298
T. Grgic, V. Huskic, and M. Matijasevic Table 2. Modified Gx application-specific AVPs Attribute name
Value Type
Charging-Rule-Install Charging-Rule-Remove Charging-Rule-Definition Authorized-QoS
Grouped Grouped Grouped Grouped
Charging-Rule-Install Charging-Rule-Definition (n) Charging-Rule-Base-Name Charging-Rule-Remove Charging-Rule-Name (n)
Charging-Rule-Definition (n) Charging-RuleName (n) ServiceIdentifier FlowDescription (n) Flow-Status
MDP-Media-Max-Bandwidth
Authorized-QoS
MDP-Media-Min-Bandwidth
Authorized-QoS MDP-Media-Delay MDP-Media-Jitter MDP-Media-Loss
Fig. 5. Structure of Gx specific AVPs
stored in Charging-Rule-Name AVP. For each media flow specified in the selected configuration, a separate PCC rule is created. 3.3
Prototype Implementation
In the course of this work, we developed a prototype implementation of the extended Diameter Rx and Gx applications based on Open Diameter. Open Diameter (www.opendiameter.org) is an open source implementation of the Diameter Base Protocol in the C++ programming language. We extended the existing implementation by adding MDP support for Rx and Gx Diameter applications. Our next step is performance evaluation.
4 Example: An Adaptive Video Call Service The proposed model of MDP support in Rx and Gx reference points is illustrated by an adaptive video call service as follows. A user initiates a video call session to his colleague by using a softphone on his desktop computer in the office. At some point, he needs to leave for the meeeting, yet wants to use the time while on the way to continue the conversation. According to his preferences, the video call is transferred without interruption to his mobile phone as he leaves the office. Once he gets to the car, he activates his hands-free set, and the video component is turned off. Finally, the session is terminated. Table 3 presents an example of received MDP for the negotiated video call. Each media component may be described by unique MDP Media AVP, containing all QoS parameters and flow descriptions.
Resource Authorization in IMS
299
Table 3. MDP for video call service
4.1
Configuration
Video codec, frame size
Config 1 Config 2 Config 3
H.263, 640x480 pixel H.263, 320x240 pixel –
Audio codec Utility PCM GSM GSM
0.8 0.6 0.3
Session Establishment
For a new video call session establishment in the IMS signaling plane, resources in the connectivity layer must be reserved for the session. Once the session is successfully negotiated in the signaling plane, the P-CSCF receives a SIP message containing the MDP of the negotiated service (Figure 6). The P-CSCF receives a new SIP message containing MDP to be applied (1), and collects MDP data and creates new MDP-Configuration AVPs (2). It then sends a Diameter AAR message containing all MDP information to the PCRF (3), which stores the received service information (4). Next, the PCRF selects configuration with the highest utility, Config 1 (5), and creates one PCC rule per each media component in it (6). It then sends a new Diameter Re-Auth-Request (RAR) message to the PCEF, containing all PCC rules to be applied (7). Depending on the received rules, PCEF is able to install, remove, or modify PCC rules. In this case, reservation of network resources for configuration Config 1 is performed (8). The PCEF informs the PCRF of the successful or unsuccessful application of PCC rules by sending a Re-Auth-Answer (RAA) Diameter message (9). The PCRF stores the P-CSCF
PCRF
PCEF
1.Trigger 2.Collect MDP data 3. Diameter AAR 4. Store service information 5. Select configuration 6. Create and store PCC rules
MDP Config 1 Config 2 Config 3
7. Diameter RAR 8. Policy Enforcement 10. Diameter AAA
9. Diameter RAA
Fig. 6. Session establishment scenario
300
T. Grgic, V. Huskic, and M. Matijasevic
information of successful reservation by setting MDP-Enforced value of Config 1 to true and sending the AAA Diameter message to the P-CSCF containing the applied configuration (10). 4.2
Network Resources Modification
Change in network resources may lead to enforcing another configuration in the MDP, for example, when the video call session is switched to a mobile phone (Fig. 7). The PCEF detects the decrease of available bandwidth for video (1).
P-CSCF
PCRF
PCEF 1. Change in network resources 2. Diameter CCR
3. Choose the best configuration 4. Create and store PCC rules
MDP Config 1 Config 2 Config 3
5. Diameter CCA 6. Diameter RAR 7. Diameter RAA
Fig. 7. Network resources modification
The PCEF sends a Diameter Charging-and-Control-Request (CCR) containing information about currently available bandwidth to PCRF via Gx interface (2). The PCRF chooses the highest utility configuration specified in the MDP, based on available bandwidth, here, the Config 2 (3). The PCRF creates new PCC rules depending on the chosen configuration (4) and sends a Diameter CCAnswer to the PCEF containing new PCC rules to be applied (5). The PCRF informs P-CSCF of the configuration change by sending a Diameter RAR. MDPConfiguration AVP is used, setting the MDP-Enforced value to true for Config 2 (6). The P-CSCF responds by sending a Diameter RA-Answer (7). 4.3
Service Requirements Modification
This scenario takes place when the trigger for switching to another configuration comes from the user or application server, as shown in Fig. 8. When the user in the video call gets to the car, the video component is stopped and the call continues with the audio component only. The P-CSCF receives a SIP message containing a reference to the Config 3 in the MDP (1). The P-CSCF sends a Diameter AA-Request to the PCRF containing the received reference (2). The PCRF extracts Config 3 from MDP (3), creates new PCC rules based on the new configuration (4), and sends new PCC rules to the PCEF (4). The PCEF
Resource Authorization in IMS
P-CSCF
PCRF
301
PCEF
1.Trigger 2. Diameter AAR MDP Config 1 Config 2 Config 3
3. Detect new configuration 4. Create and store PCC rules 5. Diameter AAR
6. Policy enforcement 7. Diameter RAR 8. Diameter AAA
Fig. 8. Service requirements modification
modifies resources for the audio component, while the resources for the video component are released (5). The PCEF informs the PCRF about the successful resources reservation (6). The PCRF informs the P-CSCF about the successful configuration enforcement (7). 4.4
Session Termination
Fig. 9 shows a session termination scenario. The P-CSCF receives a session termination request (1), and sends a Session-Termination-Request (STR) to the PCRF (2). The PCRF erases all existing PCC rules as well as the current MDP (3), and it issues an Re-Auth-Request (RAR) with instruction to remove all PCC rules for the required session (4). The PCEF removes all PCC rules and releases all previously reserved resources (5), and sends a Re-Auth-Answer (RAA) to P-CSCF
PCRF
PCEF
1.Trigger 2. Diameter STR 4. Diameter RAR 5. Remove PCC rules 6. Diameter RAA 7. Diameter STA
Fig. 9. Session termination
302
T. Grgic, V. Huskic, and M. Matijasevic
PCRF to confirm successful release of resources (6). The PCRF sends a SessionTermination-Answer (STA) to P-CSCF, thus completing the process (7).
5 Conclusions and Future Work In this work we proposed a model of including a set of alternative service configuration parameters in the form of a MDP within the process of creating and managing policy-based rules in IMS. A case study was presented, illustrating the practical use of the model. Future work will focus on performance evaluation of the proposed Diameter signaling extensions in line with the Diameter Maintenance and Extensions workgroup recommendations.
Acknowledgement The authors gratefully acknowledge the support of the Ministry of Science, Education and Sports of the Republic of Croatia research project no. 036-03620271639, and the Ericsson Nikola Tesla ETK-FER Summercamp 2007.
References 1. Camarillo, G., Garcia Martin, M.A.: The 3G IP Multimedia Subsystem (IMS): Merging the Internet and the Cellular Worlds. John Wiley & Sons, Chichester (2004) 2. 3GPP: TS 23.203: Policy and charging control architecture, Release 7 (December 2007) 3. 3GPP: TS 29.213 Policy and charging control signaling flows and QoS parameter mapping, Release 7 (June 2007) 4. Calhoun, P., Loughney, J., Guttman, E., Zorn, G., Arkko, J.: Diameter Base Protocol. RFC 3588 (Proposed Standard) (September 2003) 5. Agrawal, P., Yeh, J.H., Chen, J.C., Zhang, T.: IP Multimedia Subsystems in 3GPP and 3GPP2: Overview and Scalability Issues. IEEE Communications Magazine 46(1), 138–145 (2008) 6. Skorin-Kapov, L., Matijaˇsevi´c, M.: Dynamic QoS negotiation and adaptation for networked virtual reality services. In: Proceedings of the IEEE WoWMoM, Taormina, Italy, pp. 344–351. IEEE Press, Los Alamitos (2005) 7. Skorin-Kapov, L., Mosmondor, M., Dobrijevic, O., Matijasevic, M.: Applicationlevel QoS negotiation and signaling for advanced multimedia services in the IMS. IEEE Communications Magazine 45(7), 108–116 (2007) 8. 3GPP: TS 29.214 Policy and Charging Control over Rx reference point, Release 7 (January 2007) 9. 3GPP: TS 29.212 Policy and Charging Control over Gx reference point, Release 7 (January 2007)
Visualizing Ontologies on the Web Ioannis Papadakis1 and Michalis Stefanidakis2 1
Ionian University, Department of Archives and Library Sciences, Plateia Eleytherias, 49100 Corfu – Greece 2 Ionian University, Department of Computer Science, Plateia Eleytherias, 49100 Corfu – Greece {papadakis,mistral}@ionio.gr
Abstract. The proposed paper introduces an ontology visualization model suitable for average web users. It is capable of retaining ontology expressiveness while at the same time hiding the formal terminology that is usually employed in the context of ontology development. According to the proposed ontology visualization model, classes are represented as boxes and their corresponding properties are represented as labeled lines connecting such boxes. By hoping from one box to another, users are able to interactively explore the underlying ontology. Such interaction is facilitated through an intuitive, web-based GUI implemented in Javascript, which communicates with the underlying ontology through a middleware component implemented in Python. Keywords: semantic web, owl, ontologies, visualization.
1
Introduction
Ontologies nowadays constitute the backbone of what is wider known as the semantic web, defined in 2001 by Tim Berners Lee [1]. However, ontologies have been around for quite some time, long before they found their way to the semantic web. Researchers belonging to the Knowledge Engineers – KE community, employed ontologies to conceptualize various domains in order to be able to reason with such domains through asserted and/or inferred knowledge. In this context, various applications have emerged capable of providing the necessary infrastructure for developing ontologies. Such applications are basically standalone software components capable of visualizing ontologies mostly for editing purposes. Since the advent of the semantic web, an ever increasing number of researchers not necessarily belonging to the KE community became interested in ontologies. They could be addressed as information experts in various domains wishing to formally model their expertise in the context of a semantic web application. Despite the fact that semantic web applications are regrettably difficult to find on the web, a significant number of ontologies has been produced. However, current ontology visualization systems are mostly addressed to KE experts rather than average web users. The employed vocabulary is only familiar to the KE community and visualization techniques rarely make use of most recent web technologies. The proposed paper aims in providing an ontology visualization model suitable for average web users. It is argued to be capable of retaining ontology expressiveness while at the same time hiding the formal terminology that is usually employed in the G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 303–311, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
304
I. Papadakis and M. Stefanidakis
context of ontology development. The underlying ontology is visualized as a linear list of nodes that are being interactively chosen by users. The rest of this paper is structured as follows: The next section presents a number of existing efforts in visualizing ontologies. The proposed ontology visualization model is introduced at the next section. In order to test the model in action, two case studies have been conducted. The two case studies are presented in the two following sections respectively. The first one visualizes an ontology for the library domain while the second one visualizes an ontology for the academic domain. Each ontology has its own special features highlighting various features of the ontology visualization model. Finally, this paper draws conclusions and points directions for future work concerning the proposed ontology visualization model.
2 Related Work in Ontology Visualization In order to bring the semantic web closer to the public, the underlying information structures should be visualized in a way that is easily comprehensible from average users. In this context, the problem of visualizing ontologies should be approached from a Human-Computer Interaction (HCI) perspective. The vast majority of ontology visualization tools are based on combinations of formbased GUIs with hierarchical trees (e.g protégé, ontoXpl [4], ontoWeb [5]) or on graphs with classes as nodes and arcs as slots (e.g. jambalaya, ontoviz [6]) or as clusters (Cluster Map [7], SWHi [2]) or, more recently, on crop circles [8]. Such visualization styles have proved to be suitable for KE experts, wishing to engineer ontologies. However, it seems that it is very difficult for average users on the web to take advantage of the expressiveness of ontologies through the aforementioned tools. KE – specific vocabulary finds its way to the screen, resulting in causing frustration to non-experts visitors. Most frequently, after a while, users are overwhelmed with numerous widgets that are squashed in their screen, flooding this way users’ short-term memory, thus infringing basic HCI principles [9]. Moreover, the aforementioned solutions are delivered mostly in client-side applications that cannot be easily integrated to the overall web infrastructure. Consequently, dissemination is limited to closed communities instead of the wider web population. It is the authors' belief that nowadays, wide dissemination of ontology visualization applications can only be achieved through common web technologies such as the web server and web browser. Indeed, users are already accustomed to the web and consequently to the corresponding metaphors that have been established. It should be kept in mind though, that designing an elegant, functional user interface on the web is not an easy task. So far, conventional sites have to deal with the so-called 'page-reload effect', which refers to the fact that whenever a user triggers a request to the server, the whole page has to reload on screen regardless of the amount of new information that arrives from the server, as compared to the previous state of the site. Such problems drove web designers to build light-weight sites, in order to minimize the page-reload effect. Another problem that web designers have to deal with, refers to the fact that not all browsers support the same versions of de facto web standards such as (X)HTML and Javascript. Thus, in order to remain interoperable and compliant with the majority of
Visualizing Ontologies on the Web
305
web browsers, web designers have to be extremely cautious with the applications they develop. Moreover, many technologies promised at their beginning to revolutionize the web. However, history shows that such technologies eventually failed to fulfill their promises due to various reasons. For example, Java-applets emerged as a way to integrate desktop applications within the web browser environment. Despite the initial success, Java-applets failed due to browser incompatibility, slow download and start up of applets, unpredictable behavior on different operating systems, lack of browser support and no standard security model1. Current trends on the web indicate that many of the above issues can be solved through the employment of Ajax [3] technology. Such approach promotes user interaction by utilizing asynchronous http requests to the server, thus eliminating the pagereload effect. Moreover, Ajax applications rely heavily on Javascript, which, after a rather long period of existence, finally seems to be well supported from major web browser vendors. Having the above thoughts in mind, the ontology visualization model that is presented in this paper is based on common web technologies. As it will be described in the following sections, the GUI that delivers the proposed functionality to average web users, interactively visualizes the underlying ontology through the employment of common HTML widgets and the most promising Ajax technology.
3 Proposed Ontology Visualization Model The proposed ontology visualization model is comprised of boxes that correspond to classes, and labelled lines corresponding to properties linking such classes. Users may hop from one class to a related one by selecting one of the various properties each class is being explicitly and/or implicitly (through a reasoner) involved. Classes related to a given class through a user-selected property are grouped together within a pop-up context menu. Such a menu is created on demand by clicking on a property. Properties are outlined inside each box. Finally, by selecting an item (i.e. class) within a context menu, a new box is drawn together with a line that connects the two boxes. Individuals require application-specific elaboration. The whole iterative process is illustrated in Fig. 1. According to wikipedia2, in the context of Computer Science (and Information science), ontologies are data models representing concepts and their corresponding relations within a specific domain. Such a definition constitutes ontologies as highly expressive information schemas. However, as argued earlier in this paper, it seems that ontology visualization research is more concerned in finding ways of aiding researchers building ontologies than finding ways to expose the underlying knowledge to average users. An average web user does not care whether a specific ontology classification derives from asserted or inferred statements. Additionally, it is equally irrelevant for an average user the type of a property and/or a restriction (e.g. symmetric, functional, etc). All it matters is not ‘why’ but ‘what kind of’ relations exist for a given concept (formalized as a class). 1 2
Available at: http://www.javaworld.com/javaworld/jw-12-2000/jw-1201-soapbox.html Available at: http://en.wikipedia.org accessed at: 28.1.2008
306
I. Papadakis and M. Stefanidakis
Class A
Class C
Class D
property 1
property 11
property 6
property 2 property 3
property 3 property 3
property 7
property 3 property 7
property 2 Class A
property 4
property 8 Class F property 22 Class B
Fig. 1. Ontology visualization model
In this context, the proposed visualization model is addressed to average web users instead of KE experts. Thus, the employed vocabulary doesn’t contain any formal terminology. Classes and properties are represented as boxes and labeled lines respectively, complying this way with well-known metaphors at the web. The simplicity of the approach does not imply that the employment of more sophisticated tools such as reasoners and/or query engines (e.g. sparql) is not possible. The following sections illustrate the effectiveness of the proposed visualization model through the presentation of two case studies, referring to a simple and a more complex ontology respectively.
4 Case Study 1: Visualizing an Ontology of Library’s Subject Headings According to this case study, average web users are able to explore an ontology that is comprised of subject headings existing within the Online Public Access Catalog’s – OPAC’s library of the Computer Science Department at the Ionian University in Greece3. Libraries are very important memory organizations containing information assets that are most of the times formally described. The descriptive information of each information asset participating in the library’s OPAC is of premium quality, since it is produced by experienced staff (i.e. librarians). Perhaps the most important semantic descriptor within a library’s catalog is the subject descriptor. In order to register an information asset within the library’s OPAC, among other things, librarians have to attach one or more suitable subject headings to the particular asset. Such headings are usually constructed from well-known subject headings registries such as the Library of Congress Subject Headings – LCSHs (available at: authorities.loc.gov/). According to current practice from OPAC vendors, users have access to such information through a form – based interface where users are prompted to select “subject” from a drop-down menu and accordingly provide a value to the adjacent textbox that best describes their conceptual information needs (see fig. 2a). After that, the OPAC performs a Boolean search to a subject-based index 3
Available at: http://195.251.111.53/server/entry/index.html
Visualizing Ontologies on the Web
307
Fig. 2a. Performing subject-based search: step one
Fig. 2b. Performing subject-based search: step two
in order to provide users with information assets displayed as a typical search results list. Upon selection of a particular information asset, users are able to observe the subject headings of the particular information asset and, accordingly, click on the one that best matches their information needs. This way, a search for information assets indexed under the specified subject heading is performed (see fig. 2b). The aforementioned process may be repeated until users find the information assets they are looking for. Such a scenario indicates that there is no way of exploring the subject headings according to their underlying structure, without having to visit specific information assets. Thus, finding information assets about a particular subject becomes a rather tedious process, despite the availability of the required knowledge. The application presented in this case study aims at simplifying such a process. In order to test the proposed ontology visualization model, an ontology capable of modeling the underlying subject headings was created. The ontology contains a total of about 500 LCSH’s records (terms), corresponding to the subject headings employed within the Ionian University’s library. Such terms constitute the official subject headings (and synonyms) defined by the Library of Congress or their Greek translations as translated from trained librarians within the library. The ontology is encoded in OWL-DL format and models subject headings as classes, and relations as object properties. There are four relations defined, namely ‘contains’, ‘is_part_of’, ‘inContextOf’ and ‘seeAlso’. Each relation corresponds to a property. Such properties act as restrictions to specific named classes, according to the official LCSHs. The ontology treats the label of a subject heading as well as all of its synonyms and Greek translations as individuals. The ontology is accessed from an application server implemented in python. The application server receives queries in XML format from the client-side and delivers responses in XML format. The client-side is implemented in javascript based on the
308
I. Papadakis and M. Stefanidakis
XMLHttpRequest – XHR [3] object. Users interact with the client-side and transparently address queries to the application server. 4.1 User Interaction Upon initialization of the ontology visualization model, users are presented with a subject heading visualized as a box consisting of a title (i.e. formal LCSH heading in English), possible subtitles (i.e. alternative subject heading and/or translated subject heading in Greek) and one or more entries corresponding to the named relations of the formal subject heading. Users are able to right-click on a relation in order to observe in a pop-up context menu the related subject headings (see fig. 3a). By clicking on a subject heading within the context menu, a new box is drawn at the right, containing information about the selected subject heading. The two boxes are connected with a labeled line representing the relation (see fig. 3b). Simultaneously, subject headings of the most right box appear in the
Fig. 3a. Linear list ontology visualization
Fig. 3b. User interaction
Fig. 3c. Application - specific individual elaboration
Visualizing Ontologies on the Web
309
OPAC’s search box. Thus, a search query is addressed to the underlying search engine, which returns a list of information assets being described by the selected subject heading (see fig. 3c). The proposed semantic web application dictates that individuals exist both as alternative headings inside each box and as query phrases within the underlying search engine’s search box. The proposed visualization model as applied to the specific case study for the library domain, aids average web users in locating information assets about a given subject heading. Users decide which part of the underlying ontology will be presented to them by selecting the appropriate class. It is argued that the employment of common web widgets (i.e. labeled line, box, context menu) enables average users to transparently explore the underlying ontology. Users are unaware of the ontology per se, while at the same time they are able to take advantage of it’s expressiveness through the employment of visualization widgets. Moreover, despite the rather large size of the ontology, the screen is never overcrowded with too much information. Despite the fact that user interactions may lead to the construction of numerous boxes, user’s interest is always at the right side of the screen, containing the most recent class metaphor. Another interesting feature of the described application is the fact that it promotes serendipitous or casual discovery of interesting information (see [10]). During the ontology browsing process, users underpin their cognitive learning, since they are able to witness which information assets correspond to the formally defined subject headings they visit.
5 Case Study 2: Visualizing an Ontology Referring to a University’s Department According to this case study4, average web users are able to explore an ontology that refers to the Computer Science Department at the University of Piraeus in Greece. As compared to the previous case study, this ontology is more expressive. It consists of nineteen classes organized in three levels of abstraction, six object properties, one datatype property and seventy nine individuals. Object properties, although not organized into an hierarchy, may have attributes and/or relations (e. g. propA inverse of propB). The proposed visualization model presents each class as a box containing relations that correspond to the properties the individuals of this class participate in. Users may click on a relation in order to come up with a list of its corresponding classes and accordingly decide which class will be visualized at the right of the existing box. It should be noticed that according to the proposed model, a relation between two classes may correspond to either an asserted or inferred view of the class hierarchy. Users do not need to know the origin of the class hierarchy (i.e. asserted or inferred). In order to accomplish such a task, the ontology is classified offline by a reasoner and stored within a suitable data structure. The internal structure acts as an index and contains information about each ontology asset (i.e. class, property, individual) deriving both from asserted and inferred knowledge. This way, the resulting semantic web 4
Available at: http://195.251.111.53/server/entry/index2.html
310
I. Papadakis and M. Stefanidakis
application achieves tolerable response times during user interaction. Otherwise, real time invocation of a reasoner would result in considerable delays during the interactive ontology visualization process, discouraging this way user interaction. As far as elaboration of individuals is concerned, the proposed case study follows the search results list paradigm (see fig. 4).
Fig. 4. Ontology visualization model as applied to case study 2
Specifically, in the case of hierarchical relations (i.e. contains, isPartOf), individuals belonging to the user-selected class are grouped together in a list beneath the linear list of boxes. Similarly, in the case of pair of classes that are connected through the property a user has selected, the search results list contains the corresponding pairs of individuals. Again, the type of a relation (i.e. attribute of a property) is irrelevant to the average user.
6 Conclusions This paper defines an ontology visualization model suitable for average web users. The proposed model intends to bring ontologies closer to the web community. It is the authors’ belief that ontologies are highly expressive information structures capable of being visualized in a way that they remain expressive while at the same time they hide their complexity from average web users. Ongoing research in ontology management tools, results in applications that are targeted towards KE experts wishing to develop ontologies. Such applications provide ontology visualization models that contain domain-specific terminology, and employ visualization widgets that are most of the times unfamiliar to average web users. The proposed model breaks down ontologies to their fundamental ingredients, namely classes, properties and individuals. Such components are visualized through common widgets that are well-established on the web. This way, average users are able to transparently explore ontologies. Ontologies are represented as dynamic lists of classes interconnected with their corresponding properties. Users are able to interact with such lists deciding this way which part of an ontology will be presented to them. Thus, users are always in control of the ontology visualization process.
Visualizing Ontologies on the Web
311
The effectiveness of the proposed model is demonstrated through the presentation of two case studies relying on two fundamentally different ontologies respectively. Both case studies refer to web-based applications targeted towards average users. Future work includes the application of the proposed ontology visualization model to a number of existing ontologies, discovering this way possible limitations. Moreover, the authors would like to extend the expressiveness of the model by exposing a bigger part of the underlying ontology to the web-based GUI. Benchmarking the effectiveness of the model in terms of web usage (i.e. scalability, response time, usability) is also under way. Finally, research is being conducted in finding ways of visualizing individuals in a more consistent manner.
Acknowledgements The authors would like to acknowledge the work of Katerina Tzali and Kyriaki Papoulia on helping developing the ontologies of the first and second case study respectively.
References 1. Berners-Lee, T., Hendler, J., Lassila, O.: The Semantic Web. Scientific American 284(5), 34–38, 40–43 (2001) 2. Fahmi, I.: SWHi System Description: A Case Study in Information Retrieval, Inference, and Visualization in the Semantic Web. In: Proceedings of European Semantic Web Conference, Austria (2007) 3. Garett, J.J.: Adaptive Path – Ajax: A New Approach to Web Applications, Adaptive Path Essay Archive (2005), http://www.adaptivepath.com/publications/essays/ archives/000385.php 4. Haarslev, V., Lu, Y., Shiri, N.: OntoXpl: Exploration of OWL Ontologies. In: Proceedings of the Web Intelligence Conference WI 2004, IEEE Computer Society, Beijing (2004) 5. Giunchiglia, F., Gomez-Perez, A., Stuckenschmidt, H., Pease, A., Sure, Y., Willmott, S.: Ontology-based information exchange for knowledge management and electronic commerce. In: Proceedings of 1st International Workshop on Ontologies and Distributed Systems (ODS 2003), Mexico (August 2003) 6. Sintek, M.: The Ontoviz Tab: Visualizing Protégé Ontologies, http://protege.stanford.edu/plugins/ontoviz/ontoviz.html 7. Fluit, F., Sabou, C., van Harmelen, F.: Ontology-based information visualization: Towards semantic web applications. Springer, Heidelberg (2005) 8. Wang, T.D., Parsia, B.: Cropcircles: topology sensitive visualization of owl class hierarchies. In: Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M., Aroyo, L.M. (eds.) ISWC 2006. LNCS, vol. 4273. Springer, Heidelberg (2006) 9. Bates, M.J.: Where Should the Person Stop and the Information Search Interface Start? Information Processing & Management 26, 575–591 (1990) 10. García, E., Sicilia, M.A.: Designing Ontology-Based Interactive Information Retrieval Systems. In: Meersman, R., Tari, Z. (eds.) OTM-WS 2003. LNCS, vol. 2889, pp. 223– 234. Springer, Heidelberg (2003)
Performance Analysis of ACL Packets Using Turbo Code in Bluetooth Wireless System Il-Young Moon School of Internet Media Engineering, Korea University of Technology and Education, Republic of Korea
[email protected] Abstract. In this paper, it is analyzed a performance analysis of ACL Packet using turbo code scheme in bluetooth system. In order for segment to improve the transfer capability, the transmission of messages have been simulated using a fragmentation that begins with the total package and incremental fragmentation for each layer using the TCP to define the resultant packet size and the level of fragmentation for each proceeding layer. And it is studied that transmission time for bluetooth wireless link according to DM1, DM3 or DM5 packet type in bluetooth system. This turbo code scheme decreases transmission time of L2CAP baseband packets by sending packets. From the results, it was able to obtain packet transmission time, optimal TCP packet size, ACL (DM packet) in AWGN and Rician channel.
1 Introduction Lately, a new universal radio interface has been developed enabling electronic devices to communicate wirelessly via short-range ad hoc radio connections. Bluetooth technology eliminates the need for wires, cables, and the corresponding connectors between cordless or mobile phones, and so on, and paves the way for new and completely different devices and applications [1],[2]. The technology enables the design of low-power, small-sized, low-cost radios that can be embedded in existing portable devices. Eventually, these embedded radios will lead toward ubiquitous connectivity and truly connect everything to everything. Radio technology will allow this connectivity to occur without any explicit user interaction. The demand for local area network is expected to undergo explosive growth as bluetooth and 802.11 capable devices will become more prevalent in the market place. PAN (Personal Area Networks) will gradually be deployed in hot spot traffic load area [3],[4]. These pico-networks (piconets) have the advantage of providing high bandwidth local connectivity for the mobile users at low cost. The PAN applications have the range of simple e-mail/data transfers to high content web page downloads and real time video. In this paper, it is studied that transmission time for bluetooth wireless link according to DM1, DM3 or DM5 packet type in bluetooth piconet environment using turbo code. From the results, it was able to obtain ACL(Asynchronous Connection Less) packet transmission time, optimal TCP packet size and DM packet size.
2 Bluetooth Wireless System Bluetooth is based on a low cost short-range radio link that facilitates ad hoc connections for stationary and mobile users. Bluetooth may be used as a cable replacement G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 313–320, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
314
Il-Y. Moon
technology or as a wireless link for an advanced personal area network. Bluetooth radio operates over the worldwide 2.4 GHz unlicensed ISM band. It uses a frequency hopping scheme to overcome interference and fading. Bluetooth uses a Gaussian shaped binary FM modulation. The raw data rate is 1 Mb/s. A time division duplex scheme is used to enable a 2-way link. The frequency hopping modulation is based on hop rate of 1600 hopes/sec. It covers 79 hop frequencies in most of the countries and 23 hop frequencies in Japan, France, and Spain. A master/slave configuration is used in order to allow the formation and coordination of the ad hoc network. Bluetooth device is allowed to send one packet per time slot. Even numbered slots are generally reserved for use by the master while odd numbered slots are used by the slave. The time slot duration is fixed at 625μs. A baseband packet may occupy one, three or five slots. The desired baseband packet length can be decided based on criterion such as the quantity of data to be transmitted of the number of contiguous slots available in the presence of voice traffic. Figure 1 shows the packet transmission of a multi slots for bluetooth.
fk
fk
fk
fk+3
Master
Slave1 625 μs Slot 1
t Slot2
Slot 3
Slot4
Frame
Fig. 1. The packet transmission of a multi slots for bluetooth
Slave Slave Master Standby Slave
Standby
Standby
Fig. 2. Bluetooth piconet network
Performance Analysis of ACL Packets Using Turbo Code in Bluetooth Wireless System
315
Table 1. The packet types for ACL(DM, DH)
Link Type
ACL
Packet Type DM1 DM3 DM5 DH1 DH3 DH5
Payload FEC Code Rate 2/3 2/3 2/3 No No No
User Payload (Bytes) 0-17 0-121 0-224 0-27 0-183 0-339
Burst Length
Occupied Slots
171-366 186-1626 186-2871 150-366 158-1622 158-2870
1 3 5 1 3 5
Bluetooth devices which have been setup using the same frequency hopping channel and clock form a piconet. In every piconet, one bluetooth device is in charge of setuping the communications, deciding the queue of frequency hopping and synchronizing the network. It is so-called master. Other devices are joined to this piconet as slave. Figure 2 shows a bluetooth piconet. Bluetooth can operate in several modes and can use different packet formats. Table 1 shows some important properties of the packet types for ACL.
3 Turbo Code Turbo codes are obtaining a large success and their introduction in many international standards is in progress. It is known that turbo codes have exceptional performance at low/medium signal-to-noise ratios (SNR), very close to the Shannon limit. Due to the poor error resilience of some source-coding schemes, some applications can require very low values of bit error rates (BER). Unfortunately, turbo code performance may be significantly worse at these error rates. In fact, their free distance dfree (the minimum Hamming distance between different codewords) can be low, even with very large interleaver lengths. This cause their BER curves to flatten following the “error floor” imposed by dfree, after the “water-fall” decreases at low SNR. To analyze the code performance, simulation could be employed. For a C(n, k) code, the knowledge of the free distance dfree, its multiplicity Nfree, and its information bit multiplicity wfree (defined as the sum of the Hamming weights of the Nfree information sequences generating the codewords with weight dfree), allows to determine the error floor slope. The performances of any binary code at high SNR are well approximated by the expression of the union bound, truncated to the contribution of the free distance. It is worthwhile to note that, for turbo codes, a small penalty must be also taken in account, due to the sub-optimality of the iterative decoding. As a result, it can be written as
BER ≅
⎛ 1 w free k Eb ⎞⎟ erfc⎜⎜ d free 2 k n N 0 ⎟⎠ ⎝
(1)
316
Il-Y. Moon
In this paper, it is analyzed that rate 1/3 turbo code with 8-state constituent encoders to improve ACL packet transmission time [5]. To analyze ACL packet transmission time, it is simulated in Rician fading channel. Rician distribution and probability density functions can be written as
f (r ) =
⎡ r 2 + A2 ⎤ ⎛ A r ⎞ r ⎟ for A ≥ 0, r ≥ 0 exp ⎢ − ⎥I0 ⎜ 2 bo ⎦ ⎜⎝ bo ⎟⎠ bo ⎣ KR =
A2 2bo
(2)
(3)
where, Rician factor KR is defined as the ratio of the specular power A2 to scattered power 2bo . When K R = 0 the channel exhibit Rayleigh fading, when K R = ∞ the channel does not exhibit any fading at all.
4 Simulation Modeling of Bluetooth Wireless System SAR operations are used to increase efficiency by supporting a MTU (Maximum Transmission Unit) larger than the maximum allotment of a single packet [6],[7]. This reduces overheads by spreading the packets used by higher layer protocols over several packets, covering 1, 3 or 5 slots. It is defined slot limit as the maximum number of slots that cross the packet. The slot limit could be less than 5 due to a very high bit error rate in the wireless channel. This factor is passed by the LMP(Link Manager Protocol) to the L2CAP (Logical Link Control And Adaptation Protocol) through a signaling packet [8]. In this paper, It used DM 1,3,5 packet for bluetooth. Also, It used an original bluetooth protocol stack for simulation. Namely, down from TCP, UDP, IP, PPP, L2CAP, a procedure of SAR shows figure 3.
Higher Layer Protocol Layer L2CAP MTU
L2CAP Layer HCL Max Buffer
Link Manager Baseband Protocol Layer
Fig. 3. Bluetooth protocol stack
Performance Analysis of ACL Packets Using Turbo Code in Bluetooth Wireless System
317
Access code Packet header
Packet Assemble
GFSK Modulation Rician
Payload Access code Packet header
AWGN Packet Reassemble
GFSK Demod.
Payload
Fig. 4. Model of bluetooth wireless system
And, in this paper, a transmitted GFSK(Gaussian Frequency Shift Keying) signal of simulation model for bluetooth baseband can be written as
{
⎧ 2E t S (t ) = Re⎨ exp⎛⎜ j 2π f c t + h ∫ g (t )dt −∞ ⎝ ⎩ T
}⎞⎟⎠⎫⎬⎭,
(4)
where, E is energy, T is time(period), fc is carrier frequency, h is modulation index, and g(t) is the transfer function of Gaussian low-pass filter, and expressed as
g (t ) =
∞
∑ a v(t − kT )
K = −∞
(5)
k
,
where, ak = 1, -1,
1 v (t ) = {erf ( − λBbT ) + erf (λBb (t + T ))} 2 , where, λ = 2 / ln 2π , BbT = 0 .5, and
t
2
0
π
erf (t ) = ∫
(6)
e −t dt 2
When the composite received signal consists of a large number of plane waves, the received complex envelope g(t) = gI(t) + gQ(t) can be treated as a wide-sense stationary complex Gaussian random process. Some types of scattering environments have a specular or line-of-sight component. In this case, gI(t)and gQ(t) are random Gaussian processes with non-zero means. In order to simulate the BER performance for Bluetooth piconet, AWGN and Rician channel model is used to this paper. In addition, to achieve transmission time using multi-slot scheme for ACL packet in Bluetooth network, TMSG , is defined as
TMSG = ( K − 1)TPKT ( q ) + TPKT ( r ) = ( K − 1)
STIME × q STIME × r , + p p
(7)
318
Il-Y. Moon
where, K is the number of total message packet, q and r is the number of time slot to be compute, TPKT(q) and TPKT(r) is a transmission time of ACL packet to be fragment q and r, STIME is slot time and p is a probability of data frame to be successfully transfer. Figure 4 shows a model of bluetooth system. To obtain the transmission time of TCP, It has to transmit the total message data packet. A modulation for bluetooth is GFSK signal. To simulate bluetooth system, channel model consists of AWGN and Rician fading.
5 Simulation of Transmission Time for Bluetooth Packet By Eb/No in wireless channel, it obtains transmission time of packet and analyzes performance of bluetooth. A kind of packet used in this simulation is DM1, DM3 and DM5 packet of ACL that carries data information only. DM stands for Data-Medium rate. This DM packet covers up to 1 time slot, 3 time slot and 5 time slot. Additionally, DM packet payload has error correction method called 2/3 FEC, so it can obtain transmission time that is used this each packet. STIME defined 625μs, 1875μs and 3125μs at 1 time slot, 3 time slot and 5 time slot, respectively. For achieving transmission time of packet for bluetooth, total message transmission time is simulated at MTOTAL = 5000 byte, Eb/No = 5 dB, rate 1/3 turbo code in AWGN channel and at MTOTAL = 5000 byte, Eb/No = 5 dB, K = 9 dB, rate 1/3 turbo code in Rician fading channel. In figure 5, the parameter Eb/No set 5 dB in AWGN channel. In figure 5, when slot packet size increase from 1 slot packet size to 5 slot packet size, transmission time is less than transmission time of typical 1 slot packet method. In figure 6, the parameter Eb/No set 5 dB in Rician fading channel. In figure 5 and figure 6, although it added Rician fading channel, it can know that the result of both figure 5 and figure 6 is almost same.
Total message transmission time (ms)
260
DS = 35 (byte)
240
DS = 140 (byte)
220
DS = 240 (byte)
DM1
200 180 160 140
DM3 DM5
120 100 80 60 200
400
600
800
1000
1200
1400
TCP packet size (byte)
Fig. 5. Total transmission time of ACL (DM packet) bluetooth wireless system (Eb/No = 5 dB, Rate 1/3 turbo code, AWGN)
Performance Analysis of ACL Packets Using Turbo Code in Bluetooth Wireless System
319
Total message transmission time (ms)
280
DS = 35 (byte)
260
DS = 140 (byte) DS = 240 (byte)
DM1
240 220 200 180 160 140
DM2 DM3
120 100 80 200
400
600
800
1000
1200
1400
TCP packet size (byte)
Fig. 6. Total transmission time of ACL (DM packet) bluetooth wireless system (Eb/No = 5 dB, Rate 1/3 turbo code, Rician)
From the result, transmission time of bluetooth packet decreases from total message, when it uses 3 slot packet size and 5 slot packet size rather than 1 slot packet size. And, when it analyzes figure 5 and 6, it can know that TCP packet size must increase to decrease transmission time in wireless channel. Also, considering BER in wireless channel, it obtain appropriate TCP packet size. Moreover, in case of optimal TCP packet size (about 600 byte) in AWGN channel and Rician fading channel, bluetooth packet transmission time considering trade-off between total message transmission time and TCP packet size is about 200ms (1 slot packet size), 100~120ms (3 slot packet and 5 slot packet size) in AWGN and is about 220ms (1 slot packet size), 120~130ms (3 slot packet and 5 slot packet size) in Rician fading channel.
6 Conclusion This paper has simulated bluetooth packet transmission times using DM1, DM3 and DM5 with turbo code. This turbo code scheme decreases transmission time of L2CAP baseband packets by sending packets. In order for segment to improve the transfer capability in bluetooth system, it is fragmented in TCP total messages that are coming down from upper layer and then the packets are sent one at time in baseband. From the result, it can perceive that the transmission time of bluetooth decreases the total message transmission time by using a DM3, and DM5 packets as opposed to the single slot packet size transmission approach. Moreover, it can gather that the transmission time in wireless channel decreases as the TCP packet size increases. As a result, based on the data collected, it was able to obtain packet transmission time, optimal TCP packet size.
320
Il-Y. Moon
References 1. http://www.bluetooth.com 2. Haartsen, J.C.: The Bluetooth radio system. IEEE Personal Comm. 7, 28–36 (2000) 3. Zurbes, S.: Analysis of interference on Bluetooth. In: Bluetooth Developers Conference (August 1999) 4. Zyren, J.: Reliability of IEEE 802.11 DSSS and FHSS WLANs in a Bluetooth environment. In: Bluetooth Developers Conference (August 1999) 5. Souissi, S., Meihofer, E.F.: Performance evaluation of a Bluetooth network in the presence of adjacent and co-channel interference. In: IEEE Emerging Technologies Symposium: Broad-band, Wireless Internet Access, pp. 6–11 (2000) 6. Barke, A., Badrinath, B.: I-TCP: Indirect TCP for mobile host. In: Proceedings of the 15th International Conference on distributed Computing Systems, pp. 136–143 (June 1995) 7. Das, A., Ghose, A., Gupta, V., Razdan, A., Saran, H., Shorey, R.: Adaptive link-level error recovery mechanisms in bluetooth. In: PWC-2000, pp. 85–89 (2000) 8. Park, H.S., Heo, K.W.: Performance evaluation of WAP-WTP. The journal of the Korean institute of communication sciences 26(1), 67–76 (2001)
Design and Implementation of Remote Monitoring System for Supporting Safe Subways Based on USN Seok Cheol Lee and Chang Soo Kim* Interdisciplinary Program of Information Security, Pukyong National University, Korea {host2000,cskim}@pknu.ac.kr
Abstract. This paper describes the prototype model of remote monitoring system based on ubiquitous sensor network (USNs). We use the wireless sensor nodes which include the temperature, humidity, micro-dust, and a water level sensor are installed to manage the environment of subways. In this paper, we present the method of construction example of wireless sensor nodes, collecting data and transmitting the information between sensors. In addition, we construct a model of the web-based integrated system to manage effective information with the presented platform. This prototype system has advantage to effect of reducing the cabling fee, easier to the installation and repairs and enables to monitor by real-time. Keywords: Remote Monitoring System, Wireless Sensor Networks, Ubiquitous Computing, USN Application.
1 Introduction Rapid progress in micro-electro-mechanical system (MEMS) and radio frequency (RF) design has enabled the development of low-power, inexpensive, and networkenabled micro-sensors. These sensor nodes are capable of capturing the physical information as well as mapping such physical characteristics of the environment to quantitative measurements. A typical wireless sensor networks (WSNs) consists of hundreds to thousands of such sensor nodes linked by a wireless medium. WSNs have created new paradigms for reliable monitoring. Moreover the applications of WSNs have been used in various domains. For example, networks for smart home, environmental monitoring, interconnection for mobile phone, agriculture service, security management and etc. A wireless sensor node called sensor or mote consists of microcontroller, sensor, and wireless communicator. It is a tiny and simple device with limited computation and resources. Sensor nodes are designed to detect the environmental effect, collect and send back the data to the user. The general trend in process instrumentation, including sensors and actuators directly contacting industrial processes, can be characterized by the attribute intelligent or smart. It is essential that subways where many people take are needed to keep the comfortable environment because of characteristic location. The inner space such as houses, departments, school, public building, cars, airplanes as well as the subways has to be comfortable. The factors to keep comfortable environment are temperature, humidity, *
Correspondent author.
G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 321–330, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
322
S.C. Lee and C.S. Kim
micro-dust, noise, illumination, air pollution and are decided to the characteristic of space [21]. The subways platform has measurement devices connected with environment purification facilities. Essential factors to keep comfortable environment is temperature, humidity, micro-dust [10][21]. The smoke devices for sensing a fire and the water level measurements in tanks are also the factors. The devices are independently operated. Particularly the measured values are limited on temporary time and specific space because the micro-dust measurement is measured by staffs with its measure device. Therefore, a system is needed to monitor the values on the real-time with devices for measuring essential environment factors. To construct the system, computer and communication technology must be integrated and connected with various devices. The environment will make a possible service of ubiquitous platform [3]. We present a model to monitor values extracted from the sensor devices based on the web services. For the work, we analyzes current environment monitoring devices and present a method for making a sensor node using ubiquitous sensor networks and for communicating between nodes using Zigbee(IEEE802.15.4). This paper is organized in the following manner. In the next section, we describe the related works include the contents of ubiquitous sensor networks and requirements for supporting safe subways. In the section 3, we present the contents of design the proposed system. In the section 4, we present implementation strategies and a search example of sensor data based on the Web services. Finally, we describe advantages our wireless sensor networking architecture and present future works.
2 Related Research 2.1 Wireless Sensor Networks Wireless Sensor Network (WSNs) are a subset of wireless networking applications focused on enabling connectivity without the use of wires to sensors and actuators in general[1][9][10][11][12]. This kind of technology is required three faces. First, there is a need to lower the cost of sensor installation. In the case of wired based model have more dollars and additional materials to construct the system. Second, huge controllers or computer system for processing the data and sensor interface is required. For instance, PLC (Programmable Logic Controller) has more efficient machine for connecting sensor device but its connection base is wire and not suitable for the space where have many moving utility machines. In this case, we use the wireless node called sink or gateway that has an interface like Ethernet or RS-232C. The third aspect is that WSNs form the lower layer of intelligent maintenance systems enabling sensor-rich environments that generate abundant data that may be used to improve industrial numbers of machines and industrial systems in general. 2.2 The Features of WSN Nodes The core technologies to construct wireless sensor networks are classified into sensors, processors, communications, interfaces, and security. Sensor nodes, router nodes, and sink nodes or gateway nodes are needed for gathering data from sensor
Design and Implementation of Remote Monitoring System
Gateway Environmental Event
323
Internet
Router SINK Sensor
Fig. 1. Sensor Networks Construction
devices. Sensor nodes work to perceive the environment on itself, and router nodes work to send information from sensor nodes to gateway or sink nodes, which transform transmitted data into meaningful information[9][14]. Sensor Nodes. Sensor nodes have basic sensors to get environment with node information, and transmit the measured data into sink nodes through periodic sensing. Sensors for getting the environment information, processors for processing data, and wireless communication devices for transmitting data are needed [4][9][14]. Router Nodes. The router nodes are needed to transmit data of sensor nodes to sink node because sensor node is constructed by a unit of from 30(m) to 100(m). The devices are an essential thing to construct mesh topology or star mesh topology between nodes. Sink Nodes. Sink nodes integrate data transmitted from sensor nodes or router nodes and send data to RS-232C or Ethernet. The sink nodes can control sensor nodes and router nodes. It calls a pan coordinator on the Zigbee networks and calls a gateway node on the classic networks[12][14]. Real-time Operation System for USN. Real-time embedded operating systems with simple task functions are used in USN systems. Representative operating Systems for USN are Tiny OS of Berkeley, Mantis of Colorado, Eyes OS of Europe, and Nano-Q+ of ETRI in Korea [11]. We use the Nano-Q+ operating system to process data sensing and analog to digital conversion, and to construct an effective monitoring system. 2.3 Requirement of Monitoring Systems for Safe Subways The environment pollution is an urgent and important issue in the universe as to the rapid industrialization. Now, the survey and the monitoring for environment are executed for many times, but the integrated and periodic measurement has not been executed. The monitoring systems will monitor and predict the varied environment information on the real-time and minimize losses of the life and property in cases of disasters [21].The environment monitoring data is measured by a lot of different types of measurement equipments, but the data accuracy cannot be confided because of the
324
S.C. Lee and C.S. Kim Table 1. Standard Values of Environment Data for Satisfaction in Doors Items
Standard Values
Temperature Humidity
17~28 40~70%
Wind velocity
0.5m/s below
Micro-dust CO CO2 Ventilation
0.15mg/m3 below 10ppm below 1,000 ppm below 20m3 person/hour
℃
measurement on the unreal time. Now, the best method is that sensor nodes with sensing cycle perceive the data using sensor network periodically and transmit the collected data by wire or wireless. We are interest in the temperature, humility, micro-dust, CO, CO2, ventilation quantity to measure the environment data in indoors. Table 1 shows standard values of the items for comfortable environment in doors.
3 Design the Remote Monitoring System 3.1 Entire Architecture of the Proposed System Figure 2 shows the architecture of ubiquitous services for remote monitoring system for managing the environment of subways. We use the micro sensor devices for measuring temperature, humility, smoke, micro-dust, water level, and design a monitoring system which extracts data from sensors, constructs sensor networks for communication, and analyzes the extracted information. 3.2 Operating the Sensor Network First of all, we explain a process for gathering data. First, a sink node broadcasts messages to get information of sensors at the end. The messages are sent from a sink node with bounded application ID and physical address. These messages are hexadecimal format. Second, each sensor node responds to the requested message. Each sensor node tries to find out routing information just in time. If the routing information exists, the sensor node can transfer data into the path. Otherwise, the sensor node waits for next request. Through the process, routing table is set up. Next, the sink node requests a message which has priority information for sensor nodes. This message has duration of sensing frame and information of priority. The duration of sensing is formed constantly for sensors to have idle time. Finally, sensor nodes perform sensing after updating their routing table and then try to communicate their communication on wireless. Through the process, a last sink node is able to receive the data from sensor nodes. The main point of this technique is to avoid redundancy of data. Figure 3 show the procedure of message.
Design and Implementation of Remote Monitoring System PC
PDA
325
Mobile via internet
Web based Monitoring Program Web Service Component Data Storage Module
DB
Data Analysis Module Data Gathering Module RS-232C SINK NODE (Gateway) Wireless Communication Temperature Sensor Node
CC2420 RF ATMega128
Humidity Sensor Node
CC2420 RF Sensor
ATMega128
Sensor
Micro-dust Sensor Node
Water-level Sensor Node
CC2420 RF
CC2420 RF
ATMega128
ATMega128
Sensor
Sensor
Fig. 2. Construction of Sensor Networks
4)Setup the Routing Table
1)Broadcast message for get address 2)Response to Sink Node
Sensor
Sensor
Sensor
Sensor
3)Broadcast message (Priority, Time) 5)Sending the sensing data
Sink
Sensors Field
Fig. 3. Procedure of message broadcasting and response
4 Implementation 4.1 Sensor Devices and Calibration Sensor Devices. Table 2 shows the specification of our sensors. The values is decided by the electric output signal of 4~20mA or 0~5V. Sensor Calibration. Data extracted from sensor devices is a general physical amount of electronic currents. Therefore, we calibrate real-data from raw-data. There are two Table 2. Specification of Sensors Item
Temp. Sensor
Water Level Sensor
Smoke Detector
Micro-Dust Sensor
Measurement Range
-20~80
0~100
0.3 ~ 5.0
0~40
Measurement Unit
°C, °F
% RH
㎛
cm
Output
4~20mA 0~5V
4~20mA
4~20mA
4~20mA 0~5V
326
S.C. Lee and C.S. Kim Table 3. Sensor Calibration Items
Calibration Formula
Water-level
Wv = (ADC Value – 30)*0.211
Temperature
Tv = (ADC Value -4000)*(0.017+1000)
Humidity
Hv = (ADC Value) * 0.05
Micro-dust
Mv = (ADC Value) * 0.01
methods in the sensor calibration. The first method is comparing the raw data with the data sheet from formula provided by sensor’s manufacturers. The other method is comparing the raw data with other measurements. We use all the two methods. Temperature, humidity, and micro-dust sensor are provided by the data sheet, so we can use the formula which is computed by providers. We use the ruler and calibrate the sensor because the calibration data of water-level measurement is not provided. The method is not good because the difference of the value between calibrated data and real-data is too much. However, the sensor precision will be high if the work of the sensor calibration is exact. Table 3 shows the formula induced by the experiment from standard measurement devices. 4.2 Node Construction We use four types of different sensors, construct sensor nodes, and collect the data with a gateway node. Sensor Nodes. A sensor node drives the sensor device with a unique identification and extracts data and sends data to sink nodes by wireless. Nano-Q+ OS provides API program of sensing module for temperature, humility, gas, infrared rays. We add the basic module to micro-dust and smoke detector module. Figure 4 shows the structure of sensor nodes. SINK Nodes. Sink nodes integrate the data transmitted from sensor nodes and send the data to Ethernet or RS 232C Interface. Figure 5 shows the structure of transmitted data from sensor nodes.
Nano-Q+ Application Sensing Module RF Comm. Module Nano Q+ Hardware Abstract Layer Hardware MCU (ATMega128L) RF (CC2420) Sensor Device Fig. 4. Structure of Sensor Nodes
Design and Implementation of Remote Monitoring System Node ID
Packet Type
Length
Sensor ID
ADCH
ADCL
Sensor Priority
Voltage
1byte
1byte
1byte
1byte
1byte
1byte
1byte
1byte
327
Control Message
Fig. 5. Packet Creation of Transmitted data via Sensor Nodes
4.3 Communication between Sensor Nodes The distance between sensor devices is limited to the range 30m~70m because of the antenna structure, design, and obstacles. Routing protocol is needed to collect wide ranges of data[4][12][14]. The routing protocol is designed to have the active routing function to search the paths with low power as well as the data collection from sensor devices. We construct communication paths between nodes using the flat routing protocol, which is an ad-hoc routing method. We set up routing paths with proper distance in advance as shown in Figure 6. Table 4 shows the node identification between parent/child nodes.
ID :05
Sink
ID :03
ID : 00
ID :04
ID :02
Router ID : 01
Fig. 6. Routing Path in this model Table 4. Parent/Child Node Identification Child Node ID 5 4 3 2 1
Node Name Water level Node Temperature Node Micro-Dust Node Smoke Detector Node Router Node
Parent Node ID 3 2 2 1 0
4.4 Application Modules for Monitoring The monitoring program is classified into data collection, data conversion, data store module, and Web services. Data Collection. Data Collection stores the format of data structure and manages raw data transmitted from sink nodes. The data Collection is implemented with two classes. The sensor devices sense the event of RS-232C port based on the thread and store the data into sensor data structure.
328
S.C. Lee and C.S. Kim
Data Conversion. The data extracted from sensor devices has 4~20mA or 0~5V amounts of electric current. The values are converted to a unit of real values. Data conversion executes the calibration of raw data with ‘GetCalibaration()’ function and conversion formula. Data Storage. Data store module stores converted data to a database and executes query statements. The accumulated data is utilized to the statistic data. Data store module stores the setting values of sensors, sensor information, and sensor data. 4.5 Result of Implementation We implement the monitoring system based on the web services. Managers and users can monitor the situation without the type of terminals. Users can search the information of temperature, humility, water level, and micro-dust and can check the operation of fire alarm. Figure 7 shows a search example of specific time and a specific sensor. Also, the search of situation alarm logs after setting up threshold values is also provided.
Fig. 7. Result of Implementation – Remote Monitoring Application
5 Conclusion In this paper, we design the remote monitoring application for managing the environment of secure subways based on the USN and implement a prototype example based on the Web services. We use the sensors devices needed for measuring temperature, humility, micro-dust, smoke detector, water level, which have an effect on the comfortable environment in subways. We construct sensor nodes and communicate between nodes using ad-hoc sensor network routing protocol. Also, the structure for data collection and formula for calibration is presented.
Design and Implementation of Remote Monitoring System
329
The advantages of our system include: (1) the free mobility of sensor devices in the ranges of transmission of radio frequency, (2) the low power for system operation with event-driven method and periodic battery change, (3) the construction of the distributed computing with low price sensor devices. As future works, we are planning to apply our monitoring system to subways and to continuously improve the performance of the system. We’ll develop the algorithm for low-energy consumption in wireless sensor networks. Also, the case studies on the real field, and the method of high precision of sensor devices are studied.
Acknowledgement This work was supported by Pukyong National University Research Fund in 2007(PK-2007-044).
References 1. Lee, J.Y.: Ubiquitous Sensor Networking Technology. The 95th TTA Journal, 78–83 2. Bulusu, et al.: Scalable Coordination for Wireless Sensor Networks: Self-Configuring Localization Systems. In: ISCTA 2001, Ambleside, U.K. (2001) 3. Choi, Y.H.: A plan of practical management in U-City. Samsung SDS Inc. (2004) 4. Edgar, H., Callaway Jr.: Wireless Sensor Networks Architectures and Protocols. CRC Press, Boca Raton (2004) 5. Gutiérrez, Callaway, Barrett: Low-Rate Wireless Personal Area Networks. IEEE, Los Alamitos (2004) 6. Kahn, J.M., Katz, R.H., Pister, K.S.J.: Next Century Challenges: Mobile Networking for Smart Dust. In: Proc. ACM MobiCom 1999, Washington, DC, pp. 271–278 (1999) 7. Kang, S.C.: The Future of a Sensor Network Stage. Electronics and Information Center IT Report (2003) 8. Kim, D.y., Hong, S.K.: A technology of smart sensor node operating system. The 97th TTA Journal, 73–80 9. Park, S.C., Nam, S.Y., Ryu, Y.D.: The technology of Ubiquitous Sensor Network: jinhan M&B (2005) 10. Lee, J.E.: An introduction of Diaster Management. Dae-young Press (2006) 11. Akyildiz, I.F., Su, W., Sankarasubramaniam, Y., Cayirci, E.: A Survey on Sensor Networks. IEEE Communications Magazine, 103–114 (August 2002) 12. Mohammad, I., Mahgoub, I.: Handbook of Sensor Networks Compact Wireless and Wired Sensing Systems. CRC Press, Boca Raton (2004) 13. Lee, K.H.: U-City Device Network. A strategy of U-City Construction and service model seminar (2004) 14. Octacomm Inc.: The understanding of Embedded System – Development of Sensor Network. Octacomm Inc. (2005) 15. Jung, W.Y., Lee, H.C., Kwon, S.Y.: Sensor & Interfaces for Ubiquitous Computing: Sung An Dong.com (2006) 16. Park, S.S.: An application development of Sensor network using Nano-24, Octacomm Inc. (2004) 17. Pottie, G.J., Kaiser, W.J.: Wireless Integrated Network Sensors. Commun. ACM 43(5), 551–558 (2000)
330
S.C. Lee and C.S. Kim
18. Son, D.R.: An application of field and specification of sensor. Hannam University (2004) 19. Shen, C., Srisathapornpha, C.t., Jaikaeo, C.: Sensor Information Networking Architecture and Applications. IEEE Pers. Commun., 52–59(2001) 20. Heinzelman, W.R., Chandrakasan, A., Balakrishman, H.: Energy-Efficient Communication Protocol for Wireless Microsensor Network. In: HICSS (2000) 21. Park, D.S.: A usage of USN technology for environmental monitoring. In: Korea Railroad Research Institute, USN Development Workshop (2006) 22. Woo, A., Culler, D.: A Transmission Control Scheme for Media Access in Sensor Networks. In: MOBICOM (2001) 23. Ephremides, A.: Energy concerns in wireless networks. IEEE Wireless Communication (August 2002)
Evaluation of PC-Based Real-Time Watermark Embedding System for Standard-Definition Video Stream Takaaki Yamada1, Yoshiyasu Takahashi1, Hiroshi Yoshiura2, and Isao Echizen3 1
Hitachi, Ltd. Systems Development Laboratory, Totsuka, Yokohama, Japan The University of Electro-Communications, Electro-Communication, Chofu, Tokyo, Japan 3 National Institute of Informatics, Chiyoda-ku, Tokyo, Japan {takaaki.yamada.tr,yoshiyasu.takahashi.gq}@hitachi.com,
[email protected],
[email protected] 2
Abstract. The effectiveness of a previously proposed system for embedding watermarks into video frames in real time using software on an ordinary personal computer was evaluated. The proposed system uses standard-definition(SD) video I/O and is separate from the encoding process, so it can be incorporated into various types of encoding and distribution systems, which makes it well suited for distributing live content. The evaluation results with regard to watermark visibility and survivability, resource consumption, and feasibility show that this system is applicable to various practical uses. Keywords: real time processing, video watermark, software implementation, system evaluation.
1 Introduction Digital content—such as images, video, and music—has become widely available because of its advantages over analog content. It requires less space, is easier to process, and is not degraded by aging or repeated use. The wide use of broadband networks and high-performance personal computers enables live digital video content, such as live concerts and distance education, to be distributed over the Internet. A serious problem, however, is that the copyright of digital content is easily violated because the content can be readily copied and redistributed through the Internet. Digital watermarking, which helps protect the copyright of digital content by embedding copyright information into it, is one countermeasure [1][2], making real-time video watermark embedding an essential requirement for live content distribution. Systems have been developed for embedding video watermarks (WMs) in real time that use dedicated hardware such as field programmable gate arrays [3] and media processors [4][5]. Real-time embedding using hardware is well-suited to applications in which video content must be distributed with high efficiency, such as broadcasting. However, real-time embedding is problematic in terms of installation and maintenance costs, and version upgrading can be difficult. To deal with these problems, we previously developed a real-time embedding system suitable for content distribution that uses software on an ordinary personal G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 331–340, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
332
T. Yamada et al.
computer (PC) [6]. This software-based system enables real-time processing including watermark embedding, MPEG encoding, and hard disk drive recording. However, it can generate only MPEG-4 encoded watermarked files because the watermark embedding process is combined with the MPEG-4 encoding process. Moreover, it supports only the QVGA (320×240-pixel) format, which is converted from the VGA (640×480-pixel) format of the incoming video signal, to reduce the total processing time. We then developed an improved system for real-time video watermark embedding [7]. It is designed to directly handle the VGA format of the incoming video signal. Moreover, it is separate from the encoding process and thus can be incorporated into various encoding and distribution systems. To determine the effectiveness of the improved system, we evaluated the performance of a prototype in terms of real-time processing, watermarked image quality, and robustness of watermarks against video processing. Section 2 briefly describes our original and improved systems for real-time video watermarking. Section 3 describes our evaluation of the processing performance, and Section 4 describes our evaluation of image quality and watermark robustness. Section 5 briefly summarizes the key points.
2 PC-Based Real-Time Video Watermarking Systems 2.1 Original System [6] The original system runs on a PC with a video-capture board and provides real-time encoding by software processing on the PC. The system receives an NTSC (National Television System Committee) video signal output from a video camera, VCR (video cassette recorder), DVD (digital versatile disc) player, or other video source. The incoming signal is converted into QVGA format by the driver software on the videocapture board. As the video is encoded, the WM embedding process embeds WMs representing copyright information into each frame of the video. The watermarked frames are then MPEG-4 encoded, and the watermarked MPEG-4 bit stream is recorded on the hard disk. The embedding, MPEG-4 encoding, and hard-disk recording are all done in real time. However, use of the original system for live content distribution is problematic: 1. it generates only MPEG-4 encoded watermarked files because the watermark embedding process is combined with the MPEG-4 encoding process, 2. it can handle only QVGA format, which is converted from the VGA format of the incoming video signal, to reduce the total processing time, and 3. the encoded watermarked stream files are stored only on the hard disk (there is no video interface to distribution servers). With the original system, it takes 7 ms per frame on average to watermark the QVGA images [13]. In addition, encoding the images, done on the same PC, takes from 5 to 23 ms. The total processing time can thus take up to 30 ms per frame. Stable operation, i.e. no more than 33 ms per frame, is possible because NTSC signals have 29.97 frames per second. However, VGA images have four times as much data as
Evaluation of PC-Based Real-Time WM Embedding System for SD Video Stream
333
QVGA ones, so a dedicated hardware with much higher performance is required to process VGA images directly. 2.2 Improved System [7] Our improved system solves the problems described above. It is separate from the encoding process and is equipped with a standard video interface, so it can be combined with various types of encoders and distribution servers. Moreover, it can directly handle the VGA format of the incoming video signal. Figure 1 shows an overview of the improved system, which runs on a PC with a video I/O board. It provides real-time WM embedding by software processing on the PC. The real-time encoding process receives an NTSC video signal output from a video camera, VCR, DVD player, or other video source. The incoming video frames in VGA (640×480-pixel) format are taken from the incoming video signal and loaded into the video memory on the video I/O board by the driver software. The WM embedding process embeds WMs representing copyright information into each frame of the video. The watermarked frames are then uploaded into the video memory on the video I/O board. The watermarked video contents are output to the encoder and server, which then multicasts them to clients, which have a decoder. Figure 2 illustrates the architecture of the software processing of the improved system, from video input through output. To enable the VGA format to be handled in real time, we: − divided the WM embedding process into a pre-process (WM pattern generation) common to every video frame and post-processes (WM pattern generation and WMed frame generation) specific to each frame, − redesigned the data flow so that the post-processes could reuse the WM pattern output by the pre-process, and − redesigned the frame flow so that the video frames are saved directly into the video memory on the video I/O board rather than in buffers on the hard disk. Of the various watermark embedding processes that have been proposed [8][9][10], we based ours on a basic WM algorithm, the patchwork algorithm [8]. The details of the process flow are described in our previous paper [11]. PC VGA データ Video データ frames YUV YUV
Camera, VCR, etc. NTSC
Software Watermark embedding
Encoder, Monitor, VCR, etc.
データ Video データ frames YUV YUV
Video I/O board
NTSC
Fig. 1. This shows an overview of improved real-time watermark embedding system
334
T. Yamada et al.
Overall control
WM pattern generation
Embedded Info.
Capture
Input signal
WM strength generation
Internal memory
WM pattern Video I/O board
WM strength Info.
WMed frame generation
NTSC input
Frame
Output signal
Save
NTSC output
Fig. 2. Software architecture of improved real-time watermark embedding system (WM: watermark)
2.3 Prototype The prototype we used to evaluate our improved system captures the SD (standard definition) video signals output from a video camera through NTSC input, embeds watermarks (WMs) into the video signal, and outputs the watermark-embedded SD video signal to a monitor through NTSC output. The WMs, representing 64 bits of information, are embedded into the luminance component (Y) of the video signal during video input and output (I/O). The WM embedding and I/O are both done in real time. The specifications of the prototype are shown in Table 1. Table 1. Prototype specifications Item
PC
Interface I/O devices Resolution Frame rate Payload Detection error ratio
1
Specification
Ordinary PC with video I/O board CPU: Intel® Xeon™1 2.4 GHz Memory: 1 GB NTSC input/output Input: video camera, DVD player, etc. Output: TV, VHS recorder, etc. 640 × 480 pixels (VGA) 29.97 fps 64 bits of information less than 10E-10
Intel® and Xeon™ are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.
Evaluation of PC-Based Real-Time WM Embedding System for SD Video Stream
335
3 Evaluation of Processing Performance 3.1 Measured Performance Time Inside System The prototype system has five processing steps, and the time taken for each was measured. In step 1, the watermarking process is initialized (memory allocation, WM pattern generation, etc.). In step 2, the I/O video board device driver is initialized. The input video stream can now be captured by the system. In step 3, the frame receiving process captures a video frame and synchronously places it into internal memory. In step 4, a watermark is embedded in the frame data. In step 5, the frame sending process saves the WMed frame into the output video buffer. The processing times for all five steps were measured inside the system and written to a log file while the video was streaming. The results are summarized in Table 2. It shows that real-time processing is performed. The times for the three iterative steps, 3–5, are the averages for 100 frames. Only the times for 100 frames were logged to avoid the log file becoming too large. If the file became too large, new entries would be cached in the hard-disk I/O buffer and then flushed out by the PC’s operating system. Our application program is unable to control the I/O buffer, and, since cache flushing is unpredictable, it would disturb our measurements. Therefore, we avoided long-term measurement. In real-time processing, each frame should be processed in 33 ms because the video data is streamed at 29.97 frames per second. Since the total average time for the three iterative steps was 32 ms/frame, real-time processing is feasible. Contrary to expectations, the CPU load of the prototype was not so high. The reason the frame receiving time was so long is attributed to synchronization, i.e. waiting for video input. The output SD video stream was apparently natural by glancing at a TV monitor. Table 2. Processing time No.
1 2 3 4 5
Step
WM process initialization I/O initialization Frame receiving WM embedding (VGA) Frame sending
Time
54 ms 721 ms 20 ms / frame 1 ms / frame 11 ms / frame
3.2 Measured Performance Time Outside System Although the real-time watermarking process is confirmed, as described above, it still seems to be lacking in discussing system performance. Once frame-sending process (step 5 in Table 2) is finished, we assume that the I/O board converts the frame data into an output video signal. However, if a frame in a long video stream is dropped and replaced with the immediately preceding frame, which is stored in the board’s buffer, the log file may not reflect this. If the output board does not handle errors reliably, there may be no interruption event in the application software. The frame rate of
336
T. Yamada et al.
output signal is regulated to 29.97 fps, that is, video stream must have 29.97 frames per second. Even if only one frame is dropped, actual frame-rate might be less than 29.97 fps. It is difficult to detect such frame drops by glancing at a TV monitor. We thus conducted a test to confirm that the processing could be done in real time, that is, that a watermarked-embedded SD video signal could be output at 29.97 fps. The test was done from outside the watermarking system using a five-step process, as illustrated in Fig. 3. 1. A visible sequence code pattern was overwritten on each frame image (see Fig. 4). 2. The frames were written onto a DVD-video disc. A DVD player then sent the frames in an SD video signal to a test system. 3. The test system embedded an invisible watermark into every frame and output them in an SD video signal to a monitoring PC in real time. 4. The monitoring PC captured the visibly and invisibly watermarked video data and encoded it (QVGA, MPEG1) using lossy compression and wrote it to a file at 1 Mbps. The deduced image size and bit rate reflect the performance limits of the monitoring PC. 5. The visible code patterns in the monitored data were compared with the originals. If a frame had been dropped, the resulting difference in patterns made it easy to detect. Even when dropped frames were difficult to visibly detect due to few movements in the original images, the pattern differences made them easy to find. Testing using 313 continuous frames confirmed that the processing could be done in real time, i.e. the estimation is regarding to about 10 seconds video in 29.97 fps. Although we prepared 15-second video frames, about 5-second data is consumed in lead-in and lead-out times that are required for the manual operations such as pushing the play button in a synchronous manner. If a mismatch in patterns revealed a frame drop, the step in which it was dropped could not be pinpointed as drops can occur in steps 2, 3, and 4 due to unforeseen events. However, no frame drops were detected in this experiment.
Original video
(1)Overwriting Code pattern
Visibly Coded Video
(2)DVD-Video Writing
Preparation PC (5)Comparison of code patterns
MPEG Video Monitoring PC
(4)Capturing with Encoding
(3) Realtime watermark embedding
DVD Player
Tested System
Fig. 3. Testing using embedded sequence code patterns
Evaluation of PC-Based Real-Time WM Embedding System for SD Video Stream
337
Fig. 4. Visible sequence code pattern written into frame image. The left image is the original with a pattern overwritten at the top left. The right image is the same image after processing and capturing by the monitoring PC. The black line down the left side of the processed image was added during the capturing process in step 4.
4 Evaluation of Image Quality and Watermark Robustness 4.1 Image Quality We subjectively evaluated the quality of images watermarked with our improved system using a procedure based on Recommendation ITU-R BT.500-7 [12]. Watermarked videos were displayed on a monitor and evaluated by ten participants, who rated the image quality using the scale shown in Table 3. A test in which both original and watermarked content is displayed in real time is difficult to arrange. Strict evaluation of image quality requires the use of dedicated equipment such as an uncompressed video disk recorder. Moreover, it is difficult to correctly operate analog video devices in a synchronous manner when testing multiple items. Therefore, we manually embedded WMs into video files using the same program libraries as those in the proposed system and then encoded the files (MPEG2, 8 Mbps). The evaluators compared the original contents with the watermarked ones for both the original system [6] and the improved system [7]. The videos were the same as those used previously [13] (see Fig. 5). As shown in Table 4, the average scores with the improved system were virtually the same as those with the original one, indicating that the image quality had been maintained. Table 3. Level of disturbance and rating scale Disturbance
Imperceptible Perceptible but not annoying Slightly annoying Annoying Very annoying
Score
5 4 3 2 1
338
T. Yamada et al.
(a)
(b)
(c)
Fig. 5. Scenes from sample videos. (a) EntranceHall (b) WalkThroughTheSquare (c) WhaleShow. Table 4. Subjective image quality
Sample EntranceHall WalkThroughTheSquare WhaleShow
Original 4.9 4.9 4.7
Improved 5.0 4.9 4.7
4.2 Watermark Robustness Watermark robustness was evaluated using the same videos, which have various motion properties used in image quality evaluation. We used the WM detection ratio, which is the ratio of the number of points where the embedded 64 bits were correctly detected to the total number of detection points. There were 450 frames in total for each video, and the watermarks of 30 sequential frames were detected at a time. There were 15 detection points in each video. A WM was correctly detected at each point. That is, the detection ratio was 100% for all three samples in VGA, MPEG2, 8Mbps. These results show that the watermarked images satisfied the three commonly used essential requirements: the watermarks should not degrade image quality, they should be robust enough to be reliably detected after common video processing, and they should never be found in unmarked images. The improved system has a general-purpose interface, meaning that the output signal can be connected to a video encoder with a lower compression rate. Moreover, once the WMed content is distributed, a user might make an illegal copy by transcoding from the original format to another smaller format, such as to QVGA, MPEG4. Watermarks should survive not only the regular distribution process in which WMed video is compressed into a specific format such as MPEG2, but also such illegal coping process. As a further study, we also examined watermark robustness including a simulation of illegal copy tracking. The watermarks created by our improved system had excellent survivability in a simulation of subscribing user-generated content on a video-upload site, for instance, QVGA-resizing with re-encoding at 250 kbps.
Evaluation of PC-Based Real-Time WM Embedding System for SD Video Stream
339
5 Conclusion Our improved system for real-time video watermarking has a standard video interface, can directly handle the VGA format of the incoming video signal, and is separate from the encoding process. It can thus be incorporated into various encoding and distribution systems, including those for − − − −
broadband video streaming, video clipping for editing, copy management for dubbing to video packages, video authentication for cameras used for monitoring, education, medical operations, and live interviews, and − easily embedding copyright information into early artworks. Real-time processing is achieved by making the watermark-pattern generation process common to every frame a pre-process and shortening the watermark embedding time by reusing the data output from this pre-process and by storing the watermarked video frames into the video memory on the video I/O board, thereby eliminating the need for storing them in buffers on hard disk. Evaluation of a prototype system demonstrated the validity of this approach in terms of watermarked image quality and watermark survivability after video processing.
References 1. Swanson, et al.: Multimedia data-embedding and watermarking technologies. Proc. IEEE 86(6), 1064–1087 (1998) 2. Bloom, et al.: Copy protection for DVD video. Proc. IEEE 87(7), 1267–1276 (1999) 3. Owada, et al.: Development of Hardware based Watermarking System. In: The 2004 Symposium on Cryptography and Information Security (SCIS 2004), 3D5-2 (2004) 4. Terada, et al.: Development of Real-time Video Watermarking System Using Media Processor. Journal of the Institute of Image Information and Television Engineers 58(12), 1820–1827 (2004) 5. Wada, et al.: Development of Real-time Video Watermarking Equipment for HDTV. In: Forum on Information Technology 2003 (FIT 2003), J-031 (2003) 6. Echizen, et al.: Real-time video watermark embedding system using software on personal computer. In: Proc. of IEEE Int’l Conf. on Systems, Man and Cybernetics (SMC 2005), pp. 3369–3373 (2005) 7. Echizen, et al.: PC-based real-time watermark embedding system with standard video interface. In: Proc. of IEEE Int’l Conf. on Systems, Man and Cybernetics (SMC 2006), pp. 267–271 (2006) 8. Bender, et al.: Techniques for data hiding. In: Proc. SPIE, vol. 2020, pp. 2420–2440 (1995) 9. Delaigle, et al.: Watermarking algorithm based on a human visual model. Signal Processing 66, 319–335 (1998) 10. Kundur, et al.: Digital watermarking using multiresolution wavelet decomposition. In: Intl. Conf. Acoustics, Speech and Signal Processing, vol. 5, pp. 2969–2972 (1998)
340
T. Yamada et al.
11. Echizen, et al.: General quality maintenance module for motion picture watermarking. IEEE Trans. Consumer Electronics 45(4), 1150–1158 (1999) 12. Rec. ITU-R BT.500-7, Methodology for the subjective assessment of the quality of television pictures (1995) 13. Yamada, et al.: Evaluation of Real-Time Video Watermarking System on a Commodity PC. In: Müller, G. (ed.) ETRICS 2006. LNCS, vol. 3995. Springer, Heidelberg (2006), http://www.etrics.org/
User Authentication Scheme Using Individual Auditory Pop-Out Kotaro Sonoda and Osamu Takizawa Information Security Research Center, National Institutes of Information Communications and Technology, 4–2–1 Nukui-Kita, Koganei, Tokyo, 184-8795, Japan
Abstract. This paper presents a user authentication scheme which takes advantage of the uniqueness of prover’s auditory characteristics. In this scheme, several audio stimuli are presented at the same time by headphone. The stimuli include a special audio stimulus that only the genuine prover can easily distinguishes from other stimuli because of the uniqueness of auditory characteristics. The prover is made to answer the contents of the special stimulus. The verifier confirms the correct answer as the genuine prover. As the special audio stimulus that distinguishes the genuiness, auto-phonic production is examined in this paper. The advantage of this scheme is that the prover does not need to keep complex sequence in minds like a password authentication. Moreover, the Personal Authentication Information (PAI) is never stolen, because the PAI in this scheme is personal auditory memory or receptor.
1 Introduction With growing number of the service for individual customers, people have to manage a lot of password and keep it complex and fresh on conventional password authentication. The issue of the password management can be forgotten at biometrics authentication. The Personal Authentication Information (PAI) of the biometrics authentication is the prover’s unique biometrics that the prover’s imposter can’t get easy. However, in the biometrics authentication, the prover has expose own biometrics anytime. Therefore, it is possible that these biometricses are unconsciously scanned even if the owner doesn’t have the intention to be authenticated. Whether, at this time, their authentications have necessity to make the body parts closer to the scanner because of the scanning resolution, the threat of replaying PAI from the stolen or scanned biometrics should grow in the future. By the way, how to recognize the prover in case of interpersonal contact? Following three strategies are considered: 1. ask the other the sharing secret word. 2. observe the other’s face, voice, or movements. 3. talk some sharing stories, then test the reaction. G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 341–349, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
342
K. Sonoda and O. Takizawa
The first and second strategy correspond with password and biometrics authentication respectively. Human reflex response authentication and mnemonic authentication corresponds with the third strategy. In this authentication, the verifier gives some stimuli which are shared with the genuine prover and are able to be recognized only by the genuine prover, then requires the prover’s responses. The correct responses prove the genuiness. CAPTCHA (Completely Automated Public Turing Test to Tell Computers and Human Apart)[1] Gimpy[2] can be categorized in the above mentioned third strategy although the system distinguishes the human from the machine. Gimpy presents the prover a sequence of distorted fonts which distracted by noise. If the prover is real human, the characters are recognized correctly while computers posing human cannot. Nishigaki et al. proposed a reflex-response-based user authentication scheme using blind spot and saccade response time[3]. The verifier presents the prover a view target on the screen several times. If the view target is presented inside of the genuine prover, any saccades shouldn’t be observed from the response of the genuine prover while it should be observed from the spoof provers. In this authentication scheme, the verifier distinguishes the genuiness of prover by measuring such a difference of existence of saccade. We proposed a response-based user authentication scheme using individual difference of auditory pop-out. In this scheme, several audio stimuli are presented at the same time by headphone. The stimuli should include a special audio stimulus that only the genuine prover can easily distinguishes from other stimuli because the stimulus were heard as poped-out from others by the genuine prover while the stimulus were melted into others by the spoof prover. The prover is made to answer the contents of the special stimulus. The verifier confirms the correct answer as the genuine prover. As such a special audio stimulus by which verifier confirms the genuiness, auto-phonic production is examined in this paper.
2 Authentication Protocol The proposed authentication protocol is followed (Fig.1): Registration: The prover (P ) who want to be authenticated submits the set of stimuli (X) which can be receipt only by himself to the verifier (V ). Authentication: V gives the stimuli s to unknown prover (P ). The s are made by mixing a stimulus xi (∈ X) and several dummy stimuli di . . . . 1. V requires P to answer the content of the stimulus xi . 2. P answers the content. (If P is P , he can answer correctly.) 3. Above 1 and 2 are repeated. Verification: V verify the P as P if the responses meet the necessary requirements. At the authentication phase, the audio stimuli are presented by headphone and located virtual in several direction by the convolution with the general head related transfer functions (HRTFs) of each direction.
User Authentication Scheme Using Individual Auditory Pop-Out
343
(a) Registration
(b) Authentication Fig. 1. Reflex-response-based user authentication scheme based on the auditory characteristics
344
K. Sonoda and O. Takizawa
3 Auto-phonic Production Auto-phonic production speech is the speech sound which is recognized by talker himself while he is speaking. As shown in Fig. 2, when the people speak, the voices are reached to the strangers by transmitting through the air. However, the voice principal heard the voice by both paths of the air and the bones or body. The voice which the stranger recognizes as the principal should be different from the auto-phonic production voice. On the other hand, the autophonic production voice should be the principals unique self-voice. Therefore, it is expected that the auto-phonic production voice must yield the difference of responses between the principal and the strangers. Nakayama et al. studied the
Stranger
Principal Autophonic-procuction
Air-conducted
Fig. 2. Auto-phonic production
characteristics of the auto-phonic production by making talker to control the amplitudes for several frequency components of the air-conducted voice which is recorded at near the talker’s mouth in order to close the heard air-conducted voice with the auto-phonic production[4]. As the result, it was found that the vocal sound was perceived relatively louder in the low frequency region (about 5 dB at 100 Hz) and softer in the high frequency region (about −5 dB at 4 kHz) than the air-conducted voice. 3.1
Experiment I: Familiarity with Auto-phonic Production
First of all, to confirm the differences of the voice that was imagined as the talker between voice principal and strangers, we carried out an auditory experiment. In this experiment, the subject listeners are ordered to evaluate in five grades the distance from the voice which they have imagined the talker to the air-conducted voice (O), auto-phonic production (A), low-boosted voice (L), and high-boosted voice (H), respectively. These stimuli were generated by the appling function of frequency to the recorded air-conducted voice. The functions of frequency applied to stimuli A, L, H are shown in Figure 3. In case of the listener is the talker (principal), the voice usually listened should be auto-phonic production, although in otherwise (strangers) the voice usually listened should
Amplifier level [dB]
User Authentication Scheme Using Individual Auditory Pop-Out
345
Stimulus H: High-boosted
+10 +5 4000 100
frequency in logarithmic scale [Hz]
-5
Stimulus A: Auto-phonic production
-10
Stimulus L: Low-boosted
Fig. 3. Functions of frequency manipulated for recorded air-conducted voice O
(a) V.S. voice of talker 1
(b) V.S. voice of talker 2
(c) V.S. voice of talker 3 Fig. 4. Familiarity on difference grades from air-conducted voice
be air-conducted voice. The words of stimuli are the Japanese four mora words cited from the list which has balanced phonemes and high familiarities[5]. The results are showed in Fig. 4 in difference grades from air-conducted voice (O). Great scores in them indicate that the target stimulus was more familiar than
346
K. Sonoda and O. Takizawa
air conducted voice and one or greater difference of scores between target stimuli indicates that those stimuli can be discriminated. Figure 4 (a) shows that subject listener 1, the voice principal (talker), can distinguish the auto-phonic production from the other stimuli, although the other listeners, the voice strangers, grade similar scores to auto-phonic production and high-boosted voice. Therefore, the auto-phonic production is expected to be available for authorizing subject.
4 Auditory Search on Simultaneous Multi Audio Stimulus In our authentication scheme, the presented auto-phonic production stimulus has to be enough distracted by simultaneously presented other stimuli. Human ability of recognition against simultaneous presented speeches are studied by several researchers. Kashino et al. examed the multi-talker recognition test[6]. He tested the number of the talkers and the number which the subject can recognize the contents of talk by using simultaneous presented Japanese four moras words. He reported as followed: Recognition of number of talkers In the case of the stimuli constructed from up to two talkers, it was possible to judge the number of talkers almost completely. In the case of the stimuli constructed more than three talkers, however, the tendency to underestimate the number of talkers was seen. Recognition of number of words It was possible to answer only up to almost two. It didn’t depend on the composition of the number of talkers. Bronkhorst et al. experimented on the word intelligibility and the talker recognition of the target voice in cases of various conditions of presentation method; monaural, binaural, and three-dimentional audiotory presentation[7]. He concluded as followed: 1. There is no difference in performance between a 3D auditory display based on individualized HRTFs and general HRTFs. This conclusion applies to all scores assessed for speech intelligibility, talker recognition (including the time required for recognition), and talker localization. This means that no individual adaptation of a band-limited (4 − kHz) communication system is needed in a practical application of an auditory display with many users. 2. Compared to conventional monaural and binaural presentation, 3D presentation yields better speech intelligibility with two or more competing talkers, in particular for sentence intelligibility. Equivalent performance is achieved with 3D presentation compared to binaural presentation when one talker is added and compared to monaural presentation when two or three talkers are added. However, in specific conditions (all competing talkers on the side opposite the target talker) binaural presentation may be superior to 3D. Within the 3D configurations examined, intelligibility is highest when the target talker is at −45 ◦ or 45 ◦ azimuth.
User Authentication Scheme Using Individual Auditory Pop-Out
347
3. Talker-recognition scores are higher for 3D than for monaural and binaural presentation, but the differences are small. Recognition scores depend less on the number of competing talkers than intelligibility scores. The virtual positions of the talkers in 3D are not a relevant factor. 4. For binaural and 3D presentation, the time required to correctly recognize a talker increases with the number of competing talkers. For two or more competing talkers, 3D presentation requires significantly less time than binaural presentation. 5. Absolute localization of a talker is relatively poor and becomes gradually more difficult as the number of competing talkers increases. These studies are about in multi-talker condition, but the condition in our scheme uses single-talker. The target stimulus in our scheme should be distracted easier than multi-talker condition. We expected four or five locations as the number that the individual difference in the response occur. 4.1
Experiment II: Mixed Multi-stimuli Task
To confirm the existence of the difference of the ability to find the auto-phonic production distracted by multiple audio stimuli between the talker of them and not talker, an auditory experiment were carried out. In the experiment, subjects were tasked to find the concern talker’s auto-phonic production distracted by multiple air conducted sounds of the talker. The subjects are assigned genuine provers when the stimuli are his speeches and are assigned spoof provers when the stimuli aren’t. The numbers of direction or stimulus in experimented conditions were four and five; one of them is a auto-phonic production. This number wasn’t told the subjects. In this experiment, we improved the method of generating the auto-phonic production from experiment I mentioned section 3.1. Considering the propagation path of the auto-phonic production, the hearing of bone- or muscleconducted component isn’t affected by mouth aperture while the air-conducted component is affected. Therefore, simple tuning on the frequency responses might have big variance by the phonic feature of the sentence. In the below experiment, therefore, auto-phonic production stimulus was generated by mixing the air- and muscle-conducted voice in the proportion that talker defined to make close with hearing as auto-phonic production by each sentence as shown in Figure 5 These measured proportions had many varieties by sentences and talkers in deed. The source stimuli are the Japanese words of four-moras which are picked up from Japanese word familiarity database. We selected 25 words from the group categorized in most familiar. Two Japanese young men attended the experiment. They are acquaintances each other. The directions which stimulus presents from are −90, −45, 0, +45, and +90 degrees in five stimuli condition and −90, −30, +30, and +90 degrees in four stimuli condition as shown in Fig.6. The amplitudes of all the presented stimuli were normalized by equalizing standard variations of them. Table 1 shows the answer rates of correct, wrong, and not-presented word in five stimuli condition
348
K. Sonoda and O. Takizawa 1111 0000 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 00000000 11111111 0000 1111 00000000 11111111 0000 1111 00000000 11111111 0000 1111 00000000 11111111 00000000 11111111 Autophonic-procuction 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111
body-conducted
111 000 000 111 000 111
Air-conducted
1111 0000 0000 1111 0000 1111
α
Fig. 5. Generation of auto-phonic production stimuli; α: amplitude of air-conducted component on mixing with body-conducted component to auto-phonic production
00 11 11 00 00 11 00 11
111 000 000 111 000 111 000 111 000 111
45 degree
00 11 11 00 00 11 00 11
Autophonic
11 00 00 11 00 11 00 11 00 11
60 degree
Air-conducted
Autophonic Air-conducted
five stimuli condition
four stimuli condition
Fig. 6. Presented directions of stimuli (located virtually)
Table 1. Answer rates of correct, wrong, not-presented words Five Directions Four Directions Talker Listener Correct Wrong Not-presented Correct Wrong Not-presented Genuine prover
01 02
01 02
4.0 8.0 6.0
66.0 74.0 70.0
30.0 18.0 24.0
16.7 10.4 13.6
75.0 64.6 69.8
8.3 25.0 16.7
02 01
16.0 6.0 11.0
40.0 60.0 50.0
44.0 34.0 39.0
25.0 6.3 15.6
47.9 77.0 62.5
27.1 16.7 21.9
total Spoof prover
01 02 total
User Authentication Scheme Using Individual Auditory Pop-Out
349
and four stimuli condition. Unexpectedly, both genuine prover (listener equals talker) and spoof prover (listener spoofs talker) didn’t have the ability to find the target sentence from the other distracting sentences. Their correct answer rates were under the chance rates (25.0 % and 20.0 % in four and five stimuli, respectively). Moreover, the differences of the answer rates between genuine and spoof prover were not showed at this time. We should study the fewer stimuli conditions and the other combinations of stimuli.
5 Conclusion In this paper, we propose an new approach for user authentication scheme using distinctive auditory characteristics. This scheme takes advantage that human might have the individual sensitivity to a certain kind of audio signal stimulus like the individual biometrics. Auto-phonic production speech was adpted as such an inducing stimulus in this paper. From the first experiment, the differences of sensitivity to auto-phonic production voice and air-conducted voice between the voice principal and the strangers were indicated. In the second experiment, we impremented the prototype of authentication system using our scheme. By this time, the availability was not indicated from the experiment. As the future works, we are planning to measure the responce times to ascertain the target audio stimulus from the distracting stimuli. It is expected that the asymmetries between the genuine prover and the imposters exist.
Acknowledgement This Study owes partially to financial support by the Grant-in-Aid for Young Scientists (B) (#19700123) from the Japanese Ministry of Education and Science.
References 1. von Ahn, L., Blum, M., Hopper, N., Langford, J.: CAPTCHA: Using Hard AI Problems for Security. In: Biham, E. (ed.) EUROCRYPT 2003. LNCS, vol. 2656, pp. 294–311. Springer, Heidelberg (2003) 2. CAPTCHA project: http://www.captcha.net/chaptchas/gimpy/ 3. Nishigaki, M., Arai, D.: A User Authentication Using Human Reflex. Transactions of Information Processing Society of Japan 47(8), 2582–2593 (2006) 4. Nakayama, I.: Voice timbre in autophonic production compared with that in extraphonic production. Journal of Acoustics Society of Japan (E) 18(2), 67–71 (1997) 5. Sakamoto, S., Suzuki, Y., Amano, S., Ozawa, K., Kondo, K., Sone, T.: New lists for word intelligibility test based on word familiarity and phonetic balance. Journal of Acoustical Society of Japan 54(12), 842–849 (1998) 6. Kashino, M., Hirahara, T.: Judging the number of concurrent talkers for sentence stimuli. Proceeding of Spring meeting of Acoustical Society of Japan III–3–3 (1997) (in Japanese) 7. Drullman, R., Bronkhorst, A.W.: Multichannel speech intelligibility and talker recognition using monaural, binaural, and three-dimensional auditory presentation. Journal of Acoustical Society of America 107(4), 2224–2235 (2000)
Combined Scheme of Encryption and Watermarking in H.264/Scalable Video Coding (SVC) Su-Wan Park and Sang-Uk Shin Pukyong National University, Interdisciplinary Program of Information Security, Korea {music016,shinsu}@pknu.ac.kr
Abstract. This paper presents a combined scheme of encryption and watermarking to provide the access right and the authentication of the video content simultaneously. This scheme protects contents more secure because the encrypted content is decrypted when the watermark is exactly detected. And the scheme is appropriate for real-time applications as it is implemented in the encoding process. In addition, we propose more efficient encryption method and watermarking method in the SVC coding as scrutinizing the structural features. For encryption, we proposed an efficient selective encryption scheme which encrypts the intra prediction modes of 4×4 luma block , the sign bits of texture, and the sign bits of MV difference values in the intra frames and the inter frames. The proposed encryption scheme keeps the format compliance and has time efficiency. For watermarking, we propose the reversible watermarking scheme using intra prediction mode. The proposed watermarking scheme has a little bit-overhead, but the degradation of the visual quality doesn’t occur. Keywords: Contents protection, Watermarking, Authentication, H.264/SVC.
1 Introduction Recently, with the development of computer technology, the distribution of digital contents over the advanced internet and advent of digital television, the digital video contents is becoming an important part of the broadcasting, entertainment and communication industries. Most digital video contents are distributed and stored in the form of compressed bitstream even though the network bandwidth and the storage capacity are increasing. In a various video compression techniques, H.264/SVC[1] is recently standardized for video data with streams of various qualities by JVT of ISO/ICE MPEG and ITU-T. The base layer of SVC is based on H.264/AVC which is the technology for video data with much higher compression efficiency. Thus, issues of contents protection and authentication that are appropriate for AVC and SVC have attracted a great deal of interest in digital video applications. In the existing methods[2-12] for the contents encryption, Lian[11] proposed video encryption scheme combining with AVC codec. But the proposed structure of encoding process isn’t able to directly apply to the one of SVC because coding method for the bitstream differs a little. Also, the encryption method proposed by Won[12] is applied on SVC bitstream. To encrypt the only sign bit, however, doesn’t provide the full stability of encryption in point of entire process. For authentication of video contents, many previous works[13-15] use various digital watermarking techniques. These video watermarking methods in the compressed domain has G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 351–361, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
352
S.-W. Park and S.-U. Shin
focused on embedding the watermark into the MPEG-2 and AVC bitstream sequence. But the proposed methods[14-15] on the H.264/AVC have the perceptual quality degradation or the bit-overhead. Therefore, we propose the efficient encryption method and watermarking method in the SVC coding as scrutinizing the coding structure. For encryption, we suggest the efficient selective encryption with format-compliance and little bit-overhead. For the authentication, we propose the reversible watermarking scheme without the degradation of the visual quality. Moreover, we propose a combined scheme of these two techniques. In [16-18] some combined schemes have been proposed. In these existing combined schemes, the content is watermarked and then the watermarked content is encrypted. That is, both techniques are independently performed. However it can cause the disclosure of the decrypted content when these schemes are performed at the client’s player, because it decrypts the encrypted content and then detects the watermark. Therefore, we propose the more tightly coupled scheme which protects contents more securely because the encrypted content is decrypted when the watermark is exactly detected. The proposed scheme provides the access right and the authentication of the video content. And it is fast and appropriate for the real-time application as the proposed schemes is implemented with the compress process. This paper is organized in the following manner. In the next section, we describe related works that is related to the architecture of SVC. In the section 3, we propose a combined scheme of encryption and authentication. In the section 4, we describe the experimental result and analyze our proposed scheme. In the last section, we summarize the paper and describe the conclusion.
2 Related Works 2.1 H.264/Scalable Video Coding (SVC) The SVC is the name given to a video compression standard developed jointly by ITU-T and ISO. The objective of the SVC is to offer content in a "scalable" way that the content is coded once while the streams of various qualities can be offered. The SVC provides the types of spatial, temporal, and SNR scalability. The spatial scalability is achieved by layered coding and the temporal scalability is achieved by hierarchical B picture structure. In the case of SNR scalability, coarse-grain scalability(CGS) and medium-grain scalability(MGS) are employed and these support much more rate point. Three scalability types can be combined for generating a bitstream that supports a variety of spatial-temporal resolution and rate points. In addition SVC bit-stream consists of a base layer and several enhancement layers for scalability. By decoding the base layer, the lowest quality of original video can be obtained. Enhancement layers are added on the base layer to get a better quality. The base layer of SVC bitstream is generally coded in compliance with H.264/AVC and includes the most important information in the video contents. In addition, the video contents need light-encryption and fast watermark detection for the real-time application. Thus, the proposed combined scheme of encryption and the authentication is applied to only the base layer.
Combined Scheme of Encryption and WM in H.264/Scalable Video Coding (SVC)
353
In the architecture of SVC’s bit-stream, encoding algorithm is selected between inter and intra coding for block-shaped regions of each picture. Inter coding uses motion vectors(MVs) for block-based inter prediction to exploit temporal statistical dependencies between different pictures. Intra coding uses various intra prediction modes(IPMs) to exploit spatial statistical dependencies in the source signal for a single picture. The prediction residual(texture) is then further compressed using a DCT transform in order to remove spatial correlation inside the transform block before it is quantized. Therefore, we utilize the domains mentioned above, MVs, IPMs and texture for encryption and authentication.
3 Combined Scheme of Encryption and Authentication The proposed method is applied in SVC and works at the MB level of frames in base layer. Encryption and watermarking are implemented in the encoding process almost simultaneously. In detail, encryption is performed for access right in the process of the proposed scheme, and then watermarking is implemented for the authentication. In the decoding process, the receiver’s device extracts the watermark from the received bitstream. This extracted watermark(W’) is compared to the original watermark(W). If they match, then the received video is trusted and the encrypted bitstream is decrypted. In other words, only authenticated contents can be decoded in the decoding process. Fig.1 shows the overview of the proposed scheme.
Fig. 1. Overview of the combined scheme of encryption and watermarking
In the proposed method, encryption uses the selective encryption which is applied to the IPMs, motion vector difference values(MVDs) and texture. And the IPMs is used again for embedding the watermark. Fig. 2 shows three types of domain for encryption and one domain for watermarking in proposed scheme. At this time, the keys of domains for encryption, E_keys, and the secret key for watermarking, A_key, are derived from the master key using the cryptographic key derivation function. Thus, we use independent and different key for each domain in the proposed scheme. The proposed encryption scheme degrades enough a visual quality of video content while it has a little bit-overhead. And the proposed reversible watermarking is useful to authenticate the content without the loss of the quality. Also, the proposed watermarking is capable of doing the localization of attack as the watermark is embedded as unit of intra period. Therefore, the combined scheme of encryption and watermarking in the H.264/SVC codec is fast and appropriate for real-time applications. We will explain about encryption method and watermarking method minutely in the following part.
354
S.-W. Park and S.-U. Shin
Fig. 2. The combined scheme of encryption and watermarking in the H.264/SVC baselayer
3.1 Encryption for the Access Right Intra Prediction Modes(IPMs) Encryption. There are nine IPMs for each 4×4 luminance (luma) block, four IPMs for a 16×16 luma block, and 8×8 chrominance (chroma) block. But encrypting the IPMs for 16×16 luma block can cause the problem of the format compliance, because these are integrated and expressed with the MB mode in bitstream. Also, as the IPMs for 8×8 chroma block are coded to the variable length using the Exp-Golomb Codes, the bit overhead can be occurred in encoding process. Therefore, it is reasonable to encrypt the IPMs for each 4×4 luma which are expressed as fixed length bit in SVC bit-stream.
(a)
(b)
Fig. 3. IPMs for each 4×4 luma (a) Labeling of prediction sample (b) Directions of IPM
The IPMs for 4×4 block are shown in Fig. 3. The sixteen samples (a~p) in the 4×4 block are predicted by using prior decoded samples (A~M) in neighbor blocks in Fig. 3(a) and the prediction direction can be utilized as one of the nine IPMs that are presented from 0 to 8 as illustrated in Fig. 3(b). Then the IPMs for blocks are modified into the range of value from -1 to 7 in the bitstream, IPM′s. The ‘1’ value of IPM′s is expressed into one bit and others are expressed into the fixed length 4 bit including the sign bit. Therefore, we suggest to encrypt the values [0~7] of IPMs′ in bitstream by xoring with encryption key E_key2 as shown in Fig. 4.
Combined Scheme of Encryption and WM in H.264/Scalable Video Coding (SVC)
355
Fig. 4. Encryption process (a) IPMs (b) IPMs′ in bitstream (c) encrypted IPMs′
There are no bit-overhead in the encrypted area because our proposed method is encoded by fixed-length three bits excluded the sign bit. Moreover, our scheme has little delay time in the encoding process by simply xoring with the key. However, for the format compliance in decoder process, the encrypted IPMs of blocks in the first row and column should have the decodable value. Motion Vector Difference Values(MVDs) Encryption. To encrypt only the IPMs in intra coding has a limitation for protecting frames in inter coding with the P-frame and the B-frame. Inter coding uses MVs for block-based inter prediction. In MVs coding, MVDs are encoded with Exp-Golomb codes and have variable length coding. Therefore, only sign bits of MVDs in the inter-frame are encrypted by using E_key3 in order to avoid the bit-overhead. The sign bits encryption of MVDs simply applies as encryption algorithm which doesn’t affect coding efficiency and satisfies the format compliance. Residual Data(texture) Encryption. The texture of the IPMs and the MVs are compressed using a DCT transform and an entropy coding. The entropy coding uses more powerful compression methods like CAVLC (Context Adaptive Variable Length Coding), CABAC (Context Based Arithmetic Coding) and UVLC (Universal Variable Length Coding). However, the domain that can be encrypted to the CAVLC bitstream which supports all profiles is very restrictive. Because the some domains cause the bit-overhead and other domains have the problem of format compliance in encryption. Therefore, the proposed scheme encrypts the sign value of non-zero coefficients using E_key1 in a zigzag ordering of the 8×8 blocks. As a light-weight encryption method, this method is applied to not only the luma components but also the chroma components in encryption domain. It may generate the bit-overhead after an encryption due to the CAVLC. However, the bit rate of encrypted frames is nearly no difference with that of unencrypted frames. 3.2 Watermarking for the Authentication We propose the reversible watermarking scheme that uses the previously mentioned intra prediction. The intra prediction uses nine IPMs from 0 to 8 for each 4×4 luma
356
S.-W. Park and S.-U. Shin
block as shown in Fig. 5(a) and selected IPM in each block has the range of value from -1 to 7 in the bitstream as shown in Fig. 5(b), IPM′. At this time, the IPM′s in the range of value from 0 to 7 are used for encryption as discussed in section 3.1. Fig. 5(b) shows the encrypted blocks illustrated as blocks filled with dots. Therefore, we use only the block with the ‘-1’ value of the IPM′ for watermarking. Watermark Embedding. To fine the block for embedding the watermark bit, we should first extract the secret key(A_key) from the master key. In case the A_key consists of the binary bits, each three bits of this A_key are expressed as the decimal number, AKi. As shown in Fig. 5, the AKi is used to select one of blocks with ‘-1’ value of IPM′ to embed the watermark. In the ordering of a dotted line in Fig. 5(b), a i-th watermark bit is embedded into (AKi+1)-th block after the block embedded the (i-1)-th watermark bit. The shaded blocks in Fig. 5(c) show the watermarked blocks.
Fig. 5. IPM processing for the watermark embedding (a) IPMs in the MB (b) IPMs to be expressed into bitstream, IPM′s (c) watermarked IPM′s at the selected positions with the A_key
To embed the watermark bit Wi in the block that has the ‘-1’ value of IPM′, the ‘-1’ value of IPM′ is modified as follows:
(1)
(2)
Combined Scheme of Encryption and WM in H.264/Scalable Video Coding (SVC)
357
Where PBi −1 is the number of the blocks with ‘2’ value of IPM in the previous MB before encryption is applied (Fig. 5(a)). This is the important parameter for watermark extracting and the combination of encryption and watermarking. In the security aspect, the attacker cannot identify the watermark though he knows that the ‘-2’ value means the watermarked block. Because the ‘-2’ value means the watermark bit 0 or 1. In addition, it isn’t easy to find the block including the watermark bit without the A_key. However, the proposed scheme has a little bit-overhead to transmit the watermarked data. The value of IPM′ with the negative in original bitstream is represented to 1 with one bit while the value of IPM′ with the positive is presented with four bits. To transmit the watermarked ‘-1’ and ‘-2’ value, we make the additive one bit for each block with the negative value as shown in Fig. 6. But the degradation of the visual quality doesn’t occur by using the reversible watermarking.
Fig. 6. Altered coding of the IPM bitstream
Watermark Detection. Watermark detection scheme is simply implemented by considering the embedding procedure. Firstly, the secret key, A_key, should extract from the master key with the received information. To determine the positions that the watermark bits are embedded, A_key is used as the same method in the watermark embedding process. Then, PBi-1 means the number of the blocks with ‘2’ value of the IPM in the previous MB decrypted after the authentication is verified. The watermark bit Wi is derived as follows: (3) Where | | means absolute value, ⊕ means bitwise XOR(exclusive-OR) operation, and ‘mod’ means modular operation. If the extracted watermark is exact, then the received content is authenticated. Only when the authentication is accepted, the decryption is processed. After the authentication is accepted, the ‘-1’ and ‘-2’ value of IPM′ are modified into the ‘-1’ value to obtain the visual quality of the original content. Then the MB executes the decryption process.
4 Experimental Result We implemented proposed methods by the JSVM 9.8. The standard video sequences of MPEG SVC are used in the experiment. The total frame numbers used in this
358
S.-W. Park and S.-U. Shin
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 7. Experimental results of proposed scheme. (a)(d) are the original videos, (b)(e) are the encrypted videos, (c)(f) are the decrypted videos after the authentication.
experiment is 50 and the GOP size and the intra period are set up as 8 and 16 frames respectively. All experiments have been conducted on the QCIF and CIF size of video sequences for the spatial scalability. Each domain for encryption and watermarking used different keys extracted from the master key. At the side of encryption scheme, the experimental results of videos which applied the proposed scheme are too chaotic to be understood as shown in Fig. 7. Thus proposed scheme is regarded as high perception security for video contents. Taking various videos for example, we tested the peak signal-to-noise ratio (PSNR) of the unencrypted videos and the encrypted videos. As you can see from the Table 1, the encrypted video’ PSNRs are all lower than those of the unencrypted videos as having the difference about 21dB. Thus, the proposed encryption scheme keeps it secure in perception. In addition, we confirm that less movement has worse quality. Also, we can get the full stability visually in spite of the only baselayer encryption. The IPMs and the MVDs encryption don’t cause an overhead, but the texture encryption method can cause the bit overhead through CAVLC process. We can see that the bitstream overhead size of the proposed encryption increases about 0.05% than the unencrypted video stream. Table 1. PSNR of the encrypted video QCIF Unencrypted video mobile mother tempete foreman football
32.5 37.6 33.8 35.6 33.8
35.2 40.4 36.2 40.5 37.8
34.4 41.7 38.6 41.7 39.1
CIF Encrypted Video
9.2 7.5 9.9 8.4 13.8
13.1 8.9 14.2 24.5 17.6
12.7 9.7 20.3 24.0 26.9
Unencrypted video 30.5 37.5 31.5 34.5 32.5
35.2 42.0 36.1 40.1 37.9
34.4 43.2 38.4 42.3 39.6
Encrypted Video 9.4 6.4 10.0 8.6 14.2
13.1 19.6 14.4 24.5 18.0
12.9 26.2 20.5 24.1 27.2
Combined Scheme of Encryption and WM in H.264/Scalable Video Coding (SVC)
359
Table 2. Performance comparisons
Video sequence mobile mother tempete football foreman
The method of [14] in AVC Number of Bit-rate embedded bit increase (QCIF) 85 0.23% 42 0.69% 81 0.44%
Proposed method in SVC Number of Bit-rate increase embedded bit QCIF CIF 174 0.66% 0.16% 183 1.59% 0.47% 184 1.21% 0.30% 180 0.43% 0.15% 190 1.49% 0.45%
Table 3. Relative time overhead Encoding (encryption+watermark embedding) 0.82%
Decoding (watermark extracting+decryption) 0.20%
At the side of watermarking scheme, the number of watermark bits embedded in an I-frame of the video sequences and the percentage increase in the video bit rate after watermarking are given in Table 2. It shows the comparisons between our proposed method and the method of Noorkami[14]. The number of watermark bits of the proposed method is larger than one of [14]. However, the bit rate increases a few percentages. But if the total frame numbers and the intra period are getting larger, the percentage increase in the video bit rate is getting smaller. Thus, the increasing rate in real applications may reduce. At this time, the result sequences of the proposed scheme in scalability are CIF size sequences and QCIF size sequences are extracted from CIF size sequences. In addition, an advantage of the proposed scheme has no visual quality degradation by using the reversible watermarking. The performance of the proposed scheme for real-time streaming in video contents should not be elevated by computational complexity. As shown in Table 3, we can see that the time overhead also is very small.
5 Conclusion We proposed a combined scheme of encryption and watermarking to provide the access right and the authentication of the content. This scheme protects contents more secure as the encrypted content is decrypted when the watermark is exactly detected. And encryption and watermarking are implemented in the coding process almost simultaneously as unit of MB. So, the proposed scheme is fast and appropriate for real-time applications. In addition, the proposed encryption and watermarking methods are applied effectively in H.264/ AVC and SVC which is recently standardized for video data. At the side of encryption, we proposed an efficient selective encryption scheme which encrypts the IPMs of 4×4 luma block , the sign bits of texture, and the sign bits of MVDs in the intra frames and the inter frames. The proposed encryption scheme keeps the format compliance and has time efficiency
360
S.-W. Park and S.-U. Shin
through the light-weight encryption. At the side of watermarking, we propose the reversible watermarking scheme using intra prediction mode. The proposed watermarking scheme has the bit-overhead, but the degradation of the visual quality doesn’t occur by using the reversible watermarking. Currently, the proposed scheme is implemented only in SVC base layer. In future work, we will extend encryption and watermarking method into all layers. Also the method to reduce the bit-overhead in watermarking is under investigation. Acknowledgement. This research was supported by the Program for the Training of Graduate Students in Regional Innovation which was conducted by the Ministry of Commerce Industry and Energy of the Korean Government.
References 1. ISO/IEC JTC 1/SC 29/WG 11 N8750: Joint Scalable Video Model(JSVM), Marrakech, Morocco (January 2007) 2. Shi, C., Bhargava, B.: A fast MPEG video encryption algorithm. In: Proc. ACM multimedia 1998, pp. 81–88 (September 1998) 3. Zeng, W., Lei, S.: Efficient frequency domain selective scrambling of digital video. IEEE Trans. Multimedia 5(1), 118–129 (2003) 4. Wen, J., Severa, M., Zeng, W., Luttrell, M., Jin, W.: A format compliant configurable encryption framework for access control of multimedia. In: IEEE Workshop on Multimedia Signal processing, Cannes, France, pp. 435–440 (October 2001) 5. Wang, C., Yu, H., Zheng, M.: A DCT-based MPEG-2 transparent scrambling algorithm. IEEE Transactions on Consumer Electronics 49(4), 1208–1213 (2003) 6. Yuan, C.y., Zhu, B.B., Wang, Y., Li, S., Zhong, Y.: Efficient and fully scalable encryption for MPEG-4 FGS. In: Proc. IEEE Int. Symp. Circuits and Systems, Bangkok, Thailand, vol. 2, pp. 620–623 (May 2003) 7. Taubman, D.S., Marcellin, M.W.: JPEG 2000: Image Compression Fundamentals, Standards and Practice. Kluwer Acad. Pubs., Dordrecht (2002) 8. Li, W.: Overview of fine granularity scalability in MPEG-4 video statndard. IEEE Trans. Circuits and Syst. Video Technol. 11(3), 301–317 (2001) 9. Zhu, B.B., Yuan, C., Wang, Y., Li, S.: Scalable Protection for MPEG-4 Fine Granularity Scalability. IEEE trans. Multimedia 7(2), 222–233 (2005) 10. Wiegand, T., Sullivan, G.j., Bjontegaard, G., Luthra, A.: Overview of the H.264/AVC Video Coding Standard. IEEE Trans. Circuits and Syst. Video Technol. 13(7), 560–576 (2003) 11. Lian, S., Liu, Z., Ren, Z., Wang, H.: Secure Advanced Video Coding Based on Selective Encryptionb Alogrithms. IEEE Transactions on Consumer Electronics 52(2), 621–629 (2006) 12. Won, Y.G., Bae, T.M., Ro, Y.M.: Scalable Protection and Access Control in Full Scalable Video Coding. In: Shi, Y.Q., Jeon, B. (eds.) IWDW 2006. LNCS, vol. 4283, pp. 407–421. Springer, Heidelberg (2006) 13. Kim, H.M., Cho, I.H., Cho, A.Y., Jeong, D.S.: Semi-fragile Watermarking Alogorithm for Detection and Localization of Temper Using Hybrid Watermarking Method In MPEG-2 Video. In: Hao, Y., Liu, J., Wang, Y.-P., Cheung, Y.-m., Yin, H., Jiao, L., Ma, J., Jiao, Y.C. (eds.) CIS 2005. LNCS (LNAI), vol. 3802, pp. 623–628. Springer, Heidelberg (2005)
Combined Scheme of Encryption and WM in H.264/Scalable Video Coding (SVC)
361
14. Noorkami, M., Mersereau, R.M.: Compressed-domain Video Watermarking for H.264. In: IEEE International Conference on Image Processing, September 2005, vol. 2, pp. 11–14 (2005) 15. Kim, S.M., Kim, S.B., Hong, Y., Won, C.S.: Data Hiding on H.264/AVC Compressed Video. In: Kamel, M., Campilho, A. (eds.) ICIAR 2007. LNCS, vol. 4633, pp. 698–707. Springer, Heidelberg (2007) 16. Simitopoulos, D., Zissis, N., Georgiadis, P., Emmanouilidis, V., Strintzis, M.G.: Encryption and watermarking for the secure distribution of copyrighted MPEG video on DVD. Multimedia systems 9(3), 217–227 (2003) 17. Xu, X., Dexter, S.D., Eskicioglu, A.M.: A hybrid scheme for encryption and watermarking. In: Proceedings of the SPIE Security, Steganography, and Watermarking of Multimedia Contents VI Conference, vol. 5306, pp. 19–22 (January 2004) 18. Wu, T., Wu, S.: Selective Encryption and Watermarking of MPEG Video. In: Proceedings Int. Conference Image Science, Systems, and Technology, CISST 1997 (Feburary 1997)
Evaluation of Integrity Verification System for Video Content Using Digital Watermarking Takaaki Yamada1 , Yoshiyasu Takahashi1 , Yasuhiro Fujii1 , Ryu Ebisawa1, Hiroshi Yoshiura2 , and Isao Echizen3 1
2
3
Systems Development Laboratory, Hitachi, Ltd., 292, Yoshida-cho, Totsuka-ku, Yokohama, 244-0817, Japan {takaaki.yamada.tr,yoshiyasu.takahashi.gq, yasuhiro.fujii.sj,ryu.ebisawa.st}@hitachi.com Faculty of Electro-Communication, The University of Electro-Communications, 1-5-1, Chofugaoka, Chofu, 182-8585, Japan
[email protected] National Institute of Informatics, 2-1-2, Hitotsubashi, Chiyoda-ku, Tokyo, 101-8430, Japan
[email protected]
Abstract. The effectiveness of a previously proposed system for verifying video content integrity was investigated using a prototype system. Unlike conventional verification systems using digital signatures and fragile watermarking, the proposed verification system embeds timecodes in consecutive frames of the content, extracts them some time later, and checks their continuity to distinguish attacks from regular modifications. Testing demonstrated that the system maintained picture quality, that the watermarks were robust against regular modifications, and that attacks could be detected reliably and be distinguished from regular modifications. This integrity verification system is thus well suited for a variety of applications using video content. Keywords: integrity verification, video watermarking, evaluation.
1 Introduction Digital video content has become widely available through various media such as the Internet and digital broadcasting because of its advantages over analog video content. It requires less space, is easier to process, and does not degrade over time or with repeated use. A serious problem, however, is that the integrity of digital video content is easily violated because the content can be easily modified using software editing tools. Methods for verifying the integrity of video content by detecting changes in the content are thus becoming increasingly important. Since the video format is regularly encoded and transcoded in various ways, the verification methods should be able to distinguish between illegal modifications, i.e., attacks against the content, and regular modifications. Conventional verification methods using digital signatures and fragile watermarking schemes cannot do this. G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 363–372, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
364
T. Yamada et al.
We previously proposed a system for verifying and protecting the integrity of video by using digital watermarking [1, 2]. The proposed system distinguishes attacks from regular modifications by extracting timecodes embedded as watermarks (WMs) in consecutive frames of the content and checking their continuity. We have now investigated the effectiveness of the proposed system by evaluating the abilities of a prototype, focusing on watermarked picture quality, the robustness of the WMs against regular modifications, and the detectability of attacks. Section 2 briefly overviews conventional methods for verifying the integrity of video content and presents the requirement for a video verification system. Section 3 briefly describes our proposed verification system. Section 4 describes our investigation of the abilities of the prototype and represents the evaluation results for the important performance measures. Section 5 briefly summarizes the key points.
2 Conventional Methods There are two types of conventional methods for verifying the integrity of video content: (a) Methods using digital signatures are widely used for content verification [3, 4]. Digital signatures are generated by calculating a hash value from data values in the content, encrypting the value, and adding it to the content header. Verification is done by recalculating the hash value from the content, decrypting the one in the header, and comparing them. If the values match, content integrity has been maintained. If they do not, it has been broken. (b) Methods using fragile or semi-fragile watermarks are also widely used [5, 6, 7]. Watermarks are embedded in each frame and are easily broken by a change in the content. Semi-fragile watermarks are likely to survive JPEG and MPEG compression at high bit rates, while fragile ones are not. Verification is done by checking for broken watermarks. If any are found, content integrity is assumed to have been broken. The first type is well suited for small-sized content, such as text and document files, that are not modified by an application. The second type is well suited for still images that are not modified or are restrictively modified. Neither type is well suited for video content because video content is regularly encoded, transcoded, and converted in various ways such as MPEG encoding, resizing, filtering, and D/A-A/D conversion depending on the application. As shown on Table 1, methods using digital signature or fragile WMs are not resistant to changes, including regular modifications: even a D/A-A/D conversion is detected as an attack. While methods using semi-fragile WMs are resistant to specific regular modifications such as MPEG compression at high bit rates, the change could be detected as a malicious attack if the latest video format is used. A method for verifying video content must be able to distinguish between regular and irregular modifications, i.e., attacks.
Evaluation of Integrity Verification System
365
Table 1. Conventional methods for verifying integrity of video content Regular modification Attacked
Attack
Digital Attacked signatures Fragile WMs Attacked Attacked Semi-fragile Normal Attacked (limited mods.) (limited attacks) WMs
3 Proposed System Our proposed integrity verification system uses robust video watermarking to identify attacks against video content and distinguish them from regular modifications such as video encoding and transcoding. There are many target applications. Here we describe two. Medical Operations. Operations in hospitals are now commonly recorded with a video recorder. If the surgeon makes a mistake, he or she might be tempted to later edit the recording to excise any damaging evidence. An auditor checking the consistency of the recording using our system could determine whether it had been edited. Public Works. The progress of public works projects is often documented by using video recording. If progress falls behind schedule, the site manager can, using a simple PC editing tool, replace some of the content with a recording of work completed elsewhere. Again, an auditor checking the consistency of the recording using our system could determine whether content had been replaced. 3.1
Attack Types
We consider three types of attacks against video recordings: deletion, addition, and replacement. A deletion attack removes some of the content, as described in the medical operations example. An addition attack adds content between frames. A replacement attack is a combination of deletion and addition: frames are deleted and replaced, in exactly the same number, at the same position. This is the type of attack described in the public works example. 3.2
Process Flow
Our system uses an encoder, which embeds watermarks, and a detector, which extracts the watermarks and checks content integrity. The encoder is implemented in a video camera. At the same time the camera records the image, the encoder embeds a watermark representing the timecode in each frame and encodes the frame. The timecode is not necessarily the actual
366
T. Yamada et al.
(a) deletion
(b) addition
(c) replacement
Fig. 1. Attack types
0
N frames
N frames
t0 = 0
t1 = d
…
d
…
N frames
… …
t k = (k − 1) ⋅ d
(k-1)⋅d
time[s]
Fig. 2. Watermark embedding
time. The same watermark is embedded in N consecutive frames1 , as shown in Figure 2. The watermark for each N -frame segment is the timecode corresponding to the encoding time (beginning at t = t1 ) of the segment’s first frame. The detector is implemented in a video recorder or an auditor’s equipment. It identifies and locates any attack on the video content. Using a detection window (the width of n = N/2 consecutive frames), it first obtains the timecode by extracting it from each n-frame segment. For each segment, the WMs representing the timecode are detected by accumulating n consecutive frames, which is described in detail in Section 4.1.). It then checks for consistency between the timecodes by checking their order and counts the number of windows where a WM was not detected. In this way, it can detect an attack, determine the type of attack, and determine how many frames are affected (with n-frame precision). The parameters used are listed in Table 2. There are four steps in the detection process. Step 1: Set initial values of T Ccur , T Cpre , T Cold , i, and nD to zeroes: T Ccur = T Cpre = T Cold = i = nD = 0.
(1)
Step 2: For the ith detection window, accumulate n frames and extract their WMs. (1) If WMs are not detected, increment nD and i (nD = nD + 1; i = i + 1). 1
N is an even number given by N = 2n, where n is the number of accumulation used in WM detection.
Evaluation of Integrity Verification System
367
Table 2. Parameters used in detection
T Ccur T Cpre T Cold i nD
Definition of parameters current detected timecode previous detected timecode timecode detected before T Cpre ordering number of detection windows number of detection windows where WMs are not detected
(2) If WMs are detected, set T Cold to T Cpre , T Cpre to T Ccur, T Ccur to “detected timecode”, and nD to zero. T Cold = T Cpre ,
(2)
T Cpre = T Ccur , T Ccur = “detected timecode”,
(3) (4)
nD = 0.
(5)
Step 3: Check for consistency between the T Cold , T Cpre , and T Ccur, and check the value of nD. If the values satisfy the specified conditions, the content has been attacked. Step 4: Increment i (i = i + 1) and go to Step 2. 3.3
Attack Identification Method
As described above, the detector uses the values of T Ccur , T Cpre , T Cold , and nD to identify the type of attack. The detector first determines whether three different timecodes appear one after the other (T Cold + d = T Cpre and T Cpre + d = T Ccur ). If they do, less than N frames have been deleted. If they do not (T Cold = T Cpre or T Ccur = T Cpre ), the type of attack is identified using the values of T Cpre , T Ccur , and nD (α, β > 1), as shown in Table 3. A replacement attack is identified when successive addition and deletion attacks are detected. Such an attack has occurred if β = α or β = α + 1, and it is detected after the timecode for the next window is extracted. Table 3. Correspondence between parameter values and attack type
No attack Addition Deletion Combination
T Ccur − T Cpre nD d or 0 0 or 1 d or 0 α α·d 0 or 1 β ·d α
368
T. Yamada et al.
M frames added
N frames
N frames
t1 Detection window
t2
1
2
3
4
5
6
7
TCcur
t1
t1
t1
t1
t1
t2
t2
TCpre
0
t1
t1
t1
t1
t1
t2
nD
0
0
1
2
3
0
0
Buffers
Results No attack No attack Addition Addition Addition No attack No attack attack attack attack
Fig. 3. Addition attack
Figure 3 illustrates how an addition attack is identified. If consecutive windows without a WM (nD = 3 in the example shown) are detected and no gap is detected between the preceding and current timecodes (t2 and t3 ), the detector identifies an addition attack.2
4 Evaluation 4.1
Prototype System
The prototype we used for evaluating our proposed system is based on the process flow described in Section 3.2. Figure 4 shows an example of the display when the detector identified attacks (deletion shown by |; addition shown by A) from 38 seconds of video. We considered the use of various methods for video watermarking [9, 10, 11, 12] in our prototype and developed one based on a basic, correlation-based algorithm [12]. It embeds and detects 64-bit information. The embedded 64 bits represent the timecode (32 bits) and content ID (32 bits). WM embedding and detection were respectively implemented in the prototypefs encoder and detector. The process flow of the embedding and detection for each N -frame segment is as follow. To simplify the description, we describe a 1-bit-WM schema. In the 64-bit schema, each frame is divided into 64 regions, and the 1-bit process is applied to each region. 2
Note that the third window is not identified as an attacked one but simply as one without a WM.
Evaluation of Integrity Verification System
369
Fig. 4. Example detector display for attacks
WM embedding. The luminance set of the f th frame consisting of M pixels (f ) is y(f ) = {yi | 1 ≤ i ≤ M }. The process flow for 1-bit-WM embedding is described below. Step E1: Do the following steps over f = 1, 2, . . . , N . (f ) by adding the WM pattern, m = Step E2: Generate watermarked frame y (f ) {mi ∈ {−1, +1} | 1 ≤ i ≤ M }, to the original region, y , according to the embedded bit, b. Each watermarked pixel is given by (f ) (f ) yi + μi mi if b = 1 (f ) (6) yi = (f ) (f ) yi − μi mi if b = 0, (f )
where μi
is the WM strength at pixel i.
WM detection. WMs are detected by correlating WM pattern with the accumulated frame: (f )
Step D1: Input n watermarked frames y s (f = 1, . . . , n). ˜ = {˜ Step D2: Accumulate the n frames in frame y yi | 1 ≤ i ≤ M }. For (f ) ˜ , y˜i = 1/n nf =1 y i . accumulated frame y Step D3: Calculate correlation value c, which is obtained by correlating WM ˜ . That is, pattern m with accumulated frame y c=
1 1 (f ) mi y˜i = mi yi ± μ, M i Mn
(7)
i,f
(f ) where μ is the averaged WM signal given by μ = 1/M n i,f μi . Step D4: Determine the embedded bit, b, by comparing c with a threshold value, T (> 0): ⎧ 1 if v ≥ T ⎨ 0 if v ≤ −T c= (8) ⎩ “not detected” if −T < v < T .
370
T. Yamada et al.
Fig. 5. Scene from evaluated video Table 4. Level of disturbance and rating scale
4.2
Evaluation of Prototype
We evaluated the performance of our system by using a standard video sample [13] (Walk through the Square (“Walk”): people walking in a town square (low-speed panning shot), 450 frames of 720 × 480 pixels). See Figure 5. Picture quality. We subjectively evaluated watermarked picture quality using the procedure described in Recommendation ITU-R BT.500-7 [14]. Watermarked videos generated by the process flow of WM embedding in Section 4.1 were displayed on a monitor, and ten people rated the picture quality based on the scale shown in Table 4. The average score for “Walk” was 4.1, better than “Perceptible but not annoying,” indicating that the proposed system maintained picture quality. Robustness against regular modifications. Since the proposed system should be able to distinguish between regular modifications and attacks, WMs used in the proposed system need to survive regular modifications such as video encoding and transcoding. We therefore evaluated WM robustness against representative video encodings of MPEG-2 and H.264 at different bit rates. The watermarked videos were encoded by MPEG-2 or H.264 at two different bit rates each (MPEG-2: 3 and 4 Mbps; H.264: 700 kbps and 1 Mbps). After decoding, the 64-bit information was sequentially detected using four accumulation numbers (n = 1, 5, 10, and 30) from 450 frames of the watermarked pictures as described in the process flow of WM detection in Section 4.1. Figure 6 shows the correct detection ratios for n = 1, 5, 10, and 30 (the numbers of detected points were, respectively, 450, 90, 45, and 15). The correct
Evaluation of Integrity Verification System
r
100
100
80
80
60 40
3Mbps 4Mbps
20 0
r
371
60 40
700kbps 1Mbps
20 0
1
5
10
30
1
5
10
n
n
(a) MPEG-2
(b) H.264
30
Fig. 6. Detection ratios
detection ratio r is given by the number of detected points where all 64 bits were correctly detected divided by the total number of detected points (15, 45, 90, or 450). From Fig. 6, we can see that the detection ratios were 100% for all the evaluated bit rates when n was 30. We therefore set the accumulation number used in the detector n to 30 and the number of frames in each segment N (= 2n) to 60. We used these values in the following evaluation. Detectability of attacks. We evaluated the prototype’s ability to detect attacks by first generating a watermarked video using the prototype’s encoder and compressed it with MPEG-2 or H.264 at two different bit rates each (MPEG-2: 3 and 4 Mbps; H.264: 700 kbps and 1 Mbps). We then applied two attacks on different parts of the same watermarked video. Each possibility was tested: two deletions, two additions, two replacements, deletion-addition, deletion-replacement, and finally replacement-addition. The detector identified every attack for all the encoded watermarked videos and also determined the position where they happened. Our system thus meets the requirement noted in Section 2: it can distinguish between regular modifications and attacks.
5 Conclusion We tested a prototype of our previously proposed system for verifying video integrity by using video watermarking. We evaluated it with regard to watermarked picture quality, robustness against regular modifications, and attack detectability. It was able to detect and identify attacks on video content even when the content suffered multiple types of attacks. It is thus usable by various types of applications using video content as evidence.
References 1. Echizen, I., et al.: Integrity Verification System For Video Content By Using Digital Watermarking. In: Proc. IEEE International Conference on Service Systems and Service Management (ICSSM 2006), pp. 1619–1624 (2006)
372
T. Yamada et al.
2. Echizen, I., et al.: Improved Video Verification Method Using Digital Watermarking. In: Proc. IEEE International Conference on Intelligent Information Hiding and Multimedia Signal Processing (IIHMSP 2006), pp. 445–448 (2006) 3. Pramateftakis, M., et al.: Authentication of MPEG-4-based surveillance video. In: Proc. IEEE International Conference on Image Processing (ICIP 2004), vol. 1, pp. 33–37 (2004) 4. Morito, H., et al.: Digital Camera for Taking Evidential Photographic Images. In: Proc. IEEE International Conference on Consumer Electronics (ICCE 2001), pp. 118–119 (2001) 5. Wu, M., et al.: Watermarking for Image Authentication. In: Proc. IEEE International Conference on Image Processing, vol. 2, pp. 437–441 (1998) 6. Lin, C.-Y., et al.: Robust Image Authentication Method Surviving JPEG Lossy Compression. In: Proc. SPIE, vol. 3312, pp. 296–307 (1998) 7. Lin, C.-Y., et al.: Issues and Solutions for Authenticating MPEG Video. In: Proc. SPIE, vol. 3657, pp. 54–65 (1999) 8. Echizen, I., et al.: Perceptually Adaptive Video Watermarking Using Motion Estimation. International Journal of Image and Graphics 5(1), 89–109 (2005) 9. Bender, et al.: Techniques for data hiding. In: Proc. SPIE, vol. 2020, pp. 2420–2440 (1995) 10. Delaigle, et al.: Watermaring algorithm based on a human visual model. Signal processing 66, 319–335 (1998) 11. Kundur, et al.: Digital watermarking using multiresolution wavelet decomposition. In: Proc. IEEE Intl. Conf. Acoustics, Speech and Signal Processing, vol. 5, pp. 2969–2972 (1998) 12. Echizen, et al.: General quality maintenance module for motion picture watermarking. IEEE Trans. Consumer Electronics 45(4), 1150–1158 (1999) 13. Evaluation video sample (standard definition), The Institute of Image Information and Television Engineers 14. Rec. ITU-R, BT.500-7, Methodology for the subjective assessment of the quality of television pictures (1995)
Improving the Host Authentication Mechanism for POD Copy Protection System Eun-Jun Yoon and Kee-Young Yoo Department of Computer Engineering, Kyungpook National University, 1370 Sankyuk-Dong, Buk-Gu, Daegu 702-701, South Korea Tel.: +82-53-950-5553; Fax: +82-53-957-4846
[email protected],
[email protected] Abstract. In 2004, SCTE has approved the POD (point-of-deployment) copy protection system as an ANSI standard. In the POD copy protection system, before the POD module removes the scrambling, this module first authenticates a host device to confirm its legitimization. The confirmation procedure is known as a “host authentication mechanism”. However, in 2005, Tian et al. showed that the host authentication mechanism in the current POD copy protection system has the following three weaknesses: (1) The re-authentication procedure cannot stand out a simple replay attack. (2) The host authentication protocol can make a DH (Diffie-Hellman) private key in any run as significant as a device private key. (3) The credentials of the POD module are futile. This paper proposes an improved secure host authentication mechanism for the current POD copy protection system that not only withstands the above three weaknesses but also provides same efficiency. Keywords: Host authentication, POD copy protection system, Diffie-Hellman key agreement, Replay attack.
1 Introduction In 2004, SCTE (Society of Cable Telecommunications Engineers) [1] has approved the POD (point-of-deployment) copy protection system [2] as an ANSI standard, and CE (consumer electronic) manufacturers should conform to the standard to implement a POD-Host interface. In the POD copy protection system, high value movies and video programs (“content”) are protected by a conditional access scrambling system when they flow through a POD-Host interface. Before delivering these contents to the consumer receivers or set-top terminals (Host devices), the POD module always removes the scrambling, and according to the content control information, may rescramble these contents. Recently, many researchers [3][4][5][6] has been designed and implemented the POD security module chip-set based on the POD copy protection system. Before the POD module removes the scrambling, this module first authenticates a host device to confirm its legitimization. This confirmation procedure is
Corresponding author.
G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 373–383, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
374
E.-J. Yoon and K.-Y. Yoo
known as a “host authentication mechanism”. In general, this mechanism consists of a host authentication protocol and a re-authentication protocol. The host re-authentication protocol is designed to accelerate the authentication procedure of a pair of ever authenticated POD module and host device. Host authentication mechanism across the POD-Host interface based on the exchange of Host and POD certificates. By using digital signature verification techniques, each host device verifies the other’s certificate, and then the Host and POD IDs are reported to the Headend. Finally, the Headend compares the IDs against a revocation list and takes appropriate revocation action against compromised devices. In 2005, Tian et al. [7], however, pointed out that the host authentication mechanism in the current POD copy protection system has the following three weaknesses: (1) The re-authentication procedure cannot stand out a simple replay attack. That is, the re-authentication protocol is only a retransmission of a previous credential so that any attacker can retransmit the credential to re-authenticate successfully. (2) The host authentication protocol can make a DH (Diffie-Hellman) private key [8] in any run as significant as a device private key. That is, when a POD module authenticates a host device, this module lays confidence heavily on one credential computed by a DH private key. This fact makes the DH private key as significant as a host device’s long term private key. In other words, the leakage of one DH private key is the same as the leakage of a device’s long term key for the authentication objective. (3) The credentials of the POD module are futile. It means that the credentials of POD module do not serve the goal of host authentication so that these credentials can be removed securely. Based on Tian et al.’s security analysis, this paper proposes an improved secure host authentication mechanism for the current POD copy protection system by adopting the timestamp technique [8][9][10] to eliminate the three weaknesses. We also proof the security of the proposed host authentication mechanism which how to withstand the three weaknesses. This paper is organized as follows: In Sections 2 and 3, we briefly review the host authentication mechanism and Tian et al.’s security analysis. In Section 4, we present an improvement of the current host authentication mechanism. In Section 5, we proof the security of our improvement. Finally, our conclusions are presented in Section 6.
2 Review of Host Authentication Mechanism This section reviews the host authentication mechanism in POD copy protection system [1]. The host authentication mechanism includes a protocol for host authentication and a host re-authentication protocol. 2.1
Host Authentication Protocol
The host authentication protocol is designed to authenticate a host device. The host authentication protocol consists of three roles: POD security module, Host device and Cable Headend. The protocol consists of three phases: the certificate verification and DH (Diffie-Hellman) key exchange, the authentication key
Improving the Host Authentication Mechanism
375
Headend
POD Host Generate x Generate y g x , SigKP (g x ), POD Cert List −−−−−−−−−−−−−−−−−−−−−−−→ g y , SigKH (g y ), Host Cert List ←−−−−−−−−−−−−−−−−−−−−−−− AuthKeyP = AuthKeyH = SHA-1((g y )x |Host ID|POD ID) SHA-1((g x )y |Host ID|POD ID) Request the authentication key −−−−−−−−−−−−−−−−−−−−−−−−→ AuthKeyH ←−−−−−−−−−−−−−−−−−−−−−−−− ? Verify AuthKeyH = AuthKeyP POD ID, Host ID ←−−−−−−−−−−−−−−−−−−−− ID Validation Message −−−−−−−−−−−−−−−−−−−−→ Verify certificate revocation lists
Fig. 1. Host authentication protocol
verification, and the headend report back. At the end of a host authentication protocol run, POD security module can be sure that host device holds a valid certificate and a corresponding private key. In addition, POD security module can also confirm that host device has derived a common authentication key. Figure 1. shows the current host authentication protocol. The message flows of host authentication protocol runs as follows: Certificate verification and DH key exchange phase: 1. POD → Host: g x , SigKP (g x ), POD Cert List POD security module sends its certificate data (POD Cert List), the newly generated DH public key (g x ), and the Diffie-Hellman key signature (SigKP (g x )) to Host. 2. Host → POD: g y , SigKH (g y ), Host Cert List The host device replies with its host certificate data (Host Cert List), the newly generated DH public key (g y ), and the Diffie-Hellman key signature (SigKH (g y )) to POD. 3. After Steps 1 and 2, the POD module and host device come up with their respective authentication key AuthKeyP and AuthKeyH as follows: AuthKeyP = SHA-1((g y )x |Host ID|POD ID), AuthKeyH = SHA-1((g x )y |Host ID|POD ID). Authentication key verification phase: 1. POD → Host: Request the authentication key The POD module sends a request message to the host device for the authentication key.
376
E.-J. Yoon and K.-Y. Yoo
2. Host → POD: AuthKeyH The host device sends its authentication key AuthKeyH to POD. 3. The POD module confirms that its derived authentication key AuthKeyP is the same as the AuthKeyH received from the host device. If it is hold, then the POD module accepts the host device as a legitimate one. Headend report back phase: The POD module and Cable Headend communicate through a private authenticated (conditional access) CA system. 1. POD → Headend: POD ID, Host ID The POD send POD ID and Host ID to Cable Headend. 2. Headend → POD: ID Validation Message The Cable Headend checks in certificate revocation lists for a certificate validation. 2.2
Host Re-authentication Protocol
The host re-authentication protocol is designed to accelerate the authentication procedure of a pair of ever authenticated POD module and host device. Figure 2. shows the current host re-authentication protocol. The messages of host re-authentication protocol is as follows: 1. POD → Host: Request the authentication key When the POD module finds a valid authentication key in its non-volatile memory, the POD module requests the host device to send its AuthKeyH . 2. Host → POD: AuthKeyH The Host device replies with its AuthKeyH , if available. 3. The POD module compares the received AuthKeyH and locally stored ? AuthKeyP . If AuthKeyH = AuthKeyP , the POD module believes that it is communicating with a legitimate host device.
POD (AuthKeyP )
Host (AuthKeyH )
Request the authentication key −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−→ AuthKeyH ←−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− ? Verify AuthKeyH = AuthKeyP
Fig. 2. Host re-authentication protocol
Improving the Host Authentication Mechanism
377
3 Security of Host Authentication Mechanism In 2005, Tian et al. [7] showed three security flaws of the host authentication mechanism. At first, the host re-authentication protocol is too simple to fulfill the authentication task. That is, a simple replay attack can overthrow the authentication objective. Secondly, the effect of the DH private value leakage is the same as the host device private key leakage. At last, the DH public key signature and POD Cert List are useless for the host authentication or the POD module authentication. In this section, we briefly review the security flaws of the host authentication mechanism [7]. 3.1
Attack on Host Re-authentication Protocol
The goal of the re-authentication protocol is to authenticate the communication peer being the one who has ever been authenticated. However, Tian et al. showed that an active attacker can simply overthrow the goal of the re-authentication protocol as follows: 1. In a run of the re-authentication protocol or a run of the host authentication protocol, the attacker eavesdrops and records the AuthKeyH . 2. When the POD module requests the authentication key in another host reauthentication procedure, the attacker sends the recorded AuthKeyH as a response. As a result, the POD module believes that the communicating peer is a legitimate host device authenticated before, but in fact the POD module is just communicating with an attacker that holds only an eavesdropped AuthKeyH . The result is an authentication failure. Although the attacker cannot receive the encrypted high value content from the POD module, the attacker indeed authenticates successfully and enables the POD module to remove the scrambling of the CA system. 3.2
Too Significant DH Private Key
Tian et al. pointed out that the DH private key is as significant as the device private key as follows: 1. Assume that an attacker finds out a DH private key y. The corresponding DH public key g y may be in the Step 2 of any run of the host authentication protocol. The message of the Step 2 is also recorded by the attacker. 2. When a POD module initiates a new run of the host authentication protocol, the attacker replies with the recorded message. 3. When the POD module requests the authentication key, the attacker computes and replies with the AuthKeyH : AuthKeyH = SHA-1((g x )y |Host ID|POD ID).
378
E.-J. Yoon and K.-Y. Yoo
As a result, if the Host ID in the recorded message is still valid, the total authentication procedure will be completed. The practical effect is that a nonlicense device with a recorded message and a compromised DH private key can receive high value contents as if it was a licensed device. 3.3
Useless Messages in the First Step
Tian et al. pointed out that the signature and certificate list are useless in the first message of the host authentication protocol because these messages contribute zero for the objective of host authentication or a guessed POD module authentication. That is, Tian et al. claimed that there is no need for a POD module to show any credentials for the goal of host authentication because of the following two reasons. (1) Any active attacker can record the first message of the host authentication protocol, and replay the message to a host device. Then, the host device cannot detect this replay since the host device does not verify anything in the remainder steps. (2) Any attacker can use a compromised certificate in a certificate revocation list to produce a valid Step 1 message. Then, a host device has no chance to know the certificate status since the certificate status is checked by a Cable Headend that has secure channel only to a POD module.
4 Proposed Host Authentication Mechanism This section proposes an improvement of host authentication mechanism in the current POD copy protection system that can withstand the above security problems. Unlike original host authentication mechanism, we include the timestamping technique [8][9][10] to prevent the replay attack and provide the freshness of the messages in a protocol. After the POD module and host device check the freshness of the received timestamp, they go to next Step to agree the authentication key. 4.1
Host Authentication Protocol
The proposed host authentication protocol also consists of three phases: the certificate verification and DH key exchange, the authentication key verification, and the headend report back. Figure 3. shows the proposed host authentication protocol. The message flows of the proposed host authentication protocol runs as follows: Certificate verification and DH key exchange phase: 1. POD → Host: g x , SigKP (g x |TP ), POD Cert List POD security module sends its certificate data (POD Cert List), the newly generated DH public key (g x ), and the Diffie-Hellman key signature (SigKP (g x |TP )) to Host, where TP is the POD’s current timestamp.
Improving the Host Authentication Mechanism
Headend
379
POD Host Generate x Generate y Pick up TP Pick up TH g x , SigKP (g x |TP ), POD Cert List −−−−−−−−−−−−−−−−−−−−−−−−−−→ g y , SigKH (g y |TH ), Host Cert List ←−−−−−−−−−−−−−−−−−−−−−−−−−− ∗ ∗ Abort if (TP − TP ) > ΔT Abort if (TH − TH ) > ΔT AuthKeyP = AuthKeyH = SHA-1((g y )x |Host ID|POD ID) SHA-1((g x )y |Host ID|POD ID) Request the authentication key −−−−−−−−−−−−−−−−−−−−−−−−−−→ SHA-1(AuthKeyH |TH ), TH ←−−−−−−−−−−−−−−−−−−−−−−−−−− ∗ Abort if (TH − TH ) > ΔT ?
Verify SHA-1(AuthKeyH |TH ) = SHA-1(AuthKeyP |TH ) POD ID, Host ID ←−−−−−−−−−−−−−−−−− ID Validation Message −−−−−−−−−−−−−−−−−−→ Verify certificate revocation lists
Fig. 3. Proposed host authentication protocol
2. Host → POD: g y , SigKH (g y |TH ), Host Cert List The host device replies with its host certificate data (Host Cert List), the newly generated DH public key (g y ), and the Diffie-Hellman key signature (SigKH (g y |TH )) to POD, where TH is the host device’s current timestamp. 3. On receiving the Step 1 message from POD at time TP∗ , the host device verifies the validity of the time interval between TP∗ and TP . If (TP∗ − TP ) > ΔT , the host device rejects the POD module, where ΔT denotes the expected valid time interval for transmission delay. Otherwise, the host device computes an authentication key AuthKeyH as follows: AuthKeyH = SHA-1((g x )y |Host ID|POD ID). ∗ 4. On receiving the Step 2 message from the host device at time TH , the POD ∗ module verifies the validity of the time interval between TH and TH . If ∗ − TH ) > ΔT , the host device rejects the host device. Otherwise, the (TH POD module computes an authentication key AuthKeyP as follows: AuthKeyP = SHA-1((g y )x |Host ID|POD ID). Authentication key verification phase: 1. POD → Host: Request the authentication key The POD module sends a request message to the host device for the authentication key. 2. Host → POD: SHA-1(AuthKeyH |TH ), TH
380
E.-J. Yoon and K.-Y. Yoo
POD (AuthKeyP )
Host (AuthKeyH )
Request the authentication key −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−→
Pick up TH
SHA-1(AuthKeyH |TH ), TH ←−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− ∗ Abort if (TH − TH ) > ΔT Verify SHA-1(AuthKeyH |TH ) ?
= SHA-1(AuthKeyP |TH )
Fig. 4. Messages of the proposed host re-authentication protocol
The host device computes a hash value of its authentication key AuthKeyH and timestamp TH , where TH is the host device’s current timestamp. Finally, the host device sends it with TH to the POD module. 3. On receiving the message SHA-1(AuthKeyH |TH ) and TH from the host de∗ vice at time TH , the POD module verifies the validity of the time interval ∗ ∗ − TH ) > ΔT , the host device rejects the host debetween TH and TH . If (TH vice. Otherwise, the POD module compares that its derived authentication key AuthKeyP is the same as the computed AuthKeyH from the host device ? by using the received SHA-1(AuthKeyH |TH ). If SHA-1(AuthKeyH |TH ) = SHA-1(AuthKeyP |TH ), then the POD module accepts the host device as a legitimate one. Headend report back phase: The POD module and Cable Headend communicate through a private authenticated (conditional access) CA system. 1. POD → Headend: POD ID, Host ID The POD send POD ID and Host ID to Cable Headend. 2. Headend → POD: ID Validation Message The Cable Headend checks in certificate revocation lists for a certificate validation. 4.2
Host Re-authentication Protocol
Figure 4. shows the proposed host re-authentication protocol. The messages of the proposed host re-authentication protocol is as follows: 1. POD → Host: Request the authentication key When the POD module finds a valid authentication key in its non-volatile memory, the POD module requests the host device to send its AuthKeyH . 2. Host → POD: SHA-1(AuthKeyH |TH ), TH
Improving the Host Authentication Mechanism
381
The host device computes a hash value of its authentication key AuthKeyH and timestamp TH , where TH is the host device’s current timestamp. Finally, the host device replies it with TH to the POD module. 3. On receiving the message SHA-1(AuthKeyH |TH ) and TH from the host ∗ , the POD module verifies the validity of the time indevice at time TH ∗ ∗ terval between TH and TH . If (TH − TH ) > ΔT , the host device rejects the host device. Otherwise, the POD module compares the received AuthKeyH and locally stored AuthKeyP by using SHA-1(AuthKeyH |TH ). ? If SHA-1(AuthKeyH |TH ) = SHA-1(AuthKeyP |TH ), the POD module believes that it is communicating with a legitimate host device.
5 Security Analysis This section provides the proof of correctness of the proposed host authentication mechanism. Theorem 1. In the proposed re-authentication protocol, an active attacker cannot overthrow the goal of the re-authentication protocol. Proof. The weakness of the current host authentication mechanism against replay attacks is due to the fact that the AuthKeyH directly sent to the POD module without any protection method in Step 2 of authentication key verification phase in the host authentication protocol and in Step 2 of the host reauthentication protocol, respectively. However, the proposed mechanism employs the concept of timestamp T protection by using timestamp and SHA-1 to prevent from above replay attacks. That is, instead AuthKeyH directly sent to the POD module, the proposed re-authentication protocol sends SHA-1(AuthKeyH |TH ) and TH to POD module. Then, an attacker cannot get AuthKeyH because it is protected by using SHA-1 hash function and timestamp T and provides protocol freshness. As a result, an active attacker cannot play the replay attack in the re-authentication protocol. Theorem 2. In the proposed host authentication protocol, an active attacker cannot perform the replay attack by using the recorded AuthKeyH . Proof. In Step 4 of the current host authentication protocol, because an active attacker can get directly the authentication key AuthKeyH , the attacker can perform the replay attack. That is, since the authentication key AuthKeyH which used in the host re-authentication protocol is same as the recorded authentication key AuthKeyH , the attacker can freely perform the replay attack to pass the authentication process from POD. However, in the proposed host authentication and host re-authentication protocols, the timestamp TH is used to protect the replay attack and provides the protocol freshness. That is, when a POD module receives the Step 2 message in the proposed host authentication protocol, the POD module can easily check whether this message is a recently created message by checking the freshness of
382
E.-J. Yoon and K.-Y. Yoo
the timestamp TH . Also, in the proposed host re-authentication protocol, the POD module can also check the freshness of the timestamp TH to detect the replay attack. Because the authentication key AuthKeyH is never exposed to the attacker in the proposed host authentication protocol and the proposed host re-authentication protocol, the attacker cannot succeed the replay attack. Theorem 3. In the proposed host authentication protocol, an active attacker cannot perform the replay attack by using the recorded first message DH public key g x , signature SigKP (g x |TP ) and certificate list POD Cert List. Proof. Tian et al. pointed out that the signature SigKP (g x ) and certificate list POD Cert List are useless in the first message of the host authentication protocol because these messages contribute zero for the objective of host authentication or a guessed POD module authentication. That is, Tian et al. pointed out that any active attacker can record the first message of the host authentication protocol, and replay the message to a host device. Then, the host device cannot detect this replay since the host device does not verify anything in the remainder steps. However, in the proposed host authentication protocol, the timestamp TP is used to protect the replay attack and provides the message freshness. That is, when the host device receives the Step 1 message in the proposed host authentication protocol, the host device can easily check whether this message is a recently created message by checking the freshness of the timestamp TP . In addition, since the first messages of the proposed host authentication protocol can protect the replay attack, these are not useless messages unlike Tian et al.’s claims.
6 Conclusions In 2005, Tian et al. first showed that the host authentication mechanism in the current POD copy protection system has the following three weaknesses: (1) The re-authentication procedure cannot stand out a simple replay attack. (2) The host authentication protocol can make a DH (Diffie-Hellman) private key in any run as significant as a device private key. (3) The credentials of the POD module are futile. To withstand the three weaknesses, this paper proposed a new secure host authentication mechanism for the current POD copy protection system by adopting the timestamp technique. We also described about the security of the proposed host authentication mechanism which how to withstand the three weaknesses.
Acknowledgements This research was supported by the MKE(Ministry of Knowledge Economy) of Korea, under the ITRC support program supervised by the IITA(IITA-2008C1090-0801-0026).
Improving the Host Authentication Mechanism
383
References 1. ANSI/SCTE, POD Copy Protection System. Society of Cable Telecommunications Engineers, 2004, ANSI/SCTE 41 (2004) 2. ANSI/SCTE, POD Copy Protection System. Society of Cable Telecommunications Engineers, 2001, ANSI/SCTE 41 (2001) 3. Zhang, C.N., Li, H., Zhang, N.N., Xie, J.S.: A DSP Based POD Implementation for High Speed Multimedia Communications. EURASIP Journal on Applied Signal Processing 2002(9), 975–980 (2002) 4. Kim, W.H., Song, W.J., Kim, B.G., Ahn, B.H.: Design and Development of a new POD security module chip-set based on the OpenCable specification. IEEE Trans. on Consumer Electron. 48(3), 770–775 (2002) 5. Traw, C.B.S.: Protecting Digital Content within the home. Compter 34(10), 42–47 (2001) 6. Eskicioglu, A.M., Delp, E.J.: An overview of multimedia content protection in consumer electronics devices. Elsevier Signal Processing: Image Communication 16, 681–699 (2001) 7. Tian, H.-B., Zhan, Y., Wang, Y.-M.: Analysis of Host Authentication Mechanism in Current POD Copy Protection System. IEEE Transactions on Consumer Electronics 51(3) (August 2005) 8. Menezes, A.J., Oorschot, P.C., Vanstone, S.A.: Handbook of applied cryptograph. CRC Press, New York (1997) 9. ISO/IEC, Information Technology-Security Techniques - Entity Authentication Mechanisms - Part 2: Entity authentication using symmetric techniques, International Organization for Standardization and International Electro-technical Commission, ISO/IEC JTC 1/SC 27 N739 DIS 9798-2, 1993-08-13 (1993) 10. Mao, W.B.: Modern cryptography: theory and practice. Prentice-Hall, Englewood Cliffs (2003)
User Stereotypes Concerning Cognitive, Personality and Performance Issues in a Collaborative Learning Environment for UML Kalliopi Tourtoglou and Maria Virvou Department of Informatics, University of Piraeus, 80 Karaoli & Dimitriou St., 18534 Piraeus, Greece
[email protected],
[email protected]
Abstract. In this paper, we describe the user modelling mechanism of AUTO-COLLEAGUE, which is an adaptive Computer Supported Collaborative Learning system. AUTOCOLLEAGUE provides personalised and adaptive environment for users to learn UML. Users are organized into working groups under the supervision of a human coacher/trainer. The system constantly traces the performance of the learners and makes inferences about user characteristics, such as the performance type and the personality. All of these attributes form the individual learner models, which are built using the stereotype theory. User modelling is applied in order to offer adaptive help to learners and adaptive advice to trainers aiming to support them mainly in forming the most effective groups of learners. Keywords: Collaboration, learning, collaborative learning, CSCL, stereotypes, student modelling, user modelling, software engineering, UML.
1 Introduction The term "collaborative learning" refers to an instruction method in which students at various performance levels work together in small groups toward a common goal [1]. There has been research that proves the advantages of the collaborative learning over the participants [11], [12]. These advantages have been the spark for the establishment over the last decades of a new category of software, the ComputerSupported Collaborative Learning (CSCL) systems. These systems are learning environments that provide the users with a communication system through which they are able to collaborate with each other. UML (Unified Modelling Language) has been established in the educational institutes and the market as well. Therefore, an adaptive learning environment for UML is useful to both educational institutes that offer software engineering courses and enterprises that would save money from the fast training of the employees. AUTO-COLLEAGUE (AUTOmated COLLaborativE leArning Uml Environment) is an adaptive CSCL system for UML. The adaptivity of AUTO-COLLEAGUE is based on creating individual learner models using stereotypes [6]. Learner modeling is the key concept concerning the application of artificial intelligence into educational software [10]. The stereotypes used in AUTO-COLLEAGUE describe the personality, the performance type and the level of expertise of the learners. Stereotypes constitute a G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 385–394, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
386
K. Tourtoglou and M. Virvou
widely used technique of building user models [6] Stereotypes have been widely and efficiently applied to educational software [7].
2 Related Work AUTO-COLLEAGUE is not the first CSCL system. For example, COLER [5], LECS [8], COLLECT-UML [9], DEGREE [3], HABIPRO [2] and CoLeMo [4] are indicative CSCL systems. In fact, COLLECT-UML [8] and CoLeMo [4] are learning environments for UML like AUTO-COLLEAGUE. All of these systems have in common an inference mechanism. All of them evaluate the type and frequency of the participation of the users in chat systems. They do not include user characteristics, such as the personality or the performance type in comparison with AUTOCOLLEAGUE that builds more complicated user models. AUTO-COLLEAGUE provides an environment for drawing UML diagrams and a subsystem that constantly traces and evaluates the users’ actions and characteristics with the aim to assist them in the learning process. There are already in the market very effective, productive and successful systems for drawing UML diagrams, such as the Rational Rose. These systems offer much greater capabilities as far as the UML designing is concerned. Our objective is not to compete with such software. What AUTO-COLLEAGUE represents is a Use Case tool that is both productive and educational at the same time, providing the users the potential of collaborating with each other. The educational purpose is being supported by the system, which is responsible for advising and helping user-adaptively the trainer and the trainees. The innovation of AUTO-COLLEAGUE is related to the complexity of the characteristics that are included in the user models (such as the personality and the performance type). Moreover, the user characteristics are constantly being traced by the system and are associated with each other. Another important innovation is the subject of advice. In specific, the advice is not limited to the cognitive context or the level of expertise of the users. The advice of the system is a tool for the trainer that assists him/her to organize the learners into groups effectively and productively. Moreover, the most of the internal calculation data of the system are not hard-coded. There are administrative data-entry forms for the trainer to define the specific parameters, such as knowledge and UML tests.
3 Overview of AUTO-COLLEAGUE 3.1 User Interface - Front End Functionality AUTO-COLLEAGUE is a learning environment for UML. There are two kinds of users: the trainer and the trainee. The trainees are the users whose aim is to learn UML. The trainer is the coacher/administrator in this process. The trainees can log in the system and design UML diagrams under the supervision of their human trainer. They can either design diagrams on a workspace or run the test wizard. The test wizard, which is illustrated in Fig. 1, is a form where the user has to follow specific steps in order to solve a problem. In the first step, the trainee is given a problem (the description is of the problem is shown in the upper
User Stereotypes Concerning Cognitive, Personality and Performance Issues
387
side of the window). After the trainee reads it carefully, s/he has to select the UML classes that should be included in the solution diagram by checking them in a given checklist box. This checklist box contains a set of possible answers. Some of them are correct. After checking the classes, the trainee can press the “Next” button. Then, in the same way, the user will have to select the correct attributes from a list. Pressing the “Next” button again, s/he will be able to define the properties and methods and,
Fig. 1. Test wizard
Fig. 2. Main form of AUTO-COLLEAUE
388
K. Tourtoglou and M. Virvou
similarly the relationships (the associations and the generalizations). At the end, the trainee has to press the “Submit” button. At this point, the system will check the given solution and compare it to the correct one. The results and the resultant UML diagram will be shown in the same form, after it is enlarged. In the main form of AUTO-COLLEAGUE, which is illustrated in Fig. 2, the trainees can draw UML diagrams mainly for their own testing purposes or as exercise under the supervision of their trainer. At the left of the form, there is the message board. Users can send messages to each other by pressing the correspondent buttons to the type of dialog they want to have. At the right part of the main form, the users can draw UML diagrams in the workspace. 3.2 Back End Functionality All of these characteristics explained in the previous section compose the front-end functionality. More or less, these features have already been deployed for educational, scientific or commercial purposes. However, the innovative characteristic of AUTOCOLLEAGUE is based on its back-end functionality. Particularly, on the background AUTO-COLLEAGUE continuously tracks the actions of the users, makes inferences about their progress and concludes to certain advices and help messages concerning their most effective arrangements into groups. The advice and help offered by the system are adaptive and achieve this adaptivity by building individual user models. The method for the process of user modelling is the stereotype-based theory and is accomplished mainly by the User Modeller, which will be explained in section 5.
4 Architecture AUTO-COLLEAGUE is developed in Borland Delphi and uses MS-ACCESS for its database engine. The architecture of AUTO-COLLEAGUE is based on five main modules: the Domain Module, the Tracker, the User Modeller, the Collaboration Module and the Advisor. The functional relationships between these modules, the database and the entities of the trainee and the trainer are cited in Fig. 3. The Domain Module is responsible for storing and handling the UML knowledge. The UML knowledge is registered in the database in the form of UML concepts. Examples of UML concepts are: Class Definition, Attributes Definition, Associations and Generalizations. For instance, the Attributes Definition concerns the knowledge of defining correctly the attributes of a UML class. Further to the UML concepts, the Domain Module is responsible for handling the exercises, the solutions, the possible answers to them and the error-checking compiler. The exercises are problems-tests given by the trainer. The Tracker is the Module that keeps track on every action and movement of the user. Beginning from the user logging in the system, the Tracker stores any available information about the user. Such information would be: the log data (date and time of logging in and out and name), the actions, the mouse movements, the errors, the groups’ structure and the update history in role and level of expertise. Every data the Tracker stores is useful especially for the User Modeller, which feeds with input data the Collaboration Module and the Advisor.
User Stereotypes Concerning Cognitive, Personality and Performance Issues
389
Fig. 3. Architecture of AUTO-COLLEAGUE
The User Modeller is the module that builds the individual models of the users. It is based on the stereotype theory. The User Modeller will be explained in detail in the subsequent section. The Collaboration Module is the component that provides the mechanism that makes AUTO-COLLEAGUE a collaborative environment. This mechanism is the chat system. The other vital functionality of the Collaboration Module is the reasoning on finding the most productive collaboration schemes between the users. To achieve this, it uses the individual user models built by the User Modeller. The conclusions of the Collaboration Module are passed to the Advisor. The Advisor is the module through which AUTO-COLLEAGUE communicates with the users. The reason why the system would communicate with the users is to advise them based on the inferences of the Collaboration Module and the User Modeller. The advice is oriented to the trainee and to the trainer. The advice to the trainee may concern either the individual progress or the group-work progress. For example, the individual-progress advice is related to what the trainee should beware of, such as UML concepts that s/he needs to try to learn. On the other hand, the group-work advice involves the strengths and weaknesses of the trainees of a group. For example, it could be an advice on the fact that a specific trainee should collaborate with a specific other one, as they are found to match their weaknesses and strengths with each other. The advice to the trainer can take both of the aforementioned roles (individual and group-work) but in another form of information. In specific, the advice to the trainees is given in the form of window messages, but the advice to the trainer is given in more complicated forms. These forms contain text and statistical information.
5 User Modeller A system that applies the stereotype-based user modelling needs two kinds of information: the stereotypes and their triggers. The stereotypes are the collections of the user facets. These facets are used by the system to make inferences about the user.
390
K. Tourtoglou and M. Virvou
The triggers are those conditions under which a stereotype will be activated for a user. This will lead to the configuration of the user model of the individual user. Following, we will describe the structure and functionality of the stereotypes used in AUTO-COLLEAGUE. 5.1 Stereotypes In a collaborative learning environment, such as AUTO-COLLEAGUE, the user characteristics that may be useful to the inference mechanism of the User Modeller should not be oriented only to the level of expertise. There are other characteristics as well that can reveal various aspects of the user’s behaviour. The stereotypes in AUTO-COLLEAGUE concern three aspects of the user entity: the Level of Expertise, the Performance Type and the Personality. The Level of Expertise category of stereotypes describes the cognitive characteristics of the user, that is the knowledge on UML. There are four stereotypes of this kind: Basics, Junior, Senior and Expert. Each of these stereotypes is linked to specific UML Concepts. This means that the system searches for the UML concepts that a user evidently knows at a certain degree and, then, classifies him/her appropriately in the corresponding Level of Expertise. For example, the Basics level of expertise is linked to the UML concepts of Class Definition (with a degree 1out of 5) and of Attributes Definition (with a degree 1 out of 5). This parameterization of the Levels of Expertise and the UML Concepts is not hard-coded. It is in the database of the system, which means that there is the flexibility of the trainer to change and adjust them according to his/her judgment. The Performance Type category of stereotypes concerns the user behaviour in a learning environment. There are four stereotypes in this category: Sceptical, Hurried, Unconcentrated and Efficient. A sceptical user is someone that hesitates to act. Hurried is the user who seems to act rather quickly than the average and s/he tends to be thoughtless. This means that s/he may have the knowledge to precede successfully specific tasks, but eventually fails to do so probably due to carelessness. The Unconcentrated user is the one who seems to act slowly and make irrelevant actions or, for example, is distracted because s/he is occupied sending greeting messages to his/her colleagues. Efficient is the user who appears to accomplish the tasks successfully in a legitimate period of time. The Personality category of stereotypes is related to the characteristics that influence the behaviour of the user not only as far as the possession of knowledge is concerned, but the communication and the collaboration with others as well. There are four Personality stereotypes: Self-confident, Diligent, Participative and Willing-toHelp. As Self-confident is characterized the user who has self-esteem and a high perception of him/herself. This does not mean that this person really has knowledge and efficiency on UML. Diligent is the user who shows willingness to complete the tasks s/he is assigned. A Participative user is a person who tends to aims at communicating and cooperating with other colleagues. The Willing-to-Help stereotype describes the user who appears to be available to others when they need help or a second opinion.
User Stereotypes Concerning Cognitive, Personality and Performance Issues
391
5.2 Facets In order to determine the stereotypes, it is necessary to define a collection of facets. A facet represents a characteristic of the user along with a value. The facets we decided to use in AUTO-COLLEAGUE are related to the attributes that would provide clues on the identification of the user stereotypes explained in the previous section. Specifically, the facets used in AUTO-COLLEAGUE are: useless mouse movements and clicks frequency, average idle time, number of actions, error frequency, same error frequency, correct frequency, help utilization frequency, average time between successive help readings, advice given frequency, help given to a member/a non member of the group, help request from a member/non member of the group, communication frequency and number of upgrades/downgrades in level of expertise. The facets can get the value 1 or 2 or 3 or 4 or 5. Each of these five values represents a width of values. Specifically, 1 represents a low value, 2 a low-tomedium value, 3 a medium value, 4 a medium-to-high value and 5 a high value. For every facet, three values are stored: Low-Limit, Medium-Limit and High-Limit. They correspond to the degrees that the system presumes that an actual facet value of a user is of low, medium or high extent. These three values have effect on the formation of the aforementioned intervals of low, low-to-medium, medium, medium-to-high and high values. For example, the facet communication frequency that is the frequency of sending greeting messages is defined to have: Low-Limit=0, Medium-Limit=6 and High-Limit=12. If a user is found to have sent 10such messages, then according to the intervals of 0, 6, 12, the value of the communication frequency facet will be mediumto-high, that is 4. For the definition of the stereotypes, there is a table in the database of AUTOCOLLEAGUE, which assigns the rules to belong to a stereotype. These rules concern the combinations of facet values and they constitute the triggering and retraction condition of the stereotypes. For example, the Hurried stereotype is assigned with the facets: number of actions = 5, correct frequency = 1, error frequency average idle time = 1, useless mouse moves = 3, help utilization frequency = 1 and average time between successive help readings = 1. 5.3 Triggers A trigger leads to the activation or deactivation of a stereotype. A trigger is a set of rules/conditions. If these conditions are satisfied/dissatisfied for a user, then the corresponding stereotype will be activated/deactivated. For the previous example of the Hurried stereotype, there is the HURRIED_TRIG. The rule of this trigger is the simultaneous satisfaction of the facet values cited above. The rules of a trigger may concern not only the facet values, but also the activation of a stereotype. For example, a trigger in our system is the EFFICIENT_ EXPERT_ TRIG, which will activate the Efficient stereotype because the Expert stereotype has been activated. This represents the inference that if a user is expert, then s/he is probably efficient too. 5.4 Building the User Model The aim of user modelling is to build individual user models. The structure of the user models in AUTO-COLLEAGUE includes the collection of the facet values of the user
392
K. Tourtoglou and M. Virvou
and the stereotypes the facet values have activated. The system, also, stores the updates in the user model. A user model of a specific user is not deleted when another user model starts to take effect. This functionality enables the system to make further inferences about the user as it offers more information about the user’s progress in time. Whenever a user action is stored, the User Modeller reevaluates the facet values of the user. The combinations of these values will activate the corresponding triggers. The activation of the triggers will then cause the activation of the corresponding stereotypes. Finally, the User Modeller updates the User Model of the user.
6 Example of Operation We have selected to cite an example of operation to show the way the system provides adaptive help to the trainees according to their user models. There are three users: John, George and Mary. The one of them (John) has been found to make three different times mistake in the definition of the Attributes. The system has, also, registered the two other users (George and Mary) to have knowledge on defining attributes correctly. Now, the Advisor is triggered to show him a help message. The
Fig. 4. First help message
Fig. 5. Second help message
User Stereotypes Concerning Cognitive, Personality and Performance Issues
393
first step in the process of configuring the text of this message is to notify him about the lack of knowledge on this specific UML concept. The second step is to acquaint him with the appropriate topics in the help subsystem of AUTO-COLLEAGUE that could help him. The third and most complicated task for the Advisor is to decide with whom (from these users who are found to know what he does not know) he should be advised to contact. The primary characteristic that the system searches for is the personality. There should be succeeded such a combination in the personalities of the two that he will be helped in the quickest and easiest way. Therefore, among all the other users, the trainee who is found to belong to the stereotype Willing-to-Help has priority in the selection. If no such trainee is found, then the next to seek for is someone who is Participative, then Self-confident and at last Diligent. In our example, George is found to be Willing-to-Help and not Participative. Mary is Participative and not Wiling-toHelp. So, the Advisor will choose George. Combining the outcomes of these three steps, John will be given successively the two help messages that are shown in Fig. 4 and Fig. 5.
7 Conclusions In this paper, we described AUTO-COLLEAGUE, which is an adaptive collaborative learning environment for UML. Its innovation consists of two characteristics. The first characteristic is the quality of the user characteristics that the User Modeller of AUTO-COLLEAGUE evaluates. These characteristics are related not only to the cognitive state of the learners, but to the personality and performance type as well. This function is of great importance as the evaluation outcome of the system will be more accurate and adjustable to the individuals. The other characteristic that makes AUTO-COLLEAGUE innovative is the advice context it provides mainly to the trainer. This advice concerns the most effective organization of users into groups.
References 1. Gokhale, A.A.: Collaborative Learning Enhances Critical Thinking. Journal of Technology Education 7(1) (1995) 2. Vizcaíno, A., Contreras, J., Favela, J., Prieto, M.: An Adaptive, Collaborative Environment to Develop Good Habits in Programming. In: Gauthier, G., VanLehn, K., Frasson, C. (eds.) ITS 2000. LNCS, vol. 1839, pp. 262–271. Springer, Heidelberg (2000) 3. Barros, B., Felisa Verdejo, M.: Analysing student interaction processes in order to improve collaboration. The DEGREE approach. International Journal of Artificial Intelligence in Education 11 (2000) 4. Chen, W., Pedersen, R.H., Pettersen, Ø.: CoLeMo: A collaborative learning environment for UML modelling. Interactive Learning Environments 14(3), 233–249 (2006) 5. Constantino-González, María de los Angeles, S., Daniel, D.: A Coached Collaborative Learning Environment for Entity-Relationship Modeling. Intelligent Tutoring Systems. In: Gauthier, G., VanLehn, K., Frasson, C. (eds.) ITS 2000. LNCS, vol. 1839, pp. 325–333. Springer, Heidelberg (2000)
394
K. Tourtoglou and M. Virvou
6. Rich, E.: User Modeling via Stereotypes. International Journal of Cognitive Science 3, 329–354 (1979) 7. Beaumont, I.H.: User Modelling in the Interactive Anatomy Tutoring System ANATOMTUTOR. User Modeling and User-Adapted Interaction 4, 21–45 (1994) 8. Rosatelli, M.C., Self, J.: A Collaborative Case Study System For Distance Learning. International Journal of Artificial Intelligence in Education 14, 1–29 (2004) 9. Baghaei, N., Mitrovic, A.: Collect-UML: Supporting Individual and Collaborative Learning of UML Class Diagrams in a Constrain-Based Intelligent Tutoring System. In: Khosla, R., Howlett, R.J., Jain, L.C. (eds.) KES 2005. LNCS (LNAI), vol. 3684, pp. 458– 464. Springer, Heidelberg (2005) 10. Dillenbourg, P., Self, J.: A framework for learner modelling. Interactive Learning Environments 2(2), 111–137 (1992) 11. Johnson, R.T., Johnson, D.W.: Action research: Cooperative learning in the science classroom. Science and Children 24, 31–32 (1986) 12. Totten, S., Sills, T., Digby, A., Russ, P.: Cooperative learning: A guide to research. Garland, New York (1991)
Intelligent Mining and Indexing of Multi-language e-Learning Material Angela Fogarolli and Marco Ronchetti Diparimento di Ingegneria e Scienza dell’Informazione Università di Trento, Via Sommarive 14, 38050 Povo di Trento {angela.fogarolli,marco.ronchetti}@unitn.it
Abstract. In this paper we describe a method to automatically discover important concepts and their relationships in e-Lecture material. The discovered knowledge is used to display semantic aware categorizations and query suggestions for facilitating navigation inside an unstructured multimedia repository of e-Lectures. We report about an implemented approach for dealing with learning materials referring to the same event in different languages. The information acquired from the speech is combined with the documents such as presentation slides, which are temporally synchronized with the video for creating new knowledge through a mapping with a taxonomy representation such as Wikipedia. Keywords: Content retrieval and filtering: search over semi-structural Web sources, Multimedia, Wikipedia, e-Learning.
1 Introduction In our work, we are addressing the problem of accessing different kinds of unstructured or semi-structured information sources taking advantages of the semantic provided by Wikipedia. The semantic result of the approach describe in the next section has been applied to the e-Learning context, specifically enhanced streaming video lectures (see [8, 5]) because of the peculiarity in this scenario of combining different kinds of unstructured or semi-structured sources of information. Our target repository collects different kinds of media (video, audio, presentation slides, text documents), which can be searched and presented in combination. For each recorded event (e.g. lecture, seminar, talk, meeting…) we provide not only the video but also related materials, which can consist of presentation slides, documents or Web sites the speaker points to. All the resources are temporally synchronized with the video. A common problem in indexing e-Learning material is that more than one language can be combined in the same event. For instance, in some cases presentation slides are written in English while the speech is delivered in another language (e.g. in Italian). This is particularly true in the technical fields where usually is preferable not to translate technical terminology. In this paper we report about how we took advantage of the link structure of Wikipedia to offer semantic support for mining and navigating multimedia resources, which could also be in different languages. Our approach is domain independent based on the knowledge exposed by Wikipedia, rather than the usage of Semantic Web technologies (specifically without ontologies). A subject should be represented G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 395–404, 2008. springerlink.com © Springer-Verlag Berlin Heidelberg 2008
396
A. Fogarolli and M. Ronchetti
in the form of structures, which show what is to be learned. Student to improve their learning performance should concentrate on relationship between concepts [9]. Our library acts as a conversation facilitator; using a mix of technologies it offers a multimodal integrated view on topics and relations between them. Using external sources of knowledge in Information Retrieval is not a new idea [6, 4] even in combination with ontologies [3]; an example, in e-Learning, of combination of semantic annotations derived from a domain ontology and e-Lecture retrieval can be found at the University of Potsdam [11]. What is new in our work is the usage of a domain independent public available semantic source to automatically describe content in different kind of media. The structure of this paper is organized as follows: in the next sections we describe the existing approaches that rely based on information extraction from Wikipedia. Then we describe our approach and an evaluation of the theory at the base of our study. Our conclusions summarize the findings.
2 State of the Art Wikipedia contains a vast amount of information, therefore there have been mainly two approaches for exploring its content and make it machine readable. The first approach consists in embedding semantic notations in its content [13, 7]; while the other one deals with information extraction based on the understanding of how the Wikipedia content is structured: [1, 12, 14, 10, 15]. The SemanticWikipedia pro ject [13] is an initiative that invites Wikipedia authors to add semantic tags to their articles in order to make them machine interpretable. The wiki software behind Wikipedia(MediaWiki [7]), itself enables authors to represent structured information in an attribute-value notation, which is rendered inside a wiki page by means of an associated template. The second main stream of Wikipedia related work is on automatically extract knowledge from the Wikipedia content as in [1, 12, 14, 10, 15]. DBpedia [1]is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia offers sophisticated queries against Wikipedia and to other linked datasets on the Web. The DBpedia dataset describes 1,950,000 ”things”, including at least 80,000 persons, 70,000 places, 35,000 music albums, 12,000 films. It contains 657,000 links to images, 1,600,000 links to relevant external web pages and 440,000 external links into other RDF datasets. Altogether, the DBpedia dataset consists of around 103 million RDF triples. DBpedia extracts [2] RDF triples from Wikipedia informations presented in the page templates such as infoboxes and hyperlinks. Yago [12] is a knowledge base which extends the relationships of DBpedia extending the standard RDF notation. At December 2007, Yago contained over 1.7 million entities (like persons, organizations, cities, etc.) A YAGO-query consists of multiple lines (conditions). Each line contains one entity, a relation and another entity. DBpedia or Yago could replaced Wikipedia as a source of knowledge in our semantic discovery approach, although at the time of this writing these knowledge bases contain only entities (such as person and places) and not abstract concepts as
Intelligent Mining and Indexing of Multi-language e-Learning Material
397
the one we have in e-Learning material. In addition we don’t know a priori with which properties a term a can be searched, so in our domain replacing Wikipidia freetext would not be beneficial. ISOLDE [14] is a system for deriving a domain ontologies using named-entity tagger on a corpus and combining the extracted information with Wikipedia and Wiktionary. The results shows that this kind of approach works better with semistructure information such as dictionaries. KYLIN [15] is another pro ject which aim is automatically complete the information presented in the Wikipedia infoboxes analyzing disambiguated text and links in Wikipedia pages. Ponzetto et al. [10] in their work have explored information extraction on Wikipedia for creating a taxonomy containing a large amount of subsumptions, i.e. isa relations.
3 Intelligent Analysis The aim of the intelligent analysis described in this section is to map concepts in the multimedia e-Learning material to a taxonomy representation such as Wikipedia. This leads to different applications for improving navigation and delivery of multimedia. In our use case scenario we utilized the new knowledge we gathered through this mapping for offering a global view on a searched topic. In particular we display for every search the related topics as query suggestions, and we summarize the learning material by means of some annotations, which describe the core content of each lecture. The annotations are descriptive since they are mapped to summarize Wikipedia definitions, which can be viewed by the user. In this way, there is no ambiguity on their meaning. We want to understand the relationships between topics in our multimedia corpus based on the content of Wikipedia. For the reason of which, the intelligent analysis is strongly based on the link structure of Wikipedia itself. The link structure in Wikipedia draws a huge networks between pages which facilitate the navigation and the understanding of concepts, in particular we are interested in the following kind of links: interwiki, interlanguage and strong links. The links between Wikipedia pages are called interwikis1. Another kind of links we strongly relay on are the interlanguage2 ones. Interlanguage links act as an internationalization mechanism; they are connections between Wikipedia articles on the same topic but in different languages. In this way is possible to relate a Wikipedia page in one language to others Wikipedia internationalizations. The third kind of link we are interesting in are “strong links”. We define a strong link as a bidirectional connection between two pages. A link in Wikipedia is considered to be strong if the page it points to has a link back to the starting page. For instance, Rome and Italy are strongly linked since the page on Rome says that it is the capital of Italy, while the page on Italy reports that Rome is the capital of the state. A minor town located in Italy will instead have a weak link with Italy, since in its page it will be declared that the town is in Italy but in the page for Italy the minor town will most likely not be mentioned. 1 2
http://en.wikipedia.org/wiki/InterWiki http://en.wikipedia.org/wiki/Wikipedia:Interlanguage links
398
A. Fogarolli and M. Ronchetti
In our use case, strong links are candidates for topics related to the searched term, and they will be used for giving user suggestions and in the process of summary generation of Wikipedia definitions. 3.1 Concept Extraction from Multimedia The data our system is going to intelligently index and navigate are of different types. Each event is usually composed, at least, of video and presentation slides, but other documents could refer to the same event as well. We performed Information Extraction (IE) on the speech transcript of the video and on the text of the presentation slides. Since we were using Lucene3, as a search engine for searching inside the learning materials, we also took advantage of its statistical functionality to analyze the indexed text. In this way, we were able to obtain for each indexed event a term vector containing the event’s important words with associated weights. Each document in the index contains all the materials of one lecture. Any other tool that implements statistical IE could be used for the same task. We discard the usage of tools implemented a linguistic approach due to the characteristics of the e-Learning domain. In fact, the work necessary to adapt a linguistic approach would be excessive in case of dealing with material in different languages. Moreover, story telling does not play an important role in e-Learning - at least not in the disciplines we considered - and this makes it difficult to locate and classify atomic elements in text into predefined categories for Entity Recognition. 3.2 Concept Mapping on a Taxonomy Our semantic approach is based on the individuation and mapping of Wikipedia definitions to the concepts expressed in the e-Learning corpus. The process first extracts concepts from the multimedia lecture material; next it assigns to each concept a Wikipedia definition. Since a word can have different meaning depending on the context we have to select for every concept a Wikipedia definition tailored on the domain of the learning material from which the concept was extracted. The Wikipedia definition is than analyzed to understand relationship with other topics in the same material. Page Retrieval and disambiguation process Information Extraction output is Ti a term vector for every lecture i=1..n which consist in the fifty most important terms for the lecture Ti = {wij }j ={1..50} . The words in the term vector have been stemmed. We process every term in wij (Term vector of lecture i). The goal of the process is to find a Wikipedia page pij for every word in the term vector wij which has a definition that semantically matches the meaning of word wij in the lecture i. For archiving this goal we first look up all the page titles that contain wij. Next, we calculate page disambiguation since the different senses of word are represented in Wikipedia through a disambiguation page. Each article in Wikipedia is identified by its title. The title consists of a sequence of words separated by underscores. When the same concept exists in different domains that name is 3
http://lucene.apache.org/
Intelligent Mining and Indexing of Multi-language e-Learning Material
399
concatenated with a string compose by a parenthetical expression which denotes the domain where the word has a specific sense. For example a Wikipedia query for the word ”Collection” returns the Wikipedia disambiguation page Collection, which points to other pages such as Collection(horse), Collection(museum), Collection(Joe Sample album), Collection(agency), Collection (computing), Collection class. The string between parentheses identifies the domain. In order to choose which is the right definition for wij to be picked for the lecture domain, we proceed analyzing the hyperlinks present inside the pages of all the possible candidate definitions listed in the Wikipedia disambiguation page. For each candidate definition pij we consider only its strong links Sz (pij) z={1..n} . A page Po has a strong link with page Pd if in Po exists a link to Pd and in Pd there is a link back to Po. Po Ù Pd Hence, a strong link represents a bidirectional relation between two Wikipedia pages. The strong links for every term wij are taken into account for computing the disambiguation process and to be used in the query suggestion task. The best definition among the candidates is the one having the majority of words wij in the lecture material Ti in common with the target article name anchored from a strong link. We can write this concept as function f (u, v) that selects a page pij which has the maximum number of elements in the intersection between the term vector for a lecture Ti and the target article name of the selected hard links for the candidate Wikipedia definition pages, pij: f (u, v) = |Ti ∩ {Sij uv } |i,u,v={1..n},j ={1..50} pij = pijk where k : |Ti ∩ {Sij uv } | = max |Ti ∩ {Sij uv } For instance for the link ”Collection (computing)” listed in the disambiguation page, we analyze the page strong links and we count the number of elements in common with the words in the term vector of the lecture in exam: S1x Collection (computing) = {object − oriented, class, map, tree, set, array, list}; T1 {set, map, array, list, java, computer, collection, casting}. The group CE contains the elements in common between the two sets: CE = T1∩ S1x Collection (computing) Since words in a term vector are stemmed, the strong links must be stemmed as well before comparing them with the keywords in the term vector. We choose the Wikipedia definition page among the candidate pages to be the one, which has the maximum number of elements in CE. The expected result of the process is a complete disambiguated term vector Tdi composed of disambiguated words wdij . For every disambiguated word wdij acts as a strong link and map a Wikipedia definition page
400
A. Fogarolli and M. Ronchetti
pijk, which describes the meaning of the word in the corpus. In other words, after a successful disambiguation, Pij ≡ wdij. 3.3 Multi-language Processing The multi-language support consists in recognizing relations between terms in the corpus, which are not in English. As a first step, we look at the links to the other instances of Wikipedia in different languages. In most cases, pages in the English Wikipedia have links to pages on the same content but in other languages (interlanguage links). Since these links were created manually by the page authors, in most cases there is no ambiguity in the translation. In case a link to the target language of interest is not present, we can resort to freely available, albeit less trustable external sources for translating from and to English. The Wikipedia process described in the previous paragraphs will not change but writing language dependent processing modules such as language specific stemmers should be added to enable the comparison of the related Wikipedia content found in English with the terms contained in the multimedia content repository. In more details, for each successful disambiguation, so when a lecture term has been associated to a Wikipedia definition, we call the internationalization process. The algorithm consists in the following steps. For every disambiguated term, the associated page is analyzed to extract interlanguage links. If those link are present we use the title of the page they anchored to populate a table which keeps track of the reference between the disambiguated term and other Wikipedia pages that refer to the same content but in different languages. At the moment the languages we support are: English, Italian, French, German and Spanish but the approach is extendable. An example of the translation for the word “Algorithm” using interlanguage links is displayed in the following table. All the translations correspond to a Wikipedia page on the same content but in different languages. The “id” field works as an identifier of the context, in our case the lecture material of a lecture. id 1
en Algorithm
it Algoritmo
fr Algorithme
de Algorithmus
Es Algoritmo
For every word where the disambiguation process failed, there is a possibility that the idiom in the lecture material is not in English but in a different language. So we first check the internalization table for a word already disambiguated for that lecture with a matching translation. In the negative case, the next step is to translate the word. We use a free online translator that we can query4 through the http protocol. We look up English translations of the word passing as a input the root of the word that comes from our lecture material. On the multiple translations we obtain, we apply the Wikipedia lookup as we did at the beginning of our analysis (sec. 2.2) with the words extracted from the corpus. In this way, we apply the disambiguation process also to pick the 4
http://sapere.alice.it/traduzioni/traduciparola?word=interfacc&traduci.x=0&traduci.y=0 &method=1&from=it&to=en
Intelligent Mining and Indexing of Multi-language e-Learning Material
401
translation that best fits the lecture domain. The resulting translation will be saved in the internationalization table. As you can see below, in the table not all the language fields are completed but only the one of the language of the idiom found in the lecture. In fact, we do not complete all the fields of the table because it is unlikely that the lecture will contain material in all the supported languages. For example, in case of another term with the same meaning but in another language would be found then it would be searched through the online translator and the specific language field in the internationalization table would be updated. id 1
en Interface_(Java)
it Interfaccia
fr
de
es
4 Evaluation of the Link Analysis The idea behind our approach is based on the link analysis of Wikipedia definition pages. For this reason in this section we explain how we evaluated our “strong link” theory. The evaluation has been done manually, to prove that if two pages are connected by a strong link their content is on the same context. This association gains strengths with the increase of the number of strong links, which connect two pages. We manually assess the association’s strength and the importance of the links between Wikipedia definitions. In particular, we noticed that strong links connects definitions on the same context and that the labels of the strong links inside the text of a definition are usually the most important words for describing the ob ject of the definition itself. It follows that it is important to take in consideration the sentences where a strong link is located. Selecting all the sentences from a Wikipedia definition where a strong link is present, we obtain a paragraph, which represents the core of the page content. From our experiments we also discovered that the first paragraph of a Wikipedia definition is always important and it gets selected always because it contains the name of the definition and with high probability also some strong links. We report about an example case to demonstrate how we evaluated the strong link assumption. For the evaluation we manually processed a considerable amount of Wikipedia definition following the steps defined in the actual algorithm as described in sec. 3.2. To clarify the process we show the evaluation of the link theory for the word “interface” which has been extracted from a e-Lecture about Java Programming. Based on the step proposed in sec. 3.2, we disambiguated the word based on the other words contained in the corpus. The disambiguated definition points to a Wikipedia definition page with the title: Interface (Java). For the description of the process we consider the content of the Wikipedia article named Interface (Java), after the first paragraph which gets selected by defaults as explained above. The examined paragraph is showed below and the sentences, which contain a strong link are highlighted. The text highlighted is around 30% of the total5: 5
Definition from Wikipedia: http://en.wikipedia.org/wiki/ Interface (Java)
402
A. Fogarolli and M. Ronchetti
“As interfaces are abstract, they cannot be directly instantiated. Object references in Java may be specified to be of an interface type; in which case they must either be null, or be bound to an object which implements the interface. The keyword implements is used to declare that a given class implements an interface. A class, which implements an interface must either implement all methods in the interface, or be an abstract class. One benefit of using interfaces is that they simulate multiple inheritance. All classes in Java (other than java.lang.Object, the root class of the Java type system) must have exactly one base class; multiple inheritance of classes is not allowed. However, a Java class may implement any number of interfaces.” What we store in our system is the association between the word “interface”, in the context of the lecture material where the word been extracted from, and a summarized definition of Interface (Java) Wikipedia page. The summarized definition for this page will contain the first paragraph plus the sentences, which contain a strong link. In this case the extracted sentences are the following6: “As interfaces are abstract, they cannot be directly instantiated. The keyword implements is used to declare that a given class implements an interface. One benefit of using interfaces is that they simulate multiple inheritance.” As we can notice from the text above the summarization expresses the core of the Wikipedia article. If we proceed with a successive paragraph taken form the Wikipedia definition of Interface (Java) we can see that the same pattern occurs again: Interfaces are used to collect similarities which classes of various types share, but do not necessarily constitute a class relationship. For instance, a human and a parrot can both whistle, however it would not make sense to represent Humans and Parrots as subclasses of a Whistler class, rather they would most likely be subclasses of an Animal class (likely with intermediate classes), but would both implement the Whistler interface. Another use of interfaces is being able to use an object without knowing its type of class, but rather only that it implements a certain interface. For instance, if one were annoyed by a whistling noise, one may not know whether it is a human or a parrot, all that could be determined is that a whistler is whistling. In a more practical example, a sorting algorithm may expect an object of type Comparable. Thus, it knows that the object’s type can somehow be sorted, but it is irrelevant what the type of the object is.
5 Conclusion In this paper we described an approach to understand the content and the relationship between concepts inside e-Learning multimedia material. This has been done combining the terms extracted from the corpus with lexicographic relationships from Wikipedia . Wikipedia has been used as an alternative to ontologies. 6
Definition stored in the Needle system.
Intelligent Mining and Indexing of Multi-language e-Learning Material
403
The approach has been used for giving search suggestions in multimedia information retrieval, in multimedia annotations and for giving a brief description of the topics of the multimedia event. Our approach is domain independent, and it could in theory also be applied to different use cases where there is a need for clustering or annotation of a corpus.
References [1] Aberer, K., Choi, K.-S., Noy, N.F., Allemang, D., Lee, K.-I., Nixon, L.J.B., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., Schreiber, G., Cudré-Mauroux, P.: ISWC 2007. LNCS, vol. 4825. Springer, Heidelberg (2007) [2] Auer, S., Lehmann, J.: What have Innsbruck and Leipzig in common? extracting semantics from wiki content. In: Franconi, E., Kifer, M., May, W. (eds.) ESWC 2007. LNCS, vol. 4519. Springer, Heidelberg (2007) [3] Bertini, M., Bimbo, A.D., Torniai, C.: Enhanced ontologies for video annotation and retrieval. In: MIR 2005: Proceedings of the 7th ACM SIGMM international workshop on Multimedia information retrieval, pp. 89–96. ACM, New York (2005) [4] Bontcheva, K., Maynard, D., Cunningham, H., Saggion, H.: Using human language technology for automatic annotation and indexing of digital library content. In: Agosti, M., Thanos, C. (eds.) ECDL 2002. LNCS, vol. 2458. Springer, Heidelberg (2002) [5] Dolzani, M., Ronchetti, M.: Video streaming over the internet to support learning: the lode system. WIT Transactions on Informatics and Communication Technologies 34, 61– 65 (2005) [6] Dowman, M., Tablan, V., Cunningham, H., Popov, B.: Web-assisted annotation, semantic indexing and search of television and radio news. In: Proceedings of the 14th International World Wide Web Conference, China, Japan (2005) [7] Ebersbach, A., Glaser, M., Heigl, R.: Wiki: Web Collaboration. Springer, Heidelberg (2005) [8] Fogarolli, A., Riccardi, G., Ronchetti, M.: Searching information in a collection of videolectures. In: Proceedings of World Conference on Educational Multimedia, Hypermedia and Telecommunications 2007, Vancouver, Canada, pp. 1450–1459. AACE (2007) [9] Pask, G.: Conversation, cognition and learning: A cybernetic theory and methodology. Elsevier, Amsterdam (1975) [10] Ponzetto, S.P., Strube, M.: Deriving a large scale taxonomy from wikipedia. In: Proceedings of the 22nd National Conference on Artificial Intelligence (AAAI 2007), Vancouver, B.C. (July 2007) [11] Repp, S., Linckels, S., Meinel, C.: Towards to an automatic semantic annotation for multimedia learning objects. In: Emme 2007: Proceedings of the international workshop on Educational multimedia and multimedia education, pp. 19–26. ACM, New York (2007) [12] Suchanek, F., Kasneci, G., Weikum, G.: Yago: A large ontology from Wikipedia and Wordnet. Research Report MPI-I-2007-5-003, Max-Planck- Institut für Informatik, Stuhlsatzenhausweg 85, 66123 Saarbrücken, Germany (2007) [13] Völkel, M., Krötzsch, M., Vrandecic, D., Haller, H., Studer, R.: Semantic Wikipedia. In: Proceedings of the 15th international conference on World Wide Web, WWW 2006, Edinburgh, Scotland, May 23–26 (2006)
404
A. Fogarolli and M. Ronchetti
[14] Weber, N., Buitelaar, P.: Web-based ontology learning with Isolde. In: Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M., Aroyo, L.M. (eds.) ISWC 2006. LNCS, vol. 4273. Springer, Heidelberg (2006) [15] Wu, F., Weld, D.: Autonomously semantifying Wikipedia. In: ACM Sixteenth Conference on Information and Knowledge Management (CIKM 2007), Lisbon, Portugal (November 2007)
Classic and Multimedia Based Activities to Teach Colors for Both Teachers and Their Pre-school Kids at the Kindergarten of Arab Schools in South of Israel Mahmoud Huleihil1 and Huriya Huleihil2 1
Academic Institute for training arab teachers – AITAT Beit Berl College, Doar Beit Berl, 44905, Tel.: 09-7476333; Fax: 09-7454104 2 Kindergarten Alhuzayyel A – Rahat, South Israel
Abstract. This study reports our findings about the lack of knowledge about color systems among teachers for preschool kids. Based on these findings a model of action is suggested to enhance the knowledge of both kids and their teachers. First we review the basic knowledge about color: its importance and classic methods to teach this subject. Basically, classic activities to teach colors include: reading, using colored toys, playing games, talking, pointing to things and asking about the names of colors and painting, In this study we suggest teaching colors as a light and waves. Computers as a multimedia centre play an important rule. Actually, a combination of classical methods and new technology based methods are used to enhance the knowledge of both kids and their teachers.
1 Introduction Colors are everywhere around us and fill our lives. Color, without our realizing it, can have a profound effect on how we feel both mentally and physically. Dr. Morton Walker, in his book The Power of Color, suggested that the ancient Egyptians as well as the Native American Indians used color and colored light to heal. Colors have emotional associations that humans tend to have with certain colors. These are important to keep in mind in order to create the mood you are seeking. [1] Colors are important because they can affect our lives in different ways [2]. According to Wikipedia [3], Color is important to the brand recognition, but should not be an integral component to the logo design, which would conflict with its functionality. Some colors are formed / associated with certain emotions that the designer wants to convey. For instance, loud colors, such as red, that are meant to attract the attention of drivers on highways are appropriate for companies that require such attention. In the United States red, white, and blue are often used in logos for companies that want to project patriotic feelings. Green is often associated with health foods, and light blue or silver is often used to reflect diet foods. For other brands, more subdued tones and lower saturation can communicate dependability, quality, relaxation, etc. Color is also useful for linking certain types of products with a brand. Warm colors (red, orange, yellow) are linked to hot food and thus can be seen integrated into many fast food logos. Conversely, cool colors (blue, purple) are associated with lightness and weightlessness, thus many diet products have a light blue integrated into the logo. Colors may influence the taste of food [4]. G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 405–415, 2008. springerlink.com © Springer-Verlag Berlin Heidelberg 2008
406
M. Huleihil and H. Huleihil
Colors are so important for kids. It can affect there modes and behavior. When a baby is born he sees only black, white and gray. Within a week or so he can see red and begins to reach out to the color as it helps him develop his perception skills. Since children see red before blue it’s best to decorate a baby's room using soft tones instead of bright primary colors as they may confuse his sensory skills and overwhelm him. As a result, kid room paint colors and the particular colors to paint a room should be well thought out prior to giving birth [5]. For many, the brightest of reds for example, may represent a loud, unsettling color that makes focusing on a task difficult, for others it could provide a sense of comfort and security since they relate it to a favorite stuffed friend or blanket. So no matter what our age, we relate to colors based on personal experiences and interests. Kid room paint colors in general and deciding on what colors to paint a room need not be overwhelming. All that's required is a little forethought and some fundamental knowledge from which to base your kid room colors on. The youngest of children aren’t yet affected by the cultural influences of color. Even though adults have had more experiences with color, we may not all respond the same. For example, for many in the world white represents purity but in Japan it represents mourning and death. More than that, colors can have healing effects on the body. Simply put, colors match, respond to and support certain body functions. I’ve seen this in action myself when my boys were born, one prematurely, requiring blue light treatment for jaundice. Hence, kid room paint colors should be selected for the positive effect they have on your child. As mentioned earlier colors are so important in our lives, so it is important to teach colors at early stages. About 85% of the brains develop at ages between 3-5 years [6]. Children can distinguish different skin colors, hair textures and facial features from as early as six months of age. At this age they begin to understand that they are a separate person and begin to see the differences and separateness of others. As children develop from infants to toddlers, around eighteen months of age, they begin to recognize their own features and if given a choice, will often choose the doll of their own color. Researchers tracking the development of racial attitudes in children found that almost half of the 200 children they studied had racial biases by age six [7]. This study illustrates that the foundations for hatred are formed at a very early age and that diversity and anti-bias training are critical when children are young. Teaching colors is also important for creating and understanding art, which is important because kids, even at the younger ages, need to learn to express themselves in ways beyond just talking. They need to feel the emotion on paper when painting or drawing. It helps to get the energy flowing too! Plus art is fun to kids! Even fingerpainting is important for toddlers, so they can learn color! [8]. Teaching colors is also important because they are related to reality. It is important to develop our understanding of the world around us [9], and to develop our thinking and brain [10]. At what age should we teach colors for kids? As mentioned earlier the brain can distinguish colors at few months. So it is possible to talk about colors just from the first day. There are several ways of teaching colors for kids, by touching [11], by following different creative ways [12, 13]. Among these methods: reading books about colors,
Classic and Multimedia Based Activities to Teach Colors
407
buying toys with different colors, talking about colors and naming colors of goods at the house, pointing to things outside and asking about colors, playing color games and painting. In order to teach colors, it is important to define it. Color [14], is the visual perceptual property corresponding in humans to the categories called red, yellow, blue, black, etc. Color derives from the spectrum of light (distribution of light energy versus wavelength) interacting in the eye with the spectral sensitivities of the light receptors. Color categories and physical specifications of color are also associated with objects, materials, light sources, etc., based on their physical properties such as light absorption, reflection, or emission spectra. Typically, only features of the composition of light that are detectable by humans (wavelength spectrum from 400 nm to 700 nm, roughly) are included, thereby objectively relating the psychological phenomenon of color to its physical specification. Because perception of color stems from the varying sensitivity of different types of cone cells in the retina to different parts of the spectrum, colors may be defined and quantified by the degree to which they stimulate these cells. These physical or physiological quantifications of color, however, do not fully explain the psychophysical perception of color appearance. How do we see colors? Colors are sensed by the three cone cells (which sense red, green and blue) in the eye. The different colors are different combinations of these basic colors [15]. When the three components are missing, the color is called black, or missing of light photons. By experimenting with prisms as early as 1672, Isaac Newton made the fundamental discovery that ordinary "white" light is really a mixture of lights of many different wavelengths, as seen in a rainbow. Objects appear to be a particular color because they reflect some wavelengths more than others. A red apple is red because it reflects rays from the red end of the spectrum and absorbs rays from the blue end. A blueberry, on the other hand, reflects the blue end of the spectrum and absorbs the red [16].
2 Motivation Although the knowledge about colors is enormous, it was interesting to discover that there still a lack of knowledge among the teachers for pre-school ages. During a class time “teaching science for kids” we asked some questions about colors: what is color? What is black? What is white? What we get if we mix basic colors of equal amounts? How do you explain colors of objects around you? What is the opposite of red? What are the basic colors you know? Most of the teachers did not know the answers to these questions. In order to find how serious this finding is, we decided to check it among larger sample of teachers. Kids have healthy imagination. From past experience, kids are able to give their own explanations about colors. It is possible to explain to them theories of colors with no limitation. We believe that it is possible to discuss with kids features of colors as light and waves. After all, there should be no limit to the imagination.
408
M. Huleihil and H. Huleihil
3 Goals of the Study According to the interesting findings mentioned above, we decided to make a move towards enhancing the understanding of colors in hour neighborhoods. The goals of this study can be summarized in the following items: a) To find out how sever is the lack of knowledge about colors among teachers for pre-school teachers. b) To develop activities for both teachers and kids and to find out how did these activities affect their knowledge? c) To make these activities a part of the pre-school curriculum.
4 Methods 4.1 Collecting Data In order to collect data about the knowledge of teachers, we prepared a questionnaire which includes 9 questions. The questionnaire is anonymous, just to make it reliable. The questionnaire includes the following questions: 1) 2) 3) 4) 5) 6) 7) 8) 9)
What is color? Define color! What are the basic colors according to your knowledge? Is black a color? How do we get black? Is white a color? How do we get white? What color we get when the basic colors are mixed? What is the opposite of red? Blue? Yellow? How do you explain the color of objects?
In order to enhance the knowledge of teachers about colors we suggested to perform simple experiments and let them report there findings. The report should include a comparison if the results of the experiments compared to the knowledge in books. 4.2 Suggestion Activities 4.2.1 Following the Method of Teaching Science and Colors at Huriya’s Kindergarten For many years at Huriay’s kindergarten the subject of science and colors is taught in different ways. Following the classical methods, the kids classify goods according to their colors, they paint, talk, read stories. The kids follow the colors in nature and its different “dresses” along the year. A major activity in the kindergarten is teaching science and colors by agriculture activity. The kids plant wheat, pepper in different colors, they learn to measure and collect data for several months. The kids enjoy tracing the development of plants day by day, week by week. At the end every kid gets his own “baby”, i.e., plant. At this kindergarten, the kids touched a little bit of the color as a light. They are challenged to look at the sky, towards the sun and the moon and they are asked to try
Classic and Multimedia Based Activities to Teach Colors
409
to explain the different colors they see. It is important to develop independent thinkers starting at the early ages. 4.2.2 Suggesting New Activities In order to enhance the knowledge of both teachers and kids, we suggest incorporating technology in the process of education. Computers could be used easily as an important tool for education. The many faces of the device add a different layer of knowledge. A major problem arises in our neighborhoods especially when talking about computers. Computers are not handy for most kids and teachers. As a new tool for them, it could be an obstacle. Actually, at the college we meet teachers with almost zero computer skills. It is so difficult to incorporate computers in education. So we have to "straighten the line" and close the gap before we start developing computer based activities. The new activities include computer based ones. Among these activities: preparing presentations, creating personal web pages. The materials are to summarize the activities in the classroom, pictures of the kids during the agriculture activity and experimenting colors.
5 Knowledge about Colors as Collected from Teachers In this paragraph, we summarize the answers to the questions collected from 100 teachers and students. 5.1 What Is Color? Define Color! As an answer to this question we got different responses: color is color, color is many colors together, color is reflection of waves, something that define goods via sensing in the eye, the look of goods, something which is not transparent, I don't know, but I think it is the result of mixing colors, I don't know, color define character of a parson, color is something which gives matters a property, color is a powder – as a matter of fact I don’t know, color differentiates between things, color expresses live and affects the mood, color is, color is life, mixing two colors gives us a new color, in light we can sense colors, color is something beautiful which gives beauty to goods, color is taste of life, color is a visual sense, color is reflection of the sun rays from objects, color is reflection of objects, color is color, color is an angle of the eye, color is life, spring shiny flowers, color is made of several things, color is a symbol. 5.2 What Are the Basic Colors You Know? Black, white. Red, yellow, blue. Green, blue, red, yellow. Red, green, blue. Green, red, blue, black, white. Black, green, blue, yellow. Green, red, yellow, blue. Rainbow colors: red, orange, blue, green, yellow. Black, white, red. Blue, red, yellow, green, black, white, brown. Green, red. Red, green, yellow. White, red, blue, green, brown. Red, green, black. White, black, red, dark blue. Red, black, white, blue. I don’t know. White, red, green. Red green, yellow. Blue, green, yellow. As we can see, there is variety of answers. What is important to stress here that none of the answers included both definitions as seen by physicist and by chemist.
410
M. Huleihil and H. Huleihil
5.3 When Basic Colors Are Mixed, What Color We Get? Grey. We get new color. Red when mixed with green we get orange. Black. I think we get pink or orange. White. Black. I don’t know, I didn't try to mix colors but I think we get grey. Velvet. Red plus yellow plus green give blue. Dark violet. Different colors. Green, red, blue. Brown. Here the answers include some right answers and many wrong ones. But none of the answers we complete to explain the two different systems of colors. 5.4 Is Black a Color? The answers for this question include 90% that were sure that black is a color. 6% they claim no, black is not a color. 3% answered I don't know and 1% claimed that black includes all colors. 5.5 How Do We Get Black? The answers to this question included so many responses. Most the answers show that it is not clear how black in produced. Some of the comments include: I have no idea. I don't remember. Black is a wonderful color. Black is the best in the world. The answers a classified to the following items: Table 1. This table summarizes the response of 100 teachers and students. 32% answered that they don't know how to get black. The table shows almost total lack of knowledge about the ways we get black.
How do we get black?
Percents%
I don't know.
32
Mixing green red blue
5
Black is ready made color
2
Mixing all base colors
17
No colors
3
Grey plus white
2
Absorbs sun rays
3
Blue, red and yellow
2
Green and blue
3
Dark grey plus dark blue
2
Subtracting color
2
Classic and Multimedia Based Activities to Teach Colors
411
Table 1. (continued)
Green, red, blue, cyan, pink
5
Base colors plus black
2
Mixing of all colors
7
Mixing of many colors
2
Mixing of dark colors
2
Blue and red
3
Grey and brown
2
Absorbs all colors
2
Mixing of colors
2
Mixing of two colors
2
Dark blue and brown
2
5.6 Is White a Color? Interesting to know that 95% of the answers said white is a color. But when we see the answers to the next questions, we still se that it is not clear to them how white color is produced. 5.7 How Do We Get White? Also there are several comments to this answer. Most of them don’t know how we get the white. Here are some comments: great, wonderful, it is impossible to get white cause it is not a color. In the following table we summarize all categories: Table 2. This table summarizes the response of 100 teachers and students. 52% answered that they don't know how to get white. The table shows almost total lack of knowledge about the ways we get white.
How do we get white?
Percents %
I don't know
52
By reflection of air
2
Grey plus orange
2
412
M. Huleihil and H. Huleihil Table 2. (continued)
Yellow plus blue
2
White is a ready made color
2
Base colors are taken as 0%
2
Mixing of all base colors
7
White exist in nature
5
Mixing of all colors
3
White is a base color
2
Red, yellow, green, blue
2
White is most brightness of a color
3
Red blue and green
7
Reflects all colors
5
It is impossible to get white cause
5
it is not a color Mixing of two colors
2
5.8 What Is the Opposite Color of Red? We got variety of responses to this question. Three answers (out of 15%) said that green is the opposite of red, cause red is the color of wars and blood, but green is the opposite, it is the color of nature and peace. The following table summarizes the answers: Table 3. This table summarizes the response of 100 teachers and students. 25% answered that they don't know what the opposite color of red is. The table shows almost total lack of knowledge about the opposite color of red .
What is the opposite of red?
Percents %
I don't know
25
Red, black and white
1.5
Green
15
Classic and Multimedia Based Activities to Teach Colors
413
Table 3. (continued)
Black
7
There is no such a color
8
White
17
Green and blue
3
White Yellow
15
Blue
7
Orange
1.5
5.9 How Do You Define the Color of Objects Like a Lemon? Seventeen different answers to this question, the majority don’t really know what the answer is. The results are summarized in the following table: Table 4. This table summarizes the response of 100 teachers and students. 43% answered that they don't know the definition of the color of real objects. The table shows almost total lack of knowledge about the definition of colors of real objects.
How do you define colors of
Percents %
objects? i.e. lemon for instance I have never thought about it and
2
there is no need to think it is as is. I don't know
43
Mixture of base colors
2
Following the name of the color
18
Reflection of light from the object
12
Mixture of colors
3
Natural color
5
Mixing of orange and red
4
414
M. Huleihil and H. Huleihil Table 4. (continued)
Property of colors
2
God gave its names
2
Beautiful
2
Yellow green
2
A result of biological process
2
Not much
2
Mixture of two colors
2
6 Using Multimedia to Enhance Understanding Colors Multimedia [17] can be used to teach colors for kids and teachers. The fact that computers are available makes it possible to use them as a tool to grasp colors. The activities could as simple as showing pictures, naming colors, mixing colors and classifying objects according to their colors. It is important to use sound like sound recorder to talk and record the discussion about colors. Showing the text and names is another way of dealing with colors. Creativity center is also essential to demonstrate using colors to produce pieces of art. Drawing and painting is done easily with computer programs like mspaint. Microsoft power point makes it easy to draw objects, shapes and coloring them. Animation is another way to deal with colors. As a multimedia center, computers can easily show videos of nature and many other things. Using diversity of tools, makes it possible to trigger the senses of kids and adults and thus enhancing the perception of many ideas and thoughts.
7 Summary and Conclusion In this study we briefly reviewed basic knowledge about colors. The motivation behind this effort is the lack of knowledge of school teachers in south Israel. It turns to be that the teacher can name colors, but are an able to define it correctly. In order to check the phenomena we passed a questionnaire to different groups of people: students at the college and teachers at the field. In this study we asked 100 people, 50 % of them are teachers at the field and the other 50% are students to become teachers. The findings are so interesting; there were none on them who was able to define colors at points of view; physics and chemistry. At this stage we explained to teachers some basic facts about colors. We suggested proceeding with the study further to try to improve the understanding by developing activities for the preschool kids.
Classic and Multimedia Based Activities to Teach Colors
415
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17.
http://iit.bloomu.edu/vthc/Design/psychology.htm http://creativitylove.wordpress.com/2006/09/08/what-certain-colors-mean/ http://en.wikipedia.org/wiki/Logo http://faculty.washington.edu/chudler/coltaste.html http://www.decorating-kids-rooms.net/painting-kid-rooms-and-colors-to-paint-aroom.html http://www.adl.org/education/miller/q_a/answer2.asp?sectionvar=2 Kase, L.M.: Talking to Kids About Race. Parents Magazine, New York (2001) http://au.answers.yahoo.com/answers2/frontend.php/question?qid=20071125172930AAM PkPl Philosophical Studies 108, 213–222 (2002) ©2002. Kluwer Academic Publishers, Netherlands http://faculty.washington.edu/chudler/words.html http://www.preschoolcurriculum.com/Free/Fall/BrownPuppet.pdf http://www.fitwatch.com/mom/teaching-children-colors-117.html http://www.ehow.com/how_9836_teach-toddler-colors.html http://en.wikipedia.org/wiki/Color http://www.tvtechnology.com/features/Tech-Corner/f-rh-white.shtml http://www.hhmi.org/senses/b120.html The use of computers to present text, graphics, video, animation, and sound in an integrated way. Long touted as the future revolution in computing, multimedia applications were, until the mid-90s, uncommon due to the expensive hardware required. With increases in performance and decreases in price, however, multimedia is now commonplace. Nearly all PCs are capable of displaying video, though the resolution available depends on the power of the computer’s video adapter and CPU
TeamSim: An Educational Micro-world for the Teaching of Team Dynamics Orazio Miglino1,2 , Luigi Pagliarini3, Maurizio Cardaci4 , and Onofrio Gigliotta2,4 1
2
3
4
Department of Relational Sciences “G.Iacono”, University of Naples “Federico II”, Naples, Italy Laboratory of Autonomous Robotics and Artificial Life, Institute of Cognitive Sciences and Technologies, National Research Council, Rome, Italy Adaptronics Group, The Maersk Mc-Kinney Moller Institute for Production Technology University of Southern Denmark Department of Psychology, University of Palermo, Palermo, Italy
Abstract. In this paper we present an educational micro-world in which a learner can manipulate variables affecting the efficiency with which a team adapts to its environment. These include the structure of hierarchical relations within the team, the structure of the communications network, and other environmental parameters. Using the micro-world, the learner can design experiments (simulations) exploring notions in the dynamics of small groups. A freeware version of TeamSim is available from http:// laral.istc.cnr.it/gigliotta/. Keywords: Small group dynamics, small team communications network, educational simulation, virtual laboratory.
1 Introduction Today, ever increasing computational power makes it possible to implement theories in the social sciences in the form of computer simulations. The social sciences have much to gain from this approach. In many cases, computer simulations complement existing “verbal” theories with formal definitions and quantitative measures. providing new insights into complex phenomena involving interaction among multiple agents. Computer models in the social sciences can be considered as “laboratories” where the scientist can directly manipulate and observe the origins and ‘flow’ of social phenomena, which formerly they could only describe in print. One side effect of this approach is the opportunity to use research tools for purposes of education, supplementing traditional book and lecture-based approaches with direct, “hands-on” experience [6]. Using simulation for teaching reduces the gap between theory and practice, study and work; learners can gain familiarity with the intellectual tools of basic research: formulating hypothesis, designing experiments, observing results, and reformulating hypotheses. For example, simulations of social organization in a factory or work group, etc., can be used to explore notions in sociological theory G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 417–425, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
418
O. Miglino et al.
and to compare the consequences of differing hypotheses and scenarios in areas in which experimental investigation would be impossible or impractical. In this paper, we describe an educational micro-world designed to teach notions in team dynamics. The objective of the work was to build an e-learning laboratory where learners could explore how individual characteristics, communication and organizational structures influence the efficiency with which a team adapts to its environment. The simulator was developed on behalf of Trainet S.p.A. (Telecom Italia Group) which tested a Java-version of the model in one of the company’s training workshops on human resource management. A free downloadable version is available from http://laral.istc.cnr.it/gigliotta/. The simulator is based on familiar notions in team dynamics [1, 3, 4, 5] drawing its main inspiration from Moreno’s sociograms [5]. In Moreno’s work a researcher uses a standard questionnaire to elicit information about relations of esteem and affect within a group. These are then depicted as a network, in which individuals are “nodes” and relations between individuals are directed links between nodes. In the work reported here we implemented Moreno’s model, using an agent-based strategy in which individual agents use input from other agents and from a simulated environment to compute an output. This in turn modifies the behavior of the other agents and/or the environment. In what follows we describe the model and, report a number of experiments (i.e.: simulations of different scenarios) exploring basic notions in team dynamics.
2 The Model 2.1
The Team and the Environment
The model represents the team as a group of hierarchically organized agents, located on a two-dimensional grid of “cells”. Each cell is either “black” or “white. Part of the environment (see Fig. 1) is designated as the “target” and is colored “black”. Other areas may also be colored black. These represent “decoy targets”. Each agent occupies a single cell in the environment and can move in any direction within a given range. In deciding their moves agents use information from their “sensory organs”. In particular every agent receives information about the position of higher ranking individuals (see below for details). 2.2
Power Relationships: The Team Hierarchy
As described earlier, the behavior of individual agents behavior is influenced by their rank in the “Team Hierarchy (defined by the learner). Different individuals may be assigned the same rank (see Fig. 2), indeed it is possible to model an egalitarian society in which all individuals have the same rank. The technique used in the model allows the representation of any possible hierarchical structure. 2.3
The Communications Network
In the model, individuals can exchange information, using a “communications network”. Communications channels can be set up between any agent and any other
TeamSim: An Educational Micro-world
419
Fig. 1. The initial configuration of a ‘world’ with four agents. The large black region on the left is the target, the other black zones represent “decoy” targets.
Fig. 2. One possible configuration of the Team Hierarchy. In this example, all agents occupy the same hierarchical rank: the lowest.
agent. Channels may be symmetrical, allowing two way communications, or asymmetrical, with communications flowing in a single direction. They are potentially noisy; a noise parameter (on a 0...5 scale ) defines the probability that a message will be transmitted incorrectly. By manipulating these parameters learners can explore a broad range of different communications architectures. They can model groups with no communications, groups where everybody communicates with everybody else or any other structure they care to investigate. By tuning the noise factor they can model both reliable and unreliable communications.
420
O. Miglino et al.
Fig. 3. Agents are connected by means of a researcher-defined communications network
2.4
Receptive Fields and Exchange of Information
Agents can “see” the color of the cell they occupy and those of surrounding cells within a given range. In what follows we will refer to the portion of the environment the agent can detect as its “receptive field” (see Fig.4a). The number of cells in an agents receptive field (1.36) is defined by the learner. Once agents have extracted the information contained in their receptive field, they can transmit this information to other agents. It should be noted, however, that noise may distort the transfer of information. For example, if an agent transmits the information that it is currently using a white cell using a communications channel
Fig. 4. An agent’s knowledge of its environment. (a) The agents receptive field (i.e.: gray cells) and the flow of the information between agents. (b) the agent’s map of the environment, putting together information from other agents with information from its own receptive field.
TeamSim: An Educational Micro-world
421
with a noise level of 3, there is a 60 per cent probability that the receiving agent will receive the wrong message (i.e. that the transmitting agent is located on a black cell). Each simulation involves a certain number of “cycles”. During each “cycle”, agents put together information from other agents with information from their own receptive field. In this way they create a map of their environment. (see Fig. 4b). 2.5
Decision Making
During each cycle of the simulation, each agent “decides” to move a certain number of steps in a given direction. To decide its move, the agent sums two vectors: (a) a vector directed towards the area on the map with the highest number of black cells; the more black cells in an area the greater the attraction it exerts on the agent; (b) a second vector directed towards the cell occupied by the next highest ranking agent(s) in the Team Hierarchy (see Fig. 5). In this case, the “attractiveness” of the agent is proportional to the difference between its own rank and the rank of its “superior”. When the agent has a number of “superiors” of the same rank, the vector is computed using the location that minimizes the sum of distances between the agent and each of its superiors. The agents act synchronously. During each “cycle” in the simulation, each agent performs the following operations: (a) it “sees” the color of the cells within its receptive field; (b) it exchange information about cell color with other agents; (c) it updates its map of the environment; (d) it “decides” how far to move, and in what direction.
Fig. 5. A decision (vector “c”). The agent’s decision is given by the sum of two vectors, pointing respectively to the most attractive target (vector “b”) and to the location of their immediate “superiors (vector “a”).
422
O. Miglino et al.
3 Experiments with TeamSim The experiments reported below used the environment described in Fig.1. The environment was populated with 10 agents. Agents’ initial locations were the same in each run of the simulation. During the simulation we monitored the average distance between agents and the target (i.e.: the large black zone in Fig.1), using this measure as an indicator of team efficiency. 3.1
Scenario 1: No Team Hierarchy, No Communications Network
We began with a basic simulation, with no hierarchy (all team members had the same rank) and no communications between team members. All team members had the same sized receptive field. In the absence of hierarchy and communications, decision-making by individual agents is independent with respect to decisions by other agents and is based exclusively on the agent’s own perceptions. Where, as often happens, an agent is unable to perceive a single black cell, it chooses its move randomly. Simulations in these conditions generally produced very poor results. In the sample run, shown in Fig. 6, the average distance between agents and the target (which started at 80 units) was still 50 units, even after 500 cycles. The (small) improvement with respect to the starting condition is due to the fact that a few agents find the target by chance and remain close to it.
Fig. 6. Average distance between agents and target in Scenario 1
3.2
Scenario 2: No Team Hierarchy; Varying Communications Architectures
Early psycho-sociological studies [5, 7, 8, 9], paid much attention to the influence of communications networks on team efficiency proposing a number of mathematical models. Scenario 2 addresses the same issue - which the research tools available at the time were unable to handle.
TeamSim: An Educational Micro-world
423
Fig. 7. Average distance between agents and target on different runs of Scenario 2
In Scenario 2, there is no hierarchical structure to the team and all agents have the same sized receptive fields. What changes is the communications network. In the simulations described below we investigate three different kinds of network: in simulation 2a, every agent has a symmetrical communication channel with every other agent; simulation 2b reproduces Moreno’s classical “star structure” [5]: all agents except one have a single, bi-directional channel of communication to an individual at the center of the network; in Simulation 2c one agent receives information from the others but does not communicate its knowledge. Fig. 7 compares results from the three simulations. As can be seen, the team performs most efficiently on the fully-interconnected network; on this network it only takes 100 cycles to reduce the average distance to the target to 10 units. The star network achieves the same result but 100 cycles later. The third group, with a one-way flow of communications to a single individual, performs very badly. The results are, in fact, almost as bad as those for the non-communicating team in Scenario 1. 3.3
Scenario 3: No Communications Network; Variable Hierarchy Structures; Differentiation of Receptive Fields between Individual Agents
Scenario 3 highlights the role of receptive fields and hierarchical structure in determining the efficiency of team work. To show these effects more clearly, the communications network was eliminated. Preliminary investigations showed that in the absence of hierarchy the size of individuals’ receptive field had relatively little effect on team efficiency. But in the presence of hierarchy the size of receptive fields took on new significance. Figure 8 illustrates this effect, comparing two communication-free simulations in which a single “chief” occupies the top rank in the hierarchical structure and all other
424
O. Miglino et al.
Fig. 8. Average distance between agents and target in Scenario 3
agents have identical low rank (see Fig.2). In both experiments all agents except the chief have identical, small receptive fields. The only difference between them is the size of the chief’s receptive field: large in Simulation 3a, small in simulation 3b. The results of the simulations (see Fig. 8) show that “chiefs” with a large receptive field have a relatively high probability of finding the target zone and attracting other agents to the area; chiefs with smaller receptive fields, on the other hand, have much less chance of finding the target and leading their team in the right direction.
4 Discussion Since the time of Moreno, many researchers have investigated the factors determining the dynamics and efficiency of teams. The theories advanced by these groups are often plausible. Usually, however, they have been verbal theories. And even when theories have couched in mathematical terms their key theoretical constructs often have no observable counterpart. TeamSim illustrates an alternative road to theory-building. In TeamSim simple rules governing interactions within a team determine the efficiency of the team. As in a many complex systems, “local rules” produce emergent global behavior. The advantage of the approach taken in TeamSim is that the mechanisms on which the model is based are completely defined in the simulation code. For the scientist, this means that the model is open to criticism and falsification. From an educational viewpoint the simplicity of the model helps learners to grasp the underlying theoretical assumptions. This does not mean, of course, that these assumptions are true. Very obviously, TeamSim is a vast simplification of social reality. But we would argue that in the last analysis, the validity of the assumptions is of secondary importance. TeamSim - and other simulation-based tools in the social sciences are useful, we would argue, not so much as a way of teaching “notions” and
TeamSim: An Educational Micro-world
425
“facts”, but as point of departure for critical thinking: a key goal for humanistic as much as for scientific education.
Acknowledgements This research, funded by the Leonardo Da Vinci Program, is a part of a preliminary work within the EUTOPIA-MT (LLP-LDV/TOI/2007/IT/160) project.
References 1. Coffeym, R.E., Athows, Reynolds, P.A.: Behavior in organizations. Prentice Hall, New York (1975) 2. Hegelmann, R., Muller, U., Troitzsch, G.: Modeling and Simulation in the Social Sciences from the Philosophy of Science Point of View. Kluwer Academic Publishers, Dordrecht (1996) 3. Homans, G.C.: The human group. Routledge and Kigan Paul, London (1952) 4. Lewin, K.: Group dynamics. Research and theory. Peterson and Company, New York (1953) 5. Moreno, J.: The sociometric view of the community. Journal of Educational Sociology 19, 540–545 (1946) 6. Parisi, D.:
[email protected]. Mondadori, Milano (2000) 7. Shaw, M.E.: Some Effects of Problem Complexity upon Problem Solution Efficiency in Different Communication Nets. Journal of Experimental Psychology 48, 211–217 (1954) 8. Shaw, M.E.: Communications networks. In: Berkowitz, L. (ed.) Advances in Experimental Social Psychology. Academic Press, New York (1964) 9. Shaw, M.E.: Group Dynamics. McGraw-Hill, New York (1976)
The Computerized Career Gate Test K.17 Theodore Katsanevas University of Piraeus
Abstract. The article analyses the basic structure and function of a modern computerized automated test for career guidance, named Career Gate Test K.17, which is provided through personal computers and the internet. The test incorporates a sincerity test and outlines occupational interests and the levels of self-image, determination and consistency. The test is based on many years of research and more specifically on Holland’s personality theory and the original notion of "classification K.17" according to which, the occupations are grouped in seventeen categories and subcategories, derived from economical, psychometrical and educational elements. Part of the algorithm of the test is also presented, as well as findings of its empirical evaluation in the last four years.
1 The Automated Career Gate Test K.17 The automated-computerized Career Gate Test K.17 for career guidance explores personality, inclinations, preferences and interests, as well as the levels of self-image, determination and consistency. It incorporates a sincerity test, so if a user is not honest, serious and reliable when answering the questions, the system software will not issue results and the user must repeat the test. The basic questionnaire contains 315 questions of interest, such as: “Would you be interested in learning how an electrical machine works?”, “Would you be interested in teaching children?”, “Would you like to take a photograph of a landscape?” etc. It also contains 42 situationquestions, such as: “I insist on my opinion” or “I’m not easily annoyed”, in which the user must answer whether they are applicable to him/her or not. The answers are assembled in a file and, via an automated process implemented in the system software, conclusions may be given electronically and printed out in a personal report. The main element of the test is the dispersion of the user interests and preferences in seventeen categories and subcategories of “classification K.17”, under which come occupations mainly compatible with two levels of studies: a) education of higher or even postgraduate level (H.E.) and b) vocational education (V.E.). Furthermore, it includes dispersion of the user’s personality factors in Holland’s six psychological types: realistic, investigating, artistic, social, enterprising, conventional. The reliability and the validity is very high as it has been tested in practice for many years and this is so for the relevance of the original notion of the “classification K” for career guidance purposes. The test has been designed according to economical, psychometric and educational data and is based on several years of research and G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 427–438, 2008. springerlink.com © Springer-Verlag Berlin Heidelberg 2008
428
T. Katsanevas
international career guidance tests and especially with John Holland’s personality theory.1 The seventeen occupational categories and subcategories of classification K.17 are the following: The 17 vocational categories and subcategories Classification Κ.17 1. Agriculture, Livestock, Fishery, Mining, Geology, Forests 2. Construction, Engineering, Metallurgy, Carpentry, Glass industry, Textile industry, Clothing, Shoes 3. Chemistry, Energy, Medicines, Foods, Drinks 4. Informatics, Telecommunications 4.1 Software 4.2 Hardware 5. Economics, Administration, Banks 6. Trade, Public Relations, Insurance, Commerce 7. Law occupations 8. Transport, Shipping 9. Tourism 10. Sports 11. Information and Mass Media 12. Fine, Applied and Graphic Arts 13. Health, Care 13.1 Health 13.2 Care 14. Education, Humanities 1
The research on occupations, the labour market and the “Classification K.17” is based on a multi year research project and publications of the author and his scientific research associates. Such previous works include: (1989). Unemployment, demand for skilled labour and vocational training in the regional labour market. Ministry of Labour, Manpower Employment Organization (1991 republications, 1993, 1996, 1998, 2002, 2004). Labour economics and labour relations Stamoulis (1997). Prospects of the labour market in the wider Athens area for the next five years 1998-2002. University of Piraeus and Manpower Employment Organization. (1998). Professions of the future and the past. Papazisis. (2001). Research for the prospects of the labour market and the determination of the demand of specialities in the 13 regions of country. University of Piraeus and Manpower Employment Organization. (2001). Education and employment. Ministry of Labour, Pedagogical Institute. (2005). Inflows and outflows of students and graduates of the High Education and Technical Education. Education Research Centre of Greece (in collaboration with Ilias Livanos 2004). An analysis of Holland’s theories. (in collaboration with Tania Kavroulaki 2002 and 2005). Choice of study and profession (2002 Republication 2003 and 2007). Professions of the future and the past and career guidance. Patakis. (2003). Professions of the future. European Conference on Information and Communication Technologies. European Commission, Brussels. Research paper in English. (2001). Labour Market Perspectives and Occupational Choice. 1st International Conference of the National Centre for Vocational Orientation (E.K.E.P.). Research paper. (2003, November). The Future of Work. 10th Mediterranean Twinned Cities Conference with the Support of the European Commission and the Cyprus Government. Famagusta, Cyprus. Research paper, etc.
The Computerized Career Gate Test K.17
429
14.1 Pedagogy, Literature, Foreign Languages 14.2 History, Archaeology, Ethnography, Geography 14.3 Sociology, Ethnology, Social Sciences 15. Physics and Mathematics Sciences 16. Military, Police 17. Clerical Occupations – Functions Table 1. Sample of the questionnaire of the test as it is appears on the screen of the computer
The final personal report of the test defines which occupational categories and professions are best suited to the user’s personality and interests, accordingly to the analysis of his answers to the test’s questionnaire. For this purpose, 33 occupations which are connected to the educational system in a higher or even postgraduate level (H.E.) and 20 professions of vocational education are shown in the relevant tables. These professions are derived from the six predominant categories-subcategories of classification K.17, which are also concluded in the personal report. Furthermore, the projections about the positive or limited occupational prospects in the labour market for all the above professions are mentioned. Moreover, the report includes the dissemination of the user’s personality in the six categories of J. Holland’s personality types. A young and even a mature person, with the essential contribution of the career councellors, can decide about the studies, the occupation or occupations and the career path that suit him and wishes to follow. The test is held on a computer easily and fast in about half an hour, while there is an integral monitoring for the sincerity of the answers. A sample from the questionnaire, as it appears on the screen of the computer is presented in Table 1. The sincerity control of the answers, which should not fall below 75%, is presented in the personal report in Histogram 1:
430
T. Katsanevas
Histogram 1. Sincerity control of the answers
Furthermore the levels of self image and determination are presented in the personal report with Histogram 2, followed by explanation remarks.
Histogram 2. Degree of self-image and degree of determination
Histogram 3. Dissemination of the factors of personality, preferences and talents of a random sample in Holland’s six types
The Computerized Career Gate Test K.17
431
The personal report of the test also includes a histogram with the dissemination of the user’s personality factors in Holland’s six types (cf. Histogram 3)2 followed by relevant texts with appropriate explanations.
Histogram 4. Dispersion of personality factors in the seventeen general occupational categories of the Classification K.17
Histogram 5. Dissemination of a random sample of the factors of personality in the six prevailing categories - subcategories of classification K.17 2
There are numerous publications regarding the internationally recognized theory of John Holland and here the following are mentioned indicatively: Holland, J. (1959), (1973) and (1992). Cf. also: Katsanevas, Th. (2002, 2004, 2007).
432
T. Katsanevas
In Histogram 4, we present the user occupational preferences and inclinations in the 17 categories of "Classification K.17". Then follows the dispersion of personality’s factors and occupational inclinations in the six prevailing categories-subcategories of classification K.17, in Histogram 5. From these principal categories – subcategories, result the 33 prevailing occupations best suited to the user’s personality and preferences related to higher and/or postgraduate studies, (cf. Table 2) and 20 occupations related to professional education (Table 3). Table 2. Dissemination of the personality and preferences of a random sample in 33 prevailing occupations
Profession Informatics-Electronics Informatics-ElectricianMechanic Automatism technician Networks Informatics Informatics mechanic Electronic office machines technician Radiotelevision and networks technician Robotic informatics Informatics programmer Economy and administration informatics Telecommunication & networks Informatics Informatics Multimedia informatics Information systems informatics scientist Internet informatics scientist Mathematician Physician Mathematician-Informatics Statistician Mathematician-Statistician Actuary Mechanic engineer Civil engineer Architect engineer
Rate of matching in % 100 100
Prospects Greece
Cyprus
International
*** ***
*** ***
*** ***
100 100 100 100
*** *** *** ***
*** *** *** ***
*** *** *** ***
100
***
***
***
100 90 90
*** *** ***
*** *** ***
*** *** ***
90
***
***
***
90 90 90
*** *** ***
*** *** ***
*** *** ***
90 70 70 70 70 70 70 59 59 59
*** * * *** *** ** * *** *** ***
*** ** ** *** ** **
*** * * *** ** **
*** *** ***
*** *** ***
The Computerized Career Gate Test K.17
433
Table 2. (continued)
TechnologistDesigner – Informatics Surveyor-Engineer Teacher (for primary education) Literature teacher Nursery teacher Nursery care attendant Economist Tax consultant-accountant Project manager economist
59
**
**
***
59 59 59 59 59 57 57 57
** ** * * * ** *** ***
** ** * ** ** ** *** ***
** ** * ** * ** *** ***
Table 3. Dissemination of the factors personality, preferences and talents of an accidental sample user in 20 prevailing occupations, related to professional education
Profession
Prospects
Rate of matching Greece
Cyprus
International
Computer technician
100
***
***
***
Car electronics technician Electronic micro device technician Medicine informatics technician Informatics network technician Airplane electronics technician Multimedia informatics technician Data base and network technician Website technician
100
***
***
***
100
***
***
***
100
***
***
***
100
***
***
***
100
**
**
***
90
***
***
***
90
***
***
***
90
***
***
***
Computer geographic systems technician Application informatics scientist Statistical researches assistant Professional chess player
90
***
***
***
90
***
***
***
70
*
*
*
70
434
T. Katsanevas Table 3. (continued)
Actuary’s assistant
70
Constructional projects technician Electrician
59
***
***
***
59
***
***
***
Plumber
59
***
***
***
Baby sitter
59
**
**
***
Bookseller
59
***
2 The Balance of Demand and Supply of Professions As noted, the above table includes estimations of forecasts for the next decade for the case of Greece, Cyprus and on an international level in the following order: Very positive prospects *** Positive prospects ** Limited prospects * These forecasts are the result of the research carried out by our scientific team and are renewed every year. They are based on a special methodology and, more specifically, on the so-called "balance of the demand and supply of professions". Pilot studies for various other countries have also been conducted and could be carried out using the same model. International forecasts refer to the average world trends. These are based on pilot surveys as well as on scientific assumptions of our model, which is further analyzed in other publications3.
3 The Software of the Test and Mathematical Data The Career Gate Test K.17 is based on the proper correlation of the selected occupations with the answers that the user gives to the questionnaire of the test. The directions of the answers in the groups of the occupations of classification K.17 and in Holland’s personality types, are the main issue of the intellectual property of the test. The specific algorithmic and mathematical data and the ability of the immediate process of the answers can be given in case there is an agreement for the undertaking and conduction of the test entirely by a specific and preferably state and/or University institution. For the scientists, who are interested in a serious research on the relevant subject and always with regard to the scientific dimension of the issue, there can be proper information, after a relevant documented demand. Part of the relevant algorithm is given in Table 4. The software and the algorithms of the test have been created, with the guidance of the author, by the exceptional scientist Thanasis Fonias. Furthermore, during the entire scientific research for the creation, the implementation and the evaluation of the 3
For more analytical data see: www.careergatetest.com
The Computerized Career Gate Test K.17
435
test and of other similar tests the following scientists of career guidance, labour economics and sociology have, inter alia, contributed: Nikos Fakiolas, Tania Kavroulaki, Ilias Livanos , Elena Viniou, Maria Tzortzaki,, Michalis Petevis, Georgia Kefala, Kostas Saltas, Virginia Makri, Kathrin Vagger, Bellafemine Franseska, Stefanos Karakitsios, George Doukas and the computer scientists: Apostolos Doukas, Nikos Christodoulakis, Chis Alex, Naoum Karaminas, Spyros Trivizas, George Kyprianou. Gratidutes must also been given for their comments and their general advisory contribution to the scientists of the field Chris Tzekinis, Maria Koutsafti, Naoum Karaminas, Panos Samoilis, Michael Kasotakis. Table 4. The algorithmic data of the test
4 The Evaluation of the Test The evaluation of the dissemination of the personality factors, interests, preferences, inclinations of both sexes in John Holland’s six types of personality and in the 17 occupational categories of «Classification K.17», is given below in Table 5. Extracting of the relevant conclusions was made through a scientific process of statistical evaluation of a sample of 10.500 young people in Greece and in Cyprus, who in their majority were between the ages of 15-18.
436
T. Katsanevas
Table 5. Dissemination of the personality per sex and in percentage in Holland’s six types of personality in Greece and Cyprus
Realistic Investigative Artistic Social Enterprising Conventional
Total 34 41 45 44 41 38
Boys 47 44 38 38 41 41
Girls 21 38 53 51 42 33
Table 6. Dissemination of the K17 interests per sex and in the 17 occupational categories of classification K.17 in pecentage in Greece and Cyprus
Agriculture, Livestock, Fishery, Mining, Geology, Forests Construction, Engineering, Metallurgy, Carpentry Chemistry, Energy, Medicines, Foods, Drinks Informatics, Telecommunications Economics, Administration, Banks Trade, Public Relations, Insurance, Commerce Law occupations Transport, Shipping Tourism Sports, Security Information and Mass Media Fine, Applied and Graphic Arts Health, Welfare Education, Humanities Physics and Mathematics Sciences Militarily, Police Clerical Occupations – Functions
Total
Boys
Girls
27 34 34 35 34 31 37 34 37 38 45 45 46 43 41 39 17
30 48 37 53 30 28 31 38 30 42 39 38 40 40 47 42 26
24 19 30 17 38 34 43 30 43 34 51 53 53 45 36 36 5
5 A Final, Short Remark The positive practical applications of our research in Greece, Cyprus and other countries, constitute an encouraging reason in order for the Career Gate Test K.17 to become a self-existent model, which can be applied on an international level. Most importantly, due to its flexibility, it may be adapted to the needs of particular population and occupational groups. Our research effort is going to be continued, developed and improved. In an era when these issues are becoming of particularly primary importance, we hope that the C.G.T. will contribute to the international scientific apparatus of career guidance and occupational choices.
The Computerized Career Gate Test K.17
437
References 1. Association of Computer-Based Systems for Career Information. ASCI Standards (2002) 2. Barret, J.: Career aptitude and selection tests. Kogan Page (2003) 3. Berens, L., Isacchen, O.A.: A quick guide to working together with the sixteen types. Telos Publications (1992) 4. Boer, P.M.: Career counseling over the Internet. Erlabaum, Mahwah (2001) 5. Brown, F.G.: Principles of educational and psychological testing, 5th edn. Holt, Rinehart and Winston, New York (1993) 6. Cattell, R.B., Eber, H.W., Tatsuoka, M.M.: Handbook for the sixteen personality factor questionnaire (16PF). Institute for Personality and Ability Testing (1970) 7. Clark, M.A., Stone, C.B.: Clicking with students: Using online assignments in counselor education courses. Journal of Technology in Counseling 2(2) (2002), Retrieved from http://jtc.colstate.edu/vol2_2/clarkstone.htm 8. Costa, P.T., MacCrae, R.R.: Revised neopersonality inventory (NEO PI-R) and NEO five factor inventory NEO-FFI0 professional manual. Psychological Assessment Resources (1992) 9. Cronbach, L.J.: Essentials of psychology testing, 6th edn. Harper and Row, New York (1990) 10. Dunning, D.: What’s your type of career. Davies-Black Publishing, Mountain View (2001) 11. Gati, I., Kleiman, T., Saka, N., Zakai, A.: Perceived Benefits of Usinf an Internet-Based Interactive Career Planning System. Journal of Vocational Behavior 62, 272–286 (2002) 12. Gore Jr., P.A., Leuwerke, W.C.: Information technology for career assessment on the Internet. Journal of Career Assessment 8, 3–19 (2000) 13. Hammer, A.L.: Introduction to type and careers. Mountain View (1998) 14. Harmon, L.W., Hansen, J.K., Borgen, F.H., Hammer, A.L.: Strong interest Inventory: Applications and technical guide. Consulting Psychologists Press (1994) 15. Herr, E.L., Ashby, J.S.: Kuder occupational interest survey. In: Kapes, J.T., et al. (eds.) A counselor’s guide to career assessment. The National Career Development Association (1994) 16. Holland, J.: Making vocational choices: A theory of vocational personalities and work environments. Psychological Assessment Resources (1992) 17. Holland, J.: Making vocational choices: A theory of career. Prentice-Hall, Englewood Cliffs (1973) 18. Holland, J.: A theory of vocational choice. Journal of counseling psychology 6, 35–45 (1959) 19. Issaacson, L.E., Brown, D.: Career information, career counseling and career development. Alyn and Bacon (1996) 20. Janda, L.: The Psychologist’s book of tests. Adams Media Corporation (2004) 21. Jada, J.: Career tests. Adams Media Corporation (1999) 22. Katsanevas, Th.: Professions of the future and professions of the past and career guidance. Patakis (2002, 2004, 2007) 23. Katsanevas, Th.: Labour market perspectives and occupational choice in Trends in vocational guidance. Hellenic National Centre for Career Guidance (2002) 24. Katsanevas, Th.: Professions of the future. In: European Conference on Information and Communication Technologies and new ways of work, EU, Brussels (2001) 25. Katsanevas, Th.: Theoretical assumptions and methodology observations on the concept of the balance of supply and demand of professions. University of Piraeus, research paper (1996)
438
T. Katsanevas
26. Knouse, S.B., Webb, S.C.: Virtual networking for women and minorities. Career Development International 6(4), 226–228 (2002) 27. Kuder, G.F., Diamond, E.E.: General manual for the Kuder occupational interest survey. SRA (1979) 28. Layne, C.M., Hohenshil, T.H.: Graduate students view the use of the Internet. Career Planning and Adult Development Journal 18, 150–156 (2002) 29. McCarthy, C.J., Moller, N., Beard, L.M.: Suggestions for training students in using the Internet for career counseling. Career Development Quarterly 51, 368–382 (2003) 30. Metcalf, H.: Future skills demand and supply trends. University of Warwick (2001) 31. Miller, K.L., McDaniels, R.M.: Cyberspace, the new frontier. Journal of Career Development 27(3), 199–206 (2001) 32. National Career Development Association, Standards for the use of the Internet for provision of career information and planning services. Author, Columbus, OH (1998) 33. Oliver, L.W., Whiston, S.C.: Internet career assessment for the new millennium. Journal of Career Assessment 8(4), 361–369 (2000) 34. Oliver, L.W., Zack, J.S.: Career assessment on the Internet: An exploratory study. Journal of Career Assessment 8, 323–356 (1999) 35. Osborn, D.S., Zalaquett, C.P.: Seeing career counselling related websites through the eyes of counsellor ed students. Journal of Technology in Counseling 4(1), 11 (2005), Retrieved from http://jtc.colstate.edu/Vol4_1/Index.htm 36. Pindyck, R.S., Rubinfeld, D.: Econometric models & Economic forecasts, 3rd edn. McGraw-Hill International Editions, New York 37. Reile, D.M., Harris-Bowlsbey, J.: Using the Internet in career planning and assessment. Journal of Career Assessment 8, 69–84 (2000) 38. Sampson, J.P., Lumsden, J.A.: Ethical issues in the design and use of Internet-based career assessment. Journal of Career Assessment 8(1), 21–35 (2000) 39. Silvestri, G.: Occupational employment projections to 2006, Monthly Labor Review (1997) 40. Strong, E.K.: Vocational interests of men and women. Stanford University (1943) 41. Super, D.A., Thomson, A.S.: A six scale, two factor test of vocational maturity. Vocational Guidance Quartely 27, 6–15 (1979) 42. Turner, S.M., DeMers, S.T., Fox, H.R., Reed, G.M.: APA’s guidelines for test user qualifications. American Psychologist 56, 1099–1113 (2001) 43. Watkins, C.E., Campell (eds.): Testing in counselling practice. Lawrence Erlbaun, Hilldsale (1990) 44. William, T.R.: Using technology to deliver career services. Association Management 50, 12 (1998) 45. Wilson, R.A. (ed.): Projections of occupations and qualifications, 2000-2010. Institute for Employment Research, University of Warwick (2001) 46. Zunker, V.G., Norris, D.S.: Using assessment results for career development. Pasific Grove Brooks/Cole (1998)
Fuzzy Logic Decisions and Web Services for a Personalized Geographical Information System Constantinos Chalvantzis and Maria Virvou University of Piraeus, Department of Informatics, Karaoli Dimitriou St 80, 18534 Piraeus, Greece
[email protected],
[email protected]
Abstract. This article describes a software navigation system which will provide location based services in a personalized way taking into account the preferences and the interests of each user. The system is called Smart Earth and it combines the geographical position of users with the behavior, actions and their profile so as to provide services of added value in the user. The technology of web services is used to provide location based services. The use of web services constitutes an alternative way in the growth of navigation systems. The personalization mechanism is based on fuzzy logic decisions. Keywords: Personalization, location-based services, fuzzy logic, fuzzy decisions, web services, user modeling, GPS, GIS, navigation.
1 Introduction Given the development of portable phones, PDAs and laptops, there is a growing need for the more personalized services. As the wireless communication technology advances rapidly, a personalization technology can be incorporated in the mobile Internet environment, which is based on location-based services to support more accurate personalized services. Location-based services (LBS) assist people in decision-making while they perform tasks in space and time. LBS pose new challenges to software applications and benefit from research in geographic information science and its founding disciplines, geography and information technology. Location-based navigation services provide decision support by answering spatial queries, e.g. “find the shortest route from current location to target location”, and combined spatial and attribute queries, e.g. “find the nearest Italian restaurant from current location”. However decision support methods in geographical information systems (GIS) go beyond simple querying in that they enable users to evaluate and rank decision alternatives based on multiple criteria.[2],[3] In view of the above, our aim was to create a navigation system which will provide location based services with a personalized way taking into account the preferences and the interests of each user. To this end we have developed a personalized geographical information system that is called Smart Earth. Smart Earth combines the geographical position of users with their behavior, actions and their profile so as to provide services of added value in the user. The location based services are provided G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 439–450, 2008. springerlink.com © Springer-Verlag Berlin Heidelberg 2008
440
C. Chalvantzis and M. Virvou
via web services and the personalization mechanism is based on fuzzy logic decisions. The use of web services constitutes an alternative way in the growth of navigation systems. It is not necessary to have applications, which store the maps of regions, that interest the user or any other geographic information (points of interesting etc). Any information the application needs, will be received by using web services. The determination of a user’s location is achieved via a connection with a GPS Receiver. Furthermore, the system provides a personalized way of the information given to the user according to his/her interests in combination with the analysis of the user’s actions on the system, the history and the analysis of other users’ preferences. Personalization is achieved in the following way: The user’s recorded actions as well as other users’ preferences are analyzed based on Fuzzy Logic Decisions and Smart Earth generates hypotheses about what interests most the current user. In this way the GIS of Smart Earth is able to provide recommendations about routes that may be of interest to the particular user. Personalization is possible because the GIS knows which user it serves. However, so far in the literature there is a shortage of personalized GISs that analyze users’ data on the fly and adapt dynamically to each individual user. The remainder of this paper is structured as follows. Section 2 presents related work in services, Fuzzy logic decisions and personalization in GIS. Section 3 describes the architecture of Smart Earth and some of the key features of the system. Section 4 presents our approach to personalization, adaptivity and location-based support via Fuzzy logic decisions in Smart Earth. Section 5 and 6 draws the main conclusions and outlines future work.
2 Related Work 2.1 Location Based Services The term “location-based services” (LBS) is a rather recent concept that integrates geographic location with the general notion of services. Examples of applications include emergency services, navigation systems, or information delivery for tourists. With the development of mobile communication, these applications represent a novel challenge both conceptually and technically. (Schiller, J., Voisard, A. 2004)[2] The five categories described next characterize what may be thought of as standard location-based services; they do not attempt to describe the diversity of services possible. − Traffic coordination and management. Based on past and up-to-date positional data on the subscribers to a service, the service may identify traffic jams and determine the currently fastest route between two positions, it may give estimates and accurate error bounds for the total travel time, and it may suggest updated routes for the remaining travel. It also becomes possible to automatically charge fees for the use of infrastructure such as highways or bridges (termed road-pricing and metered services). − iLocation-aware advertising and general content delivery. Users may receive sales information (or other content) based on their current locations when they indicate to the service that they are in “shopping-mode.” Positional data is used
Fuzzy Logic Decisions and Web Services
441
together with an accumulated user profile to provide a better service, e.g., ads that are more relevant to the user. − Integrated tourist services. This covers the advertising of the available options for various tourist services, including all relevant information about these services and options. Services may include over-night accommodation at camp grounds, hostels, and hotels; transportation via train, bus, taxi, or ferry; cultural events, including exhibitions, concerts, etc. For example, this latter kind of service may cover opening-hour information, availability information, travel directions, directions to empty parking, and ticketing. It is also possible to give guided tours to tourists, e.g., that carry on-line “cameras.” − Safety-related services. It is possible to monitor tourists traveling in dangerous terrain, and then react to emergencies (e.g., skiing or sailing accidents); it is possible to offer senile senior citizens more freedom of movement; and it is possible to offer a service that takes traffic conditions into account to guide users to desired destinations along safe paths. − Location-based games and entertainment. One example of this is treasure hunting, where the participants compete in recovering a treasure. The treasure is virtual, but is associated with a physical location. By monitoring the positions of the participants, the system is able to determine when the treasure is found and by whom. In a variation of this example, the treasure is replaced by a “monster” with “vision,” “intelligence,” and the ability to move. Another example in this category is a location-based ICQ service.[2] 2.2 Fuzzy Logic Decisions The term fuzzy set was coined by Zadeh (1965) [8] as a generalized form of set theory. Unlike traditional Boolean logic which defines whether or not an element belongs to a crisp set (1 or 0), a fuzzy set defines a degree of belonging through a membership function. In effect, fuzzy set theory deals with sources of uncertainty that are vague or non-statistical in nature such as operational definitions based on “rules of thumb”, estimations of natural processes, classification of environment types and the like. The theory of fuzzy sets is used more and more widely in the description of uncertainty. Indeed, very often some poorly formalizable notions or expert knowledge are readily expressed in terms of fuzzy sets. In particular, fuzzy sets are extremely convenient in descriptions of linguistic uncertainties [1]. On the other hand, fuzzy notions themselves often admit exible linguistic interpretations. This makes the exploitation of fuzzy sets especially natural and illustrative.[5] Applications of fuzzy sets within the field of decision making have consisted of fuzzifications of the classical theories of decision making. While decision making under conditions of risk have been modeled by probabilistic decision theories and game theories, fuzzy decision theories attempt to deal with the vagueness and nonspecificity inherent in human formulation of preferences, constraints, and goals. Applications, which may be generated from, or adapted to fuzzy set theory and fuzzy logic, are wide-ranging. Basically, the Fuzzy GIS approach is to apply different fuzzy membership functions to data layers. Fuzzy operations are then applied, using either map algebra
442
C. Chalvantzis and M. Virvou
or user-defined mathematical algorithm, along with weighting functions, to combine the fuzzified data to produce personalized maps. There are many GIS applications in the literature however that either use Boolean logic overlay or demand extensive and detailed input data, making them difficult to apply where data are either limited or absent or difficult to interpret when the outputs include multiple maps.[5] The Fuzzy GIS model (Smart Earth) described here takes a different approach, compensating for data gaps by incorporating, or codifying, expert knowledge. In this way an assessment of risk can be developed using the available relevant datasets (e.g. terrain, generalized classifications etc.).[5] 2.3 Personalized GIS One of the most basic characteristics of the LBS, is their potential of personalization as they know which user they are serving, under what circumstances and for what reason. Moreover, it is possible that services can be offered according to information like the age of the user, the time of the day, the weather conditions, the user’s location or his/her destination. Indeed, personalization of GIS has been attempted already in many systems (e.g. [1],[3],[4],[7]). However, so far the focus of each system is placed in different factors of potential personalization. − The thematic preferences of the user: The adjustment of the supplying services according to the thematic preferences of the user is wide-spread. The CRUMPET project, which researched how the visitors of a town can be benefited from personalized services about sightseeing, museums, hotels, and etc. In fact, it can provide information to tourists according to not only user’s preferences, but also his location. − The season and the age of the user: Both the time of the year and the age of the user play an important role in personalization of the information provided to the user. GiMoDig project studied the possibility of personalized maps according to the time of the year. Consequently, these maps we can see that different signs are used for possible activities, according to the user’s age and also the season of the year. From the systems described above, Crumpet is the personalized GIS, that addresses the largest number of possible personalization functions. Similarly to Crumpet, Smart Earth tries to address quite a lot of personalization functions. However, unlike Crumpet, Smart Earth uses the technology of Web Services that allows it to provide real-time location-based information to users. Real-time information is very important for personalization and turns the system more dynamic in the way it provides responses to users. Moreover, in Smart Earth, Web Services provide a more flexible way to store the reasoning model that is based on fuzzy logic decisions. A more comprehensive comparison of other personalized GIS, with Smart Earth is illustrated and explained further is Section 4.
3 Smart Earth Description The purpose of this part is to describe the characteristics and the operations of smart earth system.
Fuzzy Logic Decisions and Web Services
443
The user can watch his location to the map in real time as he moves and he has the possibility to define a route in which he wants to be informed (duration, distance) and even more to be given directions for his navigation. Smart earth has also the possibility to offer personalized geographical information such as best routes for desired locations and information about events relevant to his profile. All the information about points of interest, the history of the routes, the user’s interests and user’s profile are saved in a database. This basis is useful for the development of personalized information. The maps of the system are provided through virtual earth API and the geographical information are taken from web services. The above characteristic combined with the service of fuzzy logic decisions are an alternative modeling of development and availability of geographical information for navigation systems. Figure1 illustrates the architecture that describes the approach we propose.
Fig. 1. Functional Schema of the Smart Earth
444
C. Chalvantzis and M. Virvou
4 Fuzzy Logic Decisions in Smart Earth Personalization in Smart Earth includes − adaptation to the user's related interests and other preferences, − adaptability, i.e. automatic update of the user model based on the user's interactive history, − Location-awareness, i.e. the awareness of the system of a user's current spatial context. The model of interest, instead of this in Crumpet system (Domain model)[4] is based on Fuzzy Logic Decisions. User’s interests are saved in a Database and structured as a set of Fuzzy Decisions with an attribute which declares the priority p of this Decision. Firstly, the user model initializes the using of stereotypes. A stereotype is a (small) set of demographic data correlated to a set of typical interests. Then, the most appropriate stereotype to start with, can be identified by a few demographic attributes that the user states when he registers to the system. In addition, a user’s profile will be adjusted over time (implicitly) by learning, or can also be corrected by the user explicitly, since it has been initialized by an inappropriate stereotype. Following, an adaptive user model learns user’s interests from the user's interaction with the system. When a user asks for more information about an object, this adds a small amount to the decision that a user is interested in objects with these features, more than in others. If a user asks for more and more details about the same object, or even asks for directions to a location, this adds a greater amount to the decision that she or he is interested in such services. So, for example, when a user visits a hotel, this user is interested in hotels. Also, learning interests from user interaction, relies on offering a broad scope of information and services to the user, who reveals then his or her personal interests, for instance, by asking for more details about some of these objects. An essential usability requirement for user modeling is that users can inspect their model and can override the model's assumptions. Smart Earth allows users to inspect their model and change it on a dialog interface. The user can also override the system’s default values by explicitly specifying his or her current interest. If user interests are known, the request can be specified automatically to get better results. This would result in a query outcome that contains only objects according to the assumed user interest. It might be inappropriate to confine users to such offers that match their assumed interest. Instead, we prefer to sort outcome of a query according to a ranking that considers all user interests relevant to the required service. [3],[4] The Fuzzy Decision classes enable Smart Earth to make decisions that have a number of different influences to the decision-making process and provide a flexible reusable framework for making decisions. The Fuzzy Decision algorithm that we have developed firstly examines if the user has passed from the same location or region (user history). This has been illustrated by the use of the database in which have recorded all the previous user routes. Location of the region is calculated from the use
Fuzzy Logic Decisions and Web Services
445
of a variable parameter which declares the distance radius of around regions. Following, the demographic profile of the user is examined. This information is provided by his/her registration, e.g. country, age, family status etc. The above parameters are taken into account by the system so as to generate the nearest places which are hypothesized to be of interest to the user. Each of these parameters represents a criterion that can be taken into account for the dynamic calculation of the place that may interest the user most. However, these criteria are assigned respective weights by the system and these weights are dynamically given values by the fuzzy decision algorithm. From the collection of places which are produced from the above operation, it is becoming a filtering based on the user’s preferences. More specifically, each user’s preference from his/her profile is examined with the order, in which its priority (value of decision) is defined, and correspondingly is selected the appropriate points of interests. Each user’s interest is translated into a Fuzzy Decision and when we want to decide the priority order of the decisions, we stack them into a Decision Set and compare to each other. The user’s preferences are influenced from his/her interaction with the system. Specifically are defined from the below actions: • From the user’s searches, e.g. when a user visits a hotels, this user is interested in hotels. • From the user’s points of interest, e.g. when a user records a restaurant, this user is interested in restaurants. • From other users’ preferences, e.g. when all users prefer a specific bar, this bar proposed in first user. The mathematical approach of the above parameters described below with these mathematical types:
WSearchi =
UserSearchesi × Wsi n
∑ (UserSearches × Ws ) i
i =1
(1)
i
Where W Searchi is the weight of the search action for an interest i .
UserSearchesi is the number of user searching actions for an interest i , n is the sum of interests and Wsi is the weight value for a specific search for points of interest i .
W Re cord i =
User Re cordsi × Wri n
∑ (User Re cord ×Wr ) i =1
i
i
Where W Recordi is the weight of the record action for an interest i .
(2)
446
C. Chalvantzis and M. Virvou
User Re cordsi is the number of user recording actions for an interest i , n is the sum of interests and Wri is the weight value for specific record of points of interest i .
WRatioi =
UserRatioi UsersRatioi
(3)
Where WRatioi is the weight of the ratio parameter for an interest i .
UserRatioi is the user ratio for an interest i , and UsersRatioi is the users ratio for an interest i . From the above types (1, 2, 3) we calculate the weight of an interest with the below type:
WInteresti =
WSearchi + W Re cordi + WRatioi n
∑ (WSearch + W Re cord i =1
i
i
(4)
+ WRatioi )
In the Tour Guide Algorithm with the fuzzy decisions sets the WInteresti value is affected from the user’s history parameter and from the user’s demographics attributes. The history parameter is calculated from the below type:
WHistoryi =
Visitsi Visits
(5)
Where WHistoryi is the weight of the user history parameter for an interest i .
Visitsi is the sum of visits for a point of interest i , and Visits is the sum of visits for all points. Each demographic attribute of the user is affected the WInteresti formula:
WDemographic j × WInteresti
with this (6)
Also Smart Earth provides personalization based on psychical environment. It checks weather information for a specific location and suggests to the user the appropriate place according his/her preferences, e.g. If the temperature of user location or region is above thirty degrees celsium, it provides to user the nearest beach for swimming. Finally one another personalized feature is the filtering in GeoRSS information. Web Services provides location based information which is filtering based on priority of user preferences, in order to provide personalized location information to user, e.g. If one user prefers sports, live news and events are filtering
Fuzzy Logic Decisions and Web Services
447
Fig. 2. Tour Guide Screenshot
for this category. With this manner the user profile is updated continually and is recorded in database for future work. The figure 2 represents the execution of the above tour guide service.
5 Discussion of the Results It is undeniable the fact that most of the available navigation systems either commercial or not do not use web services for the geographic information but they restore them from their memory. This is a significant disadvantage, as the device should have a lot of memory and also the fact that the information (maps etc) is very specific and do not represent real time. The basic advantage of these services is that they do not demand internet connection. In addition, nowadays portable telecommunication and internet providers update their networks constantly when the connection speed and the cost is increased disproportionally, so an internet connection is a fact in any system. Our target is to use internet in order to achieve the greatest number of information and reduce the memory demands. Through web services we succeed better and more completed illustration in the geographical information in bigger scale and real time.[6],[7] The second bigger characteristic of our system is the personalization of information that is provided in the user. Basic factors of personalization and how this is achieved were presented in the 1.3 section. Also there we studied also certain efforts of other software to achieve personalization of information in concrete frames covering certain factors. Comparing these applications in the piece of personalization of information with our system results the following comparative table.[3],[4]
448
C. Chalvantzis and M. Virvou Table 1. Smart Earth Comparison
Features
Personalization Usage of Web Services Internet Connection is necessary Saved Maps Big size of memory Routing Searching of POIs Realistic maps 2D/3D History of routes Friendly GUI Record user activities Update user profile Evaluation of POIs Information of real time Personalized information of real time Multipoint routing GPS Tracking Searching photographs relevant to user’s location Automatic search nearest places and programming route
Microsoft Pocket Streets
Destinator
Crumpet
Google Maps
Smart Earth
Fuzzy Logic Decisions and Web Services
449
Table 2. Smart Personalization Comparison Factors of Personalization
GeoNotes
GiMoDig
Crumpet
Smart Earth
Domain Model
Fuzzy Decisions Model
User location User identity User direction User history Natural environment Automatic update of the user profile Other users preferences Architecture
6 Conclusions and Future Work All in all, the most significant services have been illustrated: − Database service for inserting , updating, deleting and selecting POIs − GPS Service for receiving the exact location from the Bluetooth GPS Receiver device − Location based services − Map Presentation service with the best as possible presentation of the maps − Route service for receiving the desired routes with accurate directions − Personalization for personalizing the information, according to the user. Our system offers an alternative approach of the construction of navigation systems. Moreover any other expansion of this project is necessary so this system will be a full version of personalized navigation in the future. Firstly, the present service should be tested with real facts. As far as the optimization of the code is concerned, it could be done the following: − illustration of the users’ profile − illustration of the categorization of the users − illustration of the data mining techniques from the database for each user, which be useful for personalized information − expansion of the possibilities of the system and the interests of the users − prediction of the actions
450
C. Chalvantzis and M. Virvou
These factors might be the basic expansion which could be illustrated in the future, in order to have a full version of a personalized navigation system. Also it is necessary to experiment the response time of the system in case of rapid escalation of the system.
References [1] Burigat, S., Chittaro, L., Gabrielli, S.: Navigation techniques for small-screen devices: An evaluation on maps and web pages. International Journal Studies 66(2), 78–97 (2008) [2] Sadoun, B., Al-Bayari, O.: Location based services using geographical information systems. Computer Communications 30(16), 3154–3160 (2007) [3] Fink, J., Kobsa, A.: User modeling for personalized city tours. Artificial Intelligence Review 18(1), 33–74 (2002) [4] Kramer, R., Modsching, M., Schulze, J., Hagen, K.T.: Context-aware adaptation in a mobile tour guide. In: Dey, A.K., Kokinov, B., Leake, D.B., Turner, R. (eds.) CONTEXT 2005. LNCS (LNAI), vol. 3554, pp. 210–224. Springer, Heidelberg (2005) [5] Malek, M.R., Karimipour, F., Nadi, S.: Intuitionistic fuzzy spatial relationships in mobile GIS environment. In: Masulli, F., Mitra, S., Pasi, G. (eds.) WILF 2007. LNCS (LNAI), vol. 4578, pp. 313–320. Springer, Heidelberg (2007) [6] Assunço, L., Osório, A.L.: Teaching web services using .NET platform. In: Working Group Reports on ITiCSE on Innovation and Technology in Computer Science Education 2006, p. 339 (2006) [7] Choi, D.-Y.: Personalized local internet in the location-based mobile web search. Decision Support Systems 43(1), 31–45 (2007) [8] Bellman, R.E., Zadeh, L.A.: Decision making in a fuzzy environment. Management Science 17(4), B141–B164 (1970) Application Series
Design Rationale of an Adaptive Geographical Information System Katerina Kabassi1 and Georgios P. Heliades2 1
Department of Ecology and the Environment, TEI of the Ionian Islands 2 Kalvou Sq. 29100 Zakynthos, Greece
[email protected] 2 Department of Sound Technology Technological Institute of Ionian Islands Stylianou Typaldou Avenue 28200 Kefalonia, Greece
[email protected]
Abstract. Design rationale constitutes of a set of methods, techniques and tools that are used to record design decisions. This paper emphasizes on the design phase of a Geographical Information System (GIS) that provides environmental information used for educational purposes and shows how design rationale has been used for designing the final product. The main characteristic of the system is that it can adapt its interaction to each user interacting with the system. Therefore, the main design decisions that are retrieved involve user modeling, the way the information is adapted etc. More specifically, the paper describes how a structure-oriented approach of design rationale has been employed for making decisions regarding the adaptation of the system. The structure-oriented approach is implemented by the QOC notation [11].
1 Introduction GISADA (GIS ADAptive) is a Geographical Information System (GIS) that holds information concerning Kefallonia, an island of Greece. The information consists of important geological data, data about topographical features and cultural heritage, e.g. monuments and churches and is adapted to user’s characteristics, interests and knowledge. This system has been based on another GIS that was developed about Zakynthos, which is neighbouring island of Kefallonia [7, 8]. However, the main difference of these two systems is that information in GISADA is used for educational purposes and, thus, the rationale of the system changes. The necessity of solid communication between teams in the GISADA project, given the physical distance that spanned among them, lead us to investigate ways to assist the design process of the GIS. Design rationale [12] constitutes of a set of methods, techniques and tools that are used to record design decisions. The main gain of this process is to give emphasis to the main points of the system’s design. Additionally, having at hand a record of the decision making process of a design job, can be of multiple interest to software projects similar to the one described here [4]. Depending on the nature of each project setting, one can keep track of design discussions “as they happen” (synchronous approach, main advocate being the IBIS G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 451–460, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
452
K. Kabassi and G.P. Heliades
model [13]) thus having a historical record of the deliberation behind the design product in the form of key points of the design process. This would lead to what is called a process-oriented approach to design rationale and could be very helpful in cases, for example, where strict project rules necessitate future assessments of decision-making process and facilitation of management planning. On the other hand, there are structure-oriented approaches to design rationale where the emphasis is put on exploring the elements of the argumentation behind design (i.e. design space [11]) in order to obtain a full picture of what has been discussed, understand deliberation and perhaps discover new elements. This would happen later in the design phase (asynchronous, post-hoc) and the purpose of such an approach would be to make sure no big issues are left out of consideration as it encourages deliberation and explicit consideration of alternatives. In every case, design rationale is a by-product of the design process and can be a helpful piece of documentation alongside the standard software documentation piles. It can be considered as design metadata, which can serve as a communication mechanism among design team to communicate past critical decisions, what alternatives were investigated, and the reason for the chosen alternative [6]. In view of the above we have used the structural approach, which proved much more convenient for the whole project setting. The structure-oriented approach is implemented by QOC [11] and DRL [10]. We chose the graphical version of the QOC approach as it is simpler, seems more natural to use, and it is sufficient to hold the rationale of a GIS. The result of this design decisions was the development of a GIS that has the ability to process information about the users so that the system can adapt its interaction to each user dynamically. The main characteristic of the system is that it can adapt its interaction to each user interacting with the system. Therefore, the main design decisions that are retrieved involve user modeling, the way the information is adapted etc.
2 Background Earlier applications of design rationale and argumentation-related work that leans to the notions of organizational memory and issue-tracking include the QuestMap [5] and DRAMA [1] software tools. QuestMap is a collaborative hypermedia system, where Questions, Ideas, and supporting or objecting Arguments are used to visualise discussions, track unresolved issues, and qualitatively assess the strengths and weaknesses of different positions. It also provides links to relevant documents and embedded maps that can encapsulate resolved problems or contain more detailed analyses as backing for a particular node. QuestMap is being used by corporations for strategic planning, environmental planning, business process reengineering, and new product design. Similarly, DRAMA (Design RAtionale MAnagement) is a methodology and associated software tool for recording and managing design rationale. It is based on the IBIS model [13], which it augments in several ways to make it appropriate for engineering design. In particular, support has been added for: articulating and tracking goals; hierarchical structuring into decision trees; and the use of quantitative (as well as the standard qualitative) argumentation.
Design Rationale of an Adaptive Geographical Information System
453
More recent advances in the area, include the Co-OPR (Collaborative Operations for Personnel Recovery) project [16] and the Memetic project [3]. In Co-ORP, dialogue mapping and visual modelling are used in order to support analysis of the diplomatic, economic and social implications of hostage/personnel recovery missions in a simulated UN coalition scenario. In the other side. “Memetic” (Meeting Memory Technology Informing Collaboration) aims at developing software to expedite the meeting process by enabling annotated recordings of distributed Access Grid meetings that allow the review of those meetings by navigation through index points of key moments, such as agenda items, decisions, actions, questions and ideas, and also by who was speaking at a particular point in time.
Fig. 1. The generic QOC notation
Leaning towards the structure-oriented design rationale approach as the most suitable to our needs, we elaborate on Design Space Analysis (DSA). DSA is a style of analysis proposed by MacLean et al [11] which creates an explicit representation of a structured space of design alternatives and the consideration for choosing among them, different choices in the design space. DSA emphasizes the explication of the space of possible designs (options) and the rationale (criteria) for choosing within the space. This type of analysis places a design product in a space of possibilities and seeks to explain why the particular design was chosen from these possibilities. In the heart of the DSA concept is the accumulation and reuse of knowledge through the articulation of coherent design spaces within which different solutions can be located
454
K. Kabassi and G.P. Heliades
(e.g. as implemented in other systems). Thus DSA is a process of identifying key problems (Questions) and raising and justifying (via Criteria) design alternatives (Options). The main notational tool that is used to realize DSA is called QOC (Questions, Options, Criteria), employing the three fundamental concepts of DSA. These elements are shown in graphical form in figure 1. An important point to notice is the thickness of the links between Options and Criteria, as it indicates the relative weight of assessment. Although there is also a tabular form of QOC, the graphical one is deemed to be more illustrative and natural to use.
3 Requirements Capture of an Adaptive Geographical Information System In order to capture the requirements of the GIS, we conducted an empirical study and performed requirements analysis. The empirical study involved the distribution of questionnaires to 299 potential users of a GIS about Kefallonia and the analysis of the results. The empirical study revealed that the potential users of such a GIS would be environmentalists, ecologists, students of environmental departments, tourists, residents of the islands as well as public authorities. Such users may have different interests, needs and background knowledge. The information collected during the empirical study was used for the design of the stereotypes. In the case of GISADA, stereotypes are used for categorising users with respect to their relation to Kefallonia and their computer skills. Stereotypes categorise users that share similar characteristics and is considered to be a powerful mechanism for building user models [9]. This is due to the fact that stereotypes represent information that enables the system to make a large number of plausible inferences on the basis of a substantially smaller number of observations [14, 15]. In GISADA, the first stereotype provides information about the users’ interests and background knowledge in environmental matters whereas the second gives information about the users’ knowledge about computers and the Internet.
4 Core Design Issues Four main categories of design issues (“design sub-spaces”) were identified. The design team, which participated 4 human experts in software engineering, identified the issues, discussed it in group sessions and then recorded them. Then, in subsequent meetings, the design rationale was presented and further discussion confirmed and filed them. The steering team, which participated 2 human experts in software engineering, had to make sure all arguments were considered, then place the outcome alongside the rest of the project documentation. The main issues considered are presented in the next subsections. 4.1 Degree of Information Refinement The issue that came upfront had to do with the initial paths a user first entering the system, an issue which is central to the acceptability of ADAGIS and provided a field of discussion for the user population.
Design Rationale of an Adaptive Geographical Information System
455
Fig. 2. Design rationale for information refinement issue
The issue was the order in which information is presented to the users. The first option was to provide a layer-based view first, having the user focus on the degree of how close to the geographical area, then pick a view related to their source of interest. The second option was to provide a view based on the subject of interest of the user. It came out that the concept of an adaptive system with subject being determined first by user’s choice, would fit more naturally to a subject-based order of information presentation. Furthermore, starting off from a bird’s eye view and zooming in the area of interest without a special subject, it could be difficult to spot the right place not having the map on the right theme. Therefore, a subject-based approach was adopted. 4.2 Determining Represenation At this point, the issue was the type of representation to be used at different views of the system. By following the defined user stereotypes, at each point in the user’s path, the representation to be used is the one determined by the user model. However, certain user groups may be keen on engaging different representations at the same or at different levels. The decision followed the philosophy of the previous questions, i.e. that the system should remain strict about the user model stereotypes and that the adaptivity feature should be supported by the user interface at all system views. More specifically, it was decided that at this point the GISADA should be a system that should “lead” the user to a path determined by their background. That was judged as an essential feature given the diversity of the user population. 4.3 Type of Adaptation The two main adaptive hypermedia techniques that exist are: (i) adaptive presentation, where adaptation is performed at the content level and (ii) adaptive navigation support, which is performed at the link level [2]. Adaptive presentation techniques adapt the content of a page according to the information that the system maintains about the user accessing that page. The idea of adaptive navigation support, on the other hand, is to augment the links with some form of comments, which can tell the user more about the current state of the nodes behind the annotated links. The issue here was which type of adaptation would be more appropriate for providing environmental information. As GISADA uses information in text form, multimedia elements and links, the decision was using a combination of adaptive presentation
456
K. Kabassi and G.P. Heliades
Fig. 3. Rationale for representation
Fig. 4. Rationale for adaptation
and adaptive link annotation. Adaptive presentation techniques have been used in GISADA only for adapting textual items. Furthermore, the system uses different font types and icons to annotate links of other topics so as to reflect how interesting that particular topic would be for the user interacting with the system. 4.4 Type of Topography Last issue of the initial phase of design space analysis, was the type of topography to be used. At this stage, most topographers that participated at the requirements questionnaire would be content with the “basic” information which would include city planning, parks etc. On the other hand, certain tasks that are part of local development require a more advanced, even specialized information like exact dimensioning of squares, estates etc. The decision here was to go for the advanced topography given the fact that there may be some sort of users interest, without harming the level of users knowledge of the system, i.e. a novice topographer or one on a simple task would not be disoriented by being exposed to a larger volume of topographical information.
Design Rationale of an Adaptive Geographical Information System
457
Fig. 5. Design space analysis for the issue of topography
5 Towards a Unified Design In view of the above, the GISADA Geographical Information System was developed. The described GIS operates over the Web and contains data about the physical and anthropogenic environment of Kefallonia, an island of Greece. The particular island has great touristic as well as ecological interest due to its geographical position. For this purpose, the information that is maintained in such a GIS would be of interest to a great variety of users. However, different kinds of users may have different interests, needs and background knowledge. For example, tourists and/or residents of the islands would prefer to find information about cultural information, e.g. monuments and churches. Environmentalists, ecologists and researchers, on the other hand, may seek low/high resolution satellite data, which are used for the estimation, charting, characterization and classification of various environmental parameters such as land cover/usage, geomorphological features, etc. In view of the above, the main characteristic of GISADA is that it can adapt its interaction with each individual user. In order to adapt the information provided to the interests and background knowledge of each user interacting with GISADA, the system incorporates a user modelling component. This component maintains information about the interests, needs and background knowledge of all categories of potential users. The information that is collected for every category of users has been based on the analysis of the results of the empirical study. The analysis of the questionnaires revealed that the potential users of the Web GIS could be divided to five main categories: • • • • •
Residents of the islands Tourists Local authorities of Kefallonia Environmentalists / Researchers Students of the department of ecology and the environment in the Technological Educational Institution of the Ionian islands, which is located in Kefallonia.
For every category of users, the empirical study revealed what their interests are. This information was used for designing the stereotypes. Every time a new user is connected to GISADA, s/he has to answer some questions about him/herself. This
458
K. Kabassi and G.P. Heliades
information refers to the user’s personal data (name, surname), his/her relation and interest for Kefallonia as well as his/her believed level of expertise in computers and the Internet. Based on his/her responses, the system assigns each user into one of five stereotypes according to his/her relation and interest for Kefallonia (resident, tourist, local authority, environmentalist, student) and into one of three stereotypes according to his/her level of expertise in computer skills (beginner, intermediate, expert). The main reason for the application of stereotypes is that they provide a set of default assumptions, which can be very useful during hypotheses generation about the user. The default assumptions of the resident, tourist, local authority, environmentalist and student stereotype provide information about the user’s interests related to Kefallonia. For example, a tourist is mainly interested in monuments and churches whereas an environmentalist would probably seek geomorphological features. Default assumptions of the activated stereotypes are used in combination with a decision making theory for making inferences of the theory topics that seem more suitable for each user. The theory topics that have been selected from the above mentioned algorithms to be more appropriate for each user are presented in GISADA using adaptive link annotation techniques.
6 Conclusions In this paper we present the design rationale of an adaptive GIS system called GISADA. The necessity to employ design rationale techniques to record the decision making process behind the design of the GIS, came from the very nature of the project: a large number of users of different types and a considerable volume of design stakeholders. The QOC notation was used throughout the design in order to serve as a basis for communication among all those involved. In the initial stages, designers used it straight after the meetings in order to agree upon a design. After that, design rationale was used alongside conventional documentation in order to justify decisions made. It came out that design rationale was very useful as it helped the design team to be positive about decisions. Through the use of design rationale, previous experience was embodied into the design of GISADA, as designers recalled previous implementations in the process of justifying certain ratings on the criteria. Some designers with high expertise in the GIS field even proposed some criteria which had not been considered at the time. This justifies that QOC helped enhancing the design space. Another benefit of the usage of design rationale is that it helped the project team to keep a certain course of line throughout the decision making of the project. In that way, decisions were avoided that would encompass a vast degree of freedom from the point of view of the end user: too open to navigation, too open to different user fits, too open to different interaction styles. It helped us to maintain a system that is consistent throughout the development and “faithful” to its users through the relevant profiles. The last benefit comes from being able to expand the ADAGIS further in the future. Design rationale gives the benefit of future enhancements or new people being involved on the project to quickly get in grips with the logic of the original design.
Design Rationale of an Adaptive Geographical Information System
459
Future work on GISADA stems from the physical distance among project members, even among members of the same project team, as Kefallonia, Patra and Athens are the bases of project members. What would be interesting to explore in the future is the capability to perform such argumentation sessions remotely. This leads us to employ techniques of web-based issue tracking which enable remote team members to form decision making processes. This can boost the process of rationalizing the project and speed-up project flow.
References 1. Brice, A., Johns, B.: Improving process design by improving the design process, QuantiSci Technical Report, Ref. QSL-9002A-WP-001 (1998), http://www.quantisci.co.uk/drama 2. Brusilovsky, P.: Methods and techniques of adaptive hypermedia. User Modeling and User Adapted Interaction 6(2-3), 87–129 (1996) 3. Buckingham Shum, S., Slack, R., Daw, M., Juby, B., Rowley, A., Bachler, M., Mancini, C., Michaelides, D., Procter, R., De Roure, D., Chown, T., Hewitt, T.: Memetic: An Infrastructure for Meeting Memory. In: Proc. 7th International Conference on the Design of Cooperative Systems, Carry-le-Rouet, France, May 9-12 (2006) 4. Buckingham Shum, S., Hammond, N.: Argumentation-based design rationale: what use at what cost? International Journal of Human-Computer Studies 40, 603–652 (1994) 5. GDSS Quest Map [computer software] Group Decision Support Systems, Washington, D.C. (1996), http://www.gdss.com/QM.htm 6. Heliades, G.P., Edmonds, E.A.: On facilitating knowledge transfer in software design. Knowledge-Based Systems 12(7), 390–396 (1999) 7. Kabassi, K., Virvou, M., Tsihrintzis, G., Kazantzis, P.G., Papadatou Th.: Adaptive Presentation of Environmental Information in a GIS. Protection and Restoration of the Environment VIII (2006) 8. Kabassi, K., Virvou, M., Kazantzis, P.G., Desiniotis, C.D.: Personalised Geographical Information System: An Empirical Study. In: Advances in Data Analysis: 30th Annual Conference of the German Classification Society (GfKl), March 8-10, Free University of Berlin (2006) 9. Kay, J.: Stereotypes, student models and scrutability. In: Gauthier, G., VanLehn, K., Frasson, C. (eds.) ITS 2000. LNCS, vol. 1839, pp. 19–30. Springer, Heidelberg (2000) 10. Lee, J.: Extending the Potts and Bruns model for recording design rationale. In: Proceedings of the 13th International Conference on Software Engineering, pp. 114–125. ACM Press, New York (1991) 11. MacLean, A., Young, R., Bellotti, V., Moran, T.: Questions, options, and criteria: elements of design space analysis. Human-Computer Interaction 6(3&4) (1991) 12. Moran, T.P., Carroll, J.M.: Overview of design rationale. In: Moran, T.P., Carroll, J.M. (eds.) Design rationale: Concepts, techniques and use. Lawrence Erlbaum Associates, Hillsdale (1997) 13. Potts, C., Takahashi, K., Anton, A.I.: Inquiry-based requirements analysis. IEEE Software 11(2), 21–32 (1994) 14. Rich, E.: Stereotypes and User Modeling. In: Kobsa, A., Wahlster, W. (eds.) User Models in Dialog Systems, pp. 199–214 (1989)
460
K. Kabassi and G.P. Heliades
15. Rich, E.: Users are individuals: individuallizing user models. International Journal of Human-Computer Studies 51, 323–338 (1999) 16. Tate, A., Buckingham Shum, S.J., Dalton, J., Mancini, C., Selvin, A.M.: Co-OPR: Design and Evaluation of Collaborative Sensemaking and Planning Tools for Personnel Recovery, Open University Knowledge Media Institute (2006)
Multimedia, User-Centered Design and Tourism: Simplicity, Originality and Universality Francisco V. Cipolla Ficarra1,2 and Miguel Cipolla Ficarra2 HCI Lab. – F&F Multimedia Communic@tions Corp. ALAIPO: Asociación Latina de Interacción Persona-Ordenador 2 AINCI: Asociación Internacional de la Comunicación Interactiva Via Pascoli, S. 15 – CP 7, 24121 Bergamo, Italy
[email protected],
[email protected] 1
Abstract. In the current work, guidelines are presented for the implementation of multimedia/hypermedia systems with the goal of fostering ecology and cultural heritage in natural and/or rural areas. The content of these systems has three elements which are interrelated in a triadic relation: simplicity, originality and universality. Additionally, in elements of the categories navigation and structure are preferred by those users who inhabit rural areas. An analysis of the guided tour component is carried out in a comprehensive way in two design categories: navigation and structure. Keywords: Multimedia, Human-Computer Interaction, User-Centered Design, Communicability, Usability, Heuristic Evaluation, Semiotics, Tourism.
1 Introduction There are several design models of multimedia/hypermedia systems, such as Tompa [1], AHM [2], OOHDM [3], etc., and plenty of guidelines for the design of interactive systems [4] [5]. However, in none of these has either a communicative strategy or a design vademecum, or set of guidelines, for hypermedia in rural areas been established. The communicative richness of multimedia systems can boost the use of the different types of structural components at the intersection of three main areas. Firstly, an improvement of access to multimedia information for users in rural environments. Secondly, communication techniques that can fuel interest towards ecological content and cultural heritage. Thirdly, globalization of hypermedia content. In the current work this interaction zone has made it possible to quickly work out a multimedia system in off-line support, that is simple and has low production costs. The used strategies have allowed a very satisfactory cost/quality equation to be maintained. A series of experiments in the usability laboratory using such techniques as: observation, questionnaires, user surveys, beta-testing and videotaped sessions [6] [7] have confirmed the results obtained with the strategies followed in the implementation of the system and which will be presented further on. Next follows an explanation of each one of the intersection areas. G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 461–470, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
462
F.V.C. Ficarra and M.C. Ficarra
2 User-Centered Design: Simplicity There are several alternatives available in the provision of access to information in hypermedia systems, mainly related to the support of the database or hyperbase [1] and the potential users. The way of structuring the information always requires a previous analysis of the empathy or profile of the potential users of the system. The users can be split into several categories according to age, education, degree of interest in the stored contents, fruition time, context, the user's physiological condition, etc. (see Annex #1). Moreover, the tests carried out by the users in the usability laboratory can remarkably improve the final design of the interactive system, but they increase production costs. The alternative to these labs is to resort to special methodologies for the design of hypermedia systems, for example: MEHEM (A Methodology for Heuristic Evaluation in Multimedia) and MECEM (Metrics for the Communications Evaluation in Multimedia) [8]. Navigation introduces a new and unfamiliar workload [9]. The large number of commands that are usually necessary to carry out navigation proves that in many instances the users of an interactive system feel lost and confused in the information space. A way to avoid these problems is via the use of guided tours [10]. According to Nielsen [11], a guided tour is a superlink which binds a sequence of nodes chosen by the designer of the hypermedia system. A guided tour is a linear path through the information space and it can help the users to get acquainted with the contents of the system. An interesting feature is that it allows the user to leave it whenever he/she wants, and to keep on navigating through the rest of the content. Although it prevents disorientation, its main drawback is that it does not allow the user to freely explore the information space. A way to guide the user in this kind of system is the inclusion of maps and diagrams. The goal is to facilitate navigation through the correct use of the metaphors in the interface [12] [13]. The basic principle consists of using concepts and models of the modern world that mimic that environment [14]. The use of the map, based on the metaphor of the tourist map, is a very valid option to provide information in that information space. These maps have been named in a different way in the design of hypertext, multimedia and hypermedia systems: global maps [11], conceptual maps [14] and graphical navigators [15]. The maps provide a schematic portrayal of the structure of the interactive system. Within a simple and straightforward style of the third generation of the cultural heritage promotion multimedia, the user has access to the three main areas of collections [1]. In the first collection we have the guided tours, in the second the map of Little World, and finally the area game puzzle (see Annex #2).
3 Original Communication Techniques for Rural Tourism From a communicative point of view, the hypermedia system can be regarded as an interactive process, in which there is a set of actions that have been organized by the designer and that must be carried out by the user. One of the functions of the interaction is to establish the access to each of the components. Consequently, there is an omnipresent role of the designer towards the user at the moment at which the latter has to carry out a series of actions. These actions relate the interaction to the navigation. The interaction in the navigation consists of overcoming the user's pauses
Multimedia, User-Centered Design and Tourism
463
and to keep advancing on the communicative process in the pre-selected direction. There are two purposes in this interaction process. The first is the communication of the contents, establishing feedback between the user's actions and the components of the system. The second purpose is the build-up to the access relationship to the several components of the static media (i.e., pictures, graphics, drawings, maps, etc.) and dynamic media (audio, video and animations) of the hypermedia systems. To speed up this interaction process, it is necessary to resort to the isomorphism and isotopies that are studied by semiotics. The use of isotopies is very positive among users who have little previous experience in the use of computers nor in the navigation of interactive systems on-line and off-line [8]. The quality criteria for the design of interactive systems, named isomorphism and isotopies, have been used to bolster the ecological content and rural patrimony. Isomorphism determines the same form of presentation (topology in the distribution of elements in the frame), the behaviour of dynamic mediums, and different ways of structuring the organization of the content (passive media such as text, photos, maps, graphics, etc. and active media such as animation, video, sound) and the synchronization between passive and dynamic media [10]. For example, that of the same place being maintained inside the frame in the set of navigation keyboards, be it in frames that belong to a manual or automatic guided link, a sequential collection or a perspective. These formal features act as common denominators for some of the constituents of an entity, thereby establishing a global and distinctive wholeness in the application. Isomorphism seeks regularity within irregularity. This is the main difference between it and consistency. Consistency means verifying that those elements which are conceptually equal have similar behaviours, while those which are different, have different behaviour patterns. There is a direct underlying relationship which can be shown by two the relationships: an equality relationship and an inequality relationship. For example, if A = B, and B = C, then A = C (an equality relationship between A, B, C); if A is not equal to B, then A is not equal to C (an inequality relationship between the components). Isomorphism goes beyond this regularity or direct relationship because it seeks in the inequality relationship (in the example, the variable factor C as related to A and to B) those elements that remain equal and that somehow make it possible to maintain homogeneity and coherence between the constituents of the categories of the multimedia/hypermedia system. These common denominators are called isotopies [16]. Greimas borrowed the term isotopy [17]. In structural semantics, isotopy describes the coherence and homogeneity of texts. Greimas develops the theory of textual coherence on the basis of this concept of contextual semes. The isotopies are sense lines that act upon the structures. That is to say, from a semantic point of view, lines are drawn which unite several components of the multimedia in order to help comprehension of the rest of the multimedia system. The sense lines are independent of our location inside a multimedia, since they draw a unity in relation to the four basic categories used in the heuristic analysis of the system. For example, if we are inside a guided tour or on the first frame of a certain entity type, we can detect that there is a set of elements belonging to the presentation of the content which do not change (typography, the background to the frames, the positioning of the navigation keys, the icons that represent the navigation keys, the kind of transitions between different frames, the speaker's voice, etc.). The lines link those elements which remain
464
F.V.C. Ficarra and M.C. Ficarra
identical among themselves and which belong to the different categories of design: presentation or layout, content, dynamism, panchronism and structure [8]. For example, the equality existing in the guided tours and in collections; the colour and the typography in presentation, the organization of the textual content as is the inverted pyramid, the activation and deactivation of the dynamic media, and the way to reach the hyperbase and structure of the whole of the nodes. By means of the isomorphism notion it can easily be detected whether the system is easy to use, easy to learn and whether it avoids mistakes in interaction with a user. These invisible lines which link the design categories also have an influence on the quality of the system. In the CD-Rom, isomorphism and isotopies are distributed in each of the main categories of design such as: presentation or layout, content, navigation, structure and panchronism. For example, in the layout, isomorphism is present from the topological point of view of the distribution of the navigation keys, the area for the photographs, etc. They are maintained in an identical way in each one of the languages that make up the system. Isotopy is present in the use of colours, typography, colour pictures, etc. Some of the communicative ─stylistic─ strategies used in the elaboration of interactive system were: usually while working in rural areas it is advisable to create different versions of the interface, but always keeping a simple and sober style. The colours of the sand, the terracotta bricks of the buildings in the area, the mixture of light brown and grey blue of the river water of the plain, the green of the trees, the soft light blue colour of the sky, and the clouds, are well accepted by the population and potential users of Southern Europe’s rural areas. Introducing elements of the worked field: ploughs, tractors, carts, animals, etc. is very positive because there's a quick integration of the user with the daily context, even if the content can make reference to other elements of the cultural heritage: museums, historical archives, statues, monuments, etc. irresepective of the geographical political limits, i.e., province, region. The area to which the CD-Rom refers, encompasses four provinces from Emilia Romagna region (Reggio Emilia, Parma, Piacenza and Mantova), where the plain prevails and where flow numerous courses of water which are tributaries of the river Po. Consequently, the use of an old map from the 18th century, with its rivers, mountains, etc., has made acceptance of the interface easier. It is advisable to follow the natural limits; rivers, mountains, etc. together with the area’s cultural heritage, for example, a clock tower, a port, a school, a square, a council building, etc. The illustrations that have to predominate are the pictures of spring, especially, photographs. The historical constructions that reflect cultural heritage have been photographed maintaining different angles (25-30 degrees in relation to the horizontal of the ground) to give them a greater importance, that is to say, they are not “face-on” photographs parallel to the horizontal line of the ground. The purpose of this technique is to break away from the horizontality of the plain. Additionally in the usability tests carried out by the users is was found that they preferred photographs with different angles instead of the classical photographs taken face-on to the subject, with a long or general take. In the pastime area, 27% of the users accepted photographs taken face-on. Moreover, the pictures that are presented in the initial animation have also been slanted as if they were stamps or scattered postcards. The final objective was to intensify this union between nature and people's identification with their natural context.
Multimedia, User-Centered Design and Tourism
465
4 Design and Universality of the Contents In the last few years, because of the widespread international circulation of content related to tourism, there is a greater bi-directional feedback between social sciences and empirical sciences. Currently there is a trend in software engineering [18] [19], in the design of hypermedia systems [20] and in the environment of social communication [21] [22], to carry out a greater inference concerning the potential requirements of the international users [23]. Some of these works are aimed at disabled people or those with little experience in the use of computers [24]. These users make it necessary in some cases to verify the efficiency of the tools and assessment techniques used in the design of the systems and the guidelines [25]. Currently there are no guidelines for the development of hypermedia systems directly related to ecology and cultural heritage. In our case, we have used a test carried out with users from several rural areas from Europe. The results we obtained have allowed us to make continuous adjustments in the design to reach the required globalization. In the first place, languages. The CD-Rom has six languages: English, French, Spanish, German, Italian and Portuguese. The length of the texts has been maintained in the number of nodes that make up the collection. Additionally, a journalistic style has been chosen where information is summed up as succinctly as possible. The international inverted pyramid system has been used, where a third part of it has been obliterated, since the user can go deeply into the subject if interested in it, through the links to websites. The used typography is standard for those languages; Arial, Times and Verdana. Secondly, the use of iconography for navigation is very standard (arrows). No special icon has been made that could generate ambiguity in their interpretation [13]. Thirdly, it has been decided to insert an internationally understood game, such as the puzzle. In this game the potential types of user have been taken into account, establishing different numbers of pieces, for example, 12, 20, 35 and 80. Fourth, the music that accompanies each one of the collections of the hypermedia system belongs to the jazz genre. Some world-wide known musical themes that prompt the user to keep on navigating through the system have been included, for instance a classical piece at the moment of driving a car. Fifth, the photographs that make up the different frames have been chosen according to a criteria of mixing up local aspects, such as gastronomy in the case of tourism, with global landscapes, typical of the plain areas. The visual content has been organized in such a fashion that there is a continuous two-way street relationship between local and global. Sixth; the austerity and design simplicity of the different interfaces stands in sharp contrast with the richness of the static contents, such as the colours of the photographs.
5 Guidelines for Promotion of Rural Tourism Throughout the evolution of interactive systems and basic ergonomic principles, usability engineering has determined a series of design principles focused on the user, where there are several quality attributes, such as accessibility. However, users have now left the learning and use of multimedia/hypermedia systems stage. Currently we are in the era of qualitative communication for all. Many traditional notions in the
466
F.V.C. Ficarra and M.C. Ficarra
design of hypermedia systems are, for marketing reasons, being reformulated with new terms, introducing ambiguity in definitions. For example, user-centered design, use-centered design, participative design, etc. Therefore, each one of the items in the guide that is described in what follows focuses on the era of communicative quality. Additionally, the notions of accessibility, consistence, fruition control, naturalness of the metaphor, etc. [8] are structural quality attributes which make up the current hypermedia systems and not autonomous entities that can be added or wiped out of the interactive systems. In the interactive communication process, the user of the new millennium expects these attributes in the hypermedia/multimedia systems as something that occurs naturally as part of the evolution of the computer science era. It is also important to stress that the task of heuristic assessment in the current work has been carried out by a new profile of professional in interactive communication, where there is an intersection of the formal sciences with the factual sciences. Consequently, the results obtained with the different techniques of heuristic assessment of usability engineering (i.e., observation, user surveys, questionnaires, interview, beta testing, videotaped sessions, etc.) and the application of quality criteria for communication in the interactive systems [8] have made it possible to carry out the following vademecum or set of guidelines taking into account the following design areas panchronic, content, layout, structure and dynamism: • To combine access to the information through interactive maps and guided tours. • To use the highest possible number of bi-directional guided tours with an average of 20 frames for each of them. • To resort to the use of ancient maps to eliminate the current divisions on the territory that is to be promoted from the tourist point of view. • To use the white colour as a background of the interface. • To insert some object representative of the toil on the earth ─preferably ancient─ in the interface. For example, some farming machinery, some utensil for the production of typical foods, etc. • To bolster the use of classical typography and iconography. • To translate the textual context into several languages for a more widespread international circulation. In the European case, the following are recommended: English, French, German, Spanish, Portuguese and Italian. • To respect the length of the texts in each one of the languages, that is, not to divide the frames, generating guided tours with different amounts of nodes between one language and the other. • To write the texts in a concise way and according to the technique of the inverted pyramid. To go deeper into the information it is advisable to incorporate links to websites, for example. • To sum up the whole content aimed at the promotion of the cultural and ecological patrimony in a guided tour. The ideal thing is to insert images with a small explanatory text that does not go beyond 7-10 words. The text should follow the existing rules to write a advertising slogan, for example. • To avoid content related to either political or religious symbols. • To resort to the use of colour photography, instead of animated or static graphics.
Multimedia, User-Centered Design and Tourism
467
• To include the cinematographic framing techniques in the making of the photographs, especially to create images from different angles. • To widen spring-related contents and incorporate anonymous people in their daily tasks. • To reduce the use of animations, excepting those that explain some production process or with didactic purposes. • To add classical music "with rhythm" for the fruition of the images in the guided tours. • To favour seriousness rather than humour in this kind of system. Pastimes with a paper support are very positively valued; puzzles, card games, etc. • To analyze before starting the hypermedia project, all the legal aspects related to obtaining the permits for the reproduction of the images, especially photographs and films. For example, the law that regulates these subjects in Italy dates from 1941. In many instances, filming and photographing the cultural heritage by the production team of the hypermedia system is not enough to solve all the legal aspects. A way to solve these troubles in the multimedia field is to resort to the use of drawings, outlines or paintings of statues, monuments, buildings, etc., made by the members of the production team. The purpose of these rules is to design hypermedia systems in rural areas aimed at the promotion of cultural heritage and ecology, with an excellent level of quality, in the least possible time and with few financial resources.
6 Conclusions The design and production of hypermedia systems in rural areas requires a series of guidelines in order to speed up the production process while maintaining quality and cutting down production costs. The costs of those systems with cultural content can be more expensive for legal reasons due to the reproduction of digital images, on top of digital support on-line and/or off-line. Sometimes, to overcome these obstacles, it is necessary to resort t 2D and/or 3D reconstruction of monuments, statues, buildings, etc. In our case, digital photography has enabled us to fill in the visual contents of the hypermedia system with very low costs and with a great acceptance from the users of the system. A set of these photographs has composed a summary in the shape of guided tours so that the potential users have a global view of the whole hypermedia, in the least possible time and without getting lost in the structure. Access to information in the shape of guided links has made it possible to divide the content of the whole system into three great collections. Using the basic resources in the structure of the hypertext systems such as nodes, links, frames, etc. and admitting a bi-directional navigation is very beneficial in this kind of system. Additionally, the incorporation of a real and old map reduces considerably the disorientation problem. The potential users of these systems are adult, older people, and with little experience in the use of computers. The richness of the dynamic means has been cut down to the
468
F.V.C. Ficarra and M.C. Ficarra
minimum, as has the length of the texts in the different languages. Links to the internet have been inserted for the most experienced users. The obtained results have made it possible to establish a set of rules to be followed when it comes to designing this kind of system, keeping the two-way street relationship between simplicity and universality. The tests that have been made have proven the efficacy of the presented system and in the near future its content will be transferred, after it is adapted to the PDA's. Acknowledgments. A special thanks to Emma Nicol (Strathclyde University), Carlos and Marco Fredianelli (Alaipo & Ainci – Italy and Spain) for their helps.
References 1. Tompa, F.: A Data Model for Flexible Hypertext Database System. ACM Transactions on Information Systems 1, 85–100 (1989) 2. Hardman, L., et al.: The Amsterdam Hypermedia Model: Adding Time and Context to the Dexter Model. Communications of the ACM 2, 50–62 (1994) 3. Schwabe, D., Rosi, G.: The Object-Oriented Hypermedia Design Model. Communications of ACM 8, 45–47 (1995) 4. Apple: Macintosh Human Interface Guidelines. Addison-Wesley, Massachusetts (1992) 5. Helander, et al.: Handbook of Human-Computer Interaction. Elsevier, Amsterdam (1997) 6. Nielsen, J., Mack, R.: Usability Inspections Methods. Wiley, New York (1994) 7. Holzinger, A.: Usability Engineering Methods for Software Developers. Communications of ACM 1, 71–74 (2005) 8. Cipolla-Ficarra, F.: Universal Access in HCI. In: Communication Evaluation in Multimedia: Metrics and Methodology, pp. 567–571. LEA, London (2001) 9. Berk, E., Devlin, J.: Hypertext/Hypermedia Handbook. McGraw-Hill, New York (1991) 10. Trigg, R.: Guided Tours and Tabletops: Tools for Communicating in a Hypertext Environment. ACM Transactions on Office Systems 4, 85–100 (1988) 11. Nielsen, J.: Multimedia and Hypertext. AP Professional, Cambridge (1995) 12. Väänänen, K., Henderson, D.: Interfaces to hypermedia: Communicating the structure and interactions possibilities to the users. Computer & Graphics 3, 219–228 (1993) 13. Cipolla-Ficarra, F.: Computer graphics. In: A User Evaluation of Hypermedia Iconography, GRASP, Paris, pp. 182–191 (1996) 14. Barker, P.: Exploring Hypermedia. Kogan Page, London (1993) 15. McKnight, C., et al.: Problems in Hyperland? A Human Factor Perspective. Hypermedia 1, 167–178 (1989) 16. Nöth, W.: Handbook of Semiotics. Indiana University Press, Indianapolis (1995) 17. Colapietro, V.: Semiotics. Paragon House, New York (1993) 18. Robson, R.: Globalization and the Future of Standardization. Computer 7, 82–84 (2006) 19. Damian, D., Moitra, D.: Global Software Development: How Far Have We Come? Software 5, 17–19 (2006) 20. Nielsen, J., del Galdo, E.: International User Interfaces. Wiley, New York (1996) 21. Ramakrishnan, R., Tomkins, A.: Toward a People Web. Computer 8, 63–72 (2007) 22. Sears, A., Lund, A.: Creating Effective User Interfaces. Software 3, 21–24 (1997) 23. Alm, N., et al.: A Communication Support System for Older People with Dementia. Computer 5, 35–41 (2007)
Multimedia, User-Centered Design and Tourism
469
24. Tscheligi, M., Reitberger, W.: Persuasion as an Ingredient of Societal Interfaces. Interactions 14, 41–43 (2007) 25. Jeyaraj, A., Sauter, V.: An Empirical Investigation of the Effectiveness of Systems Modeling and Verification Tools. Communications of ACM 6, 63–67 (2007)
Annex #1: Classification of the Potential Users Age
Autonomy of interaction Education or previous experience with computers Aims of fruition Localization or internationalization of content Time of fruition or interaction.
We can establish different groups of users: children and young people (we consider that the minimum age of interaction with a PC or another system of micro computer is of four years) adolescents, grown-ups and old people. The physical condition of the user can create a division among users that are: 100% autonomous, semi-autonomous or depending on other people. Knowledge and/or experiences in the use of PCs divide users into: expert, inexpert, inclined and obviously a combination of these groups. The aim of fruition can be consultation, study, analysis, or pastime. If the content of a multimedia system is local, then there are one or two languages or dialects. On the contrary, if the multimedia system is internationally widespread, the number of languages and/or dialects is more than two. Time spent using a system can be divided into: very short (less than 30 minutes), short (more than 30 minutes and less than two hours) and unlimited (more than two hours).
Annex #2: Little World CD-Rom
Fig. 1. Fruition of the nature through guide tours. Little World CD-Rom. Blue Herons Editions & Contacto Visual. Bergamo, Italy – Esposende, Portugal (2007).
470
F.V.C. Ficarra and M.C. Ficarra
Fig. 2. Naval museum: Area guided tour. Slide Show –full screen.
Fig. 3. Interactive maps in ‘Little World’. The importance of the photos with angles in cultural heritage, i.e, a square and old statue –Gonzalo Ferrante defeats the envy.
Fig. 4. Puzzle. Levels for potential users: children, junior, adult and senior.
Dynamically Extracting and Exploiting Information about Customers for Knowledge-Based Interactive TV-Commerce Anastasios Savvopoulos and Maria Virvou University of Piraeus, Department of Informatics, Karaoli Dimitriou St 80, 18534 Piraeus, Greece {asavvop,mvirvou}@unipi.gr
Abstract. Today many e-commerce systems try to extract information about users in order to help users buy products that really need more easily. On the other hand, large TV networks move towards TV-shopping because customers may find it difficult to shop products through the internet due to the lack of familiriaty with computers. In this paper we propose a interactive TV-shopping application, called iTVMobi that extracts and exploits users’ information in order to provide automatic personalized reccomendations and adaptive help responses. Our system combines reccomendations and adaptive help to maximise the efficiency of the system towards customer support. iTVMobi has been evaluated by real users. The evaluation shows the the combination of these two functions in an interactive TV environment really helped customers buy products they really needed. Keywords: Interactive TV, personalized recommendations, intelligent help, user models.
1 Introduction Information about customer needs and inferences is very difficult to acquire. Even human salesmen find it difficult sometimes to fully understand what the customers need. Nowadays many e-commerce systems (e.g. Amazon, Buy.com) try to extract information about users through their behavior and exploit this information in order to provide product recommendations. On the other hand customers may find it difficult to use and buy products through these systems, mainly because a lot of prospective buyers do not have sufficient familiarity with computers which are used for ecommerce applications. To this end, many large TV networks try overcome these problems by providing TV-shopping through their channels. They create TV shopping channels and implement TV-shopping applications with the use of the interactive-TV technology. However, most of existing TV-shopping applications are generic and do not address specific needs, preferences and attributes of individual customers or of certain groups of people. In view of the above, our purpose was to create a novel application for the interactive TV that could automatically help the customers buy mobile phones. In this kind of application customers may have problems both with the interaction with the eshopping application itself as well as with the products to be sold to them because for both they need to have familiarity with the recent technological advances (Internet, G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 471–480, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
472
A. Savvopoulos and M. Virvou
mobile phones etc.) We chose interactive TV because of the fact that people feel more comfortable with TV than with computers. The application extracts user information from the users’ moves and actions throughout the system. Than it exploits this information and provides automatic recommendations and help to users in order to make shopping easier. The iTVMObi application has two major functions. The first is a recommendation function that provides recommendations about mobile phones through the use of adaptive hypermedia [1] and user modeling techniques. The second function is adaptive help concerning the usage of the system. The system helps customers through multiple user interfaces and an animated help agent with adaptive behavior controlled by a user model that uses the extracted user information. Both functions are based on user grouping techniques through the usage of clustering algorithms. After the creation of iTVMObi, the system was evaluated. For this purpose, we used 50 users, in order to evaluate the adaptive recommendation and help facilities of our application. These 50 users had different educational backgrounds and differed a lot concerning the knowledge of computer usage. The main body of the paper is organized as follows. Section 2 describes related work concerning adaptive help systems, recommendation systems and systems that combine adaptive recommendations. Section 3 describes the TV-shopping system we created. Section 4 presents the construction of the recommendation model and adaptive help model. The next section discusses results from the evaluation process of the system. The last section gives conclusions and future work.
2 Related Work The functions provided by iTVMObi are based on two major domains. The first is adaptive recommendation and the second is adaptive help. The use of recommendation systems is widely spread and can be found in many fields such as e-commerce, TVcommerce, program recommendation and many others. 2.1 Adaptive Help Systems Adaptive help systems are systems that try to help users perform tasks by exploiting information that acquire implicitly through users’ actions. There many adaptive help systems in various fields of science like hydrological time series [11], medical support and patient life quality [12] and exonomics [13] but very few in fields of computer support, web support and recommendation. Such a system is AdaptHelp by Iglezakis [6]. She presents an approach that uses the techniques of plan recognition not only to infer short-term plans and goals, but also to infer the long-term procedural knowledge of a user in a non-binary way. The information about the procedural knowledge in terms of activation builds the user model of AdaptHelp, an adaptive help system for web-based systems. AdaptHelp creates user models based on web logfiles and by using xml tries to help users. On the other hand, iTVMobi does not use xml data to create user profiles. Instead it combines client-server architecture and databases to exploit user information extracted from users’ behavior throughout the system. Another system in this field [8] is a multimodal navigational system that learns from the cognitive load of users and then categorizes them into two different stereotypes:
Dynamically Extracting and Exploiting Information
473
elderly and average aged adults. Despite the fact that their application has many modes of functioning, including speech, that can help elderly people, it does not focus on the topic of group categorization. This means that a person may be an elderly but s/he can have different problems from another elderly person. Our system uses the user model that is based on a clustering technique in order to categorize elderly people according to their own specific problems concerning the navigation, thus solving the problem of specific needs of an elder user against another elder one. In [10], a system called Unified User Interface is presented. That system is a framework that can adapt to users depending on their age and kind of incapability by creating polymorphic user interfaces. In their work they apply the Unified User Interface on a health application scenario, namely the MediBridge C-Care web-based EHR system. The polymorphic interfaces are produced through rules of the "tasks" of the user performances. On the other hand, iTVMObi instead of having specific tasks to categorize the users it creates dynamic groups of users based on their navigational mistakes while they interact with the system. Our system creates these groups through the use of a user model that uses clustering techniques to dynamically produce groups of similar users. 2.2 Recommendation Systems E-commerce recommendation systems are wide spread in our days. Many approaches have been made towards useful product recommendation. Such an approach has been made by [7]. Their paper proposes personalized recommendation agents called fuzzy cognitive agents. Fuzzy cognitive agents are designed to give personalized suggestions based on the user’s current personal preferences, other user’s common preferences, and expert’s domain knowledge. Fuzzy cognitive agents are able to represent knowledge via extended fuzzy cognitive maps, to learn users’ preferences from most recent cases and to help customers make inferences and decisions through numeric computation instead of symbolic and logic deduction. In contrast, in our system we use a clustering algorithm to group users according to the information that itvmobi gets through their behavior. In particular, our system extracts representatives from a clustering technique of users' tastes and then uses these representatives to recommend its products. Another interesting approach has been made by Choi, et al. [2]. They chose a multi-attribute decision making method to find similar products. In their system, the customer must first order a product in order for the system to propose a similar one. On the other hand in our system a product can be proposed even if a customer has not bought anything. Moreover our system uses a clustering technique in order to acquire representing vectors of taste percentages. Another recommending system that uses clustering techniques in order to group products is the system proposed by Guan et al. [5]. In their system they use an explicit method of ranking to acquire generic attributes from products and then cluster new attributes into the different groups of generic attributes using the k-NN algorithm. On the other hand in our system instead of asking customers to rank products we observe their behavior and interaction with the system. In our system we do not cluster similar product attributes but similar taste percentages in product attributes. In this way instead of creating a system that is based on product attributes we create a system that
474
A. Savvopoulos and M. Virvou
is based on customer taste percentages. Moreover, instead of asking the customer to define what attribute s/he likes more, our system learns it from his/her high taste percentage. An important field in recommendation that has a large affinity with interactive TV is TV programs recommendation. Important steps have been made in this field too, e.g. [9,4]. However, in their approach the aim is to help users find a program of interest to them which a quite different domain than the one in our system. In iTVMObi, the fact that the product to be sold (mobile phones) and the use of I-TV in a non-passive way that is different from the one that the elderly know, such as watching programs, raises research problems that are also different. 2.3 Combined Systems Despite the fact that there are systems that have been built either for recommendation purposes or for help purposes towards specific groups of people, very few systems have been constructed to combine the above two functions. However, in the case of elderly people, both functions would be needed based on the individual user model created for a particular user. This is so, because the elderly need extra help for both on the kind of product they need to buy as well as on the use of the TV-shopping application itself. Otherwise the aim of the elderly buying something may not be achieved and the elderly may drop the application. One of the very few systems that try to combine recommendation and helping elder is the system constructed by Fink and Kobsa [3]. Their system is called AVANTI and its aim is to evaluate distributed information and provide recommendations through adaptive hypermedia information about a metropolitan area for a variety of users including the elderly. On the other hand, our system is an interactive TV commerce application that recommends products. Our system can be accessed from all users that possess an interactive TV.
3 The Experimental Personalized Interactive TV System–iTVMObi iTVMObi is an adaptive mobile shop created for the interactive television that learns from customer preferences (Figure 1). iTVMobi was built on Microsoft TV (MSTV) technology. The core components of MSTV are available in the Windows XP operating system and can be run on personal computers (PCs). MSTV technology can be utilized within a familiar and mature Integrated Development Environment (IDE). Microsoft Visual Studio offers a multitude of tools for designing, developing, testing and deploying an application. iTVMObi can be used by a telemarketing channel to sell mobile phones in a personalized way. Its aim is to provide help to customers of all ages by suggesting the best mobile phone for them. In order to help customers make the best buy iTVMObi performs two functions. The first is a recommender system that makes suggestions concerning mobile phones and the second is the adaptive help system that provides adaptive help generated for the users. Both functions are based on user modeling. The system can learn about the users’ preferences and mistakes and provide more helpful responses. User models are created using clustering algorithms. These techniques will be explained more thoroughly in the next section.
Dynamically Extracting and Exploiting Information
475
For every user, iTVMObi creates a different record at a database. In iTVMObi every customer can visit a large amount of mobile phones. For the purposes of our research we have implemented the system for five popular mobile brands. These five brands are: Nokia, Sony Ericsson, LG, Sharp and Samsung. Every customer has her/his own personal shopping cart. If customers intend to buy a phone they must simply move the phone into their cart by pressing the specific button or they can press the buy button at their remote control at the time that the specific product is shown on their TV screen. They also have the ability to remove one or more phones from their cart by choosing to delete them. After deciding which phones to buy a customer can easily purchase them by pressing the button “buy” at their shopping cart. All navigational moves of a customer are made through the TV remote control and are recorded by the system in the statistics database. In this way iTVMObi saves statistics considering the visits in the different brands and specific phones individually. The same type of statistics is saved for every customer and every phone that is moved to the buyer’s cart. The same task is conducted for the mobile phones that are eventually bought by every customer. All of these statistical results are scaled to the unit interval [0, 1].
Fig. 1. iTVMobi infrastructure framework
In particular, ITVMObi interprets users’ actions in a way that results into two different functions. The first is the calculation of users’ interests in individual phones and production companies and the second is the interpretation of users’ actions concerning possible navigation mistakes. Each user’s action contributes to the individual user profile by showing degrees of interest into one or another company or individual phone or by showing likelihood on a specific mistake. For example, the visit of a user into a phone-icon shows interest of this user to the particular phone and its brand. If the user puts this phone into the shopping cart this shows more interest in the particular phone and its brand. If a user buys this phone then this shows even more interest whereas if the user takes it out of the shopping cart before payment then there is not any increase in the interest counter. On the other hand if a user follows a
476
A. Savvopoulos and M. Virvou
different pattern of navigational moves, like repeated clicks on the same brand-name, the system interprets this action but as a confusion navigational mistake rather than as a high degree of interest in this brand. Thus, in this case, the system decides to intervene with an adaptive help action. Apart from brands that are already presented, other features that are taken into consideration by iTVMObi, for customer interest degree, are the following: phone price range, phone technical features, phone size, phone connectivity, phone display features, phone memory capabilities and phone battery autonomy. The price of every phone belongs to one of the five price ranges: 100 to 250 €€ , 251 to 400 €€ , 401 to 600 €€ , 601 to 800 €€ and over 801€€ . As for mistakes degrees we consider the following: difficulty of the user to see brands’ names, difficulty to see phones’ names and pictures and the confusion degree. Suggested phones are presented in the suggestions window through the help of adaptive hypermedia [1]. Moreover, iTVMObi uses an animated agent to inform and help the users throughout the system (Figure 2). This agent can give information about the usage of the system if a user cannot understand some sections. It is also used by the adaptive help system in order to help the users when they are confused.
Fig. 2. Left: Screenshot of iTVMobi user interface. Right: Sample shots of the films database.
4 Building Individual User Models ITVMobi consists of two modules, the recommender module and the adaptive help module. Both modules are based on user models that are constructed using the kmeans algorithm. The recommender function is based on the principle that many customers tend to have similar interests. Every customer’s interest in one of the phone features described above is recorded as a percentage of his/her visits in the respective phone-pages. For example, the degree interest of the customer at a particular phone brand is calculated by the following equations. DegreeOfIn terestInCo mpany1 =
VisitsInPh onesBelong ToCompany VisitsInAl lPhones
(1)
Dynamically Extracting and Exploiting Information
477
PhonesPlac edInBasket BelongToCo mpany AllPhonesP lacedBaske t
(2)
PhonesBoughtBelongToCompany AllBoughtPhones
(3)
DegreeOfIn terestInCo mpany 2 =
DegreeOfInterestInCompany3 =
DegreeOfInterestInCompany = WC1 * DegreeOfIn terestInCompany1 + WC 2 * DegreeOfInterestInCompany 2 + WC 3 * DegreeOfInterestInCompany3
(4)
As the previous equations show the interest in a phone feature (for example: the company that the phone belongs to in the case above) is measured in three ways. Then in order for the full interest to be acquired the system calculates a weighted sum of the three different degrees of interest, the degree of interest that corresponds to the visits of the user in the phone pages, the degree of interest that corresponds to the phones placed by the user to his/her basket and the degree of interest that corresponds to the phones bought by the user. The weights used by the system are different for every phone feature and are calculated based on the way the system is used by its users. For example a user chooses to visit a phone through the phone icon and does not have the ability to know the display abilities of this phone before opening the specific phone page. As such, the opening of a specific phone page through its icon may not mean that the user is necessarily interested in the phone but that s/he is just browsing several phones. On the other hand every company’s name is displayed from the very beginning to every user and in this way the user is aware for the company that he/she selects to visit thus making his/her selection more accountable. As a result the WC1 weight used to measure the Interest in Company from the user visits is bigger than the weight W D1 used to measure the Interest in Phone Display from user visits in different phones. The recommender module uses the k-means clustering algorithm in order to create representatives of customer groups that the system uses to make buying suggestions. The recommender takes as input the statistical data, described above, of the navigational moves of every customer and feeds them to the clustering algorithm. The clustering algorithm provides the recommender with clusters-groups of customer that have similar tastes. The recommender module takes these results and calculates the representatives of every group. Every time a customer uses the system the recommender module finds his/her representative and proposes phones based on the representative taste percentages through the use of adaptive hypermedia. For example if the recommender finds a phone that is very close to the representative’s tastes than this phone is noted as “recommended” product and is given a different type of indicator than a phone that is more far, considering the tastes of the representative. If a new user enters the system the recommender classifies him/her to the group that has the largest number of members. This is based on the idea that if many users have similar tastes then a new user is probably going to have similar tastes with the majority of them. The degree of recommendation is presented through adaptive hypermedia. The product that has the highest degree of interest for this user is noted
478
A. Savvopoulos and M. Virvou
as a “recommended” product and the one with a lower degree is noted as “check this too” product. The adaptive help module concerns the adaptive help responses. This module tries to identify mistakes in the navigational moves of every user. This module is based on the principle that many users tend to have similar navigational mistakes. Again, the kmeans algorithm is used to group users but in this case a different set of input data is used. The input data consists of the mistake degrees that were introduced in the above section. Hard _ to _ see _ Company =
Mistakes _ in _ Companies _ buttons Times _ in _ Companies _ page
(5)
Mistakes are considered as different “wrong” navigational patterns. For example, a user can make “confusion navigation” like the continuous visiting of two neighboring production company buttons. This action raises the possibility of vision problem. Another example is “navigation without a purpose” witch can be achieved by a pattern of pressed buttons and clicked areas that leads to no purpose. This action raises the confusion problem. Degrees are calculated as a percentage of specific mistakes committed in a specific phones’ page. For example the disability to see companies’ buttons degree is calculated by equation (5). If a user has many navigational mistakes then the system responds and tries to help the user with help actions customized to his/her mistakes. These actions can vary a lot. For example, the “confusion navigation” results in actions such as the automatic changing of the size of brand names buttons or phone links. This can result in a clearer presentation of the user interface. Another action taken by the system is changing the location of brand names’ buttons in order to avoid confusion. Other actions involve speech synthesizers and agents pointing on the screen in order to help customers understand the locations of the user’s interface components. The “navigation without a purpose” can result in actions like showing an options message and asking the user directly what he/she wants to do. An example of a wrong navigational pattern can be the following: a user chooses to click on company button, then click the adjacent company button, then click the previous company again and then click the same adjacent company again. These four moves are interpreted by the system as a possible mistake of confusion between company buttons. Every time a customer uses the system the adaptive help system finds his/her representative and responds with adaptive help actions. An example can be seen in Figure 3. In this particular example the system observes the user’s navigation moves between two neighboring company buttons and counts his/her mistakes. Here we must note that this particular customer doesn’t have any mistakes with regarding the phone pictures but only the categories buttons. At the first stage the stages enhances the categories buttons and enlarges them in order to be clearer. If the user continues to repeat similar navigational mistakes like continuously clicking neighboring categories then the system will change the neighboring categories position on the screen. If problem continues then the system will enlarge the “mistaken” categories only and increase the distance between this category and its neighboring ones.
Dynamically Extracting and Exploiting Information
Stage 1
Stage 2
Stage 3
479
Stage 4
Fig. 3. Stage 1: Original Size and Position.Stage2: Increased size buttons. Stage 3: Changed Button Position. Stage 4: Changed Size of “Mistaken Button” and moved neighbor buttons away.
5 Discussing Evaluation Results As a first step of evaluation studies we have evaluated the standalone version of the system that incorporates the user modeling and reasoning mechanisms. In order to evaluate iTVMobi we asked 50 men and women to use our system and then answer a 16-question questionnaire in order to compare the system results with the answers given to the questionnaire. The group of 50 people consisted of 25 men and 25 women. 20 people of the group had prior knowledge in computers and the rest had very little or no knowledge on computers. The questionnaire had three sections of questions. The first section consisted of general questions. The second section of questions concerned tastes in phone technical features, battery, memory, display etc. These questions correspond to the phones features measured by the system by observing users’ navigational moves. We compared the answers of every user in the questionnaire with the degree of interest of the representative of the group that this user belonged to. The diagram in Figure 4 (left) is a comparison diagram between users’ interests extracted from the questionnaire and those generated automatically by iTVmobi. The diagram (Figure 4 right) shows user mistake degrees on some sections. Users opinion
iTVmobi predictions
0,25
0,2
Navigation 33% 0,15
Hard to See Company 58%
0,1
0,05
0 SIZE
DISPLAY
MEMORY
CONNECT
FEATURES
BATTERY
Hard to See Phone 9%
Fig. 4. Left: A diagram showing user interest percentages by the questionnaire compared with those from iTVmobi. Right: Diagram Showing User Mistakes.
480
A. Savvopoulos and M. Virvou
6 Conclusions and Future Work In this paper we proposed and a novel system, called iTVMobi, that combines two adaptive functions. Our system extracts user information through their usage behavior. Then, itvmobi exploits this information by creating user models based on clusters of similar users. This information concern phone features and user mistakes. The customer degree of interest in specific phone features and mistake degrees are measured through weighted equations. The exploited information and user models help itvmobi recommend phones to specific customers and help them use the system. Our evaluation showed us that by combing weighted interest calculation and mistake degree with the animated agent and adaptive hypermedia techniques help motivated customers more to use the system and buy phones. It also showed that itvmobi predicted customer interests to a large extent and solved mistakes most of the times.
References 1. Brusilovsky, P.: Adaptive Hypermedia. Journal of User Modeling and User-Adapted Interaction 11, 87–110 (2001) 2. Choi, S.H., Kang, S., Jeon, Y.J.: Personalized recommendation system based on product specification values. Journal of Expert Systems with Apps. 31, 607–616 (2006) 3. Fink, J., Kobsa, A.: Adaptable and Adaptive Information Provision for All Users, Including Disabled and Elderly People. Journal of the New Review of Adaptive Hypermedia and Multimedia 4, 163–188 (1998) 4. Goren-Bar, D., Glinansky, O.: FIT-recommending TV programs to family members. Journal of Computer and Graphics 28, 149–156 (2004) 5. Guan, S.-U., Chan, T.K., Zhu, F.: Evolutionary intelligent agents for e-commerce: Generic preference detection with feature analysis. Journal of Electronic Commerce Research and Applications 4, 377–394 (2005) 6. Iglezakis, D.: Adaptive Help for Webbased Applications. In: De Bra, P.M.E., Nejdl, W. (eds.) AH 2004. LNCS, vol. 3137, pp. 304–307. Springer, Heidelberg (2004) 7. Miao, C., Yang, Q., Fangc, H., Goh, A.: A cognitive approach for agent-based personalized recommendation. Journal of Knowledge-Based Systems 20 (2007) 8. Muller, C., Rainer, W.R.: Adapting Multimodal Dialog for the Elderly. In: ABIS Workshop on Personalization for the Mobile World, pp. 31–34 (2002) 9. O’ Sullivan, D., Smyth, B., Wilson, D.C., Mcdonald, K., Smeaton, A.: Improving the Quality of the Personalized Electronic Program Guide. Journal of User Modeling and User-Adapted Interaction 14, 5–36 (2004) 10. Savidis, A., Antona, M., Stephanidis, C.: Applying the Unified User Interface Design Method in Health Telematics. In: Stephanidis, C. (ed.) Universal Access in Health Telematics. LNCS, vol. 3041, pp. 115–140. Springer, Heidelberg (2005) 11. Zounemat-Kermani, M., Teshnehlab, M.: Using adaptive neuro-fuzzy inference system for hydrological time series prediction. Journal of Appl. Soft Comput. 8(2), 928–936 (2008) 12. Tokmakçi, M., Ünalan, D., Soyuer, F., Öztürk, A.: The reevaluate statistical results of quality of life in patients with cerebrovascular disease using adaptive network-based fuzzy inference system. Journal of Expert Syst. Appl. 34, 958–963 (2008) 13. García-Barrios, L.E., Speelman, E.N., Pimm, M.S.: An educational simulation tool for negotiating sustainable natural resource management strategies among stakeholders with conflicting interests. Journal of Ecological Modelling 210, 115–126 (2008)
Caring TV as a Service Design with and for Elderly People Katariina Raij and Paula Lehto Laurea University of Applied Sciences Metsänpojankuja 3, 02130 Espoo, Finland Tel.: +358 40 8306172/ + 358400541479 Fax: +358988687501
[email protected],
[email protected]
Abstract. The increasing number of elderly people and decreasing number of employees challenge us to seek for new solutions in the field of health care and social welfare. Also the changing structures of welfare organizations and service processes demand new approaches in order to respond to future challenges. CaringTV aims to design virtual, interactive service concept with and for elderly people in order to support the wellbeing and the quality of life in the elderly. In this article the concept of quality of life will be defined and described as a client driven method based on the findings of two research projects; Coping at Home and Home. They, in turn, are based on the development of the Caring TV – concept at the Well Life Center at Laurea University of Applied Sciences.
1 Introduction In the next few decades the number of elderly people will rapidly increase. We will have over one million elderly people in Finland in 2020 according to recent statistics. It means that the need for elderly services will increase and because the number of care givers appears to be on the decrease, real challenges lie ahead. It also means an increase in costs in the health and social sector. Innovative application of technology is needed in order to be able to provide elderly people with high quality services in future. (1) Caring TV is one of the promising innovations discovered and developed as a service design with and for elderly people by Laurea University of Applied Sciences, TDC Song and Videra Oy and Espoo City. Caring TV is a concept where Laurea operates as a researcher, developer and content provider. The concept includes two research and development projects, Coping at Home and HOME, funded by TEKES/ Finn Well programme and EU/ InnoElli Senior programme which aim to find new solutions for elderly people living at home and for municipalities dealing with current problems in health and social services. In the Coping at Home project our pilot group consists of 25 family care givers living in Espoo, whereas in the HOME project there are 60 high risk clients from Vantaa City and 40 elderly people using services delivered by special service houses in Espoo, Turku and Lappeenranta. The interest is in discovering new, technology based solutions which support elderly people in staying at home and improving their quality of life by allowing them to have more control of their own lives. G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 481–488, 2008. springerlink.com © Springer-Verlag Berlin Heidelberg 2008
482
K. Raij and P. Lehto
Caring TV is a two channel interactive TV system through which guidance and support services are given as various participative programmes to improve and promote the capacities of elderly people living at home. The content of guidance and support services and participative programmes are planned together with clients according to their own expectations and with the supervision of experts. In planning these services an elderly person is taken into account as an active, participative partner and as a holistic being (2) with his or her own knowledge base, skills and abilities, values and experiences (3). We call this a client - driven model. In the first phase a municipality buys the TV channel for selected elderly people. In the second phase it will be offered to the private sector. This will mean that everyone living at home can buy a product which includes both the technology and the content production (4). Laurea is responsible for the research and development of Caring TV – concept and participative content production, while TDC Song and Videra Oy as private companies provide the technology and the participating municipalities (e.g. Espoo and Vantaa) the guidance and support services. Other private companies, e.g. Medixine Oy as the developer of the tools for advanced e-services, PhysioSporttis Oy and Lääkärikeskusyhtymä Oy as the developers of physiotherapeutic and medical services and municipalities are also included, as well as experts from the third sector. The development of Caring TV opens new doors and gives us valuable knowledge on how to proceed towards the development of a virtual clinic. Caring TV also offers a learning environment for students where they can achieve new competences by working together with educators, working life experts and clients, according to the Learning by Developing action model developed at Laurea. (5). Caring TV has also given us valuable knowledge of how to introduce new technological innovation to an end user by proceeding from a user centric to a userdriven action model. In this process defining the concept of an environment as a physical, symbolic and social environment has helped. (6) The development of Caring TV seems to open new doors by inviting new groups of people to participate and it also opens doors for new business. It means integrating welfare expertise, technology -, business - and research and development expertise.
2 Action Research Leading to the Client Driven Methods of the Service Design Action research is based on interventions. It can be seen as a practical, participative, reflective and social process. It is meant to study social reality in order to change it and to change reality to be studied again. In action research, as a process, there are different cycles following each other and in each of them four phases can be identified as observing, reflecting, planning and implementing. The whole study usually consists of three implementing cycles. (7). In action research the interest of knowledge will be defined by the target of the research. By applying Habermas and Heikkinen & al presentations that efficiency and effectiveness is guided by a technical interest of knowledge. Whereas the interpretation
Caring TV as a Service Design with and for Elderly People
483
of social action is related to practice-hermeneutic interest of knowledge and influencing a world is linked to emancipatory interest of knowledge. According to them action research can serve all of the three interests of knowledge. In the development of the Caring TV – concept, all of them can be seen to be included. It serves the technical interest of knowledge in the development of technology needed in delivering new, virtual services, which aim to add efficiency and new kinds of technically effective solutions. Practice – hermeneutic and emancipatory interests of knowledge are present in the content development of virtual Caring TV – services. Interpretation of the social actions of participative clients is essential when developing new services together with clients and influencing the development of new action models is one purpose of the studies of Coping at Home and Home. (8, 7) In the Caring TV research projects the purpose is to enhance the quality of life of elderly people living at home. They also aim to find new solutions for municipalities who have to face the future challenges in health care and social services. Rapley concludes in his book about quality of life and its research that the concept of quality of life as a formally operationalized and measurable construct is very problematic and rather offers a sensitizing concept for thinking through the delivery of services or ways of enhancing the liveability of human communities. As a sensitizing concept, subjective well-being and life satisfaction, subjective and objective descriptors which are identified in many studies (see e.g. Rapley 2003, 215) could have a special meaning if they are defined by clients who need human services and experts who are responsible for the delivery of services. All this led, in our projects, to design and develop new services for elderly people together with them and home care professionals as a focus group. The development process is based on clients’ expectations, their ways of view on how services and what kind of services should be developed and how clients could be involved. Elderly people, as clients, with their own home environments are seen to form a living lab where service design and development can take place in a client driven way. Because of this, action research was chosen as a research method. It means that elderly people and focus groups, which consist of experts, have been interviewed three times from autumn 2005 to autumn 2007, in the beginning, in the middle of the process and at the end. All the three cycles have consisted of the phases of observing, reflecting, planning and implementing as mentioned above. (9)
3 The Main Findings of the Caring TV Concept Caring TV has been developed as a service production with and for clients by integrating welfare, technology and service competences. In the beginning all the clients were interviewed, expectations were identified and participative, interactive programmes were planned and sent through Caring TV. In the programmes the clients are also in interaction with each other. They were offered an opportunity to share their own experiences with each other and this led to another more active level which means participating in the development of their own programmes for themselves.
484
K. Raij and P. Lehto
Services: Interactive instruction and advice Personal instruction Experts services: homedoctor, homephysiotherapy, religion, heart assosiation
Silent Occasional Caring TV users Active Participating Caring TV-programmes: Themes: Promote safety, Action competence and participating Increase competency of family care giving
Transferor Technological services: follow coping at home, VAS-painline, functional assessment, sleep-awakening-rythm, blood pressure, blood sugar, weight, etc
Fig. 1. Family care givers as Caring TV users and Caring TV as a service production
The first research project, Coping at Home, was carried out during the years 2005 – 2007. The purpose was to develop new kind of services for family care givers (n=25) in the virtual, interactive TV-environment. Based on Piirainen´s research findings different user groups were identified as active users, silent users, occasional users and transferors (Figure 1) Caring TV has been developed as a service production with and for clients by integrating welfare, technology and service competences. (10, 11) Figure 1 The second research project, Going Home, aims to develop a new kind of model of safe discharge from the hospital and virtual, participative services with and for elderly home care clients. Safe discharge is considered through the ongoing Going homeresearch project. Participants in the research are homecare clients (N=60), over 65 years old with a high risk of hospitalization. The challenge is to response on issues like loneliness and unsafe, self-care and monitoring, activities of daily living and rehabilitation at home after discharge from hospital. The Client driven approach in CaringTV means, as it has been mentioned above, that based on the findings, the contents of the programmes and new virtual services are planned together with clients and their significant others and also with home care professionals. The voice of the client is seen as the most important evidence when planning and broadcasting programmes through CaringTV. The client driven method means that the client is seen as an active participant and co-partner in planning, implementing and evaluating the broadcasting process of CaringTV. Based on clients´
Caring TV as a Service Design with and for Elderly People
Promoting safety and mental health
Support to rehabilitation e.g. physical exercises, breathing exercises, relaxation
485
Promoting activities of daily living e.g. nutrition, sleeping
CaringTV Supportive services at home
Supportive methods e.g. consultation, monitoring
Support to manage with self-care e.g. medication, pain
Activating situational support e.g. peer group, discussions, participation
Fig. 2. The model of supportive methods through CaringTV
and professionals´ feedback the programmes and supportive services through CaringTV have been planned and renewed during the process. In order to manage at home after discharge, tailor-made supportive and safe services are needed more and more. Based on feedback of ongoing research including programme and broadcasting the model of holistic client driven services can be illustrated through the supportive methods. (Figure 2) (12)
4 The Indicators of Quality of Life Based on collected material by interviewing our clients (N=85 x 3) it also has been possible to identify the indicators of quality of life. Because the aim has been to improve quality of life they form a structure for planning future participative programmes. The Caring TV service concept has been divided into six categories with their subcategories in relation to the indicators of quality of life. They are health, mental health, nutrition, activity, social support and habitat. They, in turn have been divided into subcategories as the following figure shows. Elderly people seem to be very interested in knowing more about their illnesses and they favour programmes where they can consultant physicians and other experts. They like to participate in morning gyms, they wish to have cooking advice meant for them and they like to look back on their lives, sharing experiences. The indicators also include security, safety and action competences as well as increasing possibilities for participation in social interaction. The indicators of quality of life have been validated by the clients. (Figure 3).
486
K. Raij and P. Lehto
Mental health Sense of Belonging Absence of fear Mental stimulation Memory activation Health Knowledge of illnesses Good sleep Right medication Assessment and control Hygiene
Nutrition Cooking skills Healthy food
Indicators of Quality of Life And Habilitation Activity Physical balance Physical workload Physical activation Right tools Empowerment
Habitat Security Safety Obstacle free Social Support Availability of services Religious services Significant others Peer support Participation Fig. 3. Indicators of quality of life
5 Conclusions The Caring TV concept has been presented as an environment in which two research projects have been carried out. In both of them the aim is to improve the quality of life of elderly people. Action research was decided to be used as a research method because elderly people as the clients were supposed to be active developers in the projects. In the first project virtual, interactive and participative programmes were designed with and for family care givers. The ongoing Caring TV research project deals with supporting methods through Caring TV with and for high risk patients. Supporting methods have been developed with elderly people and home care professionals. New methods have emerged during the process and have been identified. The supported methods are congruent with the content themes of the programmes sent through Caring TV. At the end the indicators of quality of life based on the conceptions of elderly people have been described. They show us how an elderly person looks at his or her own life situation. They are much in line with the determinants of quality of life found by
Caring TV as a Service Design with and for Elderly People
487
Wilkinson in the European definitions. There are, however, some differences. In our projects elderly people have recognized nutrition as one of the indicators. According to them they have met new challenges in buying food and preparing meal in relation to their health status and maybe also to their present income. Our clients have also described more details related to activity. They like to emphasize the meaning of physical workload and physical balance in their daily activities. (13) In our earlier writing about quality of life has been described in relation to the theory of the person existing as a holistic being. According to Rauhala a human being exists as an organic, conscious, situational and spiritual being. Also Raij’s view of seeing a person as a holistic being with his or her own knowledge base, skills and abilities, values and experiences, was added when developing our action research set. (14, 15) If we compare the determinants of quality of life presented by Wilkinson, the indicators of quality of life we have identified and the conception of a human being, some links are to be seen. Existing as an organic and conscious being is related to physical and mental health and activity. Existing as a situational being, in turn, is related to social relationships, income, housing, technology applications and participation and in our findings to habitat, nutrition and also activity. In the definitions there are, however, no determinants linked to a human being as a spiritual being, which we found to be one of the factors in elderly people’s wellbeing. The indicators presented here are useful in designing services with and for elderly people. In our case they are used as guides in the development of virtual services through CaringTV. They also offer a tool for municipalities to decide what kinds of services could be delivered by benefiting CaringTV. The concept seems to be successful in improving the well-being of elderly people living independently at home
References [1] Kettunen, T.: Neuvontakeskustelu: tutkimus potilaan osallistumisesta ja sen tukemisesta sairaalan terveysneuvonnassa, Acta Universitatis Jyväskylä Sport and Health Sciences, Jyväskylä (2001) [2] Rauhala, L.: Tajunnan itsepuolustus. Yliopistopaino, Helsinki (1995) [3] Raij, K.: Osaamisen tuottaminen ammattikorkeakoulun päämääränä. In: Kotila, H. (ed.) Ammattikorkeakoulupedagogiikka, Edita, Helsinki (2003) [4] Piirainen, A., Raij, K.: Coping at Home, Annuals of the Kansei Fukushi Research Centre, Special Issue, Refurbishing the Elderly Care-evidence and Theoretical Targets, 13–21 (2006) [5] Raij, K.: Learning by Developing. Laurea Publication A*58. Edita. Helsinki (2007) [6] Kim, H.S.: The Nature of Theoretical thinking in nursing. Appleton-Century-Crofts, Norwalk (1983) [7] Heikkinen, H., Kontinen, H.P.: Toiminnan tutkimuksen suuntaukset. In: Heikkinen, H., Rovio, E., Syrjälä, L. (eds.) Toiminnasta Tietoon, Dark Oy, Vantaa (2006) [8] Habermas, J.: Knowledge and Human Interests. Boston, Beacon (1973) [9] Rapley, M.: Quality of Life Research. Sage Publications, London (2003) [10] Piirainen, A., Sarekoski, I.: Coping at Home (2007), http://akseli.tekes.fi/opencms/ opencms/OhjelmaPortaali/ohjelmat/FinnWell/fi/Dokumenttiarkisto/Tuloksia/Coping_at_ Home_-hankkeen_raporttix1x.pdf
488
K. Raij and P. Lehto
[11] Raij, K., Piirainen, A., Lehto, P.: Quality of Life. New Interpretation of Rehabilitation. In: The Third Sendai-Finland Seminar. Refurbishing the Elderly Care - New Health/Social Services and Network (2007) [12] Lehto, P.: CaringTV as a Context for Learning Client-driven Services. In: The International conference Learning by Developing – New ways to Learn, Conference on Innovative Pedagogical Models in Higher Education, Laurea University of Applied Sciences (2008), http://www.laurea.fi/net/en/06_Laurea/01_Communications/ Seminars/LbD_conference_Feb_2008/03_Workshops/Workshop_3_2/ LbDconference_3.2Lehto.pdf [13] Wilkinson, R.: Unhealthy Societies. Routledge, London (1999) [14] Rauhala, L.: Ihmisen ykseys ja moninaisuus. SHKS, Helsinki (1989) [15] Raij, K.: Toward A Profession. Clinical leaning in a hospital nvironment as described by student nurses. An Academic Dissertation, University of Helsinki, Department of Education. Research report no.166 (2000)
A Biosignal Classification Neural Modeling Methodology for Intelligent Hardware Construction Anastasia Kastania1, Stelios Zimeras2, and Sophia Kossida1 1 2
Biomedical Research Foundation of the Academy of Athens, Greece Department of Statistics and Actuarial-Financial Mathematics University of Aegean, Greece
[email protected],
[email protected],
[email protected]
Abstract. Herein, we investigate methodologies for the construction of intelligent hardware based on neural networks for biosignals classification. The full concept of machine implementation on a chip is also explored. Experimental results based on biosignals are produced using extensive training and evaluation of adaptive neural networks to extract efficient logic functions for hardware construction. The VHDL modeling methodology for the hardware implementation of the derived logic functions on a chip is also presented. In conclusion we have succeeded to formulate a methodology to map neural networks on hardware architectures. Keywords: neural networks, classification, intelligent hardware construction.
1 Introduction Adaptive logic networks (ALNs) use an evolutionary mechanism to grow neural networks to have success as classifiers. Herein, the rules for designing intelligent engines based on simulation performance analysis of ALN classifiers are investigated. The behavior of an ALN classification engine (I2CE) in different states is simulated and evaluated using experimental results. The focus is to reveal principles and concepts for fast efficient design of ALN based neurochips. An ALN is an artificial neural system which uses only simple logical functions for building a decision tree architecture during training [1]. The knowledge is stored in the network architecture and in the functions of the system nodes. ALNs use hard limiters instead of sigmoids, and the hidden layers above the first one are all logic gates AND or OR [2]. During training, the ALN applies least squares fitting using many linear pieces at once; after training, a decision tree is created that partitions the input space into blocks, in each of which the function is represented by a simple expression containing a few linear pieces connected by MAXIMUM and MINIMUM operations of Armstrong, et al [3]. A challenging biosignal engineering problem is the classification of patients with essential hypertension [4-6] disease into prognostic classes of physiological and pathological behavior on the basis of 24h biosignal Holter measurements [4-6]. The integrated intelligent classification engine (I2CE) is defined [7] to be a system constructed with neural networks data mining techniques capable of recognizing and G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 489–495, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
490
A. Kastania, S. Zimeras, and S. Kossida
classifying specific diagnostic output states. The basic entity of the I2CE system is a classification function which is feasible to be produced [7] using an adaptive logic network (ALN) [1]. This function is capable to determine whether or not biosignal input and diagnostic output quantities [7] are in a certain relationship to each other. This investigation is focused on the simulation based extensive validation of the combined biosignal engineering ALN architectures included in the I2CE system [7] specification and aims at the identification of the combined ALN architecture that detects with 100% classification accuracy the hypertensives without hypertension (i.e. without target organ damage prognostic prediction). The simulation based performance analysis models of the I2CE system are also compared herein with experimental results obtained from a challenging biosignal engineering dataset.
2 Biosignal Quality Issues Two hundred and thirty patients with untreated essential hypertension, as determined from casual Blood Pressure (BP) readings, were studied. All subjects were leading a normal active life. They entered the study so long as they met the following criteria: (a) Blood pressure readings by the sphygmomanometer were above 140/90 mmHg. In each patient, BP was measured daily by the cuff method, using the 1st and 5th Korotkoff sounds for identifying systolic and diastolic BP, respectively. Blood pressure readings were obtained from the right or left arm with the patients lying supine for at least 10 min. (b) They had not received any antihypertensive drug treatment at any time. (c) They had not been given any other kind of drug that might have affected their BP level for at least 3 weeks before entering the study. (d) Apart from the raised BP and its impossible adverse effects on the cardiovascular system, such as left ventricular hypertrophy and vascular fundi changes, no patient suffered from arrhythmia, heart failure or any other cardiovascular disease. (e) Secondary hypertension had been excluded by negative physical examination and by normal results of the following laboratory tests: renal function tests, serum electrolytes, catecholamine excretion, radionuclear studies of kidneys and/or intravenous pyelography. The two hundred and thirty patients [4] underwent 24 hour BP and Heart Rate (HR) ambulatory monitoring. A Spacelabs 90202 ABP monitor (Spacelabs Inc., Redmond, Wash.) was used for 24-hour ambulatory monitoring (Table 1). Measurements of hypertension waveform parameters were timely transformed to start at the same time for all patients. For each patient were available 48 measurements were available per 24-hours measured at 30 min intervals. The structural changes in patients hearts were assessed by means of echocardiographic, ECG, and the chest with x-ray examination. Echocardiographic measurements were made by M-mode echocardiography performed with a Sigma-1C echocardiograph (Kontron Instruments Inc., Everett, Mass.) with patients in the partial left lateral position (Table 2). Patients were characterized as pathologic in cases that PWD >1.1, IVSD >1.1 LVmass >110, otherwise they were characterized as physiologic.
A Biosignal Classification Neural Modeling Methodology
491
Table 1. Ambulatory monitoring variables Variable
Explanation
SBP24 DBP24 HR24
mean 24h Systolic BP mean 24h Diastolic BP mean 24h Heart Rate
Table 2. Echocardiographic left ventricular hypertrophy indices Index
IVSD PWD LVmass
Explanation end-Diastolic IntraVentricular Septum thickness
left ventricular Posterior Wall Diastolic thickness Left Ventricular mass
3 Specification of the I2CE System The training patterns of pathological and physiological patients consisted of 24-hour systolic arterial pressure (SAP), diastolic arterial pressure (DAP) and heart rate (HR) values, measured every 30 minutes. Each linear piece Li in the ALN architecture under exploitation accepts the input features of the essential hypertension waveform parameters as described by the following equation (Equation (1)) ALN-input-variables=(x1,x2,..x144,x145)
(1)
where the input variables represent the systolic arterial pressure input pattern (i.e. x1,..,x48), the diastolic arterial pressure input pattern (i.e. x49,..,x96), the heart rate pattern (i.e. x97,..,x144) and also the clinical characterization as pathologic or physiologic (i.e. x145). Various ALN architectures were built on the basis of the combined patterns, to learn functions from the empirical data presented to them during training and testing for a specific number of epochs. Each ALN extracted after simultaneous training and testing for a specific number of epochs, is a type of feedforward multilayer perceptron that uses linear threshold units in a first hidden layer, and logic gates AND and OR in other hidden layers and in the output layer. The units form a tree with a single boolean output. In the ALN approach used, I2CE learned [7] the linear pieces functions from the available empirical data during training. For the ALNs training and testing we have used the ALNBench software [8].
4 Simulation Based Performance Analysis in the I2CE System The quality metrics to accept a neural classifier as reliable and efficient are the maximization of the classification accuracy in unseen data sets and the minimization of the errors produced in training, test and validation phases. Based on this
492
A. Kastania, S. Zimeras, and S. Kossida
Fig. 1. VHDL Simili 3.1 Sonata implementation of the I2CE system
requirement we have extracted the optimal DTREE architecture from the ALNBench Platform[8]. For the VHDL implementation and simulation of the DTREE for the I2CE system we used VHDL Simili 3.1 Sonata Professional Edition [9] (Fig. 1).
5 Hardware Neural Network Building Block Neural networks in hardware cover areas of high performance, embedded applications and neuromorphic systems. Large networks, and also fast learning, will provide the requirements that will eventually demand hardware architectures [10]. There are various neural networks implementations in hardware involving FPGA: neurocomputers, arithmetic precision, field programmable neural arrays, very large associative memories, neocognitrons, self organizing feature maps, non linear predictors and reconfigurable architectures [11]. Real valued continuous functions by means of piecewise linear approximants are represented by the ALNs and programmable hardware can be used to allow efficient flow of control [12]. The basic components of an ALN are: (i) the Linear Threshold Units (LTUs), which are the leaves of the decision tree architecture, and (ii) other interior nodes which are AND and OR operators with two or more inputs [1]. Since the ALN is
A Biosignal Classification Neural Modeling Methodology
Fig. 2. Micro-architecture for the calculation of the ALN Linear Threshold Units
493
494
A. Kastania, S. Zimeras, and S. Kossida
trained, several linearform equations are produced. The form of these equations is given by the following equation: 145
LTU j = ∑ a ij ( xi − bij ) i =1
(2)
where coefficients ai j , bi j are estimated during the training phase for each input measurement with confidence interval 99%. In order to map equation (2) to hardware building blocks, we developed and tested in Quartus II 7.2 (32-Bit) by Altera [13] a VHDL micro-architecture (Fig. 2), which performs addition, subtraction, multiplication and division based on the IEEE standard for Binary Floating-Point Arithmetic [14]. Our work for the calculation of the ALN Linear Threshold Units is based upon adaptation and reusability of the Floating Point Unit Core [15,16]. The full hardware implementation of the DTREE as a neurochip involves also implementation of two different hardware modules named MIN and MAX respectively. We intend to develop VHDL building blocks for these modules and test them extensively with real biosignal data.
6 Conclusions The proposed biosignal classification neural modeling methodology for intelligent hardware construction based on adaptive logic networks will permit the digital implementation of the I2CE engine as a neurochip to be embedded in biomedical engineering devices for 24-hour blood pressure and heart rate monitoring.
References 1. Armstrong, W.W., Thomas, M.M.: Adaptive logic networks. In: Fieseler, E., Beale, R. (eds.) Handbook of neural computation. Institute of physics publishing and Oxford University press, USA (1996) 2. Armstrong, W.W., Godbout, G.: Properties of binary trees of flexible elements useful in pattern recognition. In: IEEE International Conf. on Cybernetics and Society, San Francisco, USA. IEEE Cat. No.75, CHO 997-7 SMC, pp. 447–449 (1975) 3. Armstrong, W.W., Stein, R.B., Kostov, A., Thomas, M., Baudin, P., Gervais, P., Popovic, D.: Application of adaptive logic networks and dynamics to study and control of human movement. In: 2nd International Symposium on 3D Analysis of Human Movement, Poitiers, pp. 81–84 (1993) 4. Guidelines Subcommittee, World Health Organization, International Society of Hypertension guidelines for the management of hypertension. J. Hypertens. 17, 151–183 (1999) 5. Perera, G.: Hypertensive vascular disease: description and natural history. J. Chronic Dis. 1, 33–42 (1955) 6. Swales, J.D.: Evidence-based medicine and hypertension. J. Hypertens. 17, 1511–1516 (1999)
A Biosignal Classification Neural Modeling Methodology
495
7. Kastania, A.N., Bekakos, M.P.: An Integrated Intelligent Classification Engine for Biosignal Engineering. In: Zhang, D., Pal, S.K. (eds.) Neural Networks and Systolic Array Design, Series in Machine Perception and Artificial Intelligence, vol. 49, pp. 279–300. World Scientific, Singapore (2002) 8. Dendronic Decisions Limited, http://www.dendronic.com 9. VHDL Simili 3.1 Sonata Professional Edition by Symphony EDA, http://www.symphonyeda.com/ 10. Lindsey, C.S.: Neural Networks in Hardware: Architectures, Products and Applications, http://www.particle.kth.se/~lindsey/HardwareNNWCourse/home. html 11. Omondi, A.R., Rajapakse, J. (eds.): FPGA Implementations of Neural Networks. Springer, Heidelberg (2006) 12. Armstrong, W.W.: Hardware requirements for fast evaluation of functions learned by Adaptive Logic Networks. In: Higuchi, T., Iwata, M., Weixin, L. (eds.) ICES 1996. LNCS, vol. 1259, pp. 17–22. Springer, Heidelberg (1997) 13. ALTERA, http://www.altera.com/ 14. IEEE, IEEE-754-1985 Standard for binary floating-point arithmetic (1985) 15. Usselman, R.: Documentation for Floating Point Unit, http://www.opencores.org 16. Yalamanchi, M.S., Koltur, R.: Single Precision Floating Point Unit, http://web. cecs.pdx.edu/~mperkows/CLASS_VHDL/VHDL_CLASS_2001/Floating/p rojreport.html
Virtual Intelligent Agents to Train Abilities of Diagnosis in Psychology and Psychiatry José Gutiérrez-Maldonado, Ivan Alsina-Jurnet, María Virginia Rangel-Gómez, Angel Aguilar-Alonso, Adolfo José Jarne-Esparcia, Antonio Andrés-Pueyo, and Antoni Talarn-Caparrós Facultad de Psicología, Campus Vall d’Hebrón, Passeig de la Vall d’Hebrón, 171, 08035 Barcelona, Spain {josegutierrezm,ivanalsina,mrangego7,aaguilarpsi, ajarne,andrespueyo,atalarn}@ub.edu
Abstract. The diagnostic interview in Mental Health Sciences involves a series of abilities that require sound training. This training should be provided under guidance from a professor in controlled settings that mimic real-life situations as closely as possible, but in the initial stages the interaction with real patients should be avoided. Precisely, the objective of this study was to develop a system constructed with artificial intelligence and 3D design applications that creates an environment in which the trainee can interact with a group of simulated patients. These virtual patients are realistic objects that can interact in real-time with the user using a series of parameters that define their verbal, emotional and motor responses. From them the trainee must obtain the data needed to make an accurate diagnosis. The high level of flexibility and interactivity increases the trainees’ sensation of participating in the simulated situation, leading an improving of the learning of the skills required. Keywords: Virtual Reality, Artificial Intelligence, Training, Diagnostic Interview, Psychology, Psychiatry.
1 Introduction From the development of the electronic methods of communication, healthcare professionals used the information and communication technologies (TIC's) in the field of the sanitary attention, thus for example the telegraph, the telephone, the radio, the television, etc. have been used by the medicine from the half XIXth Century. However, in the last years a great advance in the development of the new technologies had been produced. These advances are changing the ways in which people relate, communicate, and live. Thus, technologies that were hardly used 10 years ago, such as the internet, e-mail or video teleconferencing, are becoming familiar methods for diagnosis, therapy, education and training. All this is leading to the appearance of a new field, the e-health, whose main objective is the use of the TIC's in order to improve all the processes related to the sanitary attention. Recent advances in educational technology are offering an increasing number of innovative learning tools that are having a significant impact on the structure of healthcare professionals’ education in many ways. Among these Virtual Reality (VR) G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 497–505, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
498
J. Gutiérrez-Maldonado et al.
and Artificial Intelligence (AI) are getting higher importance in the educational ambit and professional training, nowadays. Virtual Reality integrates real-time computer graphics, body tracking devices, visual displays and other sensory inputs to immerse individuals in computer-generated virtual environments (1). From this definition it can be derived the two basic properties of a virtual reality system, these are: immersion and interaction. The term immersion refers to the stimulation of the different sensorial channels of the user. This is usually achieved by means of visual, auditory or haptic devices. But virtual reality is also interactive, virtual reality not imply a passive visualization of a virtual world, the user can interact with it and, what is more important, the virtual world responds in real time to those actions. VR by means of their two basic properties creates an illusion in the user of being physically inside the virtual world. Precisely, this sense of presence can have positive effects on task performance (figure 1).
Fig. 1. Properties of a Virtual Reality system
In the other hand, Artificial Intelligence permits the design of intelligent agents that can engage in a natural dialogue with the human (2), understanding the motives and emotional states of the user. One of the most striking features of this humancomputer interaction is to use anthropomorphic characters, to readily attribute human qualities to the machine (such as personality traits, moods, etc.) (3). Precisely, in the field of mental health care, for a good human-computer interaction, is essential that an intelligent machine can display emotions. When using AI healthcare students can engage in a natural interaction with a virtual patient (that should display a wide range of moods), understanding more deeply their beliefs, motives and emotions, and providing a realistic setting in which she can learn specific skills such as those involved in the diagnostic interview. This should facilitate that this learning can be generalized more easily to the real world situations. In summary, VR and AI provide the possibility to experience the learning situation as a real context, which in turn promotes the experiental learning. As suggested by several authors (4, 5, 6) these new technologies represents a promising area with high potential of enhancing and
Virtual Intelligent Agents to Train Abilities of Diagnosis in Psychology and Psychiatry
499
modifying the learning experience. It’s important to take into account the advantages that mean these tools characteristics (7): • Immateriality; because its raw material is the information. • Interactivity; that allows subjects not to be passive receivers of information, but active and conscious processors of it. • They have high parameters of image and sound quality • Instantaneous, because it facilitates the rapidity to the access and interchange of information, breaking space-temporary barriers. • They have a higher influence on process than on products. • And the possibility of interconnection. A growing body of research in education demonstrates that the use of e-health tools can enhance learning in numerous domains (8). These emerging technologies can provide a rich, interactive, engaging educational context, supporting experiential learning. VR and AI allows the student to learn by doing, through first-person experience (9, 10). This active process enhances the learning, thus it seems that students can assimilate the concepts more accurately when they have the freedom to navigate and involve in self-directed activities within their learning context. These tools also provide high similarity with real life situations but does not expose the student to situations for which s/he is still not prepared. Furthermore, in the case of the mental health care, it must be noted that practice with real patients during the studies is very difficult, so a good alternative is to train students with simulated patients, which approaches more to reality than traditional methods (such as textual books). Knowledge obtained by training abilities by means of e-health tools has to be generalized to real situations in order to be successful. As pointed by Thorndike (11), it’s reasonable to think that the more similar to reality is the simulation, the more is the probability of transferring the knowledge acquired to the real situation. Another important characteristic to stand out is the possibility of self-learning and over-learning provided by these tools, since the student can repeat the situation as many times as she wants. It’s also an activity almost totally guided by the student, which promotes the development of operational and formal thinking, because it facilitates the exploration of different possibilities. This kind of educational method also adapts to the student’s pace, timetable and needs. These tools also make it possible to graduate the difficulty of the problems to be solved, facilitating learning by bringing subjects progressively closer to the solution. From a constructivist perspective is assumed that students are not only active processors of information, but also significant constructors of it. This mean allows the student to advance in the acquisition of knowledge at his own rhythm according to his previous knowledge and attitudes. VR and AI provide a tool for developing instruction along constructivist lines and an environment in which learners can actively pursue their knowledge needs. As pointed by McGuire (12) this active learning process allows the user to understand the world though an “ongoing process of making sense out of new information by creating their own version of reality instead of simply receiving the author’s view”. Besides these advantages, it must be pointed that any new method of education implies automatically an increase of student’s attention towards it. This higher
500
J. Gutiérrez-Maldonado et al.
motivation has positive effects in concentration, interest, and effort employed by the student (13). 1.1 Simulations and Healthcare One of the applications of these new tools is the possibility to create simulations (14). As we commented before, these simulations facilitate the realization of practices in environments of easy control for professors and students. They provide the opportunity of making first-person, non-symbolic experiences, since immersive environments allow to construct knowledge from direct experience by giving the participants the illusion of “being there” in a mediated environment. VR and AI technology provides learners with the possibility to reflect and get a deeper understanding of the process through which a person can reach a knowledge of the world. Furthermore, these flexible and open tools can be used by professors in different contexts and designed learning situations. Today, in healthcare, virtual simulations are frequently used for professional training of many kinds (15). Current applications are mainly related with medical and surgical training such as simulators for temporal bone dissection (16), virtual endoscopy simulator (17), simulator for training esophageal intubation (18), orthopaedic surgery (19), mastoidectomy simulation (20), a VR training and assessment of laparoscopic skills (21) and so on. The objective of these applications is seek to train a single set of skills within a simulation that is highly realistic and anatomically correct. Despite this, surprisingly, to date only exists one application for training mental health professionals. This application, called “The Bus Ride” was presented by the Janssen Pharmaceutica Products LP in the 155th Annual Meeting of the American Psychiatric Association. The realistic virtual experience is directed to educate healthcare professionals about the symptoms experienced by a psychotic patient. The realistic virtual-reality experience puts the learner on a city bus and surrounds them with the same visual and auditory hallucinations that experience a patient with schizophrenia. Thus, due to the lack of applications directed to train mental health professionals, the objective of this study was to develop an application with the aim to train healthcare professionals in the skills implicated in the clinical diagnostic interview.
2 Materials and Methods 2.1 Instruments 2.1.1 Software To develop the virtual environments, tools of two kinds were used: •
Modelling and animation tools: the scenario, virtual elements and animated 3D objects were constructed with 3D Studio Max 6. The People Putty program was used to design and animate the virtual characters. Adobe Photoshop 6.0 was used to create the textures and images.
Virtual Intelligent Agents to Train Abilities of Diagnosis in Psychology and Psychiatry
•
•
501
Interactive development applications: Virtools Dev 2.5 Educational Version was used to combine the objects and characters created with the different graphic design tools, and to integrate them with textures and sound. It was also used to make the environments interactive and to facilitate browsing. Artificial Intelligence: To create the interactive agent, the Vhuman Program and Artificial Intelligence Markup Language (AIM) were used.
2.1.2 Hardware These computer interviews can be presented on a computer screen or on a more immersive system comprising Head Mounted Displays (HMD) plus tracking devices or in a projection screen (figure 2). It must be pointed that several studies have reported improved learning with more immersive systems like the HMD or project screens.
Fig. 2. View of the HMD and the projection screen
2.2 Procedure 2.2.1 Linguistic Corpus \The first stage of the development involved the compilation of a linguistic corpus from which the contents of the simulated interviews was later extracted (that is, the virtual patients’ questions and answers). The DSM IV-TR diagnostic trees (23) were used as basic sources of information. A linguistic corpus was produced that corresponded to the main diagnostic groups on axes I and II of the APA classificatory system. In the following stage the corpus was applied to generate the agents that would simulate the answers of patients corresponding to different, specific diagnostic categories from each of the main diagnostic groups: anxiety disorders, psychotic disorders, mood disorders and personality disorders (table 1). 2.2.2 Simulated Interviews A virtual office was created in which the learner can realize a clinical interview to different virtual patients via a videoconference (figure 3). From the corpus linguistic
502
J. Gutiérrez-Maldonado et al.
Table 1. Main diagnostic groups and specific disorders represented in the simulated interviews Diagnostic Group
Specific Disorders
Psychotic disorders
Schizophrenia Schizoaffective disorder Schizophreniform disorder Delusional disorder
Mood disorders
Bipolar I disorder Bipolar II disorder Cyclothymic disorder
Anxiety disorders
Generalized anxiety disorder Obsessive compulsive disorder
Eating disorders
Bulimia Nervosa disorder
Somatoform disorders
Hypochondriasis
Personality disorders
Borderline personality disorder 1 Borderline personality disorder 2
Fig. 3. View of the virtual office
Fig. 4. View of the chat boot
13 virtual reality patients were represented. Furthermore a chat boot was created, this robot can respond in real-time to a wide range of student questions (figure 4). As it was commented before each of these patients had a specific mental disorder that corresponds to some of the most prevalent diagnostic groups such. Skills of
Virtual Intelligent Agents to Train Abilities of Diagnosis in Psychology and Psychiatry
503
psychopathological examination are taught via a series of diagnostic interviews realized to these virtual patients.
3 Results On entering "Simulated Interviews 3.1" the user enters within a virtual reality office in which she can freely move around it. When the student wants to start the clinical interview, she can sit in front of two screens. Here she can conduct a videoconference. In the screen of the left appears the virtual patient, whereas in the screen situated on the right will appear the questions, the diagnostic hypothesis and, in general, all the feedback that provides the system. It’s important to note that the clinical interview is the starting point of the psychopathological exploration, and one of its principal goals is to generate diagnostic hypothesis. From these hypothesis the psychologist starts an investigation process to corroborate or refute them by applying tests and designing specific strategies to obtain information in each case. Precisely, the objective of the simulated interviews is to obtain enough data to formulate a diagnosis. To do so, the student selects the most suitable question at each stage of the interview; then the system informs him/her how accurate his choice is, and the virtual patient responds to his/her questions. Each list of alternatives of questions contains a button called "HYPOTHESIS"; when the student presses it a list of possible diagnoses appears, from which s/he selects the one s/he considers best for the case in question. The student decides at each stage whether to continue asking questions or whether s/he has enough information to formulate a diagnostic hypothesis. If s/he selects the correct diagnosis at any given time during the interview, the system will only accept it if the patient has been adequately examined. Once the student gives the correct diagnosis, at a suitable moment of the interview, s/he will be able to formulate a prognosis. Also a bot named Alex was created. These virtual agent was an extensive modification of the ALICE bot (http://www.alicebot.org) developed by Richard Wallace. This bot consists of a set of AIML content files. The original ALICE bot was a female robot. Hence we edited the AIML files in order to change the bot’s personal characteristics to that of a patient with a Generalized Anxiety Disorder. This was accomplished by deleting all references to being a robot. Instead, Alex was programmed to simulate a real patient. In this case healthcare students can freely interact with this bot in order to investigate his mental disorder.
4 Future Projects In future we will compare the efficacy of virtual reality as an educational tool in psychology students. In this study we will evaluate the students acceptation of this resource by measuring its usability and utility. Moreover we will compare subjects’ performance in a test designed to measure the acquired knowledge and abilities related with differential psychopathological in two groups: Virtual reality, and a more traditional approach based in role-playing techniques.
504
J. Gutiérrez-Maldonado et al.
In the future version “Simulated Interviews 4.0” we will also integrate both virtual reality and artificial intelligence in a single application.
References 1. Emmelkamp, P.M.G.: Technological innovations in clinical assessment and psychotherapy. Psychotherapy & Psychosomatics 74, 336–343 (2005) 2. Knott, A., Vlugter, P.: Multi-agent human-machine dialogue: issues in dialogue management and referring expression semantics. Artificial Intelligence 172, 6–102 (2008) 3. Lee, K.M., Nass, C.: The multiple source effect and synthesized speech: doubly disembodied language as a conceptual framework. Human Communication Research 30, 18–207 (2004) 4. Winn, W.: A Conceptual Basis for Educational Applications of Virtual Reality, Technical Report TR, 93–99 (1993) 5. Roussos, M., Johnson, A., Moher, T., Leigh, J., Vasilakis, C., Barnes, C.: Learning and building together in a immersive virtual world. Presence 8, 247–263 (1999) 6. Stansfield, S., Shawver, D., Sobel, A., Prasad, M., Tapia, L.: Design and implementation of a virtual Reality Systems and Its Application to Training Medical First Responders. Presence 9, 524–556 (2000) 7. Cabero, J., Barroso, J.: En el umbral del 2000. Formación ocupacional y nuevas tecnologías de la información: encuentros y desencuentros. In: En Bermejo, B., et al. (eds.) Formación ocupacional. Perspectivas de futuro inmediato, Sevilla, GID-FETE, pp. 245– 261 (1996) 8. Lillehauga, S.-I., Lajoieb, S.: AI in medical education-another grand for medical informatics. Artificial Intelligence in Medicine 12(3), 197–225 (1998) 9. Bruner, J.: Towards a theory of instruction. WW Norton, New York (1966) 10. Moreno, R., Mayer, R.: Learning Science in Virtual Reality Multimedia Environments: Role of Methods and Media. Journal of Educational Psychology 94, 598–610 (2002) 11. Thorndike, E.L.: Human learning. Appleton-Century (1931) 12. McGuire, E.G.: Knowledge representation and construction in hypermedia and environments. Telematics and Informatics 13, 251–260 (1996) 13. Campbell, B., Collins, P., Hadaway, H., Hedley, N., Stoermer, M.: Web3D in Ocean Science Learning Environments: Virtual Big Beef Creek. In: Web3D 2002, pp. 24–28 (2002) 14. Gutiérrez, J.: Aprendizaje asistido por ordenador a través de Internet. In: Gutiérrez, J., Andrés, A., Bados, A., Jarne, A. (eds.) Psicología Hoy, Centro Asociado de Tortosa, Universidad Nacional de Educación a Distancia, pp. 103–126 (1998) 15. Mantovani, F., Castelnuovo, G., Gaggioli, A., Riva, G.: Virtual reality training for healthcare professionals. CyberPsychology & Behaviour 6(4), 389–395 (2003) 16. Kuppersmith, R., Johnston, R., Moreau, D., Loftin, R., Jenkins, H.: Building a Virtual Reality Temporal Bone Dissection Simulator. In: Proc. Medicine Meets Virtual Reality, pp. 180–186 (1997) 17. Robb, R.: Virtual endoscopy: evaluation using the visible human datasets and comparison with real endoscopy in patients. In: Medicine Meets Virtual Reality/Studies in Health Technology and Informatics (1997) 18. Kesawadas, T., Joshi, D., Mayrose, J., Chugh, K.: A virtual environment for esophageal intubation training. In: Westwood, J.D., Hohhman, H.M., Robb, R.A., et al. (eds.) Medicine meets virtual reality 02/10, pp. 221–227. IOS Press, Amsterdam (2002)
Virtual Intelligent Agents to Train Abilities of Diagnosis in Psychology and Psychiatry
505
19. Tsai, M.D., Hsieh, M.S., Jou, S.B.: Virtual reality orthopedic surgery simulator. Computer Biology Medicine 31, 333–351 (2001) 20. Agus, M., Giachetti, A., Gobbetti, E., Zanetti, G., Zorcolo, A., John, N., Stone, R.: Mastoidectomy simulation with combined visual and haptic feedback. In: Westwood, J.D., Hohhman, H.M., Robb, R.A., et al. (eds.) Medicine meets virtual reality 02/10, pp. 17–23. IOS Press, Amsterdam (2002) 21. Rolfsson, G., Nordgren, A.S.B., Bindzau, S., Hagström, J.P., McLaughlin, J., Thurfjell, L.: Training and assessment of laparoscopic skills using a haptic simulator. In: Westwood, J.D., Hohhman, H.M., Robb, R.A., et al. (eds.) Medicine meets virtual reality 02/10, pp. 409–411. IOS Press, Amsterdam (2002) 22. Robb, R.: Virtual endoscopy: evaluation using the visible human datasets and comparison with real endoscopy in patients. In: Medicine Meets Virtual Reality/Studies in Health Technology and Informatics (1997) 23. American Psychiatric Association.: Diagnostic and statistical manual of mental disorders (Revised text.). American Psychiatric Association, Washington, DC (2002)
The Role of Neural Networks in Biosignals Classification Stelios Zimeras1 and Anastasia Kastania2 1 2
Department of Statistics and Actuarial-Financial Mathematics, University of Aegean, Greece Biomedical Research Foundation of the Academy of Athens, Greece
[email protected],
[email protected]
Abstract. The neural networks (NNs) are regularly employed in biosignal processing because of their effectiveness as pattern classifiers. This study presents an overview of the application of neural networks in the field of biosignal classification (especially in anomaly detection problems), and, in addition, results of adaptations of conventional neural classifiers are presented. Statistical techniques based on pattern recognition analysis (like Principal Components Analysis and Clustering) might be use to evaluate the proposed methodology. Finally we will illustrate advantages and drawbacks of neural systems in biosignal analysis and catch a glimpse of forthcoming developments in machine learning models for the real clinical environment. Keywords: neural networks, biosignal, statistical measure, clustering.
1 Introduction Biosignal processing and analysis is a field of great importance in current medical practice. In recent years, biomedical engineers have developed many algorithms and processing techniques in order to help doctors in the examination of many different biosignals, and to find new information embedded in them and not easily observable in the raw data. The most widely used biosignal is the electrocardiogram (ECG) [1]. Traditional use of the ECG focused on the identification of the anomalies in the QRS complex associated with each heartbeat. Each beat is compared to the normal pattern. In order to develop a comprehensive decision model, methods are needed to combine the results. Neural networks provide an effective framework not only for combining multiple signal analysis results but also for including other clinical information [2].
2 ECG Methodology The ECG is a bioelectric signal with records the heart’s electrical activity versus time; therefore it is important for that diagnose of the heart function. The electrical current due to the depolarization of the node stimulates the surrounding myocardium and spreads into the heart tissues. By applying electrodes on the skin at the selected points, the electrical potential generated by this current can be recorded as an ECG signal. In some specific cases, sophisticate ECG analyzers achieve accuracy rather than doctors but there are various groups of ECG signals that are too difficult to identify by computers. G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 507–512, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
508
S. Zimeras and A. Kastania
The ECG is a graph representation of the electrical activity of the heart’s conduction system recorded over a period of time. Under normal conditions ECG tracing have a very predictable direction, duration and amplitude. Because of this components of the ECG tracing can be identified, assessed and interpreted as a normal or abnormal function [3, 4]. The ECG signal usually displays a series of temporary changes in the ST-segment and T wave. These signal parts can be observed after detecting the R peak, which is the most striking point in the ECG (Fig. 1). Before R peak detection, the raw ECG data is preprocessed to reduce some artifacts. A 50 Hz notch filter is used to eliminate the power line interference, a band-pass filter (1 - 100 Hz) is employed to limit the bandwidth of the ECG signal. The ECG features can be extracted in time domain, in frequency domain, by multi-scale decomposition, or represented as statistical measures [5].
Fig. 1. Electrocardiogram
3 Neural Networks in Biosignal Analysis A neural network is a powerful tool to model complex data which represents relationships between them based on statistical techniques. It contains no domain knowledge in the beginning, but it can be trained to make decisions by mapping pairs of input data into output vectors, and adjusting its weights so that it maps each input exemplar vector into the corresponding output vector approximately [1]. A knowledge base pertaining to the internal representations (i.e. the weight values) is automatically constructed from the data presented to train the network where a learning algorithm is used to modify the knowledge base from a set of given representative cases. Any formulas relating the independent variables (i.e. input variables) to the dependent variables (i.e. output variables) need not be imposed in the neural network model. The rules with logical conditions need not be built by developers as neural networks investigate the empirical distribution among the variables and determine the weight values of a trained network. A neural network is an appropriate method when
The Role of Neural Networks in Biosignals Classification
509
it is difficult to define the rules clearly as is the case in the misuse detection or anomaly detection. In order to measure the performance of an intrusion detection system, two types of rates are identified, false positive rate and true positive rate (detection rate) according to the threshold value of the neural network. The system reaches its best performance for height value of detection rate and low value of false positive rate. A good detection system must establish a compromise between the two situations. There are various types proposed architectures, like perception [6], back propagation (BP) [6], perception- back propagation [7], Fuzzy ARTMAP [8] and Radial based function (RBF) [6]. A common neural network model is the multilayer perception [9]. This type of model is known as a supervised network because it requires a desired output in order to learn. The main task is to propose a correct input model based on the historical data so that the results could be used when the output is known. The proposed flow diagram of the process is given by the following diagram (Fig. 2):
Fig. 2. Neural network approach
The particular neural net for abnormally detection uses radical basic function to form a statistical model of dependent variables values. As new data insert inside the system it is compared with the proposed model. If data fails with the proposed model it is defined as nominal. The radical basis function is essential a nearest neighbors classifier. A generic form of a neural network intrusion detector is presented in the Fig. 3.
510
S. Zimeras and A. Kastania
Fig. 3. Standard RBF Neural Net Architecture
The system use the input labeled data (normal and attack samples) to train a neural network model. The resulting model is then applied to the new samples of the testing data to determine the corresponding class of each one, and so to detect the existing attacks. Using the label information of the testing data, the system can compute the detection performances measures given by the false alarms rate, and the detection rate. A classification rate can also be computed if the system is deigned to perform attacks multi-classification [10].
4 Anomaly Detection Abrupt changes often contain critically important information from various perspectives, and hence, the problem of discovering time points where changes occur, called change points, has received much attention in statistics and data mining [11, 12]. The nearest-neighbor technique could be applied where input data on the system are checked based on a radical based function. If the input data does not satisfy the statistical conditions of the proposed function then an abnormally is detected. For change-point detection in time series, there are a number of useful methodologies for detecting multivariate data [13, 14]. Approaches like, Yamanishi and Takeuchi whom proposing a framework in which an autoregressive (AR) model is learned recursively and hence the nonstationarity in time series [15, 16]. Also, change point detection algorithms based on the singular-spectrum analysis (SSA), which is a classical nonparametric methodology for time-series analysis handled mainly in signal processing, are proposed [16, 17]. A simple proposed method is analyzed in [18], where the authors introduce an evaluation function of a sequence of events considering predefined thresholds. If the value of the function is greater than the threshold, the point defined as normal, otherwise as anomaly. Brotherton and Johnson [10] are proposing a RBF neural network approach using hidden-layer basis unit functions for input data. Clustering of the data could be used by linear vector quantization algorithm. For the abnormally detection, the k-means algorithm is proposed as the appropriate method for classification of the output data. For estimation of the RBF neural nets, the Gaussian or the Rayleigh function shape are proposed considering the magnitude of the data.
The Role of Neural Networks in Biosignals Classification
511
5 Data Clustering and k-Means Method Clustering of data is a method by which large sets of data are grouped into clusters of smaller sets of similar data. A clustering algorithm attempts to find natural groups of components (or data) based on some similarities. The clustering algorithm also finds the centroid of a group of data sets. To determine cluster, most algorithms evaluate the distance between a point and the cluster centroids. The output from a clustering algorithm is basically a statistical description of the cluster centroids with the number of components in each cluster. The centroid of a cluster is a point whose parameter values are the mean of the parameter values of all the points in the clusters. The kmeans algorithm used in this work is one of the most non-hierarchical methods used for data clustering. The k-means is one of the simplest unsupervised learning algorithms that solve the clustering problem. The goal is to partition an object containing n data points into k clusters in such a ay that the centroids of the clusters is minimized based on a metric (like Euclidean distance or nearest-neighborhood distance). These centroids should be placed in a way that placing them as much as possible far away from each other. The next step is to take each point belonging to a given data set and associate it to the nearest centroid. When no point is pending, the first step is completed and an early grouping is done. At this point we need to re-calculate k new centroids of the clusters resulting from the previous step. After we have these k new centroids, a new binding has to be done between the same data set points and the nearest new centroid. A loop has been generated, as a result of this loop we may notice that the k centroids change their location step by step until no more changes are done. In other words centroids do not move any more. Finally, this algorithm aims at minimizing an objective function, in this case a squared error function. The objective function: k
n
Error = ∑∑ xi − c j
2
(1)
j =1 i =1
where xi is the i-data set and cj is geometric centroid of the data point. Although it can be proved that the procedure will always terminate, the k-means algorithm does not necessarily find the global minimum. The algorithm is also significantly sensitive to the initial randomly selected cluster centers. The k-means algorithm can be run multiple times to reduce this effect. Unfortunately there is no general theoretical solution to find the optimal number of clusters for any given data set. A simple approach is to compare the results of multiple runs with different k classes and choose the best one according to a given criterion, but we need to be careful because increasing k results in smaller error function values by definition, but also an increasing risk of over-fitting.
6 Conclusion The neural networks (NNs) are a powerful tool for analyzing biosignal because of their effectiveness as pattern classifiers. This study presents an overview of the application of neural networks in the field of biosignal classification (especially in
512
S. Zimeras and A. Kastania
anomaly detection problems), and, in addition, results of adaptations of conventional neural classifiers are presented. Statistical techniques based on pattern recognition analysis (Clustering and k-means method) might be use to evaluate the proposed methodology.
References 1. Hecht-Nielse, R.: Applications of counter propagation networks. Neural Networks 1, 131– 139 (1988) 2. MacQueen, J.B.: Some Methods for classification and Analysis of Multivariate Observations. In: Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297. University of California Press, Berkeley (1967) 3. Jardins, T.D.: Cardioopulmonary Anatomy Physiology, 4th edn. (2002) 4. Bronzino, D.J.: The Biomedical Engineering Handbook, 2nd edn. IEEE Press, Los Alamitos (2000) 5. Pryor, T.A., et al.: Computer Systems for the processing of diagnostic electrocardiograms. IEEE Computer Society Press, Los Alamitos (1980) 6. Haykin, S.: Neural Network: A comprehensive foundation. Macmillan College Publishing, Basingstoke (1994) 7. Dillon, R.M., Manikopoulos, C.N.: Neural Nets non-linear prediction for speech data. IEE Electronic Letters 27(10), 824–826 (1991) 8. Carpenter, G.A., et al.: Fuzzy ARTMAP: An adaptive resonance architecture for incremental learning of analog maps. In: International Joint Conference on Neural Network (1992) 9. Fauseatt, L.: Fundamentals of Neural Networks. Prentice Hall, New Jersey (1994) 10. Brotherton, T., Johnson, T.: Anomaly detection for advance military aircraft using neural networks. In: Proceedings of the 2001 IEEE Aerospace Conference, Big Sky Montana (2001) 11. Basseville, M., Nikiforov, I.V.: Detection of Abrupt Changes: Theory and Applications. Prentice-Hall, Inc., Simon & Schuster Company, Englewood Cliff (1993) 12. Gustafsson, F.: Adaptive Filtering and Change Detection. John Wiley & Sons Inc., Chichester (2000) 13. Markou, M., Singh, S.: Novelty detection: A review - part 1: Statistical approaches. Signal Processing 83(12), 2481–2497 (2003) 14. Markou, M., Singh, S.: Novelty detection: A review - part 2: Neural network based approaches. Signal Processing 83(12), 2499–2521 (2003) 15. Takeuchi, J., Yamanishi, K.: A unifying framework for detecting outliers and change points from time series. IEEE Trans. on Knowledge and Data Engineering 18(4), 482–489 (2006) 16. Yamanishi, K., Takeuchi, J.: A unifying framework for detecting outliers and change points from non-stationary time series data. In: Proc. of the 8the ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 676–681 (2002) 17. Ide, T., Kashima, H.: Eigenspace-based anomaly detection in computer systems. In: Proc. of the 10th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 440–449 (2004) 18. San-Jun, H., Sung-Bae, C.: Evolutionary Neural Networks for anomaly detection based on the behavior of a program. IEEE Trans. on Systems and Cybernetics, Part B 36(3), 559– 568 (2006)
Medical Informatics in the Web 2.0 Era Iraklis Varlamis1 and Ioannis Apostolakis2 1
University of Peloponnese, Tripoli, Greece
[email protected] 2 Technical University of Crete, Chania, Greece
[email protected]
Abstract. The main role of medical and healthcare informatics is the manipulation of medical information and the dissemination of knowledge. The advent of the Web increased the pervasiveness of medical information and attracted the interest of both practitioners and patients. Web 2.0 in its turn brings people together in a more dynamic, interactive space. With new services, applications and devices, it promises to enrich our web experience, and to establish an environment where virtual medical communities may flourish away from private interests and financial expectations. This article performs a bird's eye view of Web 2.0 novelties, portrays the structure of a medical community and describes how medical information can be exploited in favor of the community. It discusses the merits and necessities emanating from various approaches and tools and gives emphasis on the intelligent information management inside the medical community. Keywords: Web 2.0, Medical Communities, Services.
1 Introduction Medical informatics has been defined as the study and implementation of structures to improve communication, understanding and management of medical information. The objective is the extraction, storage and manipulation of data and information and the development of tools and platforms that apply knowledge in the decision-making process, at the time and place that a decision needs to be made. The advent of internet introduced the idea of tele-application of medical practices. Tele-medicine, tele-education of practitioners and nurses, tele-healthcare and teleconsultation are rapidly developing applications of clinical medicine, where medical information is transferred via telephone, the Internet or other networks for the purpose of consulting, and sometimes remote medical procedures or examinations. Internet has broaden the scope of medical information systems and led to the development of distributed and interoperable information sources and services. In the same time, the need for standards became crucial. Federated medical libraries, biomedical knowledge bases and global healthcare systems, offer a rich information sink and facilitate mobility of patients and practitioners. The Web attracted more patients and increased the popularity of freely available medical advice and knowledge. The abundance of web sites that offer medical content affected the way patients face their doctors, gave them a second opinion and increased their awareness. G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 513–522, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
514
I. Varlamis and I. Apostolakis
Its' successor, Web 2.0, was built on the same technologies and concepts but added a layer of semantic abstraction, offered a network as a platform sensation and gave a social networking aspect to medical information systems. Patients, instead of seeking medical information and requesting medical advice on their issues, are supplied with useful news, when medical advances take place. Patients are able to discuss their issues with other patients and collectively develop a medical knowledge base with easy to use tools. The plethora of tools and platforms available enhances the inventory of medical practitioners and can be of value to them and their patients if properly exploited. This paper gives an overview of these tools, discusses the merits of their use and the potential hazards that should be avoided. In the following section we enlist the major technological novelties of Web 2.0 under the prism of certain applications. In section 3 we examine a community based approach, which combines the aforementioned novelties, under the prism of intelligent information management and in section 4 we discuss the potential merits of this approach and the issues that should be considered.
2 Web, Web 2.0 and Medical Applications Internet and its services had a major impact on health care and medical information. First, it opened public access to medical information, which was previously restricted to health care providers. Of all searches on the Internet, 4.5% have been calculated to be health-related [7]. The patients feel empowered before reaching their doctors [1], since they found or ask for information on the web. They get an idea about their diagnosis and treatment options and want to actively participate in therapeutic decisions. As a result, the way of interaction between the patient and the doctor, has changed. Similarly the way people perceive medical information has changed. 2.1 Medical Information and Web-Based Applications The seek for direct medical consultation gained place from searching and browsing of medical information and this is another fact of change in the way of communication between doctors and patients. "Ask the doctor services" [17], initially deployed through e-mail, kept record of questions and replies by expert physicians and published results to the web for further reference [18]. Web sites have been also created in order to alert or support patients [8] and offer informative content, provide directions for prevention, cure and symptoms' handling and of course sample questions and feedback from physicians. Electronic assessment is another healthcare application which gained great attention. Online questionnaires, symptom checklists etc. were used in order to increase the interactivity of web based medical applications. Short screening tests [9], [14], helped people to detect and overcome their addictions, alerted them and reminded them to visit their doctor. Mailing lists was also a solution for supporting patients in a constant manner. Table 1 summarizes medical applications and services delivered over the web.
Medical Informatics in the Web 2.0 Era
515
Table 1. Web based medical services Web applications Ask the doctor Medical chats Medical forum Ask the doctor website Patient support websites and mailing lists for alerts Online assessment Tele-healthcare, medicine, homecare etc.
Purpose Offer medical consultation on demand Offer medical consultation on demand, group therapy Offer medical consultation on demand, retain archive Archive medical consultation Provide informative content, support and prevention guidelines Prevent maladies, detect additions Remotely provide clinical care, diagnosis, medical education
Services E-mail Chat Forum Dynamic Web site Email, Static Web site Active Web Site Tele-conference, Voice and Video over IP
Tele-healthcare, tele-medicine [10], tele-homecare and other applications make use of Web and the whole Internet infrastructure in order to offer clinical and non-clinical services (medical education, information and administrative services).The main aim of these application is the transfer of medical information and advices between the hospital and the remote patient, or the remote care provider, thus removing geographical and time zone barriers. In the same direction, interoperable medical information systems have been developed to support exchange and utilization of medical information across different hospitals, different healthcare providers or even across different countries [3], [2]. Web is mainly employed to achieve better coordination of all the participants in the medical process. All the above applications created a critical mass of people, practitioners, patients, care providers and care givers that requests medical information and advice in health related issues in an everyday basis. People assembled virtual communities around their issues and started seeking for more flexible and collaborative solutions. The interest for ubiquitous medical information and pervasive solutions [4], created new web applications that facilitate people in sending, processing and receiving medical information. At the same time a lot of Web 2.0 applications and standards emerged.
3 Collaborative Services and Medical Communities Despite the achievements of web and its services, it was necessary that patients seek for medical information and that patients contact their doctors for consultation, diagnosis or treatment. The advent of Web 2.0 changed this uni-directional flow of information. Now, all community members, even patients are able to feed the community with news, advices and personal experiences. Moreover, the request-serve model has changed towards a push-pull model where information is accumulated by community members and is made available to them through intelligent services (see Fig. 1). Patients receive useful alerts and doctors get notifications on medical advances, new medicines and therapies. In [13], the term Web 2.0, is perceived to encompass a set of services, which extend Web 1.0 capabilities and emphasize on the community and collaboration
516
I. Varlamis and I. Apostolakis
Fig. 1. How the medical mesh transforms into a community
aspects. In [15] and [16] authors present how medical communities can be used in favor of patients and how communication and collaboration between members of the healthcare community can be hosted in a community platform. 3.1 Web 2.0 Services There is already an important amount of Web 2.0 applications [5], [6], [11], [12], which are related to medical issues. Blogs, wikis, folksonomies, podcasts and vidcasts are among them. In the following we give details on these applications, on the way information is published, annotated and consumed and on their potential use in favour of the medical community. The main characteristics of all Web 2.0 services, which are presented in Table 2 are: a) contribution is communal, b) publishing has been replaced by participation and c) access is public or at least is granted to the members of the medical community. Anonymity and identity issues are solved with the use of virtual identities. They are mainly asynchronous since it is infeasible for all community members to be concurrently online. Blogs Blogs (WeBLOGs) are Web sites that function as online journals. They present published content in reverse publication date blogs. One or more persons may contribute with articles (posts), comments, links to other Web sites and multimedia content. Blog participants form virtual groups based on their common interests in the blog's topic. The easy and immediate publishing made them very popular. The posting of a clinical photo from a digital camera or a mobile phone directly to a blog after optimisation and commenting can be made at the touch of a button. Medical blog examples include Clinical Cases and Images, Family Medicine Notes etc. RSS RSS stands for "Really Simple Syndication." It is a standard format used to share content on the Internet. Many websites provide RSS "feeds" that describe their latest news and updates. They play the role of newsletters but offer information in pieces, at
Medical Informatics in the Web 2.0 Era
517
the moment it is created (feeds) and can be accessed by various devices and systems grace to the standard format. The doctors' lounge, RSS for medics and Medical News Today are a few medical RSS news syndication services. Most blogging services offer the ability to create RSS feeds and an RSS reader is the only tool needed to process this feed. Table 2. Web 2.0 applications. Examples and open source solutions Web 2.0 apps Purpose Online examples/tools Blogs, photo blogs Provide medical consultation, www.docnotes.net news, announcements, photos, http://casesblog.blogspot.com allow comments www.wordpress.com www.flickr.com RSS feeds and Instantly receive medical http://www.doctorslounge.com/rss news syndications information right after it is http://www.rss4medics.com published www.medicalnewstoday.com http://www.feedforall.com Podcast and Provide consults, courses and http://conversations.acc.org/ Vidcast information in audio and video http://www.annals.org/podcast/index.shtml stream format http://www.clevelandclinic.org http://video.google.com/ http://www.archive.org/details/movies Wiki Collaboratively construct an http://askdrwiki.com/mediawiki/ archive of medical knowledge http://www.radiopaedia.org http://www.mediawiki.org/ http://www.splitbrain.org/go/dokuwiki Collaborative Link to informative content, http://www.bibsonomy.org/ Tagging and Social evaluate sources and organize http://www.citeulike.org/ bookmarking knowledge http://www.flickr.com/ http://www.connotea.org/
Audio and video podcasts They can be employed similar to RSS for providing medical information on emerging issues. Moreover, the power of image and the ease of listening instead of reading make them ideal for the dissemination of medical information and for online courses. Example are: the Annals of Internal Medicine, the podcasts of the American College of Cardiology (Conversations with Experts) and the vidcasts of Cleveland Clinic. Wikis Wikis are considered to replace content management applications by allowing users to easily publish articles, images and video. They can start to cover the lack of free online medical information and function as a repository of medical information that could be readily accessed for reference. They are built and populated collaboratively by domain experts and are accessible to patients, doctors or trainees and the public. In a medical wiki, the group of editors creates and contributes with article reviews, disease definitions (symptoms, cure etc), clinical notes, medical images or video.
518
I. Varlamis and I. Apostolakis
Editors have the ability to alter content published by other editors and have their articles edited by others hoping that the wiki will finally converge into a widely accepted final version. Social bookmarking Medical bookmarking is aimed to promote the sharing of medical references mainly amongst practitioners and researchers. Scientists can share information on academic papers, are able to collaboratively catalog medical images with specific tools developed for that purpose. Article readers can organize their libraries, which can comprise Medline articles, with freely chosen tags. The result is a multi-faceted taxonomy, called folksonomy of tags (topics) and associated sources. Many medical information sources support tagging by users (i.e. JSTOR, PLoS, PubMed, and ScienceDirect). Human knowledge, captured in the categorization and characterization of articles, or web sources in general, can be exploited by intelligent agents in order to provide recommendations about related sources or tags. It is obvious, that all the services presented above, differ from typical web services, in the multitude and nature of information sources they cover and the way of enhancing and exploiting this information. 3.2 The Medical Community Structure A medical community that will encompass all the people interested in medical issues should be open to new members. Trustfulness is critical in medical issues and specifically in medical consultation, so the identity of consultants has to be valid and accessible to the community members. In the same time, the anonymity is necessary (or at least helpful) for patients that seek for consultation. Information/content providers and information consumers are the two main types of users. The former should necessarily use their real identities, whereas the latter can remain anonymous or behind use virtual personas. Information consumers (i.e. patients, people asking about medical issues etc) can potentially become providers, since their questions, remarks and bookmarks are made available to the community. However, the quality of this content is questionable. Moderators, administrators and facilitators stand in-between the two types of users and are responsible for the smooth operation of the community. They control the registration process and guarantee the validity of expert members' identities. The community members are able to form groups inside the community based on common needs and interests. The needs of each group are different and sometimes contradictory. It is necessary for the community to allow members to communicate their similarities and join their forces, whilst protecting their individuality. A healthcare community can attract scientists and researchers, doctors and nurses, patients and people with personal interests in medicine and healthcare, companies. More specifically: •
Scientists and researchers join the community in order to exchange knowledge and promote their science. They communicate with patients, analyze surveys’ results and population statistics and get useful feedback on patient needs, on medical issues that arise etc. They co-operate with other scientists for their experiments and disseminate their findings to companies
Medical Informatics in the Web 2.0 Era
•
• •
•
519
and individuals. They also give useful directions to medical associations concerning public health. Medical associations provide the professionals with guidelines on patient treatment and inform patients on topics such as prevention, self protection etc. They issue specifications for companies that produce medical devices and medications. Healthcare companies advertise their products (devices, therapies, medical applications) to doctors, nurses and patients. Healthcare practitioners get informed on new findings, emerging therapies and medical approaches and sometimes get online training. In parallel, they guide nurses and patients’ families on patient-care and provide researchers and associations with useful feedback on emerging patient needs. Patients are receivers of support, treatment, care, information and advertisement from all other participants. They contribute to the community, as end users of the community outcomes and as specimens of surveys.
Fig. 2. The medical community structure
As it is depicted in Fig. 2, the medical community portal is accessible to every web visitor or bot that wishes to browse or process the published content. Community members, should register once and login every time they want to join the community. The registration of new members should be controlled by the administrators. The
520
I. Varlamis and I. Apostolakis
identity of expert members (doctors, company officials, scientists etc) should be checked and certified by the administrators, where as simple members can join by giving a contact e-mail address. Inside the community, registered users are able to participate in the various activities (i.e. chat with doctors or other members, perform public discussions, attend a video podcast or registed to news feeds). The community experts create and publish new content and are charged with the moderation of group discussions, and the filtering of content uploaded by non experts. They use the wiki and tagging services to accumulate and organize the knowledge base of the community and inform on new findings using the news feeds.
4 Discussion The merits that arise from the community approach are many. First of all the human knowledge is captured, is enriched with semantics (i.e. tags) and is organized collaboratively (i.e. folksonomies, wikis) in a mechanically readable way. Instead of a multitude of distinct applications that do not cooperate, the community platform is the World Wide Web, and the community activities and services can be developed using commonly agreed standards and common terminology. Web offers ubiquitous access to the community services, since web 2.0 applications are light and can be accessed by mobiles, PDAs, or even tv-sets. New content (i.e. video blogging or podcasting), requests for advice, patient related information or input to surveys can be attached using the same devices (e.g. patient can select their symptoms from a list and communicated them to the community experts). The personalization of the community content to the specific needs of each member can be done by selecting the mini-applications (widgets) that fit each patient's needs. Smart alert systems can be developed that will remind patients of their scheduled treatment or that will inform doctors on their patients health status. All community transactions and communications must be secure and various access levels can be used. Trust inside the community can be guaranteed by a strong administrator organization through the use of proper technologies, validation mechanisms and security structures. Trust can also be developed by using an evaluation and reputation system. In this system, expert users will be able to validate content, and all community members will be able to judge, vote and tag content in order to make it useful for others. As it is the case with all communities, the administrators should be careful to avoid several dangers. Most of the efforts we mentioned in section 2, are made by individuals, or by a single institute or university and are not supported by a big organization or a medical forum. A centrally co-ordinated effort is necessary for a successful and effective community. Administration should be performed in cooperation with companies and associations. When the community serves for patients or doctors to support other associates, the advices and information exchanged between individuals should be validated. Group moderators need monitoring tools in order to proactively coordinate groups, and would be pleased to have collaborative platforms to support their groups. Validity can be achieved through monitoring, although, it is preferable to replace monitoring with an authorization mechanism. Advices, comments or opinions that are not signed are considered of low quality and
Medical Informatics in the Web 2.0 Era
521
consequently invalid. Valid information and services are issued by authorized community members only and are always signed. The diversity of web 2.0 tools can be confusing to the community members, especially when all novelties are introduced in one step. Changes and new services should be added slowly and training, facilitation and user feedback are appreciated. Another issue that must be considered in a medical community relates to the amount and quality of information offered. The flood of information can be confusing both to patients and doctors and as a consequence, information must be filtered and organized. Since anyone is able to publish information and since it is not always easy to see the origin of the information, users could be making decisions on the basis of a source that might not be quality assured. A certification authority is necessary to guarantee the expertise level of every user, control the quality of the published information and build trust among the community members. Even when the information is of high quality, users are not capable to make their own judgments and need support from the experts. Other issues relate to the expertise of all members in handling virtual discussions or providing diagnosis remotely. These issues should be considered in the design phase in order to increase members’ participation and improve the quality of the community services.
5 Conclusions This paper performed an overview of web 2.0 applications and compared their features to traditional web services under the prism of the medical community and its needs. Current attempts in using web 2.0 applications in favor of the medical community are disconnected, so we present a structure that will allow their interconnection. The community will bring together doctors, nurses and volunteers around patients and will provide the tools for requesting and providing medical information, advices and psychological support. Healthcare associations, companies and researchers will be able to join the community, disseminate their instructions, products and findings respectively and undertake crucial tasks such as the quality control of services and information. The use of community services will load the community database with valuable information concerning user feedback, patient needs, treatment suggestions, patient profiles and medical record history. The stockpiled information can be analyzed: by the community administrators who want to improve services, by scientists who perform medical research, by future patients who seek for a quick advice from a fellow-sufferer. The knowledge produced inside the community will be continuously filtered and managed in order to maintain quality.
References 1. Akerkar, S.M., Bichile, L.S.: Doctor patient relationship: Changing dynamics in the information age. Journal of Postgrad Medicine 50, 120–122 (2004) 2. Apostolakis, I., Kastania, A.N.: Distant teaching in telemedicine: Why and Who we do it. Journal of Management and Health 1(1), 66–73 (2000)
522
I. Varlamis and I. Apostolakis
3. Apostolakis, I., Valsamos, P., Varlamis, I.: Decentralization of the Greek National Telemedicine System through the upgrading of the Regional Telemedicine Centers. In: Tan, J. (ed.) Healthcare Information Systems and Informatics: Research and Practices, IGI Global (to appear) (April 2008) 4. Bang, M., Timpka, T.: Ubiquitous computing to support co-located clinical teams: Using the semiotics of physical objects in system design. International Journal of Medical Informatics, 58–64 (2006) 5. Barsky, E.: Introducing Web 2.0: weblogs and podcasting for health librarians. In: JCHLA/JABSC 2006, vol. (27), pp. 33–34 (2006), http://pubs.nrc-cnrc.gc. ca/jchla/jchla27/c06-013.pdf 6. Boulos, M.N., Maramba, I.: Wheeler. Wikis, blogs and podcasts: a new generation of Web-based tools for virtual collaborative clinical practice and education. BMC Med. Educ. 6, 41 (2006) 7. Eysenbach, G., Kohler, C.: Health-related searches on the Internet. JAMA 291, 29–46 (2004) 8. Hartmann, C.W., Sciamanna, C.N., Blanch, D.C., Mui, S., Lawless, H., Manocchia, M., Rosen, R.K., Pietropaoli, A.: A Website to Improve Asthma Care by Suggesting Patient Questions for Physicians: Qualitative Analysis of User Experiences. J. Med. Internet Res. 9(1), e3 (2007) 9. Linke, S., Murray, E., Butler, C., Wallace, P.: Internet-Based Interactive Health Intervention for the Promotion of Sensible Drinking: Patterns of Use and Potential Impact on Members of the General Public. J. Med. Internet Res. 9(2), e10 (2007) 10. Linkous, J.D.: Telemedicine: an overview. Journal of Medical Practice Management 18(1), 24–27 (2002) 11. Mathieu, J.: Blogs, Podcasts, and Wikis: The New Names in Information Dissemination. Journal of the American Dietetic Association 107(4), 553–555 (2007) 12. Mclean, R., Richards, B.H., Wardman, J.I.: The effect of web 2.0 on the future of medical practice and education: Darwikinian evolution or folksonomic revolution? Medical Journal of Australia 187, 174–177 (2007) 13. O’Reilly, T.: What Is Web 2.0. Design Patterns and Business Models for the Next Generation of Software, O’Reilly Network (2005) (Retrieved on 1-2-2008), http://www.oreillynet.com/lpt/a/6228 14. Vallejo, M.A., Jordán, C.M., Díaz, M.I., Comeche, M.I., Ortega, J.: Psychological Assessment via the Internet: A Reliability and Validity Study of Online (vs Paper-andPencil) Versions of the General Health Questionnaire-28 (GHQ-28) and the Symptoms Check-List-90-Revised (SCL-90-R). J. Med. Internet Res. 9(1), e2 (2007) 15. Varlamis, I., Apostolakis, I.: Self supportive web communities in the service of patients. In: Proceedings of IADIS International Conference on Web Based Communities 2007, Salamanca, Spain, February 18-20 (2007) 16. Varlamis, I., Apostolakis, I.: Use of virtual communities for the welfare of groups with particular needs. Journal on Information Technology in Healthcare (JITH) 4(6), 384–392 (2006) 17. Umefjord, G., Sandström, H., Malker, H., Petersson, G.: Medical text-based consultations on the Internet: A 4-year study. International Journal of Medical Informatics 77(2), 114– 121 (2008) 18. Umefjord, G., Hamberg, K., Malker, H., Petersson, G.: The use of an Internet-based Ask the Doctor Service involving family physicians: evaluation by a web survey. Family Practice Apr. 23(2), 159–166 (2006)
Affective Reasoning Based on Bi-modal Interaction and User Stereotypes Efthymios Alepis1, Maria Virvou1, and Katerina Kabassi2 1
Department of Informatics, University of Piraeus, 80 Karaoli & Dimitriou St., 18534 Piraeus, Greece {talepis,mvirvou,kkabassi}@unipi.gr 2 Department of Ecology and the Environment, Technological Educational Institute of the Ionian Islands, 2 Kalvou Sq., 29100 Zakynthos, Greece
Abstract. This paper describes a novel research approach for affective reasoning that aims at recognizing user emotions within an educational application. The novel approach is based on information about users that arises from two modalities (keyboard and microphone) and is processed based on a combination of the user stereotypes theory and a decision making theory. The resulting system is called Educational Affective Tutor (EAT). EAT is an educational system that helps students learn geography and supports bi-modal interaction. The main focus of this paper is to show how affect recognition is designed based on and empirical study aimed at finding common user reactions that expressed user feelings while they interacted with computers.
1 Introduction Affective computing has recently become a very important field of research because it focuses on recognizing and reproducing human feelings within human computer interaction. Human feelings are considered very important but only recently have started being taken into account in user interfaces. Thus, the area of affective computing is not yet well understood and needs a lot more research to reach maturity. As Picard claims in [1], one of the major challenges in affective computing is to try to improve the accuracy of recognizing people’s emotions. In [2] and in [3] it is suggested that the incorporation of stereotypes in emotion-recognition systems improves the systems’ accuracy. However, it is acknowledged by the same researchers that until the time they had completed their review there were few studies based on emotion stereotypes. Today, there is still a shortage of such studies despite the fact that the research interest in affective computing has been expanded. Another suggestion for the improvement of accuracy in affect recognition is the combination of more than one modes of interaction in user interfaces. It is hoped that the multimodal approach may provide not only better performance, but also more robustness [4]. In view of the above, in this paper we have combined two modes of interaction, namely keyboard and audio input devices so that the system may draw inferences about the users’ emotional state while they interact with an educational application. The reasoning mechanism that draws inferences about users’ emotions is based on monitoring users’ actions recorded by the two modes of interaction and by using user G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 523–532, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
524
E. Alepis, M. Virvou, and K. Kabassi
stereotypes that have been constructed of user modeling in the affective educational application. The educational application that performs affect recognition and has been used as a test-bed for our approach is called EAT (Educational Affective Tutor). EAT is an educational system that helps students learn geography and supports bimodal interaction. Input for the users is given through two different modes, the microphone and the keyboard. The information collected by bi-modal interaction is used for making inferences about the user’s emotional state and provide affective interaction. Considering the important problem of which mode gives better results or in what extent should the evidence from each mode should be taken into account, we propose a novel approach for calculating the weight of significance of each mode based on stereotypes and a multi-criteria theory. Stereotype-based reasoning takes an initial impression of the user and uses this to build a user model based on default assumptions [5]. Stereotypes constitute a powerful mechanism for building user models [5]. This is due to the fact that stereotypes represent information that enables the system to make a large number of plausible inferences on the basis of a substantially smaller number of observations [6]. The stereotype inferences are used in combination with a decision theory, namely Simple Additive Weighting (SAW) ([7], [8]) for estimating weight of significance of each mode in the affective reasoning of the system for a particular user.
2 Overview of the System In this section, we describe the main functionality of EAT (Educational Affective Tutor). EAT is an affective multi-modal education system about geography. More specifically, the system has the ability of providing course in geography as well as different tests for knowledge evaluation. The application works both as a web-based and a standalone application depending on the existence of an Internet connection on the user’s computer. Figure 1 illustrates the form of the educational application where users may view parts of the theory of a lesson about European geography. While using Affective Tutor from a desktop computer, users are able to retrieve information about a particular course. The information is given in text-form while at the same time an animated agent reads it out loud using a speech engine. The student can choose a specific European country and all the available information will be retrieved from the system’s database. The cartoon-teacher is an animated agent who can move around the tutoring text and can show parts of the theory that a student should read. It has also incorporated features of human body-language. It shows patience while users read the theory, boredom if they are not responding to the system; wonder if they make an unexpected move etc. Similarly users are able to take tests that include questions, answers, multiplechoice, etc, concerning specific parts of the theory. The animated agent is present in this mode too, in order to make the interaction more human-like and encourage users interact freely and affectively with the system. EAT is also accessible from a palm top PC. In this case, users can get connected to the internet pages of the application and after the authentication process they may take courses or tests.
Affective Reasoning Based on Bi-modal Interaction and User Stereotypes
525
Fig. 1. The theory form of the application concerning a geography lesson
In both a PC or a palm top PC, user authentication is necessary for the creation of the user model of the user interacting with the system. The information of the user model is used in combination with the collected evidence of each specific user’s input actions (through the microphone and through the keyboard) in order to improve the affectiveness of the application. For this purpose, the system uses a combination of user stereotypes with a multi-criteria theory.
3 Empirical Study As a fore step in designing user stereotypes for emotion recognition we have conducted an empirical study based on affective interaction through the audio mode and the keyboard. The oral mode of interaction uses the microphone as an input device while the keyboard interaction uses the personal computer’s keyboard. The study aimed at finding out common user reactions that express user feelings while they interact with computers. As soon as these reactions were identified, they were associated with particular feelings within the context of certain user groups depending on their age, level of computer familiarity, educational level and gender. Stereotypes are used in order to improve the accuracy of an emotion-recognition system by indicating the most common reactions in certain feelings. The empirical study also gave important results about the strengths and weaknesses of emotion recognition that is based on each modality. The empirical study involved 200 users (male and female) of varying educational background, ages and levels of computer experience. Analyzing the distribution of
526
E. Alepis, M. Virvou, and K. Kabassi
participants in the empirical study in terms of their age, 14% of participants were under the age of 18, approximately 20% of participants between the ages of 18 and 30 and a considerable percentage of our participants over the age of 40. Considering the participants’ gender, users were nearly half divided. The subjects’ allocation taking into account their computer knowledge level: • • • •
20% inexperienced users, 30% novice users, 32% intermediate users and 18% expert computer users.
The empirical study was conducted in two phases. In the first phase of the empirical study the participants were given questionnaires concerning their emotional reactions to several situations of computer use in terms of their actions while using the keyboard and what they say. Specifically, the subjects had to imagine or recall an interaction with a computer while they were using an educational application and the affect on their voice and on the way they type. On the second phase of the study, the participants were asked to use an application, which incorporated a user monitoring component. This component recorded the actions of users from the keyboard and the microphone and finally interpreted them in terms of emotions. The basic function of this component was to capture all the data inserted by the user either orally or by using the keyboard of the computer. The data was stored in a database and then analysed by human experts. It is interesting to notice that a very high percentage (85%) of young people (below 30 years old) who are also inexperienced with computers reported or were recorded to have preferred expressing themselves through the oral mode rather than the keyboard. On the contrary participants who were computer experts did not give us considerable data for the affect perception during the oral communication with their computer. While using the keyboard most of the participants agree that when they are nervous the possibility of making mistakes increases rapidly. This is also the case when they have negative feelings. Mistakes in typing are followed by many backspace-key keyboard strokes and concurrent changes in the emotional state of the user in a percentage of 82%. Yet users under 20 years old and users who are over 20 years old who also have low educational background seem to be more prone to making even more mistakes as a consequence of an initial mistake and lose their concentration while interacting with an application (67%). Users of the same category also admit that when they are angry the rate of mistakes increases, the rate of their typing becomes slower 62% (whereas, when they are happy they type faster 70%). Similar effects to the keyboard were reported when the emotion is boredom instead of anger. One conclusion concerning the combination of the two modes in terms of emotion recognition is that the two modes are complementary to each other to a high extent. In many cases the system can generate a hypothesis about the emotional state of the user with a higher degree of certainty if it takes into account evidence from the combination of the two modes rather than one mode.
Affective Reasoning Based on Bi-modal Interaction and User Stereotypes
527
4 Design of User Stereotypes The main reason for using stereotypes is that they allow a system to make a large number of inferences based on a smaller number of observations. According to [5], [6], a stereotype includes: • •
A set of trigger conditions, which are boolean expressions that activate a specific stereotype A set of stereotypic inferences, which serve as default assumptions for the user, once the user is assigned to a stereotype.
In our study, we have classified our users into stereotypes concerning their age, their educational level, their computer knowledge level and their gender. For each user there is a value that corresponds to a four-dimensional stereotypic vector of the form: (User_Name, Stereotypic Characteristic1, Stereotypic Characteristic2, Stereotypic Characteristic3, Stereotypic Characteristic4). Stereotypic Characteristic1 refers to the user’s age and is an element of the following set concerning ages: [(10-16), (16-24), (24-36), (36-50), (over 50)]. Stereotypic Characteristic 2 refers to the user’s computer knowledge level and is an element of the following set concerning the user’s computer experience in months (using a personal computer): [(less than 1), (1-6), (6-12), (over 12)]. Similarly we have defined stereotypic characteristics 3 and 4 that refer to the user’s educational level and to the user’s gender respectively. The inferences of this stereotype vector provide information about the weights of importance of each mode for the users belonging to that stereotype. For example if a user belongs to the stereotype [(16-24)] is inferred to have a tendency to express his/her feelings through the oral mode of interaction. Stereotypes can provide inferences concerning hypothesis about users’ feelings and which modality should be more important for providing evidence about users’ feelings . More specifically, in many cases, data from the vocal interaction or the interaction through the keyboard gives evidence of different emotions with quite similar degrees of certainty. For example, the system may have evidence that a user is either angry while saying or typing something or stressed or even confused. The incorporation of stereotypes in the system provides inferences concerning people belonging to the same category with the user that may help in recognizing an emotion that is more common for the users of this category among others or even distinguishing emotions. Evidence for the character or the personality of a particular user may raise the degree of certainty for a particular emotion recognized.
5 Multi-criteria Decision Making Using Stereotypes In many cases data from the vocal interaction or the interaction through the keyboard gives the system evidence of emotions with quite similar likelihood degrees. For example we may have evidence that a user is either angry while saying or typing something or stressed or even confused. The incorporation of stereotypes in the system provides the system with a reasoning mechanism that may recognize an emotion among others or give us evidence for discriminating emotions. Evidence for the
528
E. Alepis, M. Virvou, and K. Kabassi
character or the personality of a certain user may raise the percentage of likelihood for certain emotions. Moreover having evidence for basic characteristics of users may lead the system to focus its emotion recognition algorithms to a certain mode of interaction. If the system knows that it is more likely to recognize the emotional state of a user by analyzing data from a specific mode (e.g. his/her voice through the microphone) then it should pay more attention to that mode. The system combines the outcome of the two modes in order to the emotional state of the user. More specifically, the system uses the multi-criteria decision making method SAW to combine evidence from the two modes and conclude to one single emotion. The SAW approach consists of translating a decision problem into the optimization of some multi-criteria utility function U defined on A . The decision maker esti-
U ( X j ) for every alternative Xj and selects the one with the highest value. The multi-criteria utility function U can be calculated in the SAW
mates the value of function
method as a linear combination of the values of the n criteria: n
U ( X j ) = ∑ wi xij
(1)
i =1
where Xj is one alternative and xij is the value of the i criterion for the Xj alternative. In our proposed approach we investigate emotion recognition as a multi-criteria problem. The values of the criteria are the outcome of each mode of the system and the weight shows how important is each mode in collecting evidence from the user. In this way, the multi-criteria decision making approach provides the system with suitable weights that will determine the percentage of participation for each mode of interaction in the final detection of a user’s emotional state. As a result, the multi-criteria utility function U is formed: 2
U ( E j ) = ∑ wmi eij
(2)
i =1
where
E j is one alternative emotion and eij is the value of the i mode for the j alter-
native emotion. The way
eij is calculated, has been presented thoroughly in ([9],
[10]). In the proposed approach, we present how the weights of each mode in formula 2 can be calculated using stereotypic knowledge in combination with SAW. More specifically, the stereotypic categories constitute the criteria that will help us select the prevailing mode that will lead the system in a more accurate recognition of an emotion. 5.1 Stereotypes Inference for Weight Calculation For the creation of the stereotypic characteristics we have classified all users in four categories concerning their age, their educational level, their computer knowledge level and their gender. Each category is represented by a set of two or more elements as illustrated in Table 1.
Affective Reasoning Based on Bi-modal Interaction and User Stereotypes
529
Table 1. Categories of users
Categories Age (years) Computer Knowledge (months) Educational Level
10-16
Sets 24-36
16-24
0-1
1-6
Gender
Over 50
6-12
University level, Science Female
University level, Arts
36-50
Over 12
Postgraduate level
High School level Male
For each user a stereotype is activated for each one of the four categories of stereotypes. Table 2 results from Table 1 and presents the available stereotypes of the four different categories. For example, ag1 is the stereotype that corresponds to the stereotype of users aged 10-16 years old, ck 3 is the stereotype that corresponds to the users that use a computer for 6 to 12 months, etc. Table 2. Categories of users with stereotypic variables
Categories Age (years) Computer Knowledge (months) Educational Level Gender
ag1
Variables ag3
ag2
ag4
ag5
ck1
ck2
ck3
ck4
ed1
ed2
ed3
ed4
gd1
gd2
For example, the stereotypical variables that activated (value equal to 1) for a female user, 20 years old, with low level of computer knowledge (1 month of training) and with high school level of educational knowledge, will be “ag2”, “ck1”, “ed4”, “gd1”. In order to find out the weight that is proposed by the stereotypical mechanism for the particular user, we estimate the weight that is proposed by each different stereotype that has been activated for that user. More specifically, we have formed the following equations that will produce the expected weights for each mode of interaction, namely the oral mode of interaction and the interaction through the keyboard. For the oral mode these formulas are:
AGO = ag1*Woag 1 + ag 2 *Woag 2 + ag 3 *Woag 3 + ag 4 *Woag 4 + ag 5 *Woag 5 (Formula 3)
CK O = ck1 * Wock 1 + ck 2 * Wock 2 + ck 3 * Wock 3 + ck 4 * Wock 4 (Formula 4)
530
E. Alepis, M. Virvou, and K. Kabassi
EDO = ed1 * Woed 1 + ed 2 * Woed 2 + ed 3 * Woed 3 + ed 4 * Woed 4 (Formula 5)
GDO = gd1 * Wogd 1 + gd 2 * Wogd 2
(Formula 6)
For the interaction of users through the keyboard we have the following four formulas (7-10) respectively:
AG K = ag1 * Wkag 1 + ag 2 * Wkag 2 + ag 3 * Wkag 3 + ag 4 * Wkag 4 + ag 5 * Wkag 5 (Formula 7)
CK K = ck1 * Wkck1 + ck 2 * Wkck 2 + ck 3 * Wkck 3 + ck 4 * Wkck 4 (Formula 8)
EDK = ed1 * Wked 1 + ed 2 * Wked 2 + ed 3 * Wked 3 + ed 4 * Wked 4 (Formula 9)
GDk = gd1*Wkgd 1 + gd 2 *Wkgd 2
(Formula 10)
In each one of the above formulas the left side capitalized letters correspond to the four mentioned categories of users’ characteristics, while their subscript corresponds to the mode of interaction. Weights for each category and for each mode result from the statistical analysis of the empirical study and represent the percentage by which each category of users provides characteristic evidence in each mode for the recognition of emotions. The values for these weights are given as inference of the stereotypes. The above formulas will reveal only the weights that are suitable for the particular user. This is due to the fact that the variables agi, cki, edi and gdi of the activated stereotypes are equal to 1 and they are multiplied to the weights whereas the rest of the weights are eliminated, since they are multiplied to zero. For example, the empirical study revealed that younger users have a tendency to use the oral mode of interaction more than older users. This results in higher weight for variable Woag 1 . This specific weight is related to the age categorization of users and the oral mode of interaction. Subsequent to the increase of the value of the weight Woag1 , we have corresponding decreases to the values of variables Woag 2 , Woag 3 ,
Woag 4 and Woag 5 . 5.2 SAW for Calculating Weights In order to find out the weight of each mode that is more probable to give more evidence about the user’s emotional state, we use SAW for combining evidence from the activated stereotypes. As a next step we finally need to define the formulas that will
Affective Reasoning Based on Bi-modal Interaction and User Stereotypes
531
give the exact ratio for each one of the two modes towards the correlation of the multi-modal data for successful emotion recognition. However, we may note that during the stereotypic analysis it emerges that evidence form the stereotypic categories of users are not equally important in the emotion recognition. The age of a particular user as well as her/his knowledge in using computers seems to affect the way and how s/he uses a specific mode of interaction with a personal computer more than her/his gender. Thus, for the two modes of interaction, the following two formulas are generated: For the oral mode:
WM 1 = WAGO * AGO + WCKO * CK O + WEDO * EDO + WGDO * GDO (Formula 11) For the interaction through the keyboard:
WM 2 = WAGK * AGK + WCK K * CK K + WEDK * EDK + WGDK * GDK (Formula 12)
AGO , CK O , EDO , GDO
AG K , CK K , EDK , GDK have been calculated by formulae 3-10. The weights W AGO , WCKO , WEDO , WGDO The values of
and
WAGK , WCK K , WEDK , WGDK
Finally, the values of
and
have prefixed values irrelevant of the stereotypes.
WM 1 and WM 2
are used in formula 2 mentioned above.
6 Conclusions In this paper, we described an affective educational application that recognizes users’ emotions based on two modes of interaction, namely the microphone and the keyboard. The system uses an innovative approach that combines evidence from the two modes of interaction based on user stereotypes and a multi-criteria decision making theory to improve the system’s accuracy in recognizing emotions. For the creation of user stereotypes an empirical study has been conducted among users of different ages, computer knowledge levels and educational backgrounds. Having evidence about the basic characteristics of users may lead emotion recognition approaches to focus on a certain mode of interaction. For example, if the system knows that it is more likely to recognize the emotional state of a user by analyzing data from his/her voice through the microphone then it should take mainly audio mode into consideration. Therefore, the stereotype inferences are used in combination with a multi-criteria decision theory in order to find out the significance of each mode of interaction in selecting the emotion of the user. In this way, emotion recognition is dynamic and adapted to each user’s characteristics. In future work we plan to improve our system by exploiting a third mode of interaction, visual this time ([11]), to add information to the system’s database and complement the inferences of the user modelling component about users’ emotions.
532
E. Alepis, M. Virvou, and K. Kabassi
References 1. Picard, R.W.: Affective Computing: Challenges. Int. Journal of Human-Computer Studies 59(1-2), 55–64 (2003) 2. Moriyama, T., Ozawa, S.: Measurement of Human Vocal Emotion Using Fuzzy Control. Systems and Computers in Japan 32(4) (2001) 3. Moriyama, T., Saito, H., Ozawa, S.: Evaluation of the Relation between Emotional Concepts and Emotional Parameters in Speech. Systems and Computers in Japan 32(3) (2001) 4. Pantic, M., Rothkrantz, L.J.M.: Toward an affect-sensitive multimodal human-computer interaction. Proceedings of the IEEE 91, 1370–1390 (2003) 5. Kay, J.: Stereotypes, student models and scrutability. In: Gauthier, G., Frasson, C., VanLehn, K. (eds.) ITS 2000. LNCS, vol. 1839, pp. 19–30. Springer, Heidelberg (2000) 6. Rich, E.: Users are individuals: individualizing user models. International Journal of ManMachine Studies 18, 199–214 (1983) 7. Fishburn, P.C.: Additive Utilities with Incomplete Product Set: Applications to Priorities and Assignments. Operations Research (1967) 8. Hwang, C.L., Yoon, K.: Multiple Attribute Decision Making: Methods and Applications. Lecture Notes in Economics and Mathematical Systems, vol. 186. Springer, Heidelberg (1981) 9. Alepis, E., Virvou, M., Kabassi, K.: Knowledge Engineering for Affective Bi-modal Human-Computer Interaction, Sigmap (2007) 10. Alepis, E., Virvou, M.: Emotional Intelligence: Constructing user stereotypes for affective bi-modal interaction. Lecture notes in Computer Science, pp. 435–442. KES (2006) 11. Stathopoulou, I.O., Tsihrintzis, G.A.: Detection and Expression Classification System for Face Images (FADECS). In: IEEE Workshop on Signal Processing Systems, Athens, Greece (2005)
General-Purpose Emotion Assessment Testbed Based on Biometric Information Jorge Teixeira1,2, Vasco Vinhas1,2,3, Eugenio Oliveira1,2,3, and Luis Paulo Reis1,2,3 1 2 3
Faculdade de Engenharia da Universidade do Porto, Rua Dr. Roberto Frias s/n, Porto, Portugal Departamento de Engenharia Informatica, Rua Dr. Roberto Frias s/n, Porto, Portugal LIACC - Artificial Intelligence and Computer Science Laboratory, Rua Campo Alegre 823, Porto, Portugal {teixeira.jorge,vasco.vinhas,eco,lpreis}@fe.up.pt
Abstract. While affective computing and the entertainment industry still maintain a substantial gap between themselves, biosignals are subject of digital acquisition through low budget technologic solutions at neglectable invasive levels. The integration of electroencephalography, galvanic skin response and oximeter in a multichannel framework constitutes an effort in the path to identify emotional states via biosignals expression. To induce and detect specific emotions, gender-specific sessions were defined based on International Affective Picture System and performed in a controlled environment. Data was collected and visualized in real-time by the session instructor and stored for processing and analysis. Results granted by distinct analysis techniques showed that high frequency EEG waves are strongly related to emotions and are a solid ground to perform accurate emotion classification. They have also given strong indications that females are more sensitive to emotion induction. One might conclude that the attained success levels concerning relating emotions to biosignals are extremely encouraging not only to this research topic but also to the its application in domains such as multimedia entertainment, advertising and medical treatments.
1 Introduction Affective computing is consistently becoming a confirmed scientific domain with practical applications while the entertainment industry as a whole, and specially the cinematographic and videogame branches, which have been closing the semantic gap between them, constitute an economic giant. Having this macrocontextualization in mind, the authors have already engaged a research project with the main intention of using emotion assessment through biosignals to promote both subconscious interaction and individual specific appropriated content delivery. The presented study finds itself integrated in this scope, as it perfectly falls in the emotion assessment research module. The proposed system constitutes a solid technologic framework that intends to enable biological information acquisition in a controlled environment having as initial hypothesis the existence of human physical expression of emotional states that can be objectively measured by relatively inexpensive equipment. The multichannel structure was defined by exploiting known techniques, namely electroencephalography, galvanic skin response and heart rate monitoring, and emotional states were induced using third-party catalogued pictures. The first goal was to effectively define, build and test an experimental session framework where subjects followed G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 533–543, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
534
J. Teixeira et al.
a given strict protocol in order to visualize and/or interact with multimedia content. The second objective was to, using the given platform, identify specific, controlled and extractable biological signals that could be used as emotion index factors applied to all subjects or to a characteristic group of equals. The confirmation of the initial hypothesis and project goals would enable its immediate application in developing a full or semi-automatic emotion classification engine that would be able to apply and use the identified signal patterns to, in real-time, identify the subject’s emotional states with high accuracy. In order to better detail the presented study, this document is structured as follows: the domain state of the art is described in the next section, in section 3 the multichannel emotion assessment framework is presented with special emphasis in the most significant decisions. Still in that section, experimental sessions protocol and controlled conditions are exposed. In section 4, results are presented and related conclusions are extracted in section 5 as well as are identified future work areas and practical domains of application are suggested.
2 State of the Art The emotional state of human beings belongs to a complex theme since its definition is not unique and its essence not consensual. An overview of the emotion assessment is presented in the next subsection, as well as a brief description of the most common approaches to emotional induction, and finally a reference to equipment solutions. 2.1
Emotion Assessment
The emotion itself can be seen as a consequence of an action or an environment cause so that the induction of a specific emotional state is tightly connected with an arousal procedure. In order to identify and assess an emotion, patterns are used and they constitute different approaches to the emotional induction, which will be discussed in the next subsection. Apart from the induction, the classification is essential, and can be accomplished based on a coincidence of values on a strategic number of dimensions [10]. Based on this study, the emotion assessment can be analyzed through three distinct dimensions. The two primary levels are the valence and the arousal, and the secondary one is the dominance, which has a weaker relationship with the others [13][1]. In order to best analyze the assessment of the pictures, it is generally used an affective space. This is a standardized method to graphically display the emotional assessment results of the pictures. According to the valence and arousal mean values, it is plotted a bidimensional graph where the horizontal-axis represents the arousal and the verticalaxis the valence, both scaled from 1 to 9. 2.2
Generic Approach to Emotion Induction
There is not a process for emotional induction that is perfectly suitable for all cases, but a group of different approaches to achieve the same objective. A prevalent method
General-Purpose Emotion Assessment Testbed Based on Biometric Information
535
to induce emotional processes consists of asking an actor to feel or express a particular mood. This strategy has been widely used for emotion assessment from facial expressions and to some extent from physiological signals [4]. However, even for expert actors for whom the capacity to achieve a specific emotional state is obvious, it is hard to guarantee that the physiological responses are consistent and reproducible by other non-actor people. An alternative approach to the emotional induction is composed by multimedia stimuli. Music, images, videos and video-games belongs to a category of stimuli that has significant advantages compared with the induction through actors, since there is no need of actors and the quality of the induced emotions is greater as they are more realistic. In order to use the most suitable method for this study, it was used the IAPS library. The IAPS – International Affective Picture System – library was developed to provide ratings of affect for a large set of emotionally-evocative, internationally-accessible, colour photographs that includes contents across a wide range of semantic categories. 2.3
Equipment Solutions
Emotions assessment needs reliable and accurate communications with the subject so that the results are conclusive and the emotions correctly classified. This communication can occur through several channels and is supported by specific equipment. The BCI - Brain Computer Interface - is directly connected to this area and uses two different approaches, invasive and non-invasive methods. The invasive methods are clearly more precise, however more dangerous and will not be considered for this study. On the other hand, non invasive methods such as EEG, fMRI, GSR, oximeter and others have shorten the distance between the utopia and the truth of understanding the human brain behaviour, gathering together the advantages of inexpensive equipment and non-medical environments. Due to the medical community skepticism, EEG, in clinical use, it is considered a gross correlate of brain activity [3]. In spite of this reality, recent medical research studies [2] have been trying to revert this scenario by suggesting that increased cortical dynamics, up to a certain level, are probably necessary for emotion functioning and by relating EEG activity and heart rate during recall of emotional events. Similar efforts, but using invasive technology like Electrocorticography (ECoG), have enable complex BCI like playing a videogame or operating a robot [9]. Some more recent studies have successfully used just EEG information for emotion assessment [5]. These approaches have the great advantage of being based on non-invasive solutions, enabling its usage in general population in a non-medical environment. Encouraged by these results, the current research direction seems to be the addition of other inexpensive, non-invasive hardware to the equation. Practical examples of this are the introduction of GSR and oximeters by Takahashi [15] and Chanel et al[4]. On this study three non-invasive equipments will be used in parallel so that the reliability of all the procedures is guaranteed. A Neurobit Lite EEG device with one active electrode and two references, a Thoughtstream biofeedback system Galvanic Skin Response with two dry electrodes and an oximeter with a finger sensor.
536
J. Teixeira et al.
3 Project Description In this section it will be given an overview of the whole project development from definition of the experimental parameters to the data analysis techniques that were used. 3.1
Procedures and Methods
The procedures and methods involved on this study were grouped into two separate, yet completely integrated parts. The first one deals with the emotional induction approach used on this study and is described in the following subsection. The last one reveals the experimental conditions used along the experimental sessions and some of the samples choices. As described in the previous section, emotions can be induced through several different approaches. In order to guarantee the control of the induced emotions and optimize the available devices, the emotions induction through visual stimuli is the most suitable method for this study since its quality is greater and more realistic than using other kinds of approaches [4]. A good experimental control and an easy, yet efficient, method for results comparison are key factors that demand an effective set of visual stimuli. The IAPS library is so the most indicated emotional induction method, as it has been widely used through the research community and all the pictures classified according to valence, arousal and dominance [6][4][11]. The picture selection was based on the concept that detection, post-analysis and interpretation of the biosignals became more accessible as the pictures are stratified accordingly to its valance value [8][15]. Due to the complexity of emotional response, as well as the resolution of the devices used, the valence spectrum was reduced to two discrete emotional states, happiness and sadness. Each of these subsets constitutes a group of twenty pictures that belongs to the same emotional state which are ordered as previously described and its affective space is represented in Figure 1.
Fig. 1. Affective space used on the experimental sessions
The experimental conditions are an essential issue to the validation and acceptance of the results obtained. Special attention was given to this theme and the procedures were carefully followed during the experimental sessions. In order to create a solid base for all the choices made, several studies were used as a reference. A sample choice that adapts to this study is as crucial as setting the experimental conditions. For this
General-Purpose Emotion Assessment Testbed Based on Biometric Information
537
reason, the exclusion criteria was created as a questionnaire to be filled before each experimental session by the subject in order to avoid possible barriers, which included questions about mental diseases and skull deformations and reminded the subject not to drink alcohol twelve hours prior to the experimental session [14]. Age is directly related with changes observed among the brain waves and memory performance, as concluded in a recent study [14]. Young people have a lower value for the frequency of the activity on the brain, and from thirty years old towards the elder ones, that frequency starts to decrease once again. The optimal age for which the frequency values of the brain activity is higher is between eighteen and the thirties [7][8][4][12] [15]. A total of twenty eight subjects, seventeen males and eleven females, all right-handed aged eighteen-thirty years old took part in this study. Accordingly to the IAPS report [13] and a recent study [4] that take advantage of the IAPS library, the number of pictures to present should be sixty, displayed to the subject in frames of six seconds each and beginning with a black screen with a centered white cross to calls for the subject attention. The preferential EEG electrode localization is a controversial theme that has been widely discussed by the fact that the activity of the human brain cannot be easily cartographied. The difficult task of finding a specific area of the skull where the brain activity is sufficiently high to detect oscillations according to the emotional state of the subject was partially overcame with the recent publication of three research activities [6][7] [8]. These studies dealt exactly with this issue together with the advantage of a 64 channel EEG, gathering precise results in what concerns to the localization of the brain activity directly responsible for the emotional states. Taking on account the two opposite emotional states, happiness and sadness, as well as the physical limitation of the EEG – one active electrode plus two reference electrodes – the most appropriate area of the skull to locate the active electrode was the middle line, between the Central and the Frontal area. During experimental sessions, subjects were confronted with a sixty pictures set according to gender. The devices were composed by an EEG with one active electrode and two reference electrodes, a GSR with two dry electrodes attached to the subjects hand with a strip and an oximeter connected to the finger. Both light and room temperature were set to a comfortable level, and the outside noise was minimized. The biosignals capture was executed with the support of an application for each of the devices namely BioExplorer for EEG and Office Medic for the oximeter. Due to the lack of software and the existence of motivation to create an application from the beginning capable of displaying on real-time as well as recording the data in a normalized format, it was developed the GSR server that is also ready to be connected to a network and work remotely. 3.2
Global Architecture
The systems global architecture is an essential feature to best understand the entire project. Integrated in a PhD project and readapted from the architecture of a previous prototype study in this field [16], this architecture has two main stages, as shown in Figure 2. The first is directly related to the capture and diffusion of the biosignals; the second is responsible for the pre-processing and analysis.
538
J. Teixeira et al.
Fig. 2. Systems Global Architecture
The primordial objective of the first stage is to prepare the data so it can be accessible and manageable to any receiver that is connected to one or several broadcast servers, and each of the device drivers – may be an arbitrary number and diversity of devices – is encapsulated in a particular software tool responsible for the signal diffusion. In what concerns to the second stage, beyond the real-time monitoring, each of the software tools is able to create a .dat file containing all the biosignals information provided by the devices. Afterwards, the data is pre-processed accordingly to the techniques used and described in the next subsection. The results of the pre-processing lead to a statistical analysis and to the final conclusions of this study. 3.3
Data Analysis
The first step for the temporal analysis of the data was to decimate it. In what concerns to the EEG data, it was sent in single epochs of two-point-five seconds each, encapsulated and pre-filtered by BioExplorer with five distinct six-order Butterworth IIR band pass filters with frequencies 0.5-4 Hz for Delta band, 4-8 Hz for Theta band, 8-12 Hz for Alpha band, 12-26 Hz for Beta band and 26-45 Hz for Gamma band. In order to best understand and interpret the received data, decimation was used and epochs with larger temporal amplitude were created, in which results a smother variation and a better understanding of its behaviour with the emotional arousal. Apart from that, it was created a three step graph that included the mean variation of the brain waves amplitude for each of the emotional stages, happiness, neutral and sadness. A similar approach was used to analyze the oximeter and GSR data, however no filters were applied. The oximeter encapsulates the data into 5 seconds epochs, while the GSR server was developed to obtain a higher resolution and each epoch has 0.5 seconds length. One of the biggest concerts taken care during the procedure was the peaks, directly related with body movements or external noise during the experimental sessions. This data overflow is harmful to the analysis since it introduces variations that are not a
General-Purpose Emotion Assessment Testbed Based on Biometric Information
539
consequence of the emotional arousal. To overcome this problem these peaks were removed by smoothing them in accordance to its immediately previous and later values.
4 Results The experimental results of this study are presented following a logical structure. 4.1
Expected Behaviour
Accordingly to the hypothesis previously described, the expected behaviour of the brain activity and the skin resistance to the emotions arousal follows a specific temporal distribution and a pattern that is observed in the majority of the experimental sessions. Figure 3-A represents the evolution of Beta and Gamma amplitudes over the entire session.
Fig. 3. Expected behaviour of the brain activity and skin resistance without decimation and with decimation
The chart on the left shows the EEG data of the experimental session without decimation and for the high frequency waves, namely Beta and Gamma. The chart on the right depicts the same session after data processing and the decimation for the same waves. It is divided in three discrete steps, each one representing the emotional states considered for this study – happiness and sadness – and the neutral state. The observed differences on each step represent the emotional induction effect of the pictures on the subject. Due to the good behaviour of the brain waves data in this experimental session, and by the reason that it follows the initial hypothesis, it is considered as the expected behaviour for all the other sessions. Considering the heart rate, different emotional states cause changes on the subjects body that are reflected on the skin resistance value, as depicted on Figure 3-B. This variance indicates that the value of the resistance is lowered every time the emotional state of the subject is not the neutral, but is an induced one. The slope of the transaction between the three stages may differ from each session, accordingly to the effectiveness of the induction in the subject. Similar to the EEG expected behaviour described previously, this session was considered to be the expected behaviour for all the remaining session. The heart rate behaviour is expected to change according to the emotional arousal, so that an increase in the heart rate corresponds to an induced emotional state
540
J. Teixeira et al.
From these expected behaviours it is possible to analyze the statistical distribution of the subjects behaviour to the emotional arousal. 4.2
Discard Policy
Even though the discard policy is the last resort to be used, it is frequent among experimental sessions, and so the definition of the discard criteria is essential. External noise, electric interferences from both the subject – static electricity – and the devices, spontaneous malfunction of a particular device and exaggerated body movements from the subject are situations that lead to a useless experimental session and undoubtedly need to be discarded. The first experimental session was discarded due to the incorrect set of pictures displayed to the subject, as mentioned in subsection 2.1. The results attained were inconclusive and the emotional states arousal too heterogeneous. Too many body movements from the subject caused and exaggerated number of highs in the brain activity and lead to a discard of this specific experimental session. Later on was included in the introductory text a statement which clarifies the need of the subject to be motionless during the entire session. Electrical interferences of an unknown cause result in the discard of three distinct sessions forasmuch as the behaviour of the brain waves and the skin resistance values were confusing and misleading. 4.3
Statistical Analysis
The statistical analysis module constitutes a very powerful tool to analyze large amounts of data, and this capacity is explicit on Figure 4-A where it is presented the variation of the mean amplitude among all the valid experimental sessions for the high frequency brain waves.
Fig. 4. Beta wave and Gamma wave comparison and GSR, oximeter data analysis
Both cases indicate that mens behaviour is different from the womens one, since the slope on the last transition between the neutral and the sadness emotional states is positive or close to zero. On the other hand, the behaviour of the women brain waves indicates that this precise slope is negative as well as the mean value is generally higher than for men. Considering the GSR data, and through the observation of the left graph on Figure 4-B, it is presented the slope variation between the three stages - happiness, neutral and sadness - and analyzed its behaviour along the complete session. The heart rate analysis, represented on the right chart of Figure 4-B indicates an unexpected and almost undetectable variation of its value.
General-Purpose Emotion Assessment Testbed Based on Biometric Information
541
5 Conclusions One ought to assert that the initial study’s main goals were completed undertaken. The described experiments achieved to develop and test a solid framework to conduct controlled emotionally-evocative experiments enabling flexible capability of recording and monitoring biosignals in real-time. It was possible to define dynamic gender-designed emotional sessions, with fine tuning capabilities, in order to trigger specific emotional states. The System’s architecture that proved to be able to supply sessions with biosignals real-time monitoring and storage for physical and temporal independent processing. The achieved results demonstrated that the postsession data processing techniques were effective in emotional states/biosignals correlation identification. Regarding the initial enunciated hypothesis, the most important ones were confirmed and validated. Having the study’s results as solid ground, it is possible to confirm that basic emotional states have biological manifestations capable of being captured and recorded by the selected equipments, specially with EEG and GSR techniques. Concerning subject variables, it is plausible to state that females react more aggressively to the presented pictures, triggering with more expression and effectiveness the desired emotional states. These objective data was also corroborated by the interviews conducted at the end of each session, where females consistently stated that they felt happy and in a good mood in the first stage of the session and sad at the end. In the same interviews, a high percentage of male subjects affirmed that they did not felt a deep emotional commitment along the picture presentation. The fact that gender is a key factor in what concerns emotional state triggering through multimedia content constitutes the first major contribution of the present study. Exploiting the evidence that EEG signals were strongly influenced by the subject’s emotional states, a more detailed data analysis was performed with special focus to high frequency signals, namely beta and gamma waves. The data provided strongly suggests that, specially in female subjects, high frequency relative EEG values are directly correlated to valence, independently of their initial standard level. In other words, beta and gamma waves strongly seam to vary directly with valence, enabling, indirectly and with conjugation with other inputs, emotional state detection. Despite the described positive outcome, there were identified some features that, although did not match the initial assumptions, have already been subject of turnaround strategy definition. The first one is related to the inexistence of generic significative changes in heart rate values along experimental sessions and the second resides in the fact that the expected GSR readings curve – high conductivity with high arousal situations – was not recorded as often as predicted. The authors believe that the existence of these issues is grounded on the fact that the designed emotionally-evocative sessions based only on pictures do not trigger such strong emotions capable of significantly influence biosignals such as heart rate. It is believed that it would be necessary much strong multimedia content provided in a more immersive environment so that subjects could be more deeply involved.
542
5.1
J. Teixeira et al.
Future Work
The main future work topics are not only related to this particular study, once it is a spin off/module of a major one, but also with the main global project. With this in mind, there were identified the following areas: More Sophisticated Equipment Reinforcement: It is intended to acquire more sophisticated equipment, specially and specifically in what concerns to a multi-channel EEG and a more sensible and reliable GSR; Equipment Diversity: It would be useful to integrate in the developed framework new equipments capable of reading and extract more biosignals, namely pupil dilatation, voice analysis and facial expression recognition; More Detailed Emotion Classification: using the depicted key factors with conjugation with others provided by the study continuation and data volume and diversity enhancement brought, by new equipment acquisition, it would be plausible to perform automatic subject emotion classification with deeper detail levels; Software Control: The accomplishment of the previous items would enable both conscious and subconscious control of several tools and/or multimedia contents; Considering the studied problem as a whole several practical domain applications are not only feasible but also attractive. Most of the immediate technology adaptations shall reside in the entertainment industry, both in audiovisual and videogame branches through multimedia content adaptability to user’s emotional states. Other possible application areas are user interface enhancement, direct advertising and medical applications, namely in phobia treatments and psychological evaluations.
References 1. Mehrabian, A., Russell, J.A.: An approach to environmental psychology. The MIT Press, Cambridge (1974) 2. Aftanas, L.: Nonlinear forecasting measurements of the human eeg during evoked emotions. Brain Topography 10, 155–162 (1997) 3. Ebersole, J.: Current Practice of Clinical Electroencephalography. Lippincott Williams & Wilkins (2002) 4. Chanel, G., Kronegg, J., Granjean, D.: Emotion assessment: Arousal evalutation using eegs and peripheral physiological signals. In Technical Report (2005) 5. Ishino, K., Hagiwara, M.: A feeling estimation system using a simple electroencephalograph. In: Proceedings of 2003 IEEE International Conference on Systems, Man, and Cybernetics, pp. 4204–4209 (2003) 6. Aftanas, L., et al.: Time-dependent cortical asymmetries induced by emotional arousal: Eeg analysis of event-related synchronization and desynchronization in individually defined frequency bands. International Journal of Psychophysiology 44, 67–82 (2001) 7. Aftanas, L., et al.: Analysis of evoked eeg synchronization and desynchronization in conditions of emotional activation in humans: Temporal and topographic characteristics. Neuroscience and Behavioral Physiolog 34 (2004) 8. Aftanas, L., et al.: Neurophysiological correlates of induced discrete emotions in humans. Neuroscience and Behavioral Physiology 36 (2006) 9. Leuthardt, E.: A braincomputer interface using electrocorticographic signals in humans. Journal of Neural Engineering, 63–71 (2004) 10. Logothetis, N.: The measurement of meaning. University of Illinois Press (1957)
General-Purpose Emotion Assessment Testbed Based on Biometric Information
543
11. Muller, M., Keil, A.: Processing of affective pictures modulates righthemispheric gamma band eeg activity. Clinical Neurophysiology 110, 1913–1920 (1999) 12. Rusalova, M., Kostyunina, M., Kulikov, M.: Spatial distribution of coefficients of asymmetry of brain bioelectrical activity during the experiencing of negative emotions. Neuroscience and Behavioral Physiology 33 (2003) 13. Lang, P.J., Bradley, M., Cuthbert, B.: International affective picture system (iaps): Affective ratings of pictures and instruction manual. In Technical Report (2005) 14. Paul, R., et al.: Age-dependent change in executive function and gamma 40 hz phase synchrony. Journal of Integrative Neuroscience 4, 63–76 (2005) 15. Takahashi, K.: Remarks on emotion recognition from bio-potential signals. In: The second International Conference on Autonomous Robots and Agents (2004) 16. Vinhas, V., Gomes, A.: Mouse control through electromyography: Using biosignals towards new user interface paradigms. Biosignals 33 (2008)
Realtime Dynamic Multimedia Storyline Based on Online Audience Biometric Information Vasco Vinhas1,2,3, Eugenio Oliveira1,2,3, and Luis Paulo Reis1,2,3 1 2 3
Faculdade de Engenharia da Universidade do Porto, Rua Dr. Roberto Frias s/n, Porto, Portugal Departamento de Engenharia Informatica, Rua Dr. Roberto Frias s/n, Porto, Portugal LIACC — Artificial Intelligence and Computer Science Laboratory, Rua Campo Alegre 823, Porto, Portugal {vasco.vinhas,eco,lpreis}@fe.up.pt
Abstract. Audience complete action immersion sensation is still the ultimate goal of the multimedia industry. In spite of the significant technical audiovisual advances that enable more realistic contents, coping with individual audience needs and desires is still an incomplete achievement. The proposed project’s intention is to contribute for solving this issue through enabling realtime dynamic multimedia storylines with emotional subconscious audience interaction. Individual emotional state assessment is accomplished by direct access to online biometric information. Recent technologic breakthroughs have enabled the usage of minimal invasive biometric hardware devices that no longer interfere with the audience immersion feeling. Other key module of the project is the conceptualization of a dynamic storyline multimedia content system with emotional metadata, responsible for enabling discrete or continuous storyline route options. The unifying component is the definition of the full-duplex communication protocol. The current stage of research has already produced a spin-off product capable of providing computer mouse control through electromyography and has identified key factors in human emotions through experiments conducted in the developed system’s architecture that have enabled semi-automatic emotion assessment.
1 Introduction Although multimedia contents are becoming constantly more complex and seemlier to reality, enabling a greater action immersion sensation, the primitive absolute need of achieving a perfect match between audiovisual contents and the audience desires is still present and constitutes the main key to the industry success. The immediate solution of giving the audience the direct power to explicitly interact with multimedia system is, in many cases, impossible and in the others completely or partial shreds the desired immersion. Simultaneously, there have been recent serious efforts in the biometry field that have conducted to hardware economic democratization enabling several I&D projects in the area. A substantial number of these projects focus their attention on affections and emotional states detection. The alliance between the multimedia contents choice possibility that enables the audience to individually presence what desires and accurate emotional states detection systems leads to subconscious individual interaction between the audience and the multimedia control system, potentiating the perfect match between content and individual audience desires. G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 545–554, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
546
V. Vinhas, E. Oliveira, and L.P. Reis
The proposed project intends to use minimal invasive devices capable of extracting biometric readings from individuals, in order to detect and classify human emotional states, with the primary objective of enabling subconscious interaction with multimedia systems, in order to increase matches between audiovisual contents and audience desires. As described and justified in the next subsection, several other objectives are also identified and presented. This project definition enables the constitution of rich primary field of application and, by its nature, opens several secondary research opportunities. With this in mind, one may perfectly decompose the project’s objectives into two distinct categories: – Direct Objective Goals: these are the main issues’ solutions for what the project was designed for. These objectives are to be accomplished by the development of the presented project; – Potential Collateral Effects: in this category all identified potential secondary applications are considered. These optional extra goals are reachable with none or minor modifications to the original project. The main project goal consists in developing a fully functional real-time dynamic storyline multimedia system with the user interaction – specifically the storyline flow decision – being effectuated at a subconscious level through emotional assessment. This major objective breaks out into smaller manageable units, each one is closely related to a particular project module. The first minor objective, in terms of dimension, is the concretization of the emotion assessment module. This system unit must be able to perform real-time biometric data analysis and therefore classify the current user emotion. The realization of this key module involves the necessity of access, process and store data from a great diversity of biometric equipments – in order to enable module high flexibility levels – as well as the adoption of a standard universal data format. Also within this module, it is of great importance the definition and implementation of a complete, yet efficient, database design and the respective load/extraction abstraction layer. Another key project module is the one regarding the conceptualization and implementation of dynamic storyline multimedia contents. This system unit is responsible for enabling discrete or continuous storyline route options – in a full computer generated environment options are virtually endless but storyline routes are limited in a real life footage. Other objective of this module is the implementation of a emotion-related multimedia metadata layer in order to more effectively choose the most appropriated contents. The final key goal is definition of an efficient, reliable, standard, flexible and universal communication protocol between the previously referred two main modules. Considering the second objective category, one must refer to the identified project’s secondary applications. One of these is the possible usage of the emotion assessment subsystem to a broad variety of medical and psychiatric procedures. An accurate emotion classifier would enable more rigorous psychiatric diagnosis as well as would be an efficient tool in monitoring the patient evolution. If the virtual environment engine is accoplated to the emotion assessment subsystem, the whole software can also be used as a treatment option. A more generalized medical application might be continuous patient emotional monitoring. This approach would enable critical environment emotion variables identification which would conduct to their efficient manipulation in order to potentiate positive emotional responses by the patients. Another secondary goal is the
Realtime Dynamic Multimedia Storyline
547
adaptation of the emotion assessment subsystem to extreme minimal invasive hardware solutions in order to enable its usage in the direct publicity field. The main objective would be to adapt publicity content and presentation to the individual target’s emotional state. A global application of this approach is rather complex due to the variety of places where people are confronted with publicity contents. For instance, in a theater it would be possible to apply the original solution with none or minimal changes but in an outdoor electronic billboard individual emotional assessment would be much more difficult and ought to be done only by video images analysis. A third alternative application is the system’s adaptation to the videogame and virtual environment industry. This adaptation has a lower complexity level than the previous one due to the system user’s controlled environment and intrinsic culture. Videogame users usually accept well the introduction of game play enhancement technologic hardware add-ons. The key concept of this approach is the reuse of the emotion assessment subsystem – with few minor adaptations in order to be able to accommodate distinct hardware inputs – to feed the game engine with the detected user’s emotions so that the game action would be able to evolve according, allowing a greater immersion sensation. Optionally, the virtual environment module could be used to autonomously generate virtual scenarios, once again in harmony with the user’s emotional state. In conclusion, the project has two subsets of objectives that completely specify its domain and, simultaneously, are able to generalize it into diverse secondary fields of application. The current document is organised as follows: in the next section the state of the art is described, the project’s description and architecture is depicted in section 3, the project’s results are detailed further and the final section is reserved for conclusion’s extraction and expected outcome presentation.
2 State of the Art Regarding the project nature, this section is structured in two wide components, namely, emotional state classification based on biological data and dynamic interactive multimedia systems. Considering the first theme, since the beginning of the last century that there have been efforts to correlate biological signals to emotional states [1]. The most traditional approaches are based on the standard polygraph where physiological variables such as blood pressure, pulse, respiration and skin conductivity are recorded in order to detect different levels of anxiety. Although the polygraph lie detection accuracy is arguable, the fact that it is an efficient tool to detect basic emotional states, specially individual related, anxiety levels, is not. The human brain always performance an almost hypnotic attraction to several research fields, so in 1912, the Russian physiologist, Vladimir Vladimirovich PravdichNeminsky published the first EEG[2] and the evoked potential of the mammalian. This discover was only possible due to previous studies of Richard Caton that thirty years earlier presented his findings about electrical phenomena of the exposed cerebral hemispheres of rabbits and monkeys. In the 1950s, the english physician William Grey Walter developed an adjunct to EEG called EEG topography which allowed for the mapping of electrical activity across the surface of the brain. This enjoyed a brief
548
V. Vinhas, E. Oliveira, and L.P. Reis
period of popularity in the 1980s and seemed especially promising for psychiatry. It was never accepted by neurologists and remains primarily a research tool. Due to the medical community skepticism, EEG, in clinical use, it is considered a gross correlate of brain activity [3]. In spite of this reality, recent medical research studies [4][5] have been trying to revert this scenario by suggesting that increased cortical dynamics, up to a certain level, are probably necessary for emotion functioning and by relating EEG activity and heart rate during recall of emotional events. Similar efforts, but using invasive technology like ECoG have enable complex BCI like playing a videogame or operating a robot. Some more recent studies have successfully used just EEG information for emotion assessment [6]. These approaches have the great advantage of being based on non-invasive solutions, enabling its usage in general population in a non-medical environment. Encouraged by these results, the current research direction seems to be the addition of other inexpensive, non-invasive hardware to the equation. Practical examples of this are the introduction of GSR and oximeters by Takahashi [7] and Chanel et al[8]. The sensorial fusion, enabled by the conjugation of different equipments, have made possible to achieve a 40% accuracy in detecting six distinct emotional states and levels of about 90% in distinguishing positive from negative feelings. These results indicate that using multi-modal bio-potential signals is feasible in emotion recognition [7]. There also have been recorded serious commercial initiatives regarding automatic minimal-invasive emotion assessment. One of the most promising ones is being developed by NeuroSky, a startup company headquarted in Silicon Valley, which has already granted five million dollars, from diverse business angels, to perform research activities [9]. There are two cornerstone modules, still in the prototyping phase, yet already in the market. The first is the ThinkGear module with Dry-Active sensor, that basically is the product hardware component. Its main particularity resides in the usage of dry active sensors that do not use contact gels. Despite the intrinsic value of this module, the most innovative distinct factor is the eSense Algorithm Library that is a powerful signal processing unit that runs proprietary interpretation software to translate biosignals into useful logic commands. As previously referred it is still a cutting edge technology, still in a development stage, nevertheless it has proven its fundamental worth through participation in several game conferences[10]. More recently, a different approach was presented by Hitachi[11], nevertheless with promising results. Hitachi has tested a brain-machine interface that allows users to manipulate switches with their mind. The system is based on optical topography – near infrared light is used to measure changes in blood hemoglobin concentrations in the brain – to recognize the changes in brain blood flow associated with mental activity and translate those changes into voltage signals for controlling external devices. This technology has been used for two years to enable seriously disabled people to communicate. As an intermediate project subject, one must refer to biological data format definition. This topic is particularly important to this project due to the absolute necessity of accessing, recording and processing, eventually in a distributed system, data which origin may vary from multiple hardware solutions. The European Data Format – EDF – is a simple digital format supporting the technical aspects of exchange and storage of
Realtime Dynamic Multimedia Storyline
549
polygraphic signals. This format dates from 1992 and ,nowadays, is a widely accepted standard for exchange of electroencephalogram and polysomnogram data between different equipment and laboratories [12]. This data format’s implementation is simple and independent of hardware or software environments and has the peculiarity of enabling both XML and raw text definition. This duality is especially important if there is any computing power limitations and/or interoperability is a project requirement. Although the unquestionable positive points of EDF, hardly accommodates other investigations topics. In order to overcome this critical hurdle, EDF+ is presented in 2003 as a more flexible but still simple format which is compatible to EDF that can not only store annotations but also electromyography, evoked potentials, electroneurography, electrocardiography and many more types of investigations. Its authors believe that EDF+ offers a format for a wide range of neurophysiological investigations which can become a standard within a few years [10]. Considering the second project subject – dynamic interactive multimedia systems – one must, regarding the whole project expected outcome, refer to the necessity of the existence of a emotional multimedia metadata layer. This is also a global concern as it is a very active W3C research topic. Recent studies point to distinct directions but almost all use, directly or indirectly, the MPEG7 standard. Most of these projects intend to use MPEG7, with or without extensions, to embody the semantic representation of multimedia contents, specially audiovisual ones [13][14][15]. Other research lead points to MPEG7 replication into ontologies in order to improve application interoperability [16], and there are also projects that bet on developing new ontologies for the effect [17]. There also have been identified some, still fragile, efforts to automatically register and classify audiovisual content regarding its emotional correlation [18]. Summarizing this topic, one must say that in spite of the diversity of research leads, the most strong ones are located in using MPEG7 to register multimedia content semantics and, therefore, emotional state relations. On the pure interactive multimedia systems domain, one must refer to the growing immersion sensation provided to the audience by several factors in diverse fields. As examples of this statement one must consider the success of new generation videogame consoles that have boosted audiovisual quality and brought new interaction paradigms. In spite of these advances, the mainstream entertainment industry has not changed the storyline linearity yet, but some promising research projects are trying to alter this reality. In this domain, one must refer to Glorianna Davenport’s MIT Interactive Cinema Group [19] that have been focusing its efforts on formal structures, construction methods, and social impact of highly distributed motion video stories. Another recent interesting project is the apartment drama, 15-minute interactive story called Fac¸ade [20], where two virtual characters powered by artificial intelligence techniques, allow them to change their emotional state in fairly complicated ways in response to the conversational English being typed in by the human player. MAgentA[21] is another interesting project in the interactive multimedia systems domain. Although it is a rather domain centered project as it is an agent that automatically composes background music in real-time using the emotional state of the environment within which it is embedded. The authors proclaim that the goal is to generate a filmlike music output for a virtual environment emphasizing the dramatic aspects of such
550
V. Vinhas, E. Oliveira, and L.P. Reis
environment. In spite of the promising results, the agent architecture is rather simplistic as it chooses the music composition among several stored in a given database. In summary, the current cutting edge projects’ results are extremely promising in what concerns to the project domain and expected outcome.
3 Project’s Description In this section, the project methodology, techniques and architecture are presented and justified. Whenever it was felt necessary, distinct project modules were decomposed into smaller units and their purposes explained. Due to the essence of the described project problem and the author’s scientific backgrounds, considering the investigation motivation, it is clearly problem-oriented, were the domain hurdles are firstly presented and suitable techniques for solving them are used and conjugated. This project is also characterized by a strong prescriptive component by not being contained by simple phenomenon description and prediction but by suggesting and applying formats, protocols and processes to specific conditions. Summarizing, the described project is robustly anchored in an engineering investigation approach. As previously detailed, the whole project has three major steps. While the last two are eminently technical, the first one is extremely multifaceted, involving software and hardware questions but especially human related factors, social acceptance issues and medical knowledge, tools and procedures. For these reasons, the author chose two distinct investigation techniques, in order to cope with the project dynamics. For the most technical modules, engineering based investigation techniques will be used, while in the first step it is predictable a sensible blend between this and a positivist approach. These decisions are to better understood along the next paragraphs where the project work plan is described and its stages and modules presented. • Project Prototype: As detailed in subsection 4, a domain limited project prototype was developed with two MSc students. The prototype’s main goal is to be an effective proof of concept in what concerns to efficient and reliable online biologic signal processing; • Module A: It constitutes the project prototype’s natural extension. Its main objective is to achieve real-time effective human emotional state assessment based on biologic signals; • Module B: The second project module consists in developing both continuous and discrete content interactive environments. A collateral effect is the definition of a flexible emotional metadata layer in order to anticipate module integration; • Module C: The final module reflects the concerns related to full-scale practical system integration and usage. Issues like communication protocol, physical network infrastructure and content delivery to individual spectator are treated; As visible in figure 1, the system’s global architecture contemplates the implementation of the enunciated modules in a closed information loop. As multimedia content is presented to the user, a flexible set of biometric data is collected in real-time – in the current stage of research, the prototype contemplates the integration of information
Realtime Dynamic Multimedia Storyline
551
Fig. 1. System’s Global Architecture
produced by an EEG, Oximeter and a GSR. This data is collected by one or several computers equipped with specific developed drivers that afterwards enable their remote communication with geographical disperse machines. In the present configuration, several client applications, one per equipment, are hosted at a central server that is responsible for accessing the described data tools. Once this access is performed, data collected is subject of pre-processing and used for several purposes. One of these is real-time biological signals monitorization and after a deeper processing and analysis, the collected data is intended to be stored for posterior access and record and input for the emotion assessment unit. According to the emotion classification, the next multimedia content, picked from the emotion-layered multimedia catalogue, is presented to the initial user, hence completing the system’s loop.
4 Results In order to anticipate the project full implementation, a subset of the original project prototype was designed and proposed as thesis for the Masters in Electronic and Computer Engineering and two students were selected. Both are under the supervision of the authors have started their research. The project is intended to finish in the next couple of months is designed to be segmented into the following stages: • Biometric equipment procurement; • Adoption, adaptation or complete definition of an universal intermediate biometric data format; • Development, adaptation or reuse of selected equipment drivers; • Definition and development of a database capable of acquire, clean and process data from the equipments, in real-time; • Design and implementation of a very simple free theme multimedia application that has at least one functionality partial or totally controlled through the reading and analysis of one or more user’s biometric variable; • Integration, test and vertical component validation with performance of full scale experiments;
552
V. Vinhas, E. Oliveira, and L.P. Reis
This prototype is a small subset of the global project, embodies some redundant tasks and predicts some others not passable of being reused, but one must keep present that this prototype is simultaneously and clearly an effort to anticipate full-scale issues and provide a solid test bed for hypothesis and approaches. The project prototype is already in an fairly advanced stage, once that all predefined tasks had been successfully completed. For instance, the biometric hardware procurement was completed and three distinct equipments had been purchased: Neurobit Lite, OxiCard and Thoughtstream GSR. While the first enables brain electrical activity (EEG) measuring, the second is a truly portable pulse oximeter capable of measuring both pulse rate and blood oxygen saturation. Also during the prototype execution, a universal biometric data format was defined. This data format was translated both into a XML Schema and the correspondent relational database one. Although this topic is still a work in progress subject, it constituted a major breakthrough by enabling flexible data collection and efficient storage. Despite the importance of the referred items, one must say that the most important step for the prototype success resided in the definition of a completely flexible module communication protocol. By adapting the distributed equipment C++ drivers into contained standardized TCP\IP servers, not only hardware integration was possible, but it was done with extremely high levels of flexibility and all third-party code was encapsulated and contained at a specific physical and logical location. This architecture enables easy addition of other future equipments, programming language complete independence, and measured data efficient distribution. One major project’s achievement is the implementation of the application that has at least one functionality partial or totally controlled through the reading and analysis of one or more user’s biometric variable. Nevertheless, at the present, this proof of concept application is already capable of triggering both right mouse click and drag action by detecting the blink of a user’s eye. The hit ratio of blink detection is near 90% for a trained user and about 60% for others. For these last ones, the hit ratio declines due to false positives derived from rapid head movements or other muscular contractions[22]. This technology is already protected through intellectual and industrial propriety and waits full patent recognitions. In parallel, there have been some promising contacts regarding the commercialization of this product for disabled people by an international company. Another successful practical outcome resides in the results of a study that had used the project’s architecture as test bed to perform emotion assessment key factors discovery tests[23]. The conducted sessions not only demonstrated that gender and high frequency EEG signals are key factors in this area but also, in a collateral way, demonstrated the reliability and efficiency of the global system and its architecture. Summarizing this subsection, one must say that, at the submission date, a fullfunctional project prototype is designed and implemented; one spin-off technology was developed, recognized, protected and in a near future it shall be commercialized and through the usage of the defined prototype, some project’s thesis have been tested and demonstrated.
Realtime Dynamic Multimedia Storyline
553
5 Conclusions Having in mind the ideas depicted in section 3 and the results presented in the previous section, one shall state that, despite the project’s novelty, it is already possible to see clear pratical thrilling results. Due to the project’s broad spectrum, the current existence of a spin-off technology is not a full surprise and lets one guess that the odds of appearing more products as research side-effects are more than encouraging. Another extremely positive feature is the complete definition and implementation of the system’s architecture under the cloak of the project’s prototype. This has proved to be reliable not only for the project’s intentions but also for performing as a test bed for related experiments. Summarizing, the project global expected outcome is to enable an individual spectator/user to influence, at a subconscious level, based on real-time emotional assessment, through online biological signals processing, the multimedia contents storyline, either they are continuous or discrete. Multiple application domains have been identified some interesting collateral applications of the presented original concept that have also constituted, per si, an expected outcome. In this category, one shall consider system adaptations in order to accommodate psychiatric diagnosis and treatment procedures, either by simple emotional state assessment or by complementing this feature with audiovisual adequate contents. Direct marketing with emotional publicity contents applications is another possible field of project domain generalization. Videogame related entertainment industry is also a potential target with the introduction of emotional state information as an extra variable for game play enhancement. Another expected adaptation is related to the one firstly presented and consists in studying different human activities from the emotional point of view, and it is believable to be an important contribute to diverse social sciences. Although the project’s expected outcome is perfectly defined, there have been identified multiple fields of application that potentiate the domain of the project. This apparent paradox enables its classification as a realistic broadband project with wide horizons.
References 1. Marston, W.M.: Systolic Blood Pressure Changes in Deception. Journal of Experimental Psychology 2, 117–163 (1917) 2. Pravdich-Neminsky, V.V.: Ein Versuch der Registrierung der elektrischen Gehirnerscheinungen. Zbl Physiol. 27, 951–960 (1913) 3. Ebersole, J.S.: Current Practice of Clinical Electroencephalography. Lippincott Williams & Wilkins (2002) ISBN 0-7817-1694-2 4. De Pascalis, V., Ray, W.J., Tranquillo, I., D’Amico, D.: EEG activity and heart rate during recall of emotional events in hypnosis: relationships with hypnotizability and suggestibility. International Journal of Psychophysiology 29(3), 255–275 (1998) 5. Aftanas, L.I., Lotova, N.V., Koshkarov, V.I., Popov, S.A., Makhnev, V.P.: Nonlinear Forecasting Measurements of the Human EEG During Evoked Emotions. Brain Topography 10(2), 155–162 (1997)
554
V. Vinhas, E. Oliveira, and L.P. Reis
6. Ishino, K., Hagiwara, M.: A Feeling Estimation System Using a Simple Electroencephalograph. In: Proceedings of 2003 IEEE International Conference on Systems, Man, and Cybernetics, pp. 4204–4209 (2003) 7. Takahashi, K.: Remarks on Emotion Recognition from Bio-Potencial Signals. In: The second International Conference on Autonomous Robots and Agents (December 2004) 8. Chanel, G., Kronegg, J., Grandjean, D., Pun, T.: Emotion Assessment: Arousal Evalutation Using EEG’s and Peripheral Physiological Signals. Technical Report (December 2005) 9. Konrad, R.: Associated Press: Next-Generation Toys Read Brain Waves, http://www.usatoday.com/tech/products/games/2007-04-29-mindreadingtoys N.htm (Consulted June 2007) 10. Various Authors: NeuroSky, Do You ”Mind”? Fact Sheet. Technical Report (January 2007) 11. Johnson, M.: Branchez-Vous.com: Controler un train electrique avec son cerveau, http:// techno.branchez-vous.com/actualite/2007/06/controleruntrainelectrique. html (Consulted October, 2007) 12. Kemp, B., V¨arri, A., Rosa, A.C., Nielsen, K.D., Gade, J.: A Simple Format For Exchange of Digitized Polygraphic Recordings. Electroencephalography and Clinical Neurophysiology 82, 391–393 (1992) 13. Bailer, W., Schallauer, P.: The Detailed Audiovisual Profile: Enabling Interoperability between MPEG-7 Based Systems. In: Proceedings of 12th International Multi-Media Modeling Conference, Beijing, CN (2006) 14. Troncy, R.: Integrating Structure and Semantics into Audio-visual Documents. In: Fensel, D., Sycara, K.P., Mylopoulos, J. (eds.) ISWC 2003. LNCS, vol. 2870, pp. 566–581. Springer, Heidelberg (2003) 15. Bloehdorn, S., Petridis, K., Saathoff, S.C.N., Tzouvaras, V., Avrithis, Y., Handschuh, S., Kompatsiaris, I., Staab, S., Strintzis, M.G.: Semantic Annotation of Images and Videos for Multimedia Analysis. In: G´omez-P´erez, A., Euzenat, J. (eds.) ESWC 2005. LNCS, vol. 3532, Springer, Heidelberg (2005) 16. Hunter, J.: Adding Multimedia to the Semantic Web - Building an MPEG-7 Ontology. In: Proceedings of the 1st International Semantic Web Working Symposium (SWWS 2001), Stanford, USA (July 30- August 2001) 17. Isaac, A., Troncy, R.: Designing and Using an Audio-Visual Description Core Ontology. In: Motta, E., Shadbolt, N.R., Stutt, A., Gibbins, N. (eds.) EKAW 2004. LNCS (LNAI), vol. 3257, Springer, Heidelberg (2004) 18. Money, A.G., Agius, H.: Automating the Extraction of Emotion-Related Multimedia Semantics. In: IEEE International Workshop on Human-Computer Interaction, China (2005) 19. MIT Interactive Cinema Group (June 2007), http://ic.media.mit.edu/ 20. Fac¸ade a one-act interactive drama: http://www.interactivestory.net/ (Consulted June 2007) 21. Casella, P., Paiva, A.: MAgentA: An Architecture for Real Time Automatic Composition of Background Music. In: Proceedings of the Third International Workshop on Intelligent Virtual Agents, pp. 224–232 (2001) 22. Vinhas, A.G.V.: Mouse control through electromyography: Using biosignals towards new user interface paradigms. In: Biosignals, vol. 33 (2008) 23. Teixeira, J., Vinhas, V., Oliveira, E., Reis, L.P.: General-Purpose Emotion Assessment Testbed Based on Biometric Information. In: Review Process in KES Intelligent Interactive Multimedia Systems and Services (2008)
Assessing Separation of Duty Policies through the Interpretation of Sampled Video Sequences: A Pair Programming Case Study Marco Anisetti, Valerio Bellandi, Ernesto Damiani, and Gabriele Gianini Dpt.of Information Technology - University of Milan via Bramante 65, I-26013 Crema (CR) {anisetti,bellandi,damiani,gianini}@dti.unimi.it Abstract. In this paper we present a non-invasive technique, which can be used to interpret single camera undersampled videos in order to extract the alteration patterns of Pair Programming (PP) developers. The method uses a procedure for scene interpretation exploiting 3D face models to take into account for movement related illumination change, facial expression change and occlusion; in order to extract the PP relevant information, the scenes are sampled and then their interpretation are connected to one another also based on domain knowledge. The overall video interpretation performed in this way is robust to high frame-misclassification rates. Since the actor’s identities are not relevant by themselves but only the alternating times are important, the method can be used in lower quality videos, e.g. where quality has been purposely degraded to protect privacy.
1 Introduction Pair Programming (PP) is one of the key practices of several agile software development methodologies, including eXtreme Programming: it consists in a collaborative development method where two people are working simultaneously on the same programming task, alternating on the use of some IDE, so that while one is editing a software artefact the other is committed to assuring quality, by trying to understand, asking questions, suggesting alternative approaches and helping to avoid defects [1, 2]. In the standard version of this practice the two developers work on the same machine: while a developer plays the role of actuator, the other plays the role of supervisor. Form time to time the two developers switch their roles according to some pre-specified policy. PP is practiced in different variants, differing mainly in the switching time policy – specifying the amount of time each developer is supposed to spend in each role – and the number of developers involved – participating into different pairs. PP has has been claimed to yield, as a part of the extreme programming process, higher quality software products in less time. The claim is supported by anecdotal evidence and by empirical studies [3, 4, 5, 6]. However a more systematic study of the practice would be desirable: one based on real development settings, linking the degree of adherence to the practice to the quality level of software. To this purpose one should collect statistically relevant data sets in G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 555–564, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
556
M. Anisetti et al.
real developing environments. Another important reason why one could set up some form of data collection about the role exchange of the developers is to enforce the practice of PP. Switching regularly at relatively short time intervals is not a natural action for an untrained couple of developers: data reporting the switching times could be used to assess the degree of conformance to some specific PP policy, for instance in the context of outsourcing, when an agreement over the process methodology to be applied has been made. One of the main problems in this respect, however, is that, the collection of data about PP practice is usually based on rather invasive procedures: all the studies carried on so far would require either a person playing the role of experiment controller and taking note of the switching times, or the developers taking note of their switching times, either manually or by the equivalent of alternate log-on into some purposely designed and developed log-on system. In the present work we propose a methodology that, given a possibly blurred video sequence, performs a segmentation procedure, which – exploiting previously automatically acquired knowledge on individual programmer’s characteristics – assigns each segment of the sequence to one of the two programmers with some degree of confidence. The methodology is based on two key elements. The first one consists a method for associating a given image to a given possible situation of interest, i.e. state (e.g. Alan playing the actuator and Bob playing the supervisor) - with some probability. The second element consists in modeling the role alternation, within a developers’ pair, as a state transition of a Markov Chain. Notice that the two levels are nested. Furthermore the high level states (representing the identity of the actual programmer acting) cannot be seen directly (with certainty) in the frames: therefore the higher level chain is hidden. The overall model can therefore be categorized, as we will see briefly, as a Hierarchical Hidden Markov Model. In the next sections we will explain the video interpretation method adopted (Section 2) and detail the procedure for assessing the degree of conformance each policy (Sections 3). A discussion (Section 4) and an outline of possible developments will close the paper.
2 Video-Frame Interpretation The abilities to detect, track and recognize facial motion are useful in application such as face-based people identification, expression analysis, several surveillance and safety applications. In this work we are interested in the different states portrayed by the camera, where the states mainly differ in the number of actors portrayed, in the identity of the actors and the different distance from the camera. To this aim we perform video interpretation by classifying each video-frame as portraying the different states with different probabilities, as described briefly hereafter, and subsequently connect the information related to each individual frame – to obtain an interpretation of the video sequence – as explained in the next section. Our face detection system consists essentially in three interdependent components components integrated by means of the approach proposed
Assessing Separation of Duty Policies through the Interpretation
557
Fig. 1. A schematic view of the frame classification process
[7, 8]: i) the face detector (the Full Controllable Detector System, FCDS), ii) the tracking component charged to maintain control over their position and iii) the facial identification component. One of the main issues in face detection and tracking is the high variability arising from changes in the pose and facial expression deformations, as well as from illumination modifications and the changing in local illumination condition due this muscular motions. The system system we developed consists in a completely tunable hybrid method for accurate face localization based on a quick-and-dirty preliminary detection followed by a 3D refinement; for expression recognition we developed a 3D facial model able to morph in accordance with face shape and expression. This model is suitable for continuous facial tracking even in complex situation such as illumination change and occlusion and can work on the basis of the data of a monocular camera generating a streaming video.
3 Frame Sequence Interpretation After classifying the individual frames as to relating to one or the other states with some probability, one has the problem of associating sub-sequences of frames to one or the other programmer of a PP pair. This problem can be recast, in a Markov Model set-up, in the problem of segmenting a low level observable sequence of events into sub-sequences and of attributing each subsequence to one of the states (i.e programmers’ configurations). Sequence segmentation based on Hidden Markov Models is a consolidated practice in audio segmentation [9], video segmentation [10] and DNA sequence segmentation [11, 12] where the Viterbi’s algorithm is usually used. However for the purpose of PP practice assessment we are not interested in the full reconstruction of a state sequence based on the observable symbol sequence. We rather aim to a rather modest goal, related to general switching characteristics of the high-level state sequence: which specific goal will depend on the policy we are trying to check for. Without prior assumptions it could be hard to perform a reconstruction of the underlying state (configuration) sequence from the observed sequence, however
558
M. Anisetti et al.
thanks to the use of domain knowledge, we can assume the presence of only a limited number of switches between programmers and a relative persistence of each state, of the order of few minutes, i.e. for long frame sequences. This allow us to adopt a reconstruction method based on sampling. We sample from the main frame sequence a number of short subsequences – we will refer to them as spots – separated from one another by few minutes in order to determine how many actors are present and who is in foreground/background at that time: if the configuration of a spot and the next is the same we assume it has not been changed during the inter-time, if it is not the same, we sample some extra spot so as to achieve the desired resolution in locating the timing of the configuration change. The spots should be long just enough to allow a confident classification of the observed configuration or a reasonably confident identification of the presence of a transition between two configurations: thanks to the limited time-span of the spot, there cannot be more than one transition in the spot, this greatly simplifies the interpretation of the observed sequence. Now, after indicating the notation used in the remainder of the paper, we will give the details of the procedure – based on the application of Bayesian methods – for the interpretation of the individual spots and of the procedure for the subsequent reconstruction of the overall sequence to some desired level of accuracy. We will indicate the configuration at frame j, which is a state of a Markov Machine, with sk , and the corresponding observed symbol by ok , the high-level sequence of states captured by a spot will be indicated by S = (s1 , · · · , sk , · · · , sn ) whereas O = (o1 , · · · , ok , · · · , on ) will be the low-level, observed, sequence of symbols. Each observed symbol ok has some probability of corresponding to one of the possible configurations. For sake of simplicity we will mention only two of the possible configurations, and indicate them by A and B respectively (they could correspond, for instance, respectively to Alan being in foreground, while Bob is in background, and to Bob being in foreground while Alan is in background). With these conventions P (A|ok ) is the probability that the configuration observed is A and P (A|ok ) the probability that the configuration observed is B, whereas P (ok |A) is the likelihood that the state A produces the symbol ok and P (ok |B) is the likelihood that the state B produces the symbol ok . We assume that A will be more likely to produce symbols of a given class A ≡ {a1 , · · · , al , · · · , am }, whereas B will be more likely to produce symbols of another class B ≡ {b1 , · · · , bl , · · · , bm }. For sake of clarity, we adopt the temporary assumption that – when present – a state transition takes place instantaneously between one symbol and the next, which can be reasonable if the frames are conveniently paced. 3.1
Spot Sequence Interpretation
The procedure of decoding individual spot sequences – later to be used to reconstruct the overall sequence – can be carried out in a natural way by the use of a Bayesian approach. We need to compute the probability that, given the observed sequence, the underlying sequence corresponds to one of the following four mutually exclusive descriptions:
Assessing Separation of Duty Policies through the Interpretation
559
all A) it consists in a sequence of A states only all B) it consists in a sequence of B states only AB) the leading part of the sequence up to some index consists in A states only, the remainder of B states only BA) the leading part of the sequence up to some index consists in B states only, the remainder of A states only Thanks to the relative shortness of the spot no other cases are allowed. The first two items correspond, together, to the case that there are no state transitions, a case conventionally indicated by (# = 0), the last two items correspond, together, to the case that there is exactly one state transition, a case conventionally indicated by (# = 1). Formal statement. In Bayesian terms – we need to compute the posterior probabilities P (all A|O) , P (all B|O) , P (AB|O) and P (AB|O) of the four underlying sequences, given the observed sequence O, – to this end we will combine the four corresponding prior probabilities P (all A) , P (all B) , P (AB) and P (BA) – with the likelihoods P (O|all A) , P (O|all B) , P (O|AB) and P (O|BA) respectively, using the Bayes’ Theorem, – P (O) will denote the total probability and will be given by P (O) = P (O|all A)P (all A) + P (O|all B)P (all B) + P (O|AB)P (AB) + P (O|BA)P (BA) The transition likelihood P (O|AB) and the transition likelihood P (O|BA) are given by the contribution of mutually exclusive fine grained subcases, each corresponding to the event of the transition taking place at a different index i: P (O|AB) =
n−1 i=1
P (O|i, AB)
and
P (O|BA) =
n−1
P (O|i, BA)
i=1
the sum runs from the transition taking place exactly after the first index, to the one taking place after the penultimate in the sequence. Priors. We will attribute values to the priors according to the following probability and symmetry considerations: P (# = 0) is the predominant probability, since the swaps between developers are far from one another; P (# = 1) depends on the length of the spot relative to the swap frequency, we indicate the value of
560
M. Anisetti et al.
this probability by s, with this convention P (# = 0) = (1 − s); we are labeling the developers arbitrarily and consider states of equivalent persistence so that P (all A) = P (all B) = (1 − s)/2 and P (AB) = P (BA) = s/2; we assume the sampling is randomly distributed so that when there is a state transition within the time span covered by the spot, it can take place at any index with the same probability P (i|AB) = 1/n. Likelihoods. After these premises and due to the independence of the observed symbols from one another, we have P (O|all A) =
n
P (ok |A) and P (O|all B) =
k=1
n
P (ok |B)
k=1
whereas P (O|AB, i) =
i
P (ok |A) ·
k=1
and P (O|BA, i) =
i
n−1
P (ok |B)
k=i+1
P (ok |B) ·
k=1
n−1
P (ok |A)
k=i+1
Notice that P (O|all A) = P (O|AB, n) whereas P (O|all B) = P (O|BA, n). Evidence. We can now rewrite the evidence in an expanded form P (O) = P (O|all A)P (all A) + P (O|all B)P (all B) n−1 n−1 + P (O|i, AB)P (i, AB) + P (O|i, BA)P (i, BA) i=1
i=1
taking into account that P (i, AB) = P (i|AB)P (AB) = s/2n and that that P (i, BA) = P (i|BA)P (BA) = s/2n and combining all the previous expressions, the evidence can be computed from the experimental data, i.e. the individual frame likelihoods P (ok |A) and P (ok |B), as follows P (O) =
n
P (ok |A) · (1 − s)/2 +
k=1
+
n−1
n−1
P (ok |B) · (1 − s)/2
k=1 i
P (ok |A) ·
i=1 k=1
+
n
i
i=1 k=1
n−1
P (ok |B) · s/2n
k=i+1
P (ok |B) ·
n−1
P (ok |A) · s/2n
k=i+1
Posteriors. Finally the posteriors can be obtained from the experimental data, i.e. the individual frame likelihoods P (ok |A) and P (ok |B), as follows
Assessing Separation of Duty Policies through the Interpretation
P (all A|O) = P (all B|O) =
n k=1 n k=1
P (AB|O) =
n−1 i=1
P (BA|O) =
n−1 i=1
561
P (ok |A) · ((1 − s)/2)/P (O) P (ok |B) · ((1 − s)/2)/P (O)
i
P (ok |A) ·
n−1
k=1
k=i+1
i
n−1
P (ok |B) ·
k=1
P (ok |B) · (s/2n)/P (O) P (ok |A) · (s/2n)/P (O)
k=i+1
For a sufficiently high frame rate, in practice, these probabilities very neatly indicate the preferred interpretation: in practical cases only one of the four values is close to one, whereas all the others are some order of magnitude lower: assigning the spot to the interpretation with the highest posterior value argmaxh∈{A,B} P (h|O) produces a classification performance with a confusion matrix essentially diagonal: an analytical explanation of this performance is given in the Appendix. 3.2
Sequence Reconstruction
Thanks to the sharpness of judgment allowed by the posteriors one can classify each spot almost always correctly, and the occasional incorrect classification has little effect over the overall statistical properties of the reconstructed sequence. The sequence is reconstructed by sampling the starting frame of a sufficiently high number of spots so as to examine spots located few minutes apart: – if the configuration of a spot and the next is the same we assume it has not been changed during the inter-time, – if the configuration of a spot and the next is not the same, we sample some extra spot so as to achieve the desired resolution in locating the timing of the configuration change. 3.3
Reconstructed Sequence Analysis
At this point one can work over the reconstructed high-level sequence – Extract from the high-level state sequence the metrics relevant to the policy (e.g. the number of switches in the high-level sequence) – Assess the respect of the policy based on the metrics distribution (e.g. say OK if there have been at least n switches)
562
M. Anisetti et al.
4 Conclusions and Outlook One of the problems in the study of the PP practice, is that it has so far been based on rather invasive procedures: either a dedicated person had to play the role of experiment controller and take note of the switching times, or the developers had to take note of their own switching times, either manually or by the equivalent of alternate log-on into some purposely designed and developed logon system. Those methods are intrinsically either imprecise or invasive or both (for instance it has been reported [13] that, even given a very light-weight oneclick log-on procedure, the developers would fail to log-themselves on most of the times when switching) and therefore impractical in real developing settings: in order to acquire data in real contexts one has to rely on a less demanding procedure. A method of PP data collection – based on the characterization of each developer based on the personal time pattern detected during her navigation within the Integrated Developing Environment (IDE) – has been developed in [14, 15], which although of reduced invasiveness, still involves installing some software probe on the developer’s system to capture the relevant IDE environments, and performing a rather long training to achieve a sufficient separation between the behaviors of the developers, furthermore, due to the elusiveness of personal time patterns, the length of the training cannot be estimated in advance). In this paper we present a method of minimal invasiveness, based on video capture and interpretation, which, as such, does not need to interact with the developer’s system, and which can be trained quickly to classify the different developers, based on their facial characteristics. The method relies on single camera video interpretation and is based on an integrated procedure consisting in face detection, face tracking and face identification, which is robust with respect to several critical factors such as illumination change, facial expression change, head movement and occlusion; this robustness is obtained through the exploitation of 3D face models which are used to perform a normalization of the face mask, prior to a characterization in terms of eigenfaces and subsequent classification. Thanks to a reconstruction of the video flow, based on spot sampling and exploitation of domain knowledge, the overall video interpretation is robust to high misclassification rates. Furthermore, since the actor’s identities are not relevant by themselves but only the alternating times are important, the method and can be used in lower quality videos or in videos where quality has been purposely degraded to protect privacy. We applied this methodology to a sample of real PP data with satisfactory results. The method could be used not only to assess the degree of conformance with respect to predefined switch-times policies, but also to capture the characteristics of a given programmers pair’s switching process, in the context of PP effectiveness studies. We plan to extend the present work by exploring the limits of effectiveness of this methodology and by refining the procedure of sequence reconstruction, for instance by introducing the consideration of the speed of image elements in frame interpretation. Furthermore we plan to study to a larger sample of PP programming data and to study the capabilities of the method with respect to the assessment of different policies.
Assessing Separation of Duty Policies through the Interpretation
563
References 1. Beck, K.: Extreme programming explained: Embrace change. Addison Wesley Longman, Inc., Reading (2000) 2. Nawrocki, J., Wojciechowski, A.: Experimental Evaluation of Pair Programming. In: European Software Control and Metrics (ESCOM) (2001) 3. Lindvall, M., Basili, V., Boehm, B., Costa, P., Dangle, K., et al.: Empirical findings in agile methods. In: XP/Agile Universe 2002, Chicago, USA (2002) 4. Melnik, G., Williams, L., Geras, A.: Empirical Evaluation of Agile Processes. In: XP/Agile Universe 2002, Chicago, USA (2002) 5. Abrahamsson, P., Warsta, J., Siponen, M.T., Ronkainen, J.: New directions on agile methods: A comparative analysis. In: International Conference on Software Engineering (ICSE25), Portland, Oregon, USA (2003) 6. Abrahamsson, P.: Extreme Programming: First Results from a Controlled Case Study. In: Proceedings EUROMICRO 2003 (2003) 7. Damiani, E., Anisetti, M., Bellandi, V., Beverina, F.: Facial identification problem: A tracking based approach. In: IEEE (SITIS 2005) (2005) 8. Beverina, F., Palmas, G., Anisetti, M., Bellandi, V.: Tracking based face identification. In: Proc. of IEE 2nd Int. Conf. on Intelligent Environments IE 2006, Athens (2006) 9. Gish, H., Siu, M., Rohlicek, R.: Segmentation of Speakers for Speech Recognition and Speaker Identification. In: Proc. Int. Conf. Acoustics, Speech, and Signal Processing, May 1991, vol. 2, pp. 873–876. IEEE, Toronto, Canada (1991) 10. Phillips, M., Wolf, W.: Video Segmentation Techniques for News. In: Multimedia Storage and Archiving Systems, SPIE, vol. 2916, pp. 243–251 (1996) 11. Krogh, A., et al.: Hidden Markov models in computational biology. Applications to protein modeling. J. Mol. Biol. 235, 1501–1531 (1994) 12. Du, J., Rozowsky, J.S., Korbel, J.O., Zhang, Z.D., Royce, T.E., Schultz, M.H., Snyder, M., Gerstein, M.: Systematically incorporating validated biological knowledge: an efficient hidden Markov model framework for segmenting tiling array data in transcriptional and ChIP-chip experiments. Bioinformatics 22(24), 3016–3024 (2006) 13. Panel of the workshop QUTE-SWAP @ FSE 2004, Newport (CA) (November 2004) 14. Colombo, A., Damiani, E., Gianini, G.: Discovering the software process by means of stochastic workflow analysis. Journal of Systems Architecture 52(11), 684–692 (2006) 15. Damiani, E., Gianini, G.: Navigation dynamics as a biometrics for authentication. In: Apolloni, B., Howlett, R.J., Jain, L. (eds.) KES 2007, Part II. LNCS (LNAI), vol. 4693, pp. 832–839. Springer, Heidelberg (2007)
Appendix In this appendix we outline a simplified model of the spot classification problem in order to give an analytical explanation to the high classification efficiency for the video-frame sequence of a spot, despite the possible inefficiency of the individual frame classification. We are interested in computing the confusion matrix among the four interpretations all A, all B, AB and BA – macro confusion matrix – starting from the confusion matrix at frame level – micro confusion matrix. To this aim we will state some simplifying assumptions allowing a compact representation of the microscopic level, and then compute the probability
564
M. Anisetti et al.
P ( interpretation all B |all A), the other elements of the macro confusion matrix will be qualitatively estimated by analogy. We assumed that A is more likely to produce symbols of a given class A whereas B will be more likely to produce symbols of another class B Now we adopt a further assumption: we assume that each class contains a single symbol: the symbol α for A and the symbol β for B, and indicate the corresponding likelihoods by a and b respectively, therefore – a ≡ P (α|A) will be the likelihood that A produces the symbol α, whereas – b ≡ P (β|B) will be the likelihood that B produces the symbol β. These conventions are used to capture the typical behavior of the states A and B. As a consequence of these conventions all the likelihoods simplify as follows (t and c count the number of symbols α and β, respectively, in the sequence): P (O|all A) = at (1 − a)c furthermore
P (O|all B) = bc (1 − b)t
← − −−→ ← − −−→ P (O|AB, i) = a ti (1 − a) ci · bci+1 (1 − b)ti+1 ← − −−→ −−→ ← − P (O|AB, i) = b ci (1 − b) ti · ati+1 (1 − a)ti+1
where ← − – ti counts α’s occurrences up to index i, – ← c−i counts β occurrences up to index i, → – − c− i+1 counts β’s occurrences from index i + 1 to the end −−→ – ti+1 counts α’s occurrences from index i + 1 to the end ← − clearly t = tn and c = ← n− n . To further simplify further the calculations we will assume the micro confusion matrix is symmetric: a = b = (1 − ) and (1 − a) = (1 − b) = Since we adopt the choice of keeping the interpretation which has the highest a posteriori probability, the confusion between all A and all B can take place only when – although the state sequence is all A – it happens that bc (1 − b)t > at (1−a)c , with the additional assumption about the symmetry of micro confusion matrix this is equivalent to t > c, i.e. t > n/2. If the observed symbols are independent from one another, then – the state sequence is all A – the number t is distributed according to a Binomial Bin(t|a, n), which can – for n high enough (in practice already for n > 30) – be conveniently approximated by a Normal density N orm(t|μ = na, σ 2 = na(1 − a)): there will be confusion when the lower end of the left tail extends perceivably below 0.5. For the 3σ property of the Normal there will be a probability of confusion of the order of 3/1000 when the mean is located at 3σ’s above 1/2, i.e. when: μ − 3σ = n2 i.e., when na − 3 na(1 − a) = n2 ; rewriting a in the more convenient form a = 12 + δ, we find after some algebra that the condition is fulfilled for δ = 1/2 n9 + 1. Therefore already for n = 72, even if a = 0.66, i.e. even if the micro classification is wrong one time out of three, the macro classification is wrong only three times every one thousand.
Trellis Based Real-Time Depth Perception Chip Using Interline Constraint Sungchan Park and Hong Jeong Pohang University of Science and Technology, Electronic and Electrical Engineering Pohang, Kyungbuk, 790-784, South Korea
[email protected] http://isp.postech.ac.kr
Abstract. As a step towards real-time stereo, we will present a fast and efficient VLSI architecture and implementation of a stereo matching algorithm which has a low error rate. The architecture has the form of linear systolic array using simple processing element(PE)s that are connected with neighboring PEs. Due to this simple full parallel structure, it is smaller in the time complexity load than other methods. Thus our structure is more adequate for high resolution and real-time applications like the 3D video conference, the Z-keying,and the virtual reality. Our chip can process 320 by 240 images of 128 levels at 30 frames/s.
1 Introduction Stereo vision is the process of recreating depth or distance information from a pair of images of the same scene. If we obtain the 3D depth map in the high speed, it is possible to merge the real and the virtual world in real time. It can be used for many applications such as the 3D augmented reality [2], the 3D video conference [5] [7], the Z-keying, and the virtual reality [10]. Its methods fall into two broad categories [1]. One is the local method which uses the constraint in small window pixels like the block matching or the feature matching technique. And the other is the global method which uses the global constraints on scanlines or whole image like dynamic programming and graph cuts. Normally many real time systems [10], [13], [9], [4] use the local methods. Although it has low complexity, there are some local problems where it fails to match, due to occlusion, uniform texture, ambiguity of low texture, and etc. Also, the popular local matching method, the block matching skill blurs disparity data in the object boundary. The global methods can solve these local problems but suffer from the huge processing time. Recently, a few real-time global methods have been implemented through GPU in the graphics card or MMX of CPU. Some systems [6] [3] can output good results in the near real-time while others [8] support the real-time processing with a little worse results, so that the main performance in the real-time area is up to the computational complexity. GPU and MMX are not the special processors of the full parallel architecture, so that there is a restriction of parallel processor number. Jeong and Park [12] built the stereo global matching VLSI chip that can output 1280 by 1000 disparity image of 208 levels in real time using 208 processors. The high speed is possible due to a full parallel dynamic programming search method on G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 565–575, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
566
S. Park and H. Jeong
a trellis solution space, which is suitable for highly parallel implementation. However it is the line-by-line matching method. Due to the independence between lines, the streak noises are occurred. In this paper, by adding the smoothness cost on the trellis path using disparity data of the previous line and adopting a modified measure which is robust to the sampling effect and the pixel data noise, we can eliminate the streak noises with a small computational complexity and produce a high quality disparity image. We introduce an efficient linear systolic array architecture that is appropriate for VLSI implementation. The array is highly regular, consisting of identical and simple processing elements (PEs) and external communication occurs with the end PEs. Hence, it is possible to construct an real-time stereo vision chip which has the small size and the high resolution. This paper is organized as follows: A brief review of the matching algorithm based on a trellis structure is presented in Sec. 2. Sec. 3 describes the systolic array of PEs realizing this algorithm. The test results are discussed in Sec. 4 and finally conclusions are given in Sec. 5.
2 Trellis Based Stereo Matching System We will explain the trellis based stereo matching simply which was shown by [11]. Given the observations g l and g r of the M by N image and D disparity levels, we wish to ˆ = arg max P (d|g l , g r ). obtain a maximum a priori (MAP) estimate of the disparity d d
left image g
pl
l
0 M-1 0
pc
pr g r right image
0 M-1 2M
center disparity
(a) Projection model for stereo camera. - - 4 j
i
image pixel 0
- - - -
R 3 - R - R 2 R - R 1 - R - R - R 0 r r r l l y
-
-
it ar sp di
6
- - R - R -
r r g1l g1 g2l g2 g3l g3 g4 g4 g5 g5 1 2 3 4 5 6 7 8 9 10 site (b) Disparity trellis for M = 5.
Fig. 1. The disparity trellis model
Trellis Based Real-Time Depth Perception Chip
567
ˆ is based on the discrete center-referenced inverse projection The solution space of d as shown in Fig. 1(a) as detailed in [11]. It can be described in the graph form by the trellis in Fig. 1(b). The energy function of this solution space is as follows. U (d) =
2M
[Δg(di )o(i + di ) + γ|di − di−1 |] ,
(1)
i=1
2 d = {d0 , d1 , ..., d2M }, Δg(di ) = g l1 (i−di +1) − g r1 (i+di +1) , 2
(2)
2
which must be minimized for each scan line. Here the first term is the matching cost, the second term is the occlusion cost decided by the parameter γ and o() is a function that is 1 when the argument is odd and 0 otherwise. By applying (2) to the trellis shown in Fig. 1(b), the MAP estimate is reduced to finding the best (i.e., lowest cost) path through the trellis and the Viterbi algorithm can ˆ is the disparity that minimizes (2). be used to do so efficiently. The optimal disparity d
3 New Trellis Based Stereo Matching Algorithm The previous line by line matching suffers from the sampling error, the thermal noise of the camera and the ambiguity of the textureless part. The stereo matching model is usually characterized by 2D random field. That is, the attributes of one line must be related to neighbors on the vertical direction. We applied the simple truncated linear model V (a, b) to the vertical line dependence function. Given the weight η and the smoothness parameter λ, we can represent the surface smoothness as the inlier case of |a − b| < λ and its discontinuity as the outlier case. V (a, b) = η min(λ,|a, b|).
(3)
Given the disparity data d(i, n − 1) of the previous scanline and the matching cost Δg(di ), our energy model can be expressed as follows: U (d) =
2M
[Δg(di )o(i + di ) + γ|di − di−1 |] + V (j, d(i, n − 1)).
(4)
i=1
3.1
New Trellis VLSI Sequence and Architecture
In this algorithm, we use the subscript to represent the usage of the memory resources. Given the pixel data g l and g r , the accumulated register Uj (i), stack Vi,j (n), and the ˆ are computed at each i step and scanline n, activation flag bit aj (i, n), the disparity d as follows. At each site i = 1, . . . , 2M and the disparity level j in the trellis, forward recursion and backtracking proceed concurrently. 1 if x = 0, f (x) = (5) 0 elsewhere.
568
S. Park and H. Jeong
1. Initialization: At site i = 0, all of the node costs are set to infinity except for disparity j = 0. 1, if j = 0, 0, if j = 0, (6) U (0, j) = aj (0, n) = 0, otherwise. ∞, otherwise, 2. Forward recursion: Find the local best path for each node i = 0,...,2M. CostV (i, j, n − 1) = η|p| ∗ aj+p (i, n − 1) p∈[−λ,λ]
+ η ∗ λ ∗ f(
aj+p (i, n − 1)),
(7) (8)
p∈[−λ,λ]
a) If i + j is odd i. If n is even Uj (i) = Uj (i − 1) + Δg(i, j, n) + CostV (i, j, n − 1),
(9)
ii. If n is odd Uj (i) = Uj (i − 1) + Δg(2M − i, j, n) + CostV (i, j, n − 1),
(10)
Vi,j (n) = 0,
(11)
b) If i + j is even Uj (i) =
min
p∈[−1,1],j+p∈[0,N −1]
Uj (i − 1) + γp
+ CostV (i, j, n − 1),
2
(12) Vi,j (n) = arg
min
Uj (i − 1) + γp . 2
p∈[−1,1],j+p∈[0,N −1]
(13)
3. Backtracking: Find the optimal disparity by tracing back the path i = 0,...,2M. aj (i + 1, n − 1) = aj+p (i, n − 1) ∗ f (p + Vi,j+p (n − 1)), (14) p∈[−1,1]
(15) a) If n is even ˆ ˆ d(2M − (i + 1), n − 1) = d(2M − i, n − 1) + aj (i, n − 1) ∗ Vi,j,n−1 ,
(16) (17)
j∈[0,D−1]
b) If n is odd ˆ + 1, n − 1) = d(i, ˆ n − 1) d(i
+
j∈[0,D−1]
aj (i, n − 1) ∗ Vi,j,n−1 .
(18)
Trellis Based Real-Time Depth Perception Chip Uj+1 (i − 1) + γ
569
Uj (i − 2) + γ
6 aj+λ- Smooth Cost aj−λ- Module
-
Uj (i − 2)
Uj (i − 1) - ?-
-
?Uj (i) - D
-
rm- Matching
clk Set/Reset
6
Cost lm- Module
6
6
γ Vi,j
2
-
Control
?
6
Uj−1 (i − 1) + γ
Uj (i − 2) + γ
Fig. 2. Forward processor j at time step i aj+1 (i−1)δ(1+Vi−1,j+1 )
aj (i)δ(1−Vi,j )
6 aj (i)δ(Vi,j ) D1
-
aj (i +2)
- Control Vi,j
- D1
clk Set/Reset 2
aj (i)
6
aj (i − 1)
-
-
dmux S
6 -
∗ Vi,j
?
-
? aj−1 (i−1)δ(1−Vi−1,j−1 )
aj (i)δ(1+Vi,j )
Fig. 3. Backward processor j at time step i
The main engine is the two processors: one for the forward process and the other for backward process. As shown in Fig. 2 and the forward recursion part of the algorithm, each forward processor accepts the pixel data and calculates the matching cost. After receiving the signals aj+p (i, n − 1) of p ∈ [−λ, λ] which were calculated in the backward processors of the previous line, if there exists the activated signal among them, the smooth cost module outputs η|p| and otherwise outputs η ∗ λ. That is, the cost V (j, d(i, n − 1)) is calculated by considering this distance between the activated signal index j + p and the current forward processor index j, and used to represent the interline smoothness constraint. The pixel matching cost and output cost of the smooth cost module are added to the accumulated cost for the matching path, and a comparison is made between that
570
S. Park and H. Jeong
forward
line n-1
backward
aj(n-1) forward
Line n
backward Fig. 4. Sequence’s direction at each line
its cost and the cost of the paths coming from the neighboring processors above and below. The chosen path is pushed into a stack. The cost contained in the register is added to an occlusion cost (γ ) and output to the neighboring forward processors. During the backward phase, the decisions stored in the stacked are output to the backward processors. As shown in Fig. 3 and the backtracking of the algorithm, only one processor is active at one time as indicated by the register containing the activation flag. This is the PE that is on the optimal disparity path. Each processor collects the activation flag of the neighboring processors and itself. If any of them are high then aj (i + 1, n − 1) is active. as follows. ⎧ ⎪ ⎪self-input flag ⎪ ⎨input flag from upper PE ⎪ input flag from lower PE ⎪ ⎪ ⎩ input flag of neighbor PE
aj+0 (i)f (0 + Vi,j (n − 1), aj+1 (i)f (1 + Vi,j+1 (n − 1), (19) aj−1 (i)f (−1 + Vi,j−1 (n − 1), p∈[−1,1] aj+p (i, n − 1)f (p + Vi,j+p (n − 1)).
In this manner, the activation flag is passed from processor to processor according to the input local path decision Vi,j (n−1). If a backward processor is active, its local decision is accumulated to the register, which outputs the disparity to a bus to communicate with the external host. The forward and backward processing step have the reverse direction with each other on the trellis graph. But if we want for the forward in the current line and backward processor in the previous line to be executed to the same direction at the same time, we have to change the direction of the step reversely at each line as in Fig. 4, so that aj (i, n − 1) can be used directly by the forward processor of the line n. This can be done if the pixel data is loaded as the reverse order using the buffer of one scan line size as shown in (19) and (10). Hence, we can reduce the time complexity by half. 3.2
Systolic Array Architecture
The overall architecture is a linear systolic array of PEs as shown in Fig. 5. Communication extends only to neighboring PEs and the array is completely regular in structure, making the actual hardware design relatively simple. At D disparity levels and M by N image, the area complexity is O(D) because of D number of PEs except for the best path decision registers V of which the complexity is O(2M D). The forward processor f pj calculates the matching cost from pixel data inputs ln and rn which have to include
Trellis Based Real-Time Depth Perception Chip
6 ?
rm+1
Uj+2 + γ 6
aj+2+λ
Uj+3 + γ
aj+2−λ
? ?? f pj+2
aj+1+λ
Uj+2 + γ
aj+1−λ
??? ?? f pj+1
lm+1
6
Uj + γ 6 Uj+1 + γ
?
?
6
Uj−2 + γ 6
?
Uj−1 + γ
?
? bpj+2
-
∗ Vj+2
6 -
? bpj+1
-
∗ Vj+1
∗ ) aj δ(1 − Vj∗ )6 aj+1 δ(1 + Vj+1
aj
- stack
aj−1+λ
??? ?? f pj−1
lm
aj+1
6 -
? bpj
-
Vj∗
-
∗ Vj−1
∗ )6 aj−1 δ(1 − Vj−1 aj δ(1 + Vj∗ )
aj−1−λ
Uj + γ
6 -
∗ ∗ )6 ) aj+1 δ(1 − Vj+1 aj+2 δ(1 + Vj+2
aj−λ
??
66Uj−1 + γ 6
aj+2
- stack
aj+λ
f pj
rm
∗ ∗ )6 ) aj+2 δ(1 − Vj+2 aj+3 δ(1 + Vj+3
- stack
66Uj+1 + γ 6
571
aj−1
- stack
6 -
? bpj−1
∗ ∗ aj−1 δ(1 + Vj−1 aj−2 δ(1 − Vj−2 )6 ) ˆ d
?
?
Fig. 5. PE array structure
enough number of pixels for our dissimilarity measure. The processor computes the local decision, and stores it into the stack. This is the forward processor followed by the backward processor which reads the stack and determines the optimal path. 3.3
Computational Complexity
This algorithm has the total time complexity O(NM4D) if the forward and backward part are processed sequentially for the total 2M step and D disparity levels. Its complexity reduces to O(2MN) by the parallel D forward and backward processors, so that it can process in the real-time.
4 Experimental Results First, we will verify the algorithm quantitatively and qualitatively using the Middlebury data set by the software simulations. In this experiment, we used 3 by 1 size for the dissimilarity measurement window and set up the smoothness parameter λ to 0. Fig. 6 shows the comparison results between the single line method [12] which has the stripe noises and our method where they are effectively suppressed. We will quantify our algorithm through the output error performance. The error calculating method represents the error rate for disparity error more than 1 at the unoccluded area P m.
572
S. Park and H. Jeong
(a) Tsukuba left image
(b) Map left image
(c) Venus left image (d) Sawtooth left image
(e) Single-line DP method
(f) Our method
(g) True disparity Fig. 6. Resulting depth maps from the Middlebury stereo data set
100 ˆ 0 , p1 ) − dT RUE (p0 , p1 )| > 1), (|d(p N (p0 ,p1 )∈P m 1). (N =
error(%) =
(20) (21)
(p0 ,p1 )∈P m
Our method is better than single-line DP(dynamic programming) and real-time DP, but worse than real-time GPU and real-time BP. The real-time GPU and BP require the wider range of smoothness dependence in the image. They need more computational complexity.
Trellis Based Real-Time Depth Perception Chip
573
Table 1. Disparity error comparison of several methods(%) Image Tsukuba Map Venus Sawtooth Single-line DP [12] 5.57 4.12 8.72 5.22 Real-time DP [8] 2.85 6.45 6.42 6.25 Our method 2.63 0.91 3.44 1.88 Real-time GPU [3] 2.05 NA 1.92 NA Real-time BP [6] 1.49 NA 0.77 NA
Right Camera
-
Left Camera
-
Disparity Out to Grabber board
-
Rectification FPGA
-
Stereo Matching FPGA
Rectification FPGA
Stereo matching FPGA
Camlink connector
(a) Overall system.
(b) Hardware of real-time stereo vision system.
Fig. 7. Hardware system descriptions
In Fig. 7(a), our VLSI system is described. The images received from a pair of cameras are processed by the rectification logics and then the disparity data is calculated by the stereo matching part from rectified images. The result data is transferred from the FPGA to the grabber through the Camlink cable. PC then reads the computed disparity, converts it to a gray scale image and displays it on the screen. Our architecture was implemented on Xilinx Virtex II pro-100 FPGA which incorporates 128 PEs. The implemented stereo matching board is shown in Fig. 7(b). Currently, we have not used fully the resource of FPGA due to the attached camera, which has a small image size. The clock resource was used as 16% at 12.3MHz, and the logic resource also occupied 49%. Like Jeong and Park [12]’s parallel architecture, we can easily extend the image size and the disparity level by increasing processors and the clock hertz if necessary. The Table 2 shows the computation time performance. Single-line DP exhibits the highest speed because it has a full parallel VLSI structure like ours. Our method also displays the second ranking performance. Table 2. Comparison of computation time Spec Image Levels fps Real-time BP [6] 320x240 16 16 Real-time GPU [3] 320x240 48 18.5 Real-time DP [8] 320x240 100 26.7 Our method 320x240 128 30 Single-line DP [12] 1920 x 1000 400 15
574
S. Park and H. Jeong
(a) Left image
(b) Disparity image
(c) Left image
(d) Disparity image
Fig. 8. Real disparity image outputs
Real-time BP and GPU have the computational limitation because they need the complex calculations and do not have the full parallel structure, although they produce lower error. That is, it is hard to increase disparity resolution and processing speed with the general PC or the graphics card. Fig. 8 shows the output results in our real-time system .
5 Conclusions We have presented a real time stereo matching chip based on a VLSI structure which effectively suppresses the streak noises by considering the interline dependence in the image. For a pair of images with M by N pixels, only O(2M N ) time is required. The design is highly scalable and fully exploits the concurrent and configurable nature of the algorithm. We implemented a stereo chip on Xilix FPGA with 128 PEs(Processing Elements) that can obtain the disparity range of 128 levels. Currently, it can provide real-time stereo matching for the 320 x 240 images at 30 frames/s. We will extend to the higher image size and disparity levels in real time by increasing the FPGA resources.
References 1. Brown, M.Z., Hager, G.D.: Advances in computational stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(8), 993–1007 (2003) 2. Gordon, G., et al.: The use of dense stereo range data in augmented reality. In: Proceedings of the International Symposium on Mixed and Augmented Reality (2002) 3. Wang, L., et al.: High-quality real-time stereo using adaptive cost aggregation and dynamic programming. In: 3DPVT (2006) 4. Hariyama, M., et al.: Architecture of a stereo matching vlsi processor based on hierarchically parallel memory access. In: The 2004 47th Midwest Symposium on Circuits and Systems, vol. 2, pp. II245–II247 (2004) 5. Kauff, P., et al.: Fast hybrid block- and pixel recursive disparity analysis for real-time applications in immersive tele-conference scenarios. In: Proceedings of 9th International Conference in Central Europe on Computer Graphics Visualization and Computer Vision, pp. 198–205 (2001) 6. Yang, Q., et al.: Real-time global stereo matching using hierarchical belief propagation. In: The British Machine Vision Conference (2006) 7. Yang, R., et al.: Real-time consensus-based scene reconstruction using commodity graphics hardware. In: Proceedings of the 10th Pacific Conference on Computer Graphics and Applications, pp. 225–235 (2002)
Trellis Based Real-Time Depth Perception Chip
575
8. Forstmann, S., et al.: Real-time stereo by using dynamic programming. In: CVPR, Workshop on real-time 3D sensors and their use (2004) 9. Kimura, S., et al.: A convolver-based real-time stereo machine (sazan). Proc. Computer Vision and Pattern Recognition 1, 457–463 (1999) 10. Kanade, T., et al.: A stereomachine for video-rate dense depth mapping and its new applications. In: Proceedings of the IEEE International Conferenceon Computer Vision and Pattern Recognition (1996) 11. Jeong, H., Oh, Y.: Fast stereo matching using constraints in discrete space. IEICE Transactions Info. (September 2000) 12. Jeong, H., Park, S.C.: Generalized trellis stereo matching with systolic array. In: Cao, J., Yang, L.T., Guo, M., Lau, F. (eds.) ISPA 2004. LNCS, vol. 3358, pp. 263–267. Springer, Heidelberg (2004) 13. Konolige, K.: Small vision systems: Hardware and implementation. In: Proc. Eighth Int. Ol Symp. Robotics Research (1997)
Simple Perceptually-Inspired Methods for Blob Extraction Paolo Falcoz Universit` a degli Studi di Milano – Dipartimento di Tecnologie dell’Informazione
[email protected]
Abstract. Studies in Visual Attention (VA) and eye movements have shown that humans generally only attend a few regions of interest (ROIs) in an image. The idea is to perform image analysis only on a small neighborhood of each ROI. This regions can be thought of as the most informative parts of the image, and as such can be analyzed with respect to colors, textures, and shapes. In this paper we will focus on color-driven blob extraction. Inspiration from the human retina guides the definition of neighborhood of a ROI, while psychological factors in human color perception are used to drive color selection. Keywords: blob extraction, regions of interest, color perception.
1 Introduction Studies in Visual Attention (VA) and eye movements [5],[10] have shown that humans generally only attend a few regions of interest (ROIs) in an image, which are determined in part by their information content. Different methods to find ROIs are listed and tested in [7]; we will make use of Michaelson Contrast. The idea is to perform image analysis only on a small neighborhood of each ROI: this should decrease processing costs, without impacting on important image features. The word should is used because our exploration is still at the beginning and we don’t have performance figures yet. In this paper we will focus on blob extraction using Michaelson contrast to find ROIs; since the process can easily be extended from one ROI to many, here we will discuss the simpler case only. In particular, the most prominent region of interest will be considered. Related to the concept of ROI there is the problem of finding a good definition for neighborhood : given a region of interest – from now on referred to as focus point – how many pixels should be considered as important for effective image processing? How should this neighborhood be shaped? There are of course many answers, but we wanted to push a step further the analogy with human perception system. Inspiration comes from the human retina, the light-sensitive layer at the back of the eye that covers about 65 percent of its interior surface. Photosensitive cells called rods and cones in the retina convert incident light energy into signals that are carried to the brain by the G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 577–584, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
578
P. Falcoz
optic nerve. In the middle of the retina is a small dimple called the fovea or fovea centralis. It is the center of the eye’s sharpest vision and the location of most color perception. In fact, while cones are concentrated in the fovea, rods are absent there but dense elsewhere. Measured density curves for the rods and cones on the retina show an enormous density of cones in the fovea; to them is attributed both color vision and the highest visual acuity [3]. We will try to formalize the idea of rods and cones different distributions in section 2.3. Section 3 deals with blob extraction, and specifically with color flattening (section 3.1)and histogram processing (section 3.3).
2 Region of Interest Extraction 2.1
Michaelson Contrast
Michaelson contrast C is most useful in identifying high contrast elements, generally considered to be an important choice feature for human vision. Given LM the overall mean luminance of the image I, and Lm the mean luminance within a 7 × 7 surrounding of the center pixel Iij , Cij = (Lm − LM )/(Lm + LM ) The result of Michaelson contrast is a matrix which associates to each pixel of the original color image a measure of how much luminance changes; greater the value, bigger the variation. Examples are given in figure 1. 2.2
Focus Point
The focus point could be chosen as the pixel with the greatest Michaelson contrast, but this in general could lead to bad choises due to noise; to overcome this limitation, a mean filter is used. In the mean filter the value of each pixel is the sum of the values of all its neighbors divided by the neighborhood’s cardinality. This can easily be done by convolving C with a square mask mC of size k such that 1 , ∀i, j = 1 . . . k k2 and then choose the focus point to be the maximum of the resulting matrix. Focus point f p will be mCij =
f p = max(C ⊗ mC ) where ⊗ is the convolution operator. 2.3
Mask Construction
The next step is to build the focus matrix. The idea is to weight the blobs based on their distance from the focus point; the further they are from the focus point
Simple Perceptually-Inspired Methods for Blob Extraction
a
b
c
d
e
f
579
Fig. 1. Original images (a, b), corresponding Michaelson contrast (c, d), and focus point (red dot in e, f )
580
P. Falcoz
a
b
c
d
Fig. 2. Focus point (a, b) and resulting focus matrix (c, d)
the less they weight. By weight of a blob we mean the mass of a blob, that is the number of pixels belonging to it. Usually each pixel is given a unitary weight, but in this case the weight depends on the position of the pixel relative to the focus point. If the focus point changes, the same pixel will likely have a different weight. A bidimensional Gaussian distribution G(x, y) was chosen as the distance function. The weight Wij of a pixel Iij is the value of G(i, j) scaled between [0 . . . 1]. Wij = G(i, j) =
2) 1 − 12 (i2 +j σ2 e 2πσ 2
(1)
Simple Perceptually-Inspired Methods for Blob Extraction
581
Scaling is done via the rule of three. Note that the focus mask is first entirely generated, and then centered in the focus point. G parameters are chosen under the assumptions that I coincides with the observer’s field of view, that is what we see is exactly what we can see; given k = max(Iwidth , Iheight ), and the fact that points at distance more than 3σ can be considered effectively 0 [9], then (k − m) (2) 6 where m is the fovea size relative to the current field of view. Since human field of view is about 180◦ , with the fovea covering about 15◦ , then σ=
15 k (3) 180 The reason why we subtract m from k in (2) is that the region representing the fovea in the focus mask has a perfect color vision, so its weight is 1. The process of focus matrix generation can be summarized in the following few points: m=
1. 2. 3. 4.
calculate macula size m; generate a (k − m) × (k − m) weight matrix W ; uniformly scale W in the closed interval [0 . . . 1]; expand W by adding a square m × m matrix of ones at its center.
The focus matrix is then centered in the focus point, as shown in figure 2.
3 Blob Processing Now that the focus mask has been created, we are able to use it for blob selection. This involves some additional steps like color flattening, color quantization, and histogram processing. 3.1
Color Flattening
Of all the huge number of digital images created each day, only a very limited amount is shot with professional cameras; the great majority is taken with low cost, low quality equipment, meaning non uniform colors and evident noise. To cope with this the first step is to flatten colors by performing several steps of bilateral filtering [8]. The effectiveness of this approach is to combine a low-pass filter with a range filter h(x) = k −1 (x) where
inf − inf
inf
− inf
f (ξ)c(ξ, x)s(f (ξ), f (x))dξ
582
P. Falcoz
k(x) =
inf
− inf
inf
− inf
c(ξ, x)s(f (ξ), f (x))dξ
The simple Gaussian filtering has been used, in which both the closeness function c(ξ, x) and the similarity function s(f (ξ), f (x)) are Gaussian functions of the Euclidean distance between their arguments. Closeness then becomes c(ξ, x) = e
− 12
d(ξ,x) σd
2
where d(ξ, x) = d(ξ − x) = |ξ − x| while similarity becomes s(ξ, x) = e− 2 ( 1
δ(f (ξ),f (x)) σr
)
2
where δ(φ, f ) = δ(φ − f ) = |φ − f | To achieve perceptually uniform colors – colors where mathematical distance equals perceptual distance – other color spaces should be used instead of RGB and the strictly correlated HSV and HSL, such as CIE-Lab or CIE-Luv. Anyway, for simplicity HSV has been chosen here, and all the color processing has been done in this space. 3.2
Color Quantization
After color flattening a quantization step is performed: using a uniform quantization approach, the total number of colors is reduced to (at most) 512 – or 3 bits per color plane. 3.3
Histogram Processing
Color histogram processing is a key step, and can be divided in two stages: histogram construction and histogram enhancement. Histogram construction means counting the number of pixels of each (quantized) color, giving each pixel an equal weight of 1. Usually the most represented colors are chosen for blob extraction. Here we will use the focus mask built in section 2.3 to give each pixel a different weight depending on the distance from the focus point. This means that a color globally poorly represented but locally close to the focus point has a chance to be selected over a globally well represented but locally distant (from the focus point) color. Typical examples are skies, seas, meadows et cetera. Histogram enhancement takes the move from some general agreements on physical and emotional understanding of certain colors [1] [4] [11]. The color red is associated with fire and is considered as an aggressive color, whereas blue is
Simple Perceptually-Inspired Methods for Blob Extraction
583
associated with water and coolness. Warm colors in general call for action [6], and as such we tend to give them priority over cool colors; this idea is captured in the histogram enhancement step by multiplying warm colors by a fixed factor (a sort of bonus). We empirically define warm colors those colors having the following HSV values ⎧ ⎨ 0 ≤ x ≤ 60 x ∈ H, H = {h ∈ R, 0 ≤ h ≤ 360} 0 ≤ y ≤ 0.12 y ∈ S, S = {s ∈ R, 0 ≤ s ≤ 1} ⎩ 80 ≤ z ≤ 255 z ∈ V, V = {v ∈ N , 0 ≤ v ≤ 255} Of course other ranges can be defined; an example could be skin detection. After histogram enhancement the nbest most represented colors are selected for blob extraction. 3.4
Blob Extraction
Blob extraction step takes the nbest selected colors and scans input image to find the biggest nblobs blobs. For each nbest color, a binary mask representing all the pixels with that color is first computed, and all 8-connected objects are labeled. Labeling is achieved using the techinique outlined in [2]. Each blob’s mass is then calculated, and focus matrix weights are used. The effect is to make bigger but distant blobs less important than smaller but closer blobs (figure 3). Blobs are ordered according to their mass, and only the biggest nblobs are taken.
Fig. 3. Blob detection without (top figure) histogram enhancement and with (bottom figure) histogram enhancement. Yellow blobs take precedence over dark brown blobs. Focus point corresponds to image’s center. Settings: nbest = 8, nblobs = 4.
584
P. Falcoz
4 Conclusions The goal of this paper – and the ultimate sense of our exploration – is to show that making use of simple facts and observations from the fields of visual perception, visual attention, and human eye’s anatomy can help discriminating important information from “background” information. We applied few simple concepts to blob processing, and showed how these could be used to select blobs with specific good characteristics against less desirable blobs. We are not able to give performance figures yet, so we cannot state how advantageous our approach is compared to full blob processing. We are aware of the many limitations of our work, but we are also convinced that if we want to successfully accomplish such a difficult task as reproducing human vision, then also simple and marginal facts should be considered.
References 1. Churchland, P.: The Engine of Reason: the Seat of the Soul. MIT Press, Cambridge (1995) 2. Haralick, R.M., Shapiro, L.G.: Computer and Robot Vision, vol. 1, pp. 28–48. Addison Wesley, Reading (1992) 3. Hecht, E.: Optics, 2nd edn. Addison-Wesley, Reading (1987) 4. Hundert, E.M.: Lessons from an optical illusion. Harvard University Press, Cambridge (1995) 5. Norton, D., Stark, L.W.: Eye movements and visual perception. Sci Am 224, 34–43 (1971) 6. Panchanathan, S., et al.: The Role of Color in Content–Based Image Retrieval. In: Proc. IEEE International Conference on Image Processing, vol. 1, pp. 517–520 (2000) 7. Privitera, C.M., Stark, L.W.: Algorithms for Defining Visual Regions-of-Interest: Comparison with Eye Fixations. IEEE Trans. on Pattern Analysis and Machine Intelligence 22(9), 970–982 (2000) 8. Tomasi, C., Manduchi, R.: Bilateral Filtering for Gray and Color Images. In: Proc. IEEE International Conference on Computer Vision (1998) 9. Wikipedia, http://en.wikipedia.org/wiki/Gaussian blur 10. Yarbus, A.L.: Eye movements and vision. Plenum, New York (1967) 11. Zeki, S.: A Vision of the Brain. Blackwell, Oxford (1994)
LOGOS: A Multimodal Dialogue System for Controlling Smart Appliances Theodoros Kostoulas, Iosif Mporas, Todor Ganchev, Nikos Katsaounos, Alexandros Lazaridis, Stavros Ntalampiras, and Nikos Fakotakis Artificial Intelligence Group, Wire Communications Laboratory Dept. of Electrical and Computer Engineering, University of Patras, 26500 Rion-Patras, Greece {tkost,imporas,tganchev,nkatsaounos,alaza,dallas, fakotaki}@wcl.ee.upatras.gr
Abstract. The present work details a multimodal dialogue system, which offers user-friendly access to information, entertainment devices and white good appliances. We focus on the speech interface and the spoken dialogue management, with extensive description of the system’s architecture and functionalities. The services supported are detailed, with comprehensive description of the scenarios implemented. Keywords: Smart Home, Dialogue, Spoken Dialogue System, Multimodal Dialogue System, Speech Interaction.
1 Introduction The increasing use of multimodal dialogue systems along with the progress of technology raises the need for user-friendly human-machine interaction. Intelligent interfaces of home appliances provide the means for facilitating the operation of these devices, within a dialogue system. Combining speech, which is the most natural communication mean between human beings, with other user inputs, like mouse or keyboard, would ensure successful interaction experiences. To this end, a number of multimodal dialogue systems have been reported. In [1], Lu et al developed a dialogue system consisting of five main modules: sign-language recognition and synthesis, voice recognition and synthesis and dialogue controlmanagement. A pair of colored gloves and a stereo camera had been used in order to track the movements of the signer. Sign-language synthesis is achieved by regenerating the motion data obtained by an optical motion capture system. In [2], MiPad is presented, a personal digital assistant, which gives to the users the ability to accomplish many common tasks, by using a multimodal interface and wireless technology. This prototype has been based on plan-based dialogue management [3]. Hemsen, [4], discussed design decisions for the combination of speech with other modalities, towards designing a multimodal dialogue system for mobile phones. Furui and Yamaguchi, [5], introduced a paradigm for designing multimodal dialogue systems, through designing and evaluating a variety of dialogue strategies. The system presented accepted speech and screen touching as input and presents retrieved G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 585–594, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
586
T. Kostoulas et al.
information on a screen display. In [6] an architecture that combines finite-state multimodal language processing, a speech-act based multimodal dialogue manager, dynamic multimodal output generation and user-tailored text planning, is described. The application provides access to restaurants and subway information. In the present work, a multimodal dialogue system that offers user-friendly access to information and intelligent appliances is reported. This paper is organized as follows: Section 2 describes the system’s general architecture. In section 3 the system’s components and their function are detailed. Section 4 refers to the services supported by the LOGOS system, together with the scenarios implemented.
2 System Architecture The LOGOS smart-home automation system offers user-friendly access to information, entertainment devices and provides the means for controlling intelligent appliances installed in a smart-home environment. Fig. 1 illustrates the overall architecture of the LOGOS system, and the interface to various appliances, such as high definition television (HDTV), DVD player, mobile phone, washing machine, refrigerator, etc. The multimodal user interface of that system allows home devices and appliances to be controlled via the usual infrared remote control device, PC keyboard, or spoken language. In the present study, we focus on the speech interface and the spoken dialogue interaction. In detail, the spoken dialogue interaction functions as follows: The speech is captured by a microphone array, which tracks the position of the user as s/he moves. Next, the preprocessed speech signal is fed to the speech recognition component, which utilizes domain-specific language models. The speech understanding User Interface
Microphone array Remote control Terminal
Loudspeakers TV screen
Audio Input IR Remote Control Keyboard Input Audio Output
Speech Understanding
Speech Recognition Speaker Recognition
User Modelling
Emotion Recognition Language Recognition
Dialogue Manager
Text-toSpeech
Natural Lang. Generation
Video/Text Output
GSM network
Application Manager/STB Fridge
WM HDTV
Fig. 1. Architecture of the LOGOS system
DVD
LOGOS: A Multimodal Dialogue System for Controlling Smart Appliances
587
component provides the interpretation of the output of the speech recognition component in terms of concept-command. The dialogue manager generates feedback to the user, according to the commands received from the speech understanding component, the device status and other information. This feedback is delivered via the text-to-speech component, which is fed by the natural language generation component. All components of the spoken dialogue system are deployed on a personal computer, and the communication with the intelligent appliances is performed through a set-top box (STB), which provides the interface to the intelligent appliances. The STB component is responsible for synchronization and efficient data flow among the dialogue manager and the smart-home appliances. Moreover, it provides additional visual feedback to the user, which can, optionally, accompany the audio output. The smart-home system is designed with open architecture, which allows new components to be added, for extending the system’s functionality. Such additions could be user modeling, speaker recognition, emotion recognition and language identification components (in Fig. 1 these are shown with dashed-line). These components could contribute to significant improvement of the user-friendliness of the system, as well as of its overall performance. For instance, by utilizing feedback from the emotion recognition component the system will be able to select an emotionspecific acoustic model, for improving the speech recognition performance, or to steer the dialogue flow accordingly, when negative emotions are detected. The speaker recognition component would provide the means for implementing access authorization for specific commands, but also for intelligent user profiling or using of user-specific acoustic models, etc. The language recognition component would enable multilingual access to the LOGOS system, i.e. using the acoustic and language models which correspond to the user’s spoken language. The information carried out from the aforementioned components feed the user modeling component. User modeling is responsible for handling all user-related data. Thus, user’s preferences, user specific settings and interaction schemes can be specified. Therefore, the system is able to adapt to different user requirements.
3 System Components The present session describes in detail the aforementioned system components, which constitute the LOGOS system architecture. Each one of these components is independent to the rest, though interconnected, based on the system architecture. 3.1 User Interface The user interface components are responsible for realizing the communication between the human and the machine. They consist of the acoustic input-output and the device input-output. Acoustic Input. To facilitate acoustic input processing, speech enhancement methodologies are employed, so as to remove background additive noise. Most of the work has been focused on the improvement of synchronous communication devices, which operate in noisy environments [7]. The particularity of speech signal makes one
588
T. Kostoulas et al.
able to estimate the noise part alone, during speech pauses, a property which is exploited by almost every algorithm. In brief, noise reduction methodologies are divided into the following classes: (i) Subspace algorithms, (ii) Spectral Subtractive algorithms, (iii) Wiener filter-based algorithms and (iv) Statistical model based algorithms. Receiver’s sensitivity can be reduced in the space of interference and noise, while concentrated on the desired signals (spatial filtering), by utilizing beamforming algorithms. Subsequently, noise suppressing methods can be used to achieve effective denoising. There are two categories of beamforming techniques, adaptive, which change their parameters based on received signals, and fixed, which keep them unchanged. A time delay in the time domain represents a negative phase shift in the frequency domain. This fact is utilized by a frequently used beamforming technique called delay-sum beamforming [8], where time delays are applied to the sensors for efficient alternation of the area of interest. This method is fixed, while generalized sidelobe canceller (GSC) [9], belongs to the adaptive methods. It encompasses two stages, in which one depends on the other. A fixed beamformer is employed, firstly, while an adaptive part is constantly filtering the signal, so as to assure noise elimination. In this particular task, noise is thought to be every sound which source is not placed in our area of interest. In order to overcome a number of different limitations deriving from a realistic room environment (such as generic speaking scenario and room layout, multiple moving speakers and low computational cost) we utilized a commercial microphone array with eight sensors. The device, called Voice TrackerTM, achieves high signal-tonoise ratio (SNR) values by concentrating on a specific talker and filtering out background noise and reverberations in the acoustic environment [10]. Acoustic Output. The actions taken by the system are given to the user as the output of two conventional loudspeakers which are set up outside the range of the microphone array to avoid any interference. They are placed in a symmetrical way, according to the room boundaries, and we tried to keep acoustic echo level as low as possible. Device Input-Output. Additional input from the user is retrieved from the IR remote control and the keyboard. These data are feeding the application manager / STB component, which can either directly execute the retrieved command, or pass this command to the dialogue manager. In the later case, the dialogue manager will handle the request by modifying the dialogue flow. Device output consists of the TV screen where notifications are displayed. 3.2 Speech Recognition Component The Speech Recognition Component, as well as the Speech Understanding Component are based on the ©ScanSoft SpeechPearl [11] speech recognizer. The speech recognition process can be divided into two phases, namely preprocessing and decoding. In the preprocessing phase, the speech input is recorded and digitized. Afterwards, an acoustic analysis is performed. Specifically, using the fast Fourier transform (FFT), the front end processor breaks down the recorded audio sequence into small
LOGOS: A Multimodal Dialogue System for Controlling Smart Appliances
589
units of frames, where each frame covers some milliseconds of speech. For each frame, a feature vector is created. Each feature vector contains frequency and energy information. The frames overlap each other, by using a sliding window. The feature vectors are the input of the decoding process. In the decoding phase, a word graph is built on the basis of an acoustic model, an application lexicon and a language model. The acoustic model is based on the widely used hidden Markov models (HMMs) and consists of one HMM model for each uniphone and triphone of the language. Furthermore, HMM models for whole words such as digits, yes/no and natural numbers exist. Using the Viterbi algorithm, the feature vector sequence is compared against the HMM models. The Viterbi algorithm searches for the most probable of the word observations in the application lexicon. The lexicon is organized in a tree and contains only the words that are part of the related task grammar. In addition to the acoustic score, a language model score is counted in. Language model consists of word bigrams, concept bigrams and rule probabilities. The Speech Recognition Component combines the acoustic and language scores, to construct a word graph. The word graph contains nodes connected by edges, which are labeled with the recognized words. This structure represents a set of the most probable spoken word sequences. Finally, the n most probable paths through the word graph are sorted to an n-best list in order to be further used from the Speech Understanding Component. 3.3 Speech Understanding Component When the speech recognition process is finished, the resulting word graph is passed to the speech understanding component, for further processing. The speech understanding component searches for meaningful word sequences, by parsing the incoming word graph and using the related grammar to find matching concepts. A concept is a non-terminal that defines a certain word or word sequence, forming a sense unit within a sentence. Concepts are computed according to the rules defined in the corresponding grammar. The defined grammars can have multiple concepts, which may occur in arbitrary order. Any parts of the sentence that are not covered by a concept are taken as so-called fillers. Fillers are not taken into account, when processing the meaning of the sentence. From the found concepts and the remaining fillers the speech understanding component creates an acyclic directed graph called concept graph. The concept graph represents the most probable spoken sentences, which were found by the speech understanding module. Fillers are taken as meaningless gaps in the concept graph. The optimal path through the concept graph is determined by internal scores of the speech understanding module. The nodes of the concept graph are identical to those of the word graph. Its edges are not labeled with words, but with concepts. Concepts will not always be adjacent to one another. In any case, finding an instance of a particular concept does not necessarily mean that the corresponding words were actually said, i.e. their inclusion in the word graph may be a result of recognition errors. The sentence alternatives according to the concepts from the concept graph are presented in a n-best list. This list is sorted by internal scores and starts with the best matching sentence alternative.
590
T. Kostoulas et al.
In addition, semantic information can be included in a grammar. In this way, it is possible to interpret the recognition results on a high level of performance. Concepts can define attributes, which contain the semantic information. Then, the grammar contains information on how to compute attribute-value pairs. The understanding result provides the computed attribute-value pairs. This enormously improves the interpretation of a recognition result. 3.4 Natural Language Generation Component Natural Language Generation (NLG) is the research area investigating the ability of the machine to produce high-quality natural language text from a machine representation system, such as a knowledge base or a logical form. There are four major categories in NLG concerning the degree of the complexity and the flexibility of each method [12]: canned text method (simply prints a sequence of words from existing list), template-based method (use of pre-defined templates or scenarios), phrase-based method (using generalized templates) and feature-based method (each expression is represented by a feature – combining features to create a sentence) [13]. In the present work we inherit the template-based method, once it suits to the needs of the provided services. 3.5 Text to Speech Component Text to Speech (TtS) conversion is the process of converting a string into spoken language. The TtS of the LOGOS system is a diphone based residual excited linear predictive coding (LPC) synthesis technique based on the Festival Speech Synthesis system [14]. A diphone is a unit of speech starting in the middle of one phone till the middle of the next one. The first phase of the realization of the Greek TtS component is to parameterize all the diphones. Each diphone is divided into pitch synchronous frames and, for each frame, the LPC coefficients and the residual signal are extracted in order to be used during the synthesis stage. The second stage is interrelated with the prosody of the synthesized speech. In order to produce as natural synthetic speech as possible, emphasis must be given to the duration model of the system. The duration model is coping with the task of determining the length of phones in speech, considering various levels of signal representation [15]. The classification and regression tree (CART) [16] machine learning method is used for the duration modeling. The feature set for training the model is composed of features which can be extracted only from text such as phonological, morphological, linguistic and syntactical features. For some features, a window around the investigated phone was applied, exploiting the knowledge concerning the characteristics of the neighboring phones [16]. 3.6 Application Manager / STB The application manager / STB component is utilizing the efficient synchronization of the data among the dialogue manager and the smart-home appliances. This component keeps track of the current state of each device, providing this information to the dialogue manager, when such a request arises. Moreover, the component
LOGOS: A Multimodal Dialogue System for Controlling Smart Appliances
591
provides visual feedback to the user, realizing the graphical user interface. Additionally, the component is fed from the dialogue manager, for parsing the suitable commands to the connected intelligent appliances. 3.7 Dialogue Manager The dialogue manager component determines the tasks that should be activated, taking into consideration the dialogue history and the current user input. At each stage of the dialogue flow, the dialogue manager provides to the speech recognition the necessary information for loading the corresponding language models, in order to perceive high recognition accuracy and reliability. This language model is chosen with respect to the expected action-request of the user. The model relies on a text corpus built from all the possible ways the user can use to express her/his request. The dialogue manager retrieves information concerning the current status of all the devices from the application manager/STB. Moreover, it feeds the application manager/STB with the appropriate commands, so as to control the connected devices, correspondingly to the user’s request. The feedback to the user is adapted to the orientation of the current task, i.e. when needed, audio and video/text outputs coexist. As the human-machine interaction progresses, within a session, the dialogue manager builds a tree, whose nodes constitute the dialogue history. When needed, based on the scenario implemented, the user can abort the procedure, or change the dialogue flow. The implemented dialogue relies on the mixed-initiative model. According to the mixed-initiative dialogue model, the dialogue steps / actions can be designed in a more flexible and dynamic way. Specifically, the system orders or wait the user to make her/his request. The system decides its next step dynamically, according to the semantic information extracted from the user’s utterance, by the speech understanding module. The mixed-initiative dialogue model offers the ability of indirect confirmation, i.e. the dialogue corrects itself by using user’s requests, without prompting the user, explicitly, to clarify the last request.
4 Dialogue Flow Scenarios The LOGOS system supports two voice-controlled services: (a) Home devices control and (b) SMS Messaging. In the home devices control service, the user is able to control entertainment devices and white good appliances. The first scenario implemented provides access to the TV Tuner via voice. The home user is interacting with the dialogue system, issuing commands, for checking which one of the available TV channels has a movie starting within the next 1 hour. The LOGOS system asks the user if s/he prefers a foreign or Greek movie. The user makes a decision and the assistant informs her/him about the available movies, so that the user can choose the one s/he wishes to watch. Then, the system informs the user for the exact time the movie starts. The home user may decide to switch to the specific channel and can increase/decrease the volume of the TV. The second scenario provides stored video control. The home user subsequently issues commands to check the available stored movies. The system asks the user if
592
T. Kostoulas et al.
s/he prefers a Greek or non-Greek movie. The user makes a selection and the system informs her/him about the available movies. The user can decide to watch one movie, so he issues the relevant command to start watching the movie. Afterwards, s/he can control the volume of the projection. In the third scenario, the user has the ability to control and monitor the white good appliances connected to the application manager / STB component. The user can issue commands to retrieve a list of all the connected devices. The system informs the user, which are the available devices. Then, the user can issue commands to get informed about the current status of each device. Moreover, the user can control a selected device. An example dialogue of the aforementioned service is shown in Table 1. Table 1. Example dialogue of the white good appliances control Turn Dialogue 1. [User] Assistant, please provide me the list of the available devices 2. [Assistant] The available devices are “Fridge1” and “WashingMachine1”. 3. [User] Assistant, which is the current status of “WashingMachine1”? 4. [Assistant] The current status of “WashingMachine1” is “Wash” or “Ready”. 5. [User] Assistant, please “Start” or “Stop” the “WashingMachine1”. The new status of “WashingMachine1” is displayed on the TV screen for verification. 6. [User] Assistant, which is the current temperature of “Fridge1”? 7. [Assistant] “Fridge1” current temperature is “x.x” Celsius. 8. [User] Assistant, please decrease “Fridge1” temperature. The application is programmed to decrease the temperature by 1 degree of Celsius (default) at every call. 9. [User] Assistant, please start monitoring “Fridge1” temperature. No reply from Assistant is needed since the outcome will be displayed on the TV screen. When a change on the temperature of the fridge is taking place a notification on the screen will be displayed.
In the SMS messaging service the home user may read, compile and send an SMS. In the scenario implemented the user decides to send an SMS to a target user in order to inform her/him for the movie that s/he prefer to watch and ask the target user if s/he wishes to join. During the SMS compilation the user is prompted to utter the SMS. After the SMS compilation, the home assistant reads to the user the content of the SMS. Afterwards, the home user is prompted to provide the name or the phone number of the target user (if the name of the target user is not listed in the list of the known SMS recipients). After successful compilation, the home user prompts the home assistant to send the SMS to the target user. On successful reception, the target user replies back to the SMS with a confirmation that s/he will join the home user in order to watch the movie. The LOGOS multimodal dialogue system informs the home user via audio that a new SMS is received from the target user and, also, by displaying a text notification on the TV display. When the home user observes this notification s/he commands the home assistant to read/display and then delete the received SMS from the LOGOS system inbox. Table 2 illustrates an example dialogue for compiling SMS.
LOGOS: A Multimodal Dialogue System for Controlling Smart Appliances
593
Table 2. Example dialogue for compiling SMS Turn Dialogue 1. [User] Assistant, please compile an SMS. 2. [Assistant] SMS Body? 3. [User] "yyyy" will start at
. Do you wish to join? 4. [Assistant] Name? 5. [User] * . 6. [Assistant] Unknown recipient. Please, provide phone number. 7. [User] ** 697xxxxxxx. 8. [Assistant] , 697xxxxxxx. Do you confirm? 9. [User] *,** Yes. 10. [User] * Send SMS. 11. [Assistant] SMS has been successfully sent. Alternative Dialogues: [User] *: Cancel SMS. Then, the user needs to issue the compile SMS command in order to start again. [User] **: No. Then, the dialogue goes to turn 4.
5 Conclusion In the present work a multimodal dialogue system that provides the means for userfriendly access to information and smart appliances is presented. The smart-home system is designed with open architecture, thus allows the integration of additional components (e.g. speaker, language, emotion recognition components), for extending the system’s functionalities. We deem that these components would contribute towards enhancing the system’s performance and lead to more successful interaction experiences.
References 1. Lu, S., Igi, S., Matsuo, H., Nagashima, Y.: Towards a dialogue system based on recognition and synthesis of Japanese sign language. In: Gesture and Sign Language in Human-Computer Interaction, pp. 259–271 (2006) 2. Huang, X., Acero, A., Chelba, C., Deng, L., Duchene, D., Goodman, J., Hon, H.W., Jacoby, D., Jiang, L., Loynd, R., Mahajan, M., Mau, P., Meredith, S., Mughal, S., Neto, S., Plumpe, M., Wand, K., Wang, Y.: MIPAD: A next generation PDA prototype. In: Proc. ICSLP 2000, Beijing, China, vol. 3, pp. 33–36 (October 2000) 3. Wang, K.: A Plan-Based Dialog System with Probabilistic Inferences. In: Proc. ICSLP 2000, Beijing, China, vol. 2, pp. 644–647 (2000) 4. Hemsen, H.: Designing a multimodal dialogue system for mobile phones. In: Proc. of the first Nordic Symposium on Multimodal Communications, Copenhagen, Denmark, pp. 25– 26 (2003) 5. Furui, S., Yamaguchi, K.: Designing a multimodal dialogue system for information retrieval. In: Proc. ICSLP 1998, vol. 4, pp. 1191–1194 (1998)
594
T. Kostoulas et al.
6. Johnston, M., Bangalore, S., Vasireddy, G., Stent, A., Ehlen, P., Walker, M., Whittaker, S., Maloor, P.: MATCH: An architecture for multimodal dialogue systems. In: Proc. Annu. Meeting of the Association for Computational Linguistics, pp. 376–383 (2002) 7. Ntalampiras, S.: MoveOn Deliverable D5.10.1: Overview on Speech Enhancement Algorithms (2007) 8. McCowan, I.A.: Robust Speech Recognition using Microphone Arrays. PhD thesis, Queensland University of Technology, Australia (2001) 9. Griffiths, L., Jim, C.: An alternative approach to linearly constrained adaptive beamforming. IEEE Trans. On Antennas and Propagation 30(1), 27–34 (1982) 10. Acoustic Magic Microphone Array, http://www.acousticmagic.com/ 11. ScanSoft Speech Recognizer, http://www.scansoft.com 12. Konrad, K.: Model generation for natural language interpretation and analysis. Springer, Berlin (2004) 13. Cao, W., Zong, C., Xu, B.: Approach to interchange-format based Chinese generation. In: Proc. Interspeech 2004, pp. 1893–1896 (2004) 14. Black, A., Taylor, P.: The Festival Speech Synthesis System: System Documentation. Technical Report HCRC/TR-83, Human Communication Research Centre, University of Edinburgh, Scotland, UK, http://www.cstr.ed.ac.uk/projects/festival. html 15. Lazaridis, A., Zervas, P., Kokkinakis, G.: Segmental duration modeling for Greek Speech Synthesis. In: Proc. of the 19th IEEE International Conference on Tools with Artificial Intelligence, Patras, Greece, pp. 518–521 (2007) 16. Chung, H., Huckvale, M.A.: Linguistic factors affecting timing in Korean with application to speech synthesis. In: Proc. Eurospeech 2001, Denmark, vol. 2, pp. 815–818 (2001)
One-Channel Separation and Recognition of Mixtures of Environmental Sounds: The Case of Bird-Song Classification in Composite Soundscenes Ilyas Potamitis Department of Music Technology and Acoustics, Technological Educational Institute of Crete, Greece [email protected]
Abstract. In this paper, we address the problem of automatic separations and recognition of bird vocalizations in a mixture of other environmental sounds using a single microphone. We present a novel single-channel audio separation method that trains statistical models of the sources to perform the separation on a feature space of low dimensionality and we lead the separated streams to automatic recognition engines that recognize general sound events. The experimental part tests and evaluates the system on mixtures of bird and insects songs as well as dog barks and other environmental sounds. Keywords: Acoustic signal processing, one-channel separation, automatic sound recognition.
1 Introduction A sound source that emits consistent acoustic patterns has a very distinctive way of distributing its energy on its frequency content, which constitutes its so-called ‘spectral signature’. The spectral signature comprises a unique pattern that can be potentially revealed and subsequently automatically identified by employing statistical pattern classification techniques1. Automatic pattern recognition of general acoustic events is a suitable diagnostic tool for applications involving biodiversity assessment [1], environmental monitoring [2], as well as audio context categorization [3]. However, much of the reported research work is more or less laboratory-based focusing on deriving suitable features [2], [3], and investigating classifiers [3], [4] on the problem of classifying sound events that are dominated by sounds belonging to a single acoustic class. The present work reports results towards extending generalized sound recognition to field applications by considering the case of composite sound scenes (e.g., a bird and an insect is ‘singing’ while a dog is barking). The ultimate goal is to develop a system for automatic acoustic recognition of a specific sound event of interest in a composite sound mixture (e.g. bird vocalisations). The objective of that system is the automatic recognition of species only from the remote acoustic signal of the bird, by employing suitable microphones and supervised classification algorithms. 1
This work was sponsored by the Greek General Secretariat for Research & Technology under the call: “Joint research and technology programmes, 2006 – 2008, Novel time-frequency analysis tools for efficient bird-song classification”.
G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 595–604, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
596
I. Potamitis
Our implementation is based on extending recent advances on one-channel signal separation and combining them with recognition tactics from the speech/speaker recognition research area. Although single channel separation has been proposed for separation of speech signals and musical instruments [5-10], to the best knowledge of the author its applicability as a preprocessing phase of generalized recognition of composite environmental sounds has not been reported yet. Single channel sound separation is possible when there are available models or representational bases of the sources that constitute the mixture of a sound scene. There have been proposed several methods on how to produce these bases e.g. in [10], class specific representational bases are derived from different sound classes whereas in [5-9] a probabilistic framework based on statistical models for each class of sound events that appears in the sound mixture is proposed. The approach proposed in this paper offers several extensions on some state-of-the-art one-channel separation algorithms [7], [8] as well as on recognition of environmental sound events in general in the sense that: • the separation process takes place on the Mel-filterbank spectrum (a stage of the recognition process) and not at the short time Fourier transform (STFT) stage as we aim at recognition and not at reconstruction of the original sources. • the diagonal covariance matrices of the Mel-frequency cepstral coefficients (MFCCs) and not the large covariance matrices of the STFT are used for the separation process. To be able to apply one-channel separation using the MFCC models we transform mixture models from the cepstral domain to the linear spectral. The advantage of this approach is double. First, the recognition takes place at the MFCC feature space. These features are the most widely used features in speech recognition and have proven to produce much more compact statistical description of the sound sources than the STFT domain. Secondly, the separation takes place is a space of much lower dimensionality which entails much lower computational demands. We propose a straightforward generalization to one-channel separation of more than two sources. The extension of the single channel separation based on GMM models does not scale well for more than two audio sources mainly due to computational complexity. Our formulation takes place on the power of the STFT leading to novel gain functions. We propose that the reduced dimensionality of the problem under our formulation as well as the fact that environmental and animal vocalizations demonstrate smaller variability compared to musical instruments, vocals and speech, allows this constraint to relax. The separated audio channels of the audio mixture are led to a recognizer that follows the speech/speaker recognition paradigm. As speech processing has been a major research area through several decades the experience accumulated provides ground for boost in the area of generalized recognition of sound events. We investigate the applicability of already successful automatic speech and speaker recognition techniques to the recognition of bird vocalisations. Concretely, the identification will be based on the statistical comparison of the time-frequency characteristics of the recordings of the unknown species with the time-frequency characteristics that have been extracted from birds’ recordings with known identity. The final system produces a file with the recognized species of bird and a measure of the ambiguity of recognition.
One-Channel Separation and Recognition of Mixtures of Environmental Sounds
597
2 Problem Formulation Sound events, most of the times, do not appear in isolation but are a part of an audio mixture. The automatic recognition of a particular sound event in a soundscene is not a trivial task as the rest of the sounds excluding the one(s) we are interested in recognizing are considered noise that inflicts degradation to the recognition performance. The purpose of this work is towards building an automatic recognition system of generalized sound events in composite soundscenes. We consider a soundscene to be an audio mixture that is produced by a process that switches in time between distinct sound events while combining a number of them (see Fig. 1). The particular sources that constitute the mixture are thought to be realizations of distinct sound classes properly scaled as regards the gain. In this work, the statistical properties of each sound class are represented by a GMM trained on audio corpora of different instances of each sound class. The aim of the GMM is to represent, hopefully, all possible realizations of a particular class and, therefore, the particular realisation taking part in the audio mixture. Our approach first decomposes the audio mixture in its independent components and then leads the independent sound streams to the recognition engine. This work aims at automatic recognition of sound events and not on reconstruction of the original sources. We take advantage of this fact in order to formulate the separation stage on the power of the audio signal after they are filtered by a bank of triangular bandpass filters on a Melfrequency scale. In this way we incorporate one-channel separation techniques in the feature extraction process of the recognizer. The recognition process follows the paradigm of speech recognition and, therefore, takes place at the feature space of MFCCs. The MFCC feature extraction is composed of several steps. First the time domain signal is segmented in overlapping chunks called frames. The magnitude or power of the STFT is derived from these frames and passed through a triangular Melscale filterbank. Subsequently the log operator is applied and the features are decorrelated by applying the discrete cosine transform (DCT). The MFCC domain offers the advantage of building more compact models than STFT for three reasons: a) The recognition relies on good discriminative performance that is based on a coarse description of the power spectrum. The dimensionality of the STFT is greatly reduced through a filterbank (from 256 to 23 if we apply a 512 samples FFT and a filterbank of 23 bands) allowing the training of more efficient models. b) The log operator after filterbank reduces the large variation of the power spectrum allowing for more compact statistical models. c) Finally the decorrelating function of the DCT operator of the MFCC derivation process allows the use of diagonal covariance matrices that allow for more efficient training. As the separation process makes use of the GMMs of the trained models and takes place on the linear spectrum after the filterbank stage while the recognition takes place on the MFCC level, we need to transform the mixture parameters (means and covariances) of the cepstral domain to the linear spectral domain. Subsequently, we apply the separation process and, finally, we apply the log operator and DCT to the separated feature streams to transform them back to the cepstral domain in order to carry out the normal recognition processes (see Fig. 2 for an illustration of the separation-recognition process).
598
I. Potamitis
Statistical models of distinct sound sources
Sound Scene
Selection of models
Dog barking and bird singing on a rainy day
DCT
GMM K
...
Log()
...
GMM 1
DCT
...
Single channel Source separation
GMM 1
Log ()
MEL scale
...
( )2
...
x
STFT
Recogniser
Fig. 1. Composition of a soundscene. A soundscene is a mixture of sound events each one thought to be a specific realisation of a sound class.
GMM K
Inverse mapping from cepstral to mel-linear spectral
Fig. 2. Structure of the separation-recognition process
The role of GMMs is double fold. First, they represent distinct classes of events incorporated into the recognition machine where the separated streams are led after the audio separation stage. Secondly, the GMMs constitute the prior knowledge we have for the different audio classes. The central idea is that the spectral content of each frame of the mixture is composed of properly combined power spectral densities. These spectral densities are directly connected to the means and covariance matrices of the mixtures (see section 3 for details) and therefore, the covariances can be used to construct Wiener filters that are applied on the audio mixture to separate it into independent streams. To allow the transformation from the cepstral domain to the Mel-linear spectrum we follow the initial stage of the parallel model combination technique [11] that offers a closed-form formulation for transforming the mixture parameters. The detailed procedure is given in Figure 2 and the process is achieved by using the following formulas. We describe the procedure for one source: Let { μ Ci , Σ Ci } be the mean vector and diagonal covariance matrix, respectively, for one model in the cepstral domain. The subscript ‘i’ denotes the mixture index and i=1,..k where k is the number of mixtures and the superscript ‘C’ indicates the cepstral domain.
One-Channel Separation and Recognition of Mixtures of Environmental Sounds
599
The means and variances of the cepstral domain are transformed to the log-spectral domain (indicated by the superscript logS) through an inverse DCT transformation [11]:
{μ
logS i
( )}
= C −1μ iC , Σ logS = C −1Σ iC C −1 i
T
(1)
where C-1 denotes the inverse DCT function. The following exponential transformation transforms the means and variances of the log-spectral domain to the Mel-linear spectral domain (indicated by the superscript S):
⎧ S ⎛ logS ⎞⎫ logS ⎪μ i = exp⎜ μ i + ∑ Σ i / 2 ⎟⎪ ii ⎝ ⎠⎬ ⎨ ⎪Σ S = μ S μ S exp Σ logS − 1 ⎪ i j ij ⎩ ij ⎭
( (
(2)
) )
The subscripts ‘ij’ indicate the components of a vector or a matrix. One should note that the diagonal cepstral domain covariance matrices are transformed in full covariance matrices in the linear spectrum domain. The mean vectors and covariance matrices of (2) are used in the one-channel separation algorithm described in sec.3.
3 Mathematical Formulation We present the mathematical formulation for two sources and we show the straightforward generalization to any number. Let x(t,f) denote the STFT of the mixture and s1(t,f), s2(t,f) two independent signal sources and n(t,f) Gaussian noise. Then
x(t,f) = s1(t,f)+ s2(t,f)+n(t,f)
(3)
From (3) we can move to the power spectrum where the expected value of the cross terms is zero. If we denote by x, s1, s2, n the powers of the corresponding power vectors after the Mel-filterbank then,
x = s1+ s2+n
(4)
We model s1, s2 as Gaussian mixtures. Therefore:
(
(
)
p(s1 ) = ∑ w1iG s1; μ1S,i , Σ1S,i , p(s 2 ) = ∑ w2j G s 2 ; μ 2S, j , Σ S2, j
)
j
i
where the weights ∑w1 =1and ∑w2 =1. A power vector of the mixture is thought to be created by first selecting a mixture from s1 with probability w1i and then producing an observation
(
)
following G s1 ; μ1,i , Σ1,i . The same procedure is followed for s2 and the results are S
S
combined according to (4). Given the state i of s1 and j of s2 then
600
I. Potamitis
(
)( ) (
)
T −1 ⎡ 1 ⎤ p (s1 i ) ∝ exp ⎢− s1 − μ1S, i Σ1S,i s1 − μ1S, i ⎥ and ⎣ 2 ⎦ T −1 ⎤ ⎡ 1 p(s 2 j ) ∝ exp⎢− s 2 − μ 2S, j Σ 2S, j s 2 − μ 2S, j ⎥ ⎦ ⎣ 2
(
)( ) (
)
From (4) if we are given i, j then p (x i, j ) = G (x; μ1S,i + μ2S, j + μn , Σ1S,i + Σ S2, j + Σ n ) and
γ i , j (x ) = p (i, j x ) ∝ p (x i, j ) p (i ) p ( j ) = w1i w2j p (x i, j )
γi,j,..,k(x) denotes how probable is that the combination i,j of mixtures has produced the power of the currently processed mixture frame. By calculating the logp(s1,s2|x,i,j) we have:
(
− 2 log p (s1 , s 2 x, i, j) = Σ −n1 x − s1 − s 2 − μ n + s1 − μ1S,i 2
(s
2
− μ 2S, j
) (Σ ) (s −1 S 2, j
T
2
)
) (Σ ) (s S −1 1, i
T
1
)
− μ1S,i +
− μ 2S, j + C
where C denotes constant terms. Given i, j and by setting the derivative to zero of the above expression with respect to s1 and s2 while setting noise to zero we get:
[
E (s1 x, i, j) = Σ1S,i Σ1S,i + Σ 2S, j
] (x − μ ) + Σ [Σ −1
S 2, j
S 2, j
S 1, i
+ Σ 2S, j
]
−1
E (s1 x ) = ∫ s1p (s1 x )ds1 = ∫ s1 ∑∑ γ i , j (x)p (s1 x, i, j)ds1 =
[
i
[
j
E(s1 x ) = ∑∑ γ i , j (x ) Σ1S,i Σ1S,i + Σ 2S, j
( )
i
j
(
μ1S,i
] (x − μ ) + Σ [Σ + Σ ] + Σ ) for diminishing noise. −1
S 2, j
S 2, j
S 1,i
S −1 2, j
μ1S,i
]
(5)
2, j and γ i , j x = w1 w2 G x; μ1, i + μ 2, j , Σ1,i The subscripts i,j are indices running over the mixtures of each source. After normalizing γi,j it has dimensions of probability and in our formulation (5) is applied when γi,j(x)>0.0001 that is, Wiener gains that belong to mixtures with small probability are not evaluated. This simple threshold results to significant speed-up in performance with insignificant drop in the separation results. The extension to an arbitrary number of Μ sources is straightforward and due to lack of space we give here only the final formula: i
j
S
S
S
S
−1 −1 M ⎡ ⎡M ⎤ ⎤ ⎛ ⎞ M ⎡M ⎤ E ( s1 x ) = ∑∑..∑ γ i , j ,..,κ ( x ) ⎢Σ1,S i ⎢∑ ΣmS ,mi ⎥ ⎜ x − ∑ μmS ,mi ⎟ + ∑ ΣSm,mi ⎢∑ ΣSm,mi ⎥ μ1,Si ⎥ κ i j m= 2 ⎦ ⎝ ⎠ m= 2 ⎣ m=1 ⎦ ⎢⎣ ⎣ m=1 ⎥⎦
⎛
M
M
⎞
⎝
m=1
m=1
⎠
γ i, j ,..,κ ( x ) = w1i w2j ..wκκ G ⎜ x; ∑ μmS ,mi , ∑ ΣmS ,mi ⎟ where m=1,..,Μ and mi is running in cycle for indices i,j,..,κ (i.e. for m=1, mi=i, for m=2, mi=j etc.) One should note that our formulation is different from reported work in [7-8]. In this work the descriptive statistics of GMMs include the mean centers as Gaussians are trained on the real numbers of the power of the spectrum whereas in [7-8] the
One-Channel Separation and Recognition of Mixtures of Environmental Sounds
601
formulation takes place on the complex domain of STFT. This difference leads to a different concept of covariance as well to different gain functions.
4 Experimental Results We consider a soundscene to be a mixture of sound events although we do not actually know which sources are combined to create the mixture. The number of possible combinations between k sources taken out from N classes is N!/(k!(N-k)!). One can observe that the number of possible combinations grows significantly as N increases. For the evaluation of the proposed approach, we considered mixtures that are created by up to four different sources that is, we evaluated all cases for mixtures of two, three and four sources. Each mixture is processed to the Mel-linear spectral level where one-channel separation is applied and subsequently each feature stream after being mapped to the MFCC domain is led to the pool of the N GMM models for recognition. Testing the system implies finding the GMM that produces the highest probability D(x) of the data under test. Thus
{
}
D(x) = arg max p ( ki ) p ( x ki ) i=1,..,N i is applied to distinguish class ki, to which the input vector x belongs and p(ki) is the a priori probability for class ki. The sound classes are composed off (see Appendix for characteristic spectrograms of animals’ acoustic emissions): Bird recordings: from the BBC collection of Seasonal Birdsong BBC CD-846 (86 recordings, mean duration 17 seconds each). Insect recordings: Sounds of Cicadas from the SINA insect sound collection of crickets, katydids, and cicadas from north America http://buzz.ifas.ufl.edu. Dogs barking available from recordings at the internet (40 recordings, mean duration 15 seconds each). Rain recordings available from recordings at the internet (40 recordings, mean duration 15 seconds each). All recordings are mono, downsampled to 8000 Hz using 16 bits. For each source combination we perform 20 experiments from randomly selected recordings of each class and we average the separation results. For each of the 20 experiments we evaluated three cases: The recognition is based on 10 frames, 50 frames and the whole recording. The possible combinations are indicated by the first letter of each class, e.g. b_d for example stands for birds & dogs combination. All combinations are gathered in Table 1. In Table 1 we depict some results illustrating the separation performance for the case of a mixture of up to four sources. As the number of frames included in the recognition stage increases the separation artifacts are smoothed out and the system achieves high recognition scores. As regards the signal pre-processing of the recordings, we used an FFT of 512 samples and a Hamming window of 256 samples with a frame shift of 50%. The STFT coefficients are passed through a Mel-scale filterbank that reduces the dimensionality to 23. The number of cepstral coefficients that are kept is also 23. No
602
I. Potamitis
Table 1. Recognition results for all mixtures. b stands for birds, c for cicadas, d for barking dogs, r for rain.
Mixtures b_c b_d b_r c_d c_r d_r b_c_d b_c_r b_d_r c_d_r b_c_d_r
Recognition Accuracy (%) 10 frames 50 frames recording 95 97,5 100 100 100 100 85 95 97,5 90 90 95 72,5 85 85 85 87,5 88,5 92,5 96,67 97,5 91,67 95 97,5 98,33 100 100 91,2 95 96 95 96,67 97,5
Fig. 3. Separation results for a mixture of three sources in the Mel-spectral domain. The upper three sub-figures are the original sources (dog, cicada, bird). The middle is the mixture. The bottom three sub-figures are the corresponding separated streams.
One-Channel Separation and Recognition of Mixtures of Environmental Sounds
603
pre-emphasis is applied. We used a standard version of Expectation-Maximization performing 50 iterations with K-means initialization [12]. The number of Gaussian mixtures for all sources is 32. In the experimental set-up we estimate the feature streams corresponding to the different sound classes in the separation step from the audio mixture using the source models of each class as prior knowledge. At this point, we must stress the fact that the prior knowledge of the one-channel separation technique did not include recordings from the training set. Although this is a typical procedure for recognition tasks, in one-channel separation applied to music excerpts or vocals, the prior models are trained on the same instruments [7] or gender [6] in order to have a good separation result. In Fig. 3 we depict some results for the separation of three sources in the Melspectral domain where we can observe that the spectral cues of each source are clearly separated. It is proposed that one channel separation and recognition of environmental sound sources can achieve good separation results for many sources due to the fact that animal vocalizations and physical phenomena do not demonstrate the complexity and variability of speech and music recordings but tend to occupy a rather limited area in the spectral space. This in turn allows more compact statistical models to be trained for each source, leading to efficient separation and recognition. The recognition results can be further enhanced if one utilizes larger audio corpora for each sound source and performs gain and model adaptation as the training and test set are not matched.
5 Conclusion A landscape is characterized by a distribution of sound across time and space constituting its soundscape. This soundscape includes simultaneous sounds produced by human and physical activity as well as a diversity of organisms that possess sound producing mechanisms. We presented a novel one-channel audio recognition system that departs from the case of recognizing isolated sound events and moves towards the real case of recognizing co-existing environmental sound events. The recognition scores are very high and the results are practical for mixtures of up to four sources.
References 1. Riede, K.: Acoustic monitoring of Orthoptera and its potential for conservation. Journal of Insect Conservation 2(3-4), 217–223 (1998) 2. Goldhor, R.: Recognition of Environmental Sounds. In: Proceedings of the ICASSP 1993, vol. 1, pp. 149–152 (1993) 3. Eronen, A., et al.: Audio-Based Context Recognition. IEEE Transactions on Audio, Speech, and Language Processing 14(1), 321–329 (2006) 4. Cowling, M., Sitte, R.: Comparison of techniques for environmental sound recognition. Pattern Recognition Letters 24(15), 2895–2907 (2003) 5. Roweis, S.: One microphone source separation. In: Proc. NIPS, pp. 793–799 (2000) 6. Kristjansson, T., Attias, H., Hershey, J.: Single microphone source separation using high resolution signal reconstruction. In: Proc. of the ICASSP, pp. 817–820 (2004)
604
I. Potamitis
7. Benaroya, L., Bimbot, F., Gribonval, R.: Audio Source Separation With a Single Sensor. IEEE Trans. on Audio, Speech, and Lang. Proc. 14(1), 191–199 (2006) 8. Ozerov, A., Philippe, P., Bimbot, F., Gribonval, R.: Adaptation of Bayesian models for single-channel source separation and its application to voice/music separation in popular songs. IEEE Trans on Audio, Speech, and Lang. proc. 15(5), 1564–1578 (2007) 9. Ellis, D.: Model-Based Scene Analysis. In: Wang, D., Brown, G. (eds.) Computational Auditory Scene Analysis: Principles, Algorithms, and Applications, ch. 4, pp. 115–146. Wiley/IEEE Press (2006) 10. Virtanen, T.: Monaural sound source separation by non-negative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans on Audio, Speech, and Language processing 15(3), 1066–1074 (2007) 11. Gales, M., Young, S.: Robust Continuous Speech Recognition using Parallel Model Combination. IEEE Trans on Audio, Speech, and Lang. proc. 4, 352–359 (1996) 12. Nabney, I.: Netlab: Algorithms for Pattern Recognition. Springer, UK (2002)
Appendix We depict characteristic spectrograms of birds, insect and dog acoustic emissions Pterophylla camellifolia
Dog barking
2
2
1.5
1.5 Frequenc y
Frequency
Ceryle alcyon
1
1
0.5
0.5
0
0 0
2
4
6
8
10 Time
12
14
16
0
2
4
6
8
10 Time
12
14
16
18
Evaluating the Next Generation of Multimedia Software Ray Adams Centre Head Collaborative International Research Centre for Universal Access (CIRCUA) School of Computing Science Middlesex University The Burroughs, Hendon London NW4 4BT, UK [email protected], [email protected] Abstract. Developing new multimedia, software applications is a particularly fascinating but challenging task, depending, as it does, on the thorough evaluation of new concepts and prototypes. As complexity and choice increases exponentially, future development will be increasingly challenging for emerging, intelligent, multimedia and virtual reality applications, including Web 3D. In the past, we have been able to rely on generic evaluative heuristics. But are they generalisable to the newer generations of software? In this study, participants saw demonstrations of two interactive, virtual reality, multimedia applications, namely Second Life and 3d mailbox. They used them to generate questionnaires to capture those key aspects of such applications that were important to them. They did not find it difficult to generate questions grounded in their own experience. The resulting items turned out to be validated by significant levels of internal consistency across dependent variables. Surprisingly, however, the new heuristics bore little resemblance to traditional or current, cognitive items. The overwhelming influence was that of the immersive impact of such applications rather than standard design issues or cognitive user factors. Clearly, new software innovations require equally new innovations in evaluation techniques in general and context specific heuristics in particular, but we do not yet have a conceptual foundation upon which to base them.
1 Introduction The effective design and evaluation of software, courseware and websites is notoriously difficult. The question is not “What can be done?” but rather “How can it best be done?” (Hamming, 1968). Primary restrictions in design are no longer due to technological limitations but rather reflect the limits of human imagination. Design success depends, crucially and in large part, upon the effective evaluation of prototypes and systems at critical times in their development. If discovering the best system design can be seen as a learning process, then evaluation provides the necessary feedback for human learning to be effective (Simon, Anderson, Hoyer and Su (2004). There are many ways to evaluate an emerging system and a full discussion would be beyond the scope of any one paper. Here, the focus is upon the evaluation of a system by design heuristics by users and experts. Given the time constraints of system design, the emergence of the design heuristic (from the Greek "heuriskein" meaning "to discover") has been seen as a particularly useful, timely and effective G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 605–614, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
606
R. Adams
approach to evaluation. Heuristics are defined in this context as principles that guide practitioners to discover the best designs for their prototypes and full systems. Heuristics are not without their critics (Jeffries and Desurvire, 1992) and heuristics are best seen as one important component of software testing and not a substitute for it. One of the best known and robust sets of design heuristics are the ten proposed by Nielsen (e.g. Nielsen and Mack 1994). They are still in use today (e.g. Sloan, Gregor, Rowan, and Booth, 2000). Nielsen’s ten rules of thumb (first set out in Molich and Nielsen, 1990) are set out here as we will need to refer back to them later in this paper. 1. “Visibility of system status” 2. “Match between system and the real world” 3. “User control and freedom” 4. “Consistency and standards” 5. “Error prevention” 6. “Recognition rather than recall” 7. “Flexibility and efficiency of use” 8. “Aesthetic and minimalist design” 9. “Help users recognize, diagnose, and recover from errors” 10. “Help and documentation” Other compilations are available. For example, Bach (2003) provides the following six: controllability, observability, availability, simplicity, stability and Information. Other approaches to the use of heuristics include the use of specific heuristic for specific contexts, such as user designed spreadsheets (Clermont, 2005). Gallant, Boone and Heap (2007) present a different type of heuristic. They conducted an analysis of two online communities, based on content analysis with focus groups, generating a set of five heuristics, namely:
o o o o o
interactive creativity selection hierarchy identity construction rewards and costs and artistic forms.
Clearly, the above results identify at least two themes. First, if sufficiently robust heuristics can be identified and validated, they can be used across the board. Nielsen’s heuristics are an example of this perspective. Second, where strong contextual effects can be identified, it may be possible to develop context-specific heuristics. Adams and Smith (2005) explored how we treat different types of technologies. Can they be all seen the same way or can they be treated as distinct categories. Surprisingly the answer is somewhere in the middle. Technologies can be seen as varying from each other on a continuous scale. If so, different technologies (hardware, software etc) cannot be lumped together for the purposes of evaluation. Adams, Langdon and Clarkson, (2002) presented a conceptual framework from which context specific heuristics could be generated and applied to specific contexts or user groups (Adams
Evaluating the Next Generation of Multimedia Software
607
2007). For example, Adams and Langdon (2003) used this framework to generate design principles and concepts for blind users of information systems. Currently, we have seen an amazing level of development of intelligent, multimedia software and virtual realities. If the above arguments have any validity, then new innovations in software may require new innovations in evaluation. The present study explores the generation of new heuristics for the next generation of intelligent multimedia software.
2 Methods The study was conducted in two phases. In the first phase, sixty one participants were given a basic presentation of “Second Life”, the well known virtual reality, multimedia application and worked in groups of approximately fifteen per group to discuss the theory of the relevant psychological processes of the intended users. In the second phase, the participants worked in small group of three to five. They were given a template for a questionnaire and were set the task of designing a set of questions / concepts that they would want to use to evaluate virtual reality, multimedia applications. An example of a question would be “Is the colour scheme appropriate?” An example of a concept would be “graphics?” They were to do so without reference to any pre-existing heuristics. To aid the design process, the participants were shown video demonstrations of two virtual reality, multimedia applications (a) Second Life and (b) 3d mailbox.
3 Results The data were evaluated as follows. First, the questions / concepts were counted, with any obvious synonyms combined into a single item. The data were summarised in frequency order, the most frequent to the less frequent. The twenty most frequent items were selected for inclusion in the prototype questionnaire. These data are summarised in table 1. (The full listing of items is available from the author.) Second, the items generated for virtual reality, multimedia applications were compared against those included in Nielsen’s listing. Strikingly, there was little or no overload between the original and the new items. As shown in table 2, only one item (“realism”) from Nielsen’s list was found in the shorter list (20 most frequent items) and two items in the longer list (all 120 items generated), as shown in table 2 i.e. “match between system and the real world” (one hit) and “aesthetic and minimalist design” (two hits). Third, a longer list of thirty-two items was compiled from the present data and ordered by frequency of citation by the participants, as shown in table 3. Fourth, the same listing of thirty-two items were ordered by their average position in the list produced by the participants and shown in table 4.
608
R. Adams Table 1. Factors by frequency order (no of times selected) (n = 20)
1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
Factor Sound / audio Graphics / pictures Accessibility / availability Attractiveness Colour scheme Languages Registration Speed /high speed link Creativity Interaction / interactivity
frequency 13 11 10 10
9 9 9 9 8 8
Factor 17) Layout 18) Privacy 19) Downloads / uploads / quality/ size 20) Safety / Security 21) Services 22) Animation 23) Cost / value for money / price 24) user interface (Graphical) 25) 3d effectiveness 26) Design quality
Frequency
8 8 7 7 7 6 6 6 5 5
Table 2. Overlap of Nielsen’s heuristics and the new heuristics Nielsen’s heuristics (NH) Visibility of system status” “Match between system and the real world” “User control and freedom” “Consistency and standards” “Error prevention” “Recognition rather than recall” “Flexibility and efficiency of use” “Aesthetic and minimalist design” “Help users recognize, diagnose, and recover from errors” “Help and documentation”
No. in short list (n = 20)
No. In long list
0 0
0 1
0 0 0 0 0 1 0
0 0 0 0 0 1 0
0
0
If these two sets of orderings (tables 3 and 4) are valid measures of the importance and relevance of these items to the current participants and to applications like Second Life, then they should produce correlated data. If, on the other hand, the selection of such items is very specific to the way in which important items are identified, then orderings based on frequency and order of selection may not be related at all. Surprisingly, the two sets of orderings should a very high degree of concordance, as shown in table 5. Thirty out of thirty-two items were found in both lists. This is very highly different from chance as evaluated by a non-parametric binomial test (z = 4.77, x = 2, N = 32, P = Q = 0.5, p< 0.001, two tailed test). This procedure is given in Siegel (1956). This analysis supports the view that both, independent measures of selection (frequency of choice and order of choice) are in agreement to a considerable extent, a view that supports the validity of these data. If so, then it is acceptable to combine these two,
Evaluating the Next Generation of Multimedia Software
609
Table 3. Factors by frequency in selections; longer list (n = 32) Factor
frequency
Factor
Frequency
1.
Sound / audio
13
17.
Animation
6
2.
Graphics / pictures
11
18.
Cost / value for money
6
3.
10
19.
Design quality
5
4.
Accessibility / availability Attractiveness
10
20.
Links
5
5.
Registration
9 21. Educational
5
6.
Colour scheme
9 22. Personalisation
5
7.
Languages
9 23. Usability / user friendly
5
8.
9 24. 3d effectiveness
5
9.
Speed /high speed link Layout
8 25. Marketing
5
10.
Creativity
8 26. Commercialism
4
Table 4. Factors by average positions in selections; longer list
Factor
AV POSITION 1.40
Factor 17.
Registration
AV POSITION 9.22
1.
Design quality
2.
Accessibility / availability Commercialism
4.10
18.
Attractiveness
9.60
4.25
19.
Colour scheme
9.67
4.
Graphics / pictures
4.64
20.
10.13
5.
Avatar quality
4.75
21.
6.
user interface (Graphical) Layout
5.00
22.
Interaction / interactivity Downloads / uploads / quality/ size Cost / value for money
5.13
23.
3.
7.
Social engineering
12.00
12.00 12.00
610
R. Adams Table 4. (continued)
8.
Creativity
5.38
24.
Services
12.14
9.
Links
5.60
25.
Privacy
13.50
10.
Educational
6.00
26.
Languages
13.89
11.
Entertainment
7.00
27.
3d effectiveness
14.00
12.
value Personalisation
7.60
28.
Marketing
14.60
13.
Sound / audio
8.46
29.
Safety / Security
15.00
14.
8.80
30.
Contacts
15.00
15.
Usability / user friendly Advertising
9.00
31.
15.67
16.
Animation
9.17
32.
Speed /high speed link Life
18.00
Table 5. Overlap of membership of both orderings of items (overlap high) TABLE 3 ITEMS Items Freq.
IN TABLE 4 1,2,3,4,5,6,7,9,10,11, 12, 13, 14, 15 16, 17, 18, 19,20, 21,22, 23, 24, 25, 26, 27, 28, 30, 31, 32 30
NOT IN TABLE 4 8, 29
2
correlated sets of orderings to derive an overall measure of item choice. To do so, the average positions were converted into point (i.e. 32 points for first place, 31 points for second place, etc.). The product of frequency of selection and points was computed for each item, as shown in table 6. The contents and resulting order of the items are discussed in the discussion section below. This 32-item questionnaire replaced the initial prototype of 20 items. The resulting thirty-two items were used to conduct a brief evaluation of the Second Life application, as shown in table 7 below. Ideally, the best scores should occur for the most important items, but, in this case, the correlation between them was not statistically significant (r = 0.1587, p <= 0.4291). This shows that, although Second Life scores well overall, it did not score best on those aspects considered most important by our participants. The average rating was high (6.48, with a range from 4.5 to 8.2 on a ten point scale).
Evaluating the Next Generation of Multimedia Software
611
Table 6. Combining frequency and position data
Factor
FREQ. X POINTS 319
1.
Graphics / pictures
2. 3.
Accessibility / availability Sound / audio
260
4.
Layout
208
5.
Creativity
200
6.
162
7.
user interface (Graphical) Design quality
8.
310
17) Animation 18) Usability / user friendly 19) Downloads / uploads / quality/ size 20) Advertising
102 95 77
72
160
21) Cost / value for money 22) Entertainment value 23) Privacy
72
64
Attractiveness
150
24) Services
63
9.
Registration
144
25) Languages
63
10.
Colour scheme
126
26) 3d effectiveness
30
11.
Commercialism
120
27) Marketing
25
12.
Links
120
28) Safety / Security
21
13.
Educational
115
29) Social engineering
20
14.
Avatar quality
112
18
15.
Personalisation
105
30) Speed /high speed link 31) Contacts
16.
Interaction / interactivity
104
32) Life
66
12
4
Finally, the overlap evaluation of current heuristics and the presently generated items was repeated for the methods reported by Adams (2007), Adams, Langdon and Clarkson (2002), Adams and Langdon (2003) and Bach (2003). In each case, little
612
R. Adams
Table 7. Evaluating 2nd Life Highest scores (frequencies n = 1 and n =2 removed) (based on no of evaluations > 1). Only items of perceived relevance were appraised. Five items did not attract any ratings and, so are not included in this analysis. Factor 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.
Creativity 3d effectiveness Animation Attractiveness Sound / audio Personalisation Interaction / interactivity Usability / user friendly user interface (Graphical) Graphics / pictures Colour scheme Design quality Accessibility / availability Avatar quality
Av score 2nd Life 8.20 8.00 8.00 7.71 7.61 7.50 7.01
15) Life 16) Educational 17) Layout 18) Links 19) Marketing 20) Registration 21) Languages
Av score 2nd Life 6.50 6.50 6.40 6.33 6.00 5.81 5.81
7.00
22) Speed /high speed link
5.80
7.00
23) Downloads / uploads / quality/ size 24) Services 25) Privacy 26) Cost / value for money / price 27) Advertising
5.01
7.00 6.80 6.75 6.71
Factor
5.00 4.71 4.67 4.50
6.50
overlap was found between the original heuristics and the heuristics generated in this study, even though the latter had proved to be internationally consistent and valid.
4 Discussion As hypothesized above, virtual reality, multimedia applications can be the focus of innovative new design heuristics. In the present study, Second Life 5 and 3D Mailbox provided the basis for our 61 participants to generate heuristics that reflected what was important for them in implementation of virtual reality, multimedia. They found it relatively easy to produce egocentric heuristics that were clearly grounded in their own experience. In addition, it was possible to validate the resulting heuristic choices by the use of two dependent variables (frequency of choice and order of choice) that were functionality independent of each other and yet proved to be highly correlated. A striking feature of the present data was there lack of substantial overlap with currently used heuristics. If so, this argues for the need for different heuristics for different applications, a result that parallels the conclusion of Adams and Smith (2005) for hardware systems, where that different categories of hardware should not be evaluated for usability and accessibility by the same sets of heuristics. On this basis, different categories of software applications are likely to require different sets of evaluative heuristics. Adams (2006) argued that the need for different heuristics for hardware or software evaluations could be solved by using a common conceptual framework that could be used in each context to generate a set of context-specific heuristics.
Evaluating the Next Generation of Multimedia Software
613
However, the conceptual framework he proposed did not match those generated here for multimedia applications. It may be the case that new conceptual frameworks to underpin context-dependent heuristics, but, at the moment, the conclusion is that virtual reality, intelligent, multimedia systems are not well served by current, generic heuristics and will require the development of evaluation methods that are sufficiently focussed on them for the future. Consider the nature of the heuristics chosen by our present participants. They appeared to have no problems in generating heuristics that were grounded in their own experience and internally consistent (order versus frequency as discussed above). Sensory / perceptual factors were highly valued (first; graphics, third sound quality). Surprisingly, accessibility was consistently rated highly (second or third throughout). Factors relating to the overall impact of the application were also rated highly (e.g. layout, creativity, user interface, design quality, colour scheme and attractiveness). Unexpectedly, the rigours of the registration processes raised some concerns and this was rated as an item of at least middle importance. “Avatar quality”, “personalisation”, “interactivity”, “animation” and “usability” were all in the middle of the participants’ measurable preferences. Conversely, factors like “font”, “clarity”, “performance”, “content”, “focus” and “learning resources” were mentioned only infrequently. More surprising, it is not that traditional heuristics came low in the preferences of our users, they were simply absent. Furthermore, the cognitive factors outlined by Adams (2006) did not appear either. It is as if modern multimedia users focus, not on their own psychological processes but on the immersive experience of the multimedia themselves. If so, then we will need a new generation of heuristics and evaluation methods. Finally, the current evaluation of Second Life produced some interesting results. Ideally, an application should produce the best ratings for the most important items. If so, the two sets of scores should be significantly, positively correlated. In fact, the correlation between them was not statistically significant. Second Life performed well on some relatively unimportant factors such as 3d effectiveness, animation, usability, interactivity and interactivity. Conversely, it was rated less well on some user-defined, important factors such as graphics, accessibility and layout. Despite that low correlation, the overall ratings were moderately good (6.48, with a range from 4.5 to 8.2 on a 10 point scale). This result suggests that the performance of this application could be fine-tuned to meet our users’ preferences more closely. In conclusion, the development and application of context-specific heuristics can provide valid and useful evaluations of the next generation of intelligent, multimedia applications but the conceptual basis for such heuristics remains to be identified.
References 1. Adams, R.: Universal access through client-centred cognitive assessment and personality. In: Stary, C., Stephanidis, C. (eds.) UI4ALL 2004. LNCS, vol. 3196, pp. 3–15. Springer, Heidelberg (2004) 2. Adams, R.: Decision and stress: cognition and e-accessibility the information workplace. Univ. Access Inf. Soc. 5, 363–379 (2007)
614
R. Adams
3. Adams, R., Langdon, P., Clarkson, P.J.: A systematic basis for developing cognitive assessment methods for assistive technology. In: Keates, S., Langdon, P., Clarkson, P.J., Robinson, P. (eds.) Universal Access and Assistive Technology, pp. 53–62. Springer, London (2002) 4. Adams, R., Langdon, P.: Principles and concepts for information and communication technology design. Journal of Visual Impairment and Blindness 97, 602–611 (2003) 5. Adams, R., Smith, S.: Paradigms, technology and demand characteristics for universal access: can we treat different technologies the same? In: HCII 2005 UAHCI Proceedings (2005) 6. Bach, J.: Heuristics of Software Testability Viewed 31/01/08 (2003), http://www.satisfice.com/tools/testable.pdf 7. Clermont, M.: Heuristics for the automatic identification of irregularities in spreadsheets. ACM SIGSOFT Software Engineering Notes 30, 1–6 (2005) 8. Gallant, L.M., Boone, G.M., Heap, A.: Five heuristics for designing and evaluating webbased communities. First Monday, 3, 1–12 (2007) 9. Hamming, R.V.: One man’s view of computer science. 1968 ACM Turing Lecture. Journal of the ACM (JACM) 16, 3–12 (1969) 10. Jeffries, R., Desurvire, H.: Usability testing vs. heuristic evaluation: was there a contest? ACM SIGCHI Bulletin 24, 39–41 (1992) 11. Nielsen, J., Mack, R.L.: Usability Inspection Methods. John Wiley & Sons, New York (1994) 12. Molich, R., Nielsen, J.: Improving a human-computer dialogue. Communications of the ACM 33(3), 338–348 (1990) 13. Siegel, S.: Nonparametric statistics for the behavioral sciences. McGraw-Hill, New York (1956) 14. Simon, B., Anderson, R., Hoyer, C., Su, J.: Preliminary Experiences with a Tablet PC Based System to Support Active Learning in Computer Science Courses. In: ITiCSE 2004: Proceedings of the 9th annual SIGCSE conference on Innovation and technology in computer science education (2004)
Evaluation Process and Results of a Middleware System for Accessing Digital Music LI braries in MObile S ervices P.S. Lampropoulou, A.S. Lampropoulos, and G.A. Tsihrintzis University of Piraeus, Department of Informatics, 80 Karaoli and Dimitriou St, Piraeus 18534, Greece {vlamp,arislamp,geoatsi}@unipi.gr
Abstract. In this paper, we present and discuss the evaluation process of a middleware system that we have developed to facilitate the access to digital music libraries in push technology-based mobile services. Our system combines mobile technologies and content-based retrieval techniques to provide a semi-automatic interface that allows users to interact with a digital music library and find/retrieve music files in a flexible way. To evaluate our system, we followed a three stage process, namely user background information collection, system performance evaluation, and overall system assessment. The evaluation results are quite positive in terms of both system performance evaluation, and overall system assessment.
1 Introduction In the recent years, there has been a tremendous increase in the use of mobile technologies. Both the mobile networks and devices have become quite sophisticated and friendly to their users. This has led to the development of a large number of mobile services, which attempt to match users’ needs and requirements [1, 2]. The first mobile services were about voice calls such as call forwarding, call waiting, caller identification, conference call, or missed call notification [3]. Besides voice services, other services have been developed, which include location-based services, transaction services, and mobile banking. An expanding area of mobile services over the last few years is the area of multimedia services. Mobile multimedia services provide means for delivering multimedia content and information between two mobile stations or between a mobile station and a mobile operator. Mobile multimedia services give users the ability to use their device for entertaining purposes more than for telecommunication purposes. Mobile technology offers its users commercial opportunities for creation and sharing of multimedia. The first approach of multimedia services provided over a mobile environment was the creation of complex messages with multimedia content, such as text, audio, image and video. The users could send not only a short message (SMS), but also an image or an audio or video file G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 615–624, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
616
P.S. Lampropoulou, A.S. Lampropoulos, and G.A. Tsihrintzis
or a combination of them. With the progress of mobile technology, multimedia services became more complex in an attempt to cover user needs. Many services for handling, searching, and sharing images, audio, and video were first created to cover internet user needs. The need for mobility has caused several of internet-based multimedia services to become available through a mobile environment. Another aspect that boosts the expansion of mobile multimedia services is the evolution of mobile devices. Mobile devices are no longer devices that only process voice calls or messages, but have become a multimedia station equipped with a camera, a radio receiver, music players and other utilities to create and handle multimedia files. The mobile users can now use their mobile device as a single entertainment station as well. A serious limitation of mobile devices, namely the limited storage memory, has been eliminated in the recent years and this gave a significant boost to storing multimedia data on the mobile devices themselves [2]. Besides entertainment purposes, mobile multimedia services are important for communication purposes as well, as they give their users the ability to place video calls or to exchange multimedia messages. Moreover the tremendous growth of data-enabled handsets leads to new opportunities for push-enabled services with potential that is only beginning to be realized. Data-enabled mobile phones provide a new medium that allows people not only to place or receive voice calls, but also to communicate in new ways and to access new and old applications and information anywhere and anytime. The success of Short Message Service (SMS) for person-to-person communication is a prime example of a new way to communicate using push-enabled services. Yet the potential of push-enabled services is much greater and technologies are evolving to build upon the early success of SMS. The extensive abilities of mobile devices to create and handle multimedia items cause a rapid growth of personal multimedia collections. This fact, in conjunction with the online availability of large multimedia databases, brings the needs for developing ways for efficient management and accessibility to all these multimedia archives [4]. Taking into consideration the trend in mobile device technologies and the availability of digital music libraries, our aim was to develop a mobile service that would allow users to interact with a digital music library in a flexible way and locate and retrieve music files on the basis of a combination of mobile technologies and content-based retrieval techniques. For this purpose, we have developed a software tool, called ALIMOS[5]. In this tool, a mobile user has the ability to submit a query to a digital music library and retrieve music files that belong to the same musical genre (“are similar to”) by sending an example music file from his/her mobile device. In the present paper, we describe and discuss the evaluation process of ALIMOS and its results. Specifically, this paper is organized as follows: Section 2 outlines previous related work. Section 3 describes the general architecture and the operation of ALIMOS. Section 4 presents the evaluation process and results of our ALIMOS. Finally, conclusions are drawn in Section 5 along with indications for related future research.
Evaluation Process and Results of a Middleware System
617
2 Previous Related Work A number of attempts have begun to explore mobile music sharing systems, which build upon ideas from mobile social software. Two systems, tunA [6, 7]. and SoundPryer [8], look at how the usually private listening of a mobile music device can be turned into a social experience by synchronizing the listening to music between nearby devices. Another mobile music listening and sharing system is Push!Music, where users automatically receive songs that have autonomously recommended themselves from nearby players depending on similar listening behavior and music history. Push!Music also enables users to wirelessly exchange songs in the form of personal recommendations [9, 10]. A mobile music retrieval system, called M-MUSICS [11], is the mobile version of a previous content-based music retrieval system by the same authors. This system is based on the typical client-sever architecture and provides flexible user interfaces for querying and browsing the query result in mobile devices like PDAs. However, all the above systems constitute alternative ways through which users can share music with their mobile devices and do not constitute services. Moreover, these systems do not retrieve music pieces similar to those stored in a user’s mobile device or to pieces preferred by a user without a specific client in the mobile device. ALIMOS was developed to alleviate the limitations of existing mobile CBMR systems. Our previous work, on which ALIMOS was based, led to a middleware tool for web-based digital music libraries [12] who presented a flexible semiautomatic interface combining audio data preprocessing (format normalization and extraction of features), retrieval, and visualization/browsing facilities. The tool was based on a content-based music retrieval system, first developed as a desktop [13] and Web application, and automatically organizes a collection of music files according to such objective audio signal features as musical surface characteristics (e.g., zero crossings, short-time energy, etc.) and tempo. A process of relevance feedback complements this middleware with semantic meta-data that offer the user the ability to check the clustering result, correct existing textual meta-information (such as genre and album) and add additional semantic information (about lyrics, theme, other artists who perform similar music, musical instruments, etc.) and imposes consistency and reliability between content and semantic information of the musical database. Furthermore, a new approach has been incorporated into the system for music genre classification based on the features extracted from signals that correspond to distinct musical instrument sources [14, 15]. Additionally, the problem of modeling the subjective perception of similarity between music files has been addressed by incorporating into the system a second relevance feedback process which allows the importation of user models capable of evolving and using different music similarity measures for different users [16]. Finally, besides hierarchical and fuzzy c-means clustering, artificial immune system-based clustering and classification algorithms are used that
618
P.S. Lampropoulou, A.S. Lampropoulos, and G.A. Tsihrintzis
provide a clearer and more compact revelation of the intrinsic similarities in the music database.
3 ALIMOS Architecture and Operation The ALIMOS architecture is characterized as middleware. The term middleware has acquired numerous meanings that would allow it to be just about any piece of software that sits between systems [17]. In the strict sense, middleware is transport software that is used to move information from one program to one or more other programs, shielding the developers from dependencies on communication protocols, operating systems, and hardware platforms. In our application, the corresponding ALIMOS architecture is a multi-tier application with a front-end and a back-end level. Specifically, the front-end level includes all those modules that implement the communication between the user, i.e., the mobile network and the application, while the back-end level refers to all those modules that implement the content-based retrieval mechanism. Even though the music databases may be distributed and separate from the CBRM server in the back-end and the Push Proxy Gateway (PPG), the Push Initiator (PI) and the Wireless Access Protocol (WAP) servers in the front-end, the design of ALIMOS is such that these various modules are not visible to the users and all communication needs among the modules are handled automatically by ALIMOS. Specifically, there are three separate modules in the front-end level, as follows. The first module is a server that receives the user’s MMS, implemented as a Multimedia Messaging Service Center (MMSC). This module is also responsible for extracting the audio clip from the user’s MMS and sending it to the backend level for the query to be processed. The second module is a PI. This module receives the retrieval results and submits them via the push framework to the third module, which is a PPG and its main task is to push contents (i.e., search results) from the Internet to the mobile network. The back-end level of ALIMOS consists of four modules. The first module is a feature extractor that captures low-level audio features [13]. The second module is an id3 tag extractor and an interface for the user to update the extracted semantic meta-data and add other possible semantic keywords. In ALIMOS, we follow the framework developed for image retrieval in [18], which unifies semantics and content-based features. The third module realizes the process of various clustering methodologies, including fuzzy c-means, spectral, hierarchical, and artificial immune network-based clustering [19]. Clustering is used for the initialization of ALIMOS and applied on low-level content-based features to reveal the predominant classes in the music database. Finally, the fourth module realizes the retrieval process. The application makes a query by submitting an example which has been sent by the user with a MMS. The search for music pieces similar to the user example is implemented in the restricted format sense, in which either the Euclidean or the cosine distance metric are used to find data that are globally similar to the query. No attention is paid to the query scale. In
Evaluation Process and Results of a Middleware System
619
the end, a result list is returned of the relevant music files in the library, found on the basis of their low-level (content-based) similarity to the query matching files. The ALIMOS operation is illustrated in Fig. 1 and can be described by the following steps. 1. The user submits a Multimedia Messaging Service (MMS) containing the query audio file. 2. The server extracts it and stores the necessary user information e.g., sender’s telephone number). 3. Next, the server submits the file to the Content-Based-Retrieval sub-system (CBRS) of the back-end level. The CBRS of the back-end level returns to the Push Proxy Getway (PPG) a list of the four most relevant results. 4. The PPG forwards this list to the user in the form of a Push Message. The results are not the actual audio files but their names and the artists name, that is links to the WAP server. 5. The user may download from the WAP server the audio file that he/she selects As an example of the procedure, the user sends a MMS by inserting the ‘Led Zeppelin-All My Love’ audio file, from the music file folder of his mobile. The user expects an answer with the results that are the genre relevant to the song. The user receives a message with the results of the system. The results are links of the music files in the wap server. The name of the file consists of two parts, the first part is the name of the artist or the group and the second part is the
Fig. 1. ALIMOS Operation
620
P.S. Lampropoulou, A.S. Lampropoulos, and G.A. Tsihrintzis
title of the song. The ALIMOS results for the input file ‘Led Zeppelin-All My Love’ are the following ‘Led Zeppelin- In The Evening’, ‘Pink Floyd- Money’, ‘Pink Floyd- The Dogs Of War’, ‘U2 - Whose gonna RIDE your wild horses’ that all in the same genre of rock music
4 ALIMOS Evaluation Procedure In this section, we describe the procedure and the results of the evaluation of ALIMOS. The goals of the evaluation procedure of ALIMOS included not only assessing the its performance, but also collecting feedback from users regarding the usability and the quality of this service [20]. For this purpose, the evaluation procedure was divided into three complementary stages. A total of 100 users participated in the evaluation procedure. These participants belonged to various age groups and had various music preferences and related music education. In the first evaluation stage, background information was collected from the participants. In the second stage, qualitative and quantitative performance indices were measured from the interaction of evaluation participants with ALIMOS. Finally, in the third stage of the evaluation process, we collected subjective comments regarding the participants’ overall impression of ALIMOS. For the evaluation, we developed a web-based application. The application contained two electronic questionnaires, corresponding to the first and third evaluation stages. Additionally, the application included a demonstration of the ALIMOS operation to familiarize the participants with the ALIMOS service. 4.1
First Evaluation Stage - User Background Information
At first, each participant was asked to fill out a questionnaire in which user background information was collected for statistical purposes. In this part of the evaluation test, information regarding the mobile usage was collected. The use of the mobile devices depends on age. Younger users use their mobile to send SMS, listen to music, take photos, or chat with friends. In older ages, users use the mobile devices more for telecommunication than for entertainment purposes. As a result, the most common mobile services for people over 30 years old are searching the internet and location-based services. Mobile services such as downloading and sharing music, video, images and games are more popular among younger people. User in both the combined age group 12 to 25 use these mobile services at a percentage which exceeds 50%. 4.2
Second Evaluation Stage - System Evaluation
In the second stage of the evaluation, we examine the main characteristics of ALIMOS as a software tool. In this part of evaluation, we examine quality and quantity features of ALIMOS. For the evaluation of mobile services are proposed many quality evaluation criteria. These criteria are separated to many categories. The main categories are evaluation criteria for security, for usability,
Evaluation Process and Results of a Middleware System
621
Fig. 2. ALIMOS Evaluation Website - 1 Table 1. Mobile Usage and Services Age
12-18 18-25 25-30 30-45 45+ Mobile Usage
Send SMS
83%
75%
45%
22% 12%
Listen Music
72%
80%
45%
11% 16%
Take Photos
91%
64%
45%
16% 12%
Browse web
42%
75%
45%
10%
Chat
45%
75%
31%
13% 12%
62 % 75%
45%
12% 16%
Download, share music, image, video 52 % 76%
42%
11%
6% 3%
9%
Mobile Services Searching the internet Download, share, play games
69 % 16%
12%
9%
Locate a street, a place of interest
11 % 42%
62%
56% 32%
Other
42 % 75%
45%
10% 16%
for modifiability and for interoperability. Security. Only sender and the intended destination part are able altered during transmission. Usability. The user of the mobile service does not need any particular training in order to use the ALIMOS to know the content of the message. Data is not.
622
P.S. Lampropoulou, A.S. Lampropoulos, and G.A. Tsihrintzis Table 2. ALIMOS Software Parameters Software Evaluation Parameters Simulation Values Process Time of first end (in Sec) 38 Average Time of CBRS of back end level (in Sec) 35 Overall Transaction Response Time (in Sec) 73 Max Requests per Minute 9
The use of this service is based on the main knowledge of mobile multimedia services. Modifiability. In ALIMOS, there is a perspective to add more servers to achieve better scalability. Moreover there is the capability to support new audio types. Interoperability. Different types of mobile devices that support different types technologies could initiate the mobile service implemented in ALIMOS. Reliability. In ALIMOS, the service it stores answer messages in the push server until the mobile device is available to receive the message. This means that the when a user make a request, he/she receives an answer. For the quantity evaluation we measured a number of software parameters such as the process time of the first end level of application, the overall transaction response time, the max requests per minute and other as presented in Table 2. As mentioned the main implementation of our system based on software simulation mechanisms. As a result many of the features measured for the system are amplified than if ALIMOS implemented with hardware solution. For the process time of first end we measured the time that the server that receives the MMS, extracts the audio file and the necessary information of the user and submit to the CBRS. For the average time of CBRS of back end level we measured the time that the back-end level needs to make the content- based retrieval and return the results with the names of the songs that belongs to the same genre. The overall transaction response time is the time needed for the completing the whole transaction, from the moment the user sends the MMS until he receives the answer. The max request per minute is the number of MMS the system could receive in one minute. 4.3
Third Evaluation Stage - Performance Assessment Evaluation
In the third stage of evaluation, the users used a simulation mode of ALIMOS and were asked to feed information about this service back. When asked to rank the overall ALIMOS performance, the participants responded as in Table 3. Clearly, ALIMOS was ranked as “good”, “very good” or “excellent” cumulatively by 81% of the participants, which is a clear indication of its success. On the other hand, a percentage of 17% of the participants did not find ALIMOS helpful and 2% of the participants found ALIMOS misleading. The percentage of this latter assessment is low and may be due to reasons such as the specific evaluators being in a hurry or lacking motivation and/or familiarity with computer use. Another
Evaluation Process and Results of a Middleware System
623
Table 3. Overall ALIMOS Performance Assessment Ranking
5 - Excellent 4 - Very Good 3 - Good 2 - Not Helpful 1 - Misleading
Percentage
22%
32%
27%
17%
2%
reason could lie with the music database content rather than ALIMOS (that is, the CBMR tool) itself. That is, the music database may not contain a sufficient number of music pieces relevant to the specific evaluators’ interests.
5 Conclusions - Future Work We presented and discussed the evaluation process of a middleware system that we have developed to facilitate the access to digital music libraries in push technology-based mobile services. Our system provides a semi-automatic interface that allows users to interact with a digital music library and find/retrieve music files in a flexible way based on a combination of mobile technologies and content-based retrieval techniques. The evaluation process followed three stages, namely user background information collection, system performance evaluation, and overall system assessment. The evaluation results are quite positive in terms of both system performance evaluation and overall system assessment. In the future, our system will be enhanced with more accurate and efficient content-based music retrieval algorithms in its back end, as well as the ability to retrieve more general multimedia data. This and other research work is currently in progress and will be reported shortly.
References 1. Poikselkea, M., Mayer, G., Khartabil, H., Niemi, A.: The IMS: IP Multimedia Concepts and Services in the Mobile Domain. John Wiley, Chichester (2004) 2. Bodic, G.L.: Mobile Messaging Technologies and Services: SMS, EMS and MMS. John Wiley and Sons, Chichester (2005) 3. Ulla Koivukoski, V.R.: Managing Mobile Services: Technologies and Business Practices. John Wiley and Sons, Chichester (2005) 4. Tzanetakis, G.: Musescape: A tool for changing music collections into libraries. In: Koch, T., Sølvberg, I.T. (eds.) ECDL 2003. LNCS, vol. 2769, pp. 412–421. Springer, Heidelberg (2003) 5. Lampropoulou, P.S., Lampropoulos, A.S., Tsihrintzis, G.A.: Alimos: A middleware system for accessing digital music libraries in mobile services. In: Gabrys, B., Howlett, R.J., Jain, L.C. (eds.) KES 2006. LNCS (LNAI), vol. 4253. Springer, Heidelberg (2006) 6. Bassoli, A., Cullinan, C., Moore, J., Agamanolis, S.: Tuna: a mobile music experience to foster local interactions. In: Dey, A.K., Schmidt, A., McCarthy, J.F. (eds.) UbiComp 2003. LNCS, vol. 2864, Springer, Heidelberg (2003) 7. Bassoli, A., Baumann, S.: Bluetuna: music sharing through mobile phones. In: Third International Workshop on Mobile Music Technology, Brighton, UK (March 2006)
624
P.S. Lampropoulou, A.S. Lampropoulos, and G.A. Tsihrintzis
8. Ostergren, M., Juhlin, O.: Sound pryer: truly mobile joint music listening. In: The first international workshop on mobile music technology (2004) 9. Jacobsson, M., Rost, M., Ha*kansson, M., Holmquist, L.E.: Push!music: Intelligent music sharing on mobile devices. In: Beigl, M., Intille, S.S., Rekimoto, J., Tokuda, H. (eds.) UbiComp 2005. LNCS, vol. 3660, Springer, Heidelberg (2005) 10. Hakansson, M., Rost, M., Jacobson, M., Holmquist, L.E.: Facilitating mobile music sharing and social interaction with push!music. In: HICSS. IEEE Computer Society, Los Alamitos (2007) 11. Han, B.J., Hwang, E., Rho, S., Kim, M.: M-musics: mobile content-based music retrieval system. In: MULTIMEDIA 2007: Proceedings of the 15th international conference on Multimedia, pp. 469–470 (2007) 12. Lampropoulos, A.S., Lampropoulou, P.S., Tsihrintzis, G.A.: A middleware system for web-based digital music libraries. In: Proc. 2005 IEEE/WIC/ACM International Joint Conference on Web Intelligence, Compiegne University of Technology, France (2005) 13. Lampropoulos, A.S., Tsihrintzis, G.A.: Semantically meaningful music retrieval with content-based features and fuzzy clustering. In: Proc. 5th International Workshop on Image Analysis for Multimedia Interactive Services, Lisboa, Portugal (2004) 14. Lampropoulos, A.S., Lampropoulou, P.S., Tsihrintzis, G.A.: Musical genre classification of audio data using source separation techniques. In: Proc. 5th EURASIP Conference on Speech and Image Processing, Multimedia Communications and Services, Smolenice, The Slovak Republic (2005) 15. Lampropoulos, A.S., Lampropoulou, P.S., Tsihrintzis, G.A.: Musical genre classification enhanced by source separation techniques. In: Proc. 6th International Conference on Music Information Retrieval, London, UK, pp. 576–581 (2005) 16. Lampropoulos, A.S., Sotiropoulos, D.N., Tsihrintzis, G.A.: Individualization of music similarity perception via feature subset selection. In: Proc. IEEE International Conference on Systems, Man, and Cybernetics 2004, The Hague, The Netherlands (2004) 17. Rao, K., Bojkovic, Z., Milovanovic, D.: Introduction to Multimedia Communications: Applications, Middleware, Networking. Wiley-Interscience, Chichester (2006) 18. Lu, Y., Hu, C., Zhu, X., Zhang, H., Yang, Q.: A unified framework for semantics and feature based relevance feedback in image retrieval systems. In: Proc. ACM MULTIMEDIA, Los Angeles, California, pp. 31–38 (2000) 19. Lampropoulos, A.S., Sotiropoulos, D.N., Tsihrintzis, G.A.: Artificial immune system-based music piece similarity measures and database organization. In: Proc. 5th EURASIP Conference on Speech and Image Processing, Multimedia Communications and Services, The Slovak Republic, Smolenice (2005) 20. Juzgado, N.J., Morant, J.L.: Common framework for the evaluation process of kbs and conventional software. Knowl.-Based Syst. 11(2), 145–159 (1998)
Interactive Systems, Design and Heuristic Evaluation: The Importance of the Diachronic Vision Francisco V. Cipolla Ficarra1,2 and Miguel Cipolla Ficarra2 HCI Lab. – F&F Multimedia Communic@tions Corp. ALAIPO: Asociación Latina de Interacción Persona-Ordenador 2 AINCI: Asociación Internacional de Comunicación Interactiva Via Pascoli, S. 15 – CP 7, 24121 Bergamo, Italy [email protected], [email protected]
1
Abstract. In the current work a methodology is presented to assess the quality of interactive systems throughout time. A set of techniques has been developed for the assessment of the communicability of contents, at a given moment: synchronism, or throughout time: diachronism. A metric of quality called ‘diachronic vision’ has been generated and has been applied in the assessment of an interactive system aimed at entertainment, which has evolved for decades. An analysis of concepts has been carried out, to stress the importance of the use of a timeless and common language among the different participants in the process design, production and fruition of interactive systems that have a world-wide distribution. Keywords: Human-Computer Interaction, Design, Quality, Communicability, Usability, Evaluation, Methods, Techniques, Semiotics, Linguistics, Videogame.
1 Introduction Computer science involves people solving problems so computer scientist must perform empirical studies that involve developers and users alike. They must understand products, processes, and the relationships among them [1]. There is currently a tendency to negatively assess the content of interactive systems that do not work with current personal computers. Indirectly, many notions and guidelines for correct design of multimedia/hypermedia systems are ruled out in the new hypermedia products, whose purpose may be distribution and international sale. The current research has as its main goal the introduction of the notion of ‘diachronic vision’ at the moment of assessing the design quality in the interactive systems, through the passing of time. We focus on videogame because it is the most dynamic sector in the multimedia industry in Europe. However, the conclusions of the present work can be transferred to the rest of multimedia content, regardless of their having a formation or entertainment purpose, for instance. In the first part of the work there is a definition of the notions of synchronism and diachronism. Later on, it follows the example of the assessment of the notions of hypertext, multimedia and hypermedia, in order to observe how in literature and in real life sometimes these notions and others stemming from this are used mistakenly as synonyms. Besides, this evolution allows the constant two-way relationship between software and hardware in the interactive systems to be stressed. It is important to differentiate in this process the aspects G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 625–634, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
626
F.V.C. Ficarra and M.C. Ficarra
related to the hardware from those belonging to the software. In the third section we carry out the diachronic heuristic analysis of a multimedia project aimed at entertainment and whose circulation is international, resulting in a heuristic assessment table. In the fourth section the results and conclusions of the research work are presented.
2 Synchronism and Diachronism Ferdinand de Saussure the founder of modern linguistic and (along with Charles S. Pierce) the co-founder of contemporary semiotics, started a field of inquiry which he dubbed semiology. Saussure’s approach to the study of linguistic and other signs is based on a series of dichotomies, for example, signifier/signified; langue/parole; syntamatic/paradigmatic; synchronic/diachronic [2]. Synchronic: pertaining to that which is co-present or simultaneous or that for which the passage of time is considered irrelevant. Diachronic: pertaining to that which changes over time. A diachronic study examines its object in light of its history, as something moving across or through time. In contrast, a synchronic investigation considers its object as a system operating or functioning in the present. A diachronic examination draws us toward history, the process in which differences unfold successively; a synchronic consideration draws us away from history, toward some system in which differences are simultaneously at work. Every science should be interested in pointing out better the axes on which they base their most important objects of study; we should make the following distinction: firstly the axis of simultaneities (AB), which concerns relationships among coexisting things, excluding the interference of time, and secondly the axis of successions (CD), where you can never consider more than one thing every time, but where there are all the things of the first axis with its respective changes [2]. These concepts are illustrated in figure 1:
Fig. 1. Diachronism (AB) and synchronism (CD)
For the sciences working with values, this distinction is a practical necessity and, sometimes, an absolute necessity. In this field of investigation, we cannot carry out a rigorous analysis if we do not consider the two axes, if we do not distinguish between the system of values considered in it and these same values considered in relation to time. The more complex and organized a system of values is, the more necessary it is for its same complexity, to study it afterwards according to its two axes. And no system reaches the complexity of a language. Nowhere else there is an equivalent precision of terms mutually depending on themselves so strictly. The variety of signs, already mentioned to explain the continuity of language, absolutely prevents us from studying simultaneously its relationships in time and its relationships in the system. In
Interactive Systems, Design and Heuristic Evaluation
627
order to better point out this opposition and crossing of the two orders of phenomena relating to the same object, we can talk about synchronous linguistics and diachronic linguistics. Everything that refers to the static aspect of our science is synchronous, while everything that is related with evolutions is diachronic. In the same way, synchrony and diachrony will designate respectively a state of the language and a stage of the evolution. PC games, mobile telephony, palms, iPod, etc. have followed the same process of evolution in interfaces, i.e., 2D, 2D and 3D intersection, and 3D. Consequently, the combination of colours in order to catch attention and increase motivation of potential users has facilitated the creation of guidelines through the diachronic analysis here enclosed. Comparison of the icons used in platforms for PCs (IBM compatible) and Macintosh (from 1990 up to now) with various models of palms and mobile phones was useful to create a table of isotopies of the type used [3]. The isotopies make it possible to see those elements of the design that are maintained throughout time regardless of the hardware that is used. A good example is to analyze the evolution of the interactive systems from the point of view of the structure and navigation. As a rule, the main components of the hypertext have been maintained and boosted its use at international level, that is, from the use of the computers in their origins until the current mobile phones also called multimedia.
3 Hypertext, Multimedia and Hypermedia: Concepts Evolution The study of evolution of the terms: hypertext, multimedia and hypermedia allows us to fix a relationship between their meaning and signifier, with which a necessary condition in the elaboration of the evaluation and analysis method of multimedia/hypermedia systems is being created, as it is the prior definition of concepts. For example, in literature there are definitions of multimedia where there’s no differentiation between the technical aspect of communication and information [47]. This is the main reason why sometimes people resort to the etymological analysis of terms, besides eradicating linguistic ambiguities found in literature. In literature about interface design we observe that there is an intersection between software and hardware. This intersection is studied in Human-Computer Interaction. Now we briefly define the concepts of hypertext, interactive multimedia and hypermedia. 3.1 Hypertext In 1945 Vannevar Bush published the article As We May Think [8]. The problem discussed was the difficulty of access for the information management media of that period, so there’s a need for a medium that better ‘fits’ the way of working of the mind [8]. That is, the associative character of ideas stands out. The greater the association of the idea the greater will be the speed for fruition and empathy of the content because implicitly there is a prediction of the functioning of the game. The success of current videogames lies in the combination of real movements, even using real objects with those that are virtually depicted on the computer screen or projected on the wall, i.e. Nintendo Wii Sports. Such a notion breaks the sequential nature of ideas classification that was the paradigm of that period. So the two main notions of hypertext, pages and links, are presented. There’s also need of a system or device that
628
F.V.C. Ficarra and M.C. Ficarra
works as a support for intellectual work. They had to combine the aspects of a personal archivist and of a library, that is, a system with an ability to store big quantities of information in a related way, together with a quick and flexible consultation of all the information. This system was called “memex” (memory extender). The information organization in memex was characterized by: direct access to information through indexes (index), transmission of information through links, elaboration of a list of links (trails of links), related concept association, and addition of information to a same document, through notes. Said features cause a change in the concept of reading: in the hypertext, reading is an active process that implies writing. There’s the need to conceive the text in a more ‘virtual’ than physical way. Inside this virtuality, the movement through pages of the different texts was bi-directional and moreover, a text could be constituted by more than one page. The ideas proposed by Bush remained theory until, in the 60s, Douglas C. Engelbart and Theodor H. Nelson independently developed the first hyper-textual systems [6] [9]. Some of these concepts were already present in the first versions of the videogames played on computer in the eighties. 3.2 Multimedia From a linguistic perspective, the word multimedia has a kind of redundancy. Etymologically speaking, the suffix ‘multi’ −deriving from the Latin word multi, means ‘many’. The word ‘media’ derives from the Latin word medius, and among its wide range of meanings there is also the one that stands for: that can be used for a specific medium, quoting the case of communication. The term multimedia turns out therefore to be a redundant notion: the prefix ‘multi’ added to the concept of ‘media’ does not increase its semantic value, as they are actually the same. As for the origin of the word multimedia there is not an exact date but Kapro [5] quotes the example of Tchaikovsky, when he presents in 1812 his prelude ‘multimedia’ where he combined music with fireworks, i.e., visual and auditory. The concept of multimedia refers to the combination of two or more media, which are then constituted by different supports. Nevertheless, some people maintain that multimedia is an intersection of media [10] [11] and other people that it is a union [4] [12]. Such statements would not be wrong, if we specify that in the case of intersection we talk about communication process and in the case of union we talk about technological aspect. Conklin tries to differentiate the period before the 80s from the present decade by adding the word ‘interactive’ to the word ‘multimedia’, as in the communication process the user enters a performance of constant feedback with the system through navigation [13]. Nicholas Negroponte and Richard Bolt in the 70s developed a set of tools in order to increase the interaction possibilities of computers [14]. These tools helped create the link between hypertext and interactive multimedia. They created a work space called Dataland, whose main means were: cursor, touch system in order to interact manually with screens, joystick, zoom in to images, and use of voice for the execution of orders. In the first years of the nineties the videogames for PCs allowed the participation of several players through joysticks, with which the number of simultaneous users increased the multimedia interaction with the computer, using means other than the keyboard.
Interactive Systems, Design and Heuristic Evaluation
629
3.3 Hypermedia The word hypermedia is the acronym of: hypertext and multimedia. Here the advantages of the two technologies get together into the multimedia communication process. As observed in the origins of hypertext, where the textual aspect of the first systems prevails (including static graphs, in a wide sense), where the associative character is established in the structure of information. This aspect gives rise to a denomination less frequent in hypermedia systems, than the one of electronic book, while in multimedia and through the intersection of media there’s dynamic content of the information: video, animations for PC and audio. This dynamic aspect brings the time or synchronization factor among the different media. Consequently, hypermedia allows [15]: selected access to those parts determined by the user in advance; a higher degree of detail in the structure (they resort to the richness of content, according to the different media used in the transmission of message and reinforcing the communication process). Now it has already been established in hypertext the importance of the storage of the information in electromagnetic media and the importance of the speed of access to the database. However, the notion of synchronism cannot be assessed in a hypermedia system because it does not exist in itself. Today, the ambiguity has been extrapolated in denominated Web 2.0. Although there exist numerous valuable systems of design of systems of hypertext, multimedia and hypermedia, throughout the years it has not been possible to establish a single model or dictionary of terms that do not generate ambiguities in that sense. Some of these concepts have been maintained throughout time and have not needed any anchoring operation. The current problem is that for the designers of these interactive systems, not all of them have academic training in computer science and a diachronic view in the assessment of the interactive design in the hypermedia.
4 SimCity: Excellent Design Evolution The human factors and the accessibility to the information inside the human-computer interaction have also been evolving passage of time [16]. For example, the use of both hands in the computer keyboard to the joystick, from the joystick to a single hand −in the case of the mobile phone− and a special pencil for the PDA. The tendency to reduce interface space in the new devices must not sideline all the communications and design notions established in the guidelines for the interface of the computer screen. For such reasons we regard it as important to analyze a videogame that, since its origins, has always been developed by the same manufacturer but which has introduced all the novelties of interactive design as the potentiality in software and hardware intersection and/or union. There is a triadic relationship in the videogame between the human factors linked to the context, the human-computer interaction and technological evolution [16-9]. It is not easy to maintain quality in this triadic relationship throughout time and in the era of globalization. However, an excellent example in the history of the evolution of design and international circulation of videogames is SimCity. The key to success from the point of view of human factors lies in the build-up and personalized management of a community or autonomous virtual society. The heuristic analysis starts from the origins of the current videogame
630
F.V.C. Ficarra and M.C. Ficarra
that worked in a MS-DOS operating system until the current Microsoft Vista. Next the screens of the different versions are inserted and it is presented in an assessment table for each of the variables that will make up the quality metrics. SimCity is a videogame that has known how to exploit to the fullest extent the potential of the operating system with which it worked. In the first versions it was aimed at urban simulation but in the later versions there can be seen in the content a move towards the social aspects. For example, the measurement elements of these social aspects have been increased and perfected throughout time, such as: news in the mass media, politicians’ views, population statistics, etc. In the first version of the system we can see that the interface is 2D but with a tendency to emulate 3D, through the use of perspectives and shadows. For instance, the different constructions had an air view, with a little angle of inclination. Perhaps the early design was related to classical cartography called Mercator [20]. In the second version, the emulation of reality is greater because of the emulation of the ‘z’ coordinate in the constructions, through a series of images previously rendered. However, there are 3D objects that make up the graphical context such as flora and fauna. The problem in combining 2D and 3D is the scarcity of angles of vision: 4 in all, as can be seen in the SimCity 4 version. In the latest version there is an excellent simulation of reality, thanks to the introduction of more 3D graphical elements, high resolution pictures, buildings of American, Asian, European and Oceanian architecture.
Fig. 2. First SimCity, floppy (1989)
Fig. 3. SimCity 2000, floppy’s (1993)
Fig. 4. SimCity 3000, CD-Rom (1997)
Interactive Systems, Design and Heuristic Evaluation
Fig. 5. SimCity 4, CD-Rom’s (2003)
631
Fig. 6. SimCity Societies, DVD (2007)
From the graphical point of view, an evolution of emulation towards of emulation towards simulation of reality has taken place. Now, if we compare the multimedia information supports of the systems, we observe how the first version was stored on a floppy disk, the second on several floppy disks, and the third on a CD-ROM and the last version on a DVD. Consequently, this is an interesting example of diachronic vision, where the possibility of storing and having access to a greater amount of information, has led an interactive product in two decades from a 2D realisation to excellent quality in the simulated fruition of 2D/3D. It can also be observed how the synchronism has been gaining quality in the last two versions of the hypermedia systems aimed at entertaining, especially between the audio and the animations of special effects. The measurement of this increase in quality needs heuristic techniques and metrics of assessment.
5 Heuristic Evaluation: Methodology for a Diachronic Vision The research work has been aimed at the creation of a set of techniques and methods for the assessment of the diachronic vision of the contents of the multimedia/ hypermedia systems. In principle we have worked with a videogame, but the current technology is valid for other interactive systems aimed at: E-learning, E-commerce, E-book, etc. The investigation carried out can be divided as follows: - The first theoretical field has been established on the present theme of investigation. The techniques and methods mainly used are the ones deriving from the humancomputer interaction (human factors and context), software engineering (metrics), the user-centered design (quality criteria), the design models for hypermedia systems (eradication of ambiguity) and communicability and usability engineering. Nelson’s and Vannevar’s [9] [21] concepts have been mainly used for the category of design called structure of an interactive system. - Some heuristic features, influencing the quality of interactive systems, have been adapted and generated. They are the first quality features applied to categories of design, and they are related one another in a bi-directional way. These heuristic features have been related with the usability features defined by Nielsen [22] and communicability [23] [24]. The heuristic quality features have been decomposed in measuring factors that have been called ‘metrics’ (see Annex #1). In these metrics we
632
F.V.C. Ficarra and M.C. Ficarra
applied descriptive statistics in order to quantify different aspects of heuristic evaluation, as the total number of components of the system or the total number of mistakes found. A metric called ‘of binary presence’ has been proposed. This metric has been completed through the table for heuristic evaluation oriented to the design of an interactive system. The table created in the present investigation is the first one in the heuristic evaluation of diachronic vision and among its main qualities, we can say that it assisted in finding the first usability and communicability mistakes and helped to establish an approximate dimension of the system to be analysed, through the totals registered in some components. Therefore, with the table it was possible to immediately create a previous assumption about the cost of the evaluation. In the investigation this table has constantly changed until it has included the biggest quantity of components of a present hypermedia system. Moreover, the components in the table have been classified by design category. With the classification, more detailed results have been obtained. This kind of result in the design stage has defined the responsibilities for mistakes among people participating in the elaboration of interactive systems. - The profession of a heuristic evaluator or analyst has been incorporated into interactive systems with international distribution. It is a new profession that requires specialists with knowledge and experience in the context of multimedia/hypermedia systems and, communicability it is localized on the intersection between formal and phatic sciences. The work of evaluation of this professional may carried out alone or in team. Team work allows results obtained in previous evaluation to be verified by others experts in the evaluation field. - The results obtained and lessons learned through metrics in diachronic vision underline the existing problem in the fruition of the interactive systems from the hardware point of view and the software in the commercial products aimed at entertainment. Aside from the appearance of false new design concepts with merchandising purposes. Perhaps it is necessary to create diachronic computers, that is, with different peripheral devices of input/output, processor cards, graphic cards, etc., which allow the selection of the operating system, for instance. Solutions can also be studied and analyzed from the software point of view, that is, to generate a system which enables execution of the software in the current computers. Our future works are aimed at keeping on assessing diachronically the interactive systems aimed at the adult and children audience with education contents such as language learning, science, tourism, entertainment, etc. with the purpose of wiping out ambiguities and enabling the access to these contents through a new hardware device, such as could be a diachronic computer, for example. Starting from this investigation work, the bases have been established so that new interactive systems are based on some quality criteria necessary for their usability and communicability, and that can be increased in the future. These quality criteria have demonstrated that they improve interaction between users and hypermedia. Moreover, a communicability and heuristic evaluation professional for interactive systems can grant that these quality criteria are respected before mass production. Therefore, with the present methodology, the production costs are reduced, as the possible mistakes are eliminated during the design stage, and it is a cheap methodology., as it is not necessary to have a laboratory or a staff in order to execute it.
Interactive Systems, Design and Heuristic Evaluation
633
6 Conclusions The diachronic vision of hypertext, multimedia and hypermedia systems, whose contents can be related to the entertainment or pastime, education, checking databases, etc. increases understanding and fruition of the contents in communicative designs strategies, off-line and on-line supports through time. Through the diachronic vision one can know the evolution of the terms used in the design stage. The purpose is to eradicate ambiguities and increase the use of notions that are identical for all the people who participate in the design, ranging from production to realisation of the hypermedia system. This kind of evaluation makes it possible to re-evaluate the already existing concepts, even when the hardware support is no longer used. In the last decade, many notions of interactive design which have been used and examined with success to improve the user-centered design, are starting to be used. Besides, the used methodology introduces a new standard to maintain and increase the quality of design in the interactive systems. The analysed system is an excellent example in which many notions of communicative quality have been maintained and boosted throughout the years. It is a videogame where there is the freedom to create the content and manage the passing of time by the users, in that it is possible to speed up or slow down the diachronism. The synchronism is mainly depicted through the statistic data that from the first version have been maintained in its main elements, basically modifying the layout. Acknowledgments. A special thanks to Emma Nicol (Strathclyde University), University), Electronic Arts (Spain), and Marco Fredianelli (Alaipo & Ainci) for their helps.
References [1] Basili, V., Zelkowitz, M.: Empirical Studies to Build a Science of Computer Science. Communications of ACM 50(11), 33–37 (2007) [2] Saussure, F.: Course in General Linguistics. McGraw-Hill, New York (1990) [3] Marcus, A.: Icons, Symbols, and Signs: Visible Languages to Facilitate Communication. Interactions of ACM 10, 37–43 (2003) [4] Grimes, J., Potel, M.: What is Multimedia? Computer Graphics 1, 49–52 (1991) [5] Kaprow, A.: New Media Applications in Art and Design. ACM Siggraph, New York (1991) [6] Gray, J.: Evolution of Data Management. Computer 29, 47–58 (1996) [7] Nielsen, J.: Hypertext and Hypermedia. Academic Press, San Diego (1990) [8] Bush, V.: As We Think, pp. 101–108. Endless Horizons, Washington (1946) [9] Berners-Lee,: WWW: Past, Present, and Future. ACM Computer 29, 79–85 (1996) [10] Kjelldahl, L.: Collected Conclusions: Multimedia Systems, Interactions and Applications. In: Proceed. First Eurographics Workshop, pp. 347–353. Springer, Stockholm (1991) [11] Väänänen, K.: Interfaces to hypermedia: Communicating the structure and interactions possibilities to the users. Computer & Graphics 17(3), 219–228 (1993) [12] Meyer-Wegener, K.: Database Management for Multimedia Applications. Multimedia, pp. 105–119. Springer, Berlin (1994) [13] Conklin, J.: Hypertext: An Introduction and Survey. Computer 20, 17–41 (1987)
634
F.V.C. Ficarra and M.C. Ficarra
[14] Mitchell, W., McCullough, M.: Digital Design Media. Van Nostrand Reinhold, NY (1995) [15] Chen, P.: The entity-relationship approach: Toward a unified view of data. ACM Transactions Database Systems 1, 9–36 (1976) [16] Pargman, D., Jakobsson, P.: Five Perspectives on Computer Game History. Interactions of ACM 14, 26–29 (2007) [17] Cutumisu, M., et al.: Generating Ambient Behaviors in Computer Role-Playing Games. IEEE Computer 21, 19–27 (2006) [18] Kelly, H., et al.: How to Build Serious Games. Communications 50, 45–49 (2007) [19] Berens, K., Howard, G.: Videogaming. Rough Guides, London (2002) [20] Robinson, et al.: Elements of Cartography. John Wiley and Sons, New York (1995) [21] Nelson, T.: Literaty Machines. Mindful Press, Sausalito (1993) [22] Nielsen, J.: Usability Engineering. Academic Press, London (1993) [23] Cipolla-Ficarra, F.: Evaluation and communication techniques in multimedia product design for on the net university education. In: Multimedia on the Net, pp. 151–165. Springer, Vienna (1996) [24] Cipolla-Ficarra, F.: A Study of Acteme on Users Unexpert of Videogames. LNCS, pp. 215–224. Springer, Berlin (2007)
Annex #1: Diachronic Vision Table 1. Elements, components/attributes and design categories: Content (CO), Dynamic (DY), Panchronic (PA), Presentation (PR) and Structure (ST) Elements
Components and/or attributes
Design categories
- Accessibility - Assistance - Behaviorism - Context - Custom - Dynamic and static media - Eidetic and perception - Evolutionism - Globalization - Iconography - Interaction - Isomorphism - Realism - Time-control - Time-lapse - Vision angle
Database or hyperbase: different access Index, tutorial, virtual agents, etc. Empathic communication and disambiguation Local contents: people, transports, sports, etc. Objects, characters, manners, traditions, etc. Video and audio (i.e., old effects), computer animation, controls for the navigation, etc. Maps structure, inference, mono and polychromes images, etc. Interactive system: hypertext, multimedia, etc. International contents: cultures, architectures, etc. Functions and topology on the screen I/O: keyboards, mouse, screen touch, voice, etc. Conventional & natural symbols, typography, etc. Emulation and simulation Analepsis and prolepsis Computer graphics and FX Angular movements, free and restricted, etc.
ST, DY, PA CO, PR, DY, ST CO, PR PR, CO CO, PR CO, PR, DY, PA PR, CO ST, DY PR, CO DY, CO, PR DY, ST CO, PR CO, DY, PA, PR, ST DY, ST DY, PR, PA PR, DY, ST
Author Index
Abe, Jair Minoro 249, 265, 285 Adams, Ray 605 Aguilar-Alonso, Angel 497 Aim´e, Xavier 201 Akama, Seiki 249, 265, 285 Alepis, Efthymios 523 Alexandris, Nikolaos 67 Alsina-Jurnet, Ivan 497 Andr´es-Pueyo, Antonio 497 Anisetti, Marco 555 Apostolakis, Ioannis 513 Apostolou, Dimitris 239 Bellandi, Valerio 555 Belsis, Petros 211 Brna, Paul 33 Cardaci, Maurizio 417 Chalvantzis, Constantinos 439 Chasanis, Vasileios 45 Christel, Michael G. 21 Cigale, Boris 95 Cipolla Ficarra, Francisco V. 461, 625 Cipolla Ficarra, Miguel 461, 625 Czyzewski, Andrzej 75 Dalka, Piotr 75 Damiani, Ernesto 555 Doulgeris, Panagiotis 137 Ebisawa, Ryu 363 Echizen, Isao 331, 363 Encheva, Sylvia 221 Fakotakis, Nikos 147, 585 Falcoz, Paolo 577
Fogarolli, Angela 395 Fujii, Yasuhiro 363 Galatsanos, Nikolaos 45 Ganchev, Todor 585 Gianini, Gabriele 555 Gigliotta, Onofrio 417 Grgic, Tomislav 293 Guo, Leiyong 185 Guti´errez-Maldonado, Jos´e
497
Hadjidimitriou, Stelios 137 Hadjileontiadis, Leontios 137 Hattori, Takashi 255 Heliades, Georgios P. 451 Huleihil, Huriya 405 Huleihil, Mahmoud 405 Huskic, Vedran 293 Isahara, Hitoshi
231
Jang, Hyoung J. 165, 175 Jarne-Esparcia, Adolfo Jos´e Jeong, Hong 565
497
Kabassi, Katerina 451, 523 Karapiperis, Stelios 239 Kastania, Anastasia 489, 507 Katai, Osamu 255 Katsanevas, Theodore 427 Katsaounos, Nikos 585 Kawakami, Hiroshi 255 Kim, Chang Soo 321 Kobsa, Alfred 31 Konstantopoulos, Charalampos
211
636
Author Index
Kossida, Sophia 489 Kostoulas, Theodoros 585 Kountchev, Roumen 275 Kountcheva, Roumiana 275
Reis, Luis Paulo 533, 545 Resconi, Germano 1 Romih, Tomaˇz 117 Ronchetti, Marco 395
Lampropoulos, A.S. 127, 191, 615 Lampropoulou, P.S. 127, 615 Lazaridis, Alexandros 585 Lee, Sang-Hong 165, 175 Lee, Seok Cheol 321 Lehto, Paula 481 Leniˇc, Mitja 95, 107 Likas, Aristidis 45 Lim, Joon S. 165, 175
Savvopoulos, Anastasios 471 Shin, Sang-Uk 351 Shiose, Takayuki 255 ˇ Sinjur, Smiljan 85 Skourlas, Christos 211 Sonoda, Kotaro 341 Sotiropoulos, D.N. 191 Stathopoulou, I.-O. 55 Stefanidakis, Michalis 303 Stojanovic, Nenad 239
Mamalis, Basilis 211 Matijasevic, Maja 293 Miglino, Orazio 417 Moon, Il-Young 313 Mouchtaris, Athanasios 155 Mporas, Iosif 585 Nakamatsu, Kazumi Ntalampiras, Stavros Oliveira, Eugenio
249, 265, 285 147, 585
533, 545
Pagliarini, Luigi 417 Panas, Stavros 137 Panoulas, Konstantinos 137 Pantziou, Grammati 211 Papadakis, Ioannis 303 Park, Su-Wan 351 Park, Sungchan 565 Patsakis, Constantinos 67 Planinˇsiˇc, Peter 117 Potamitis, Ilyas 147, 595 Potoˇcnik, Boˇzidar 95, 107 Raij, Katariina 481 Rangel-G´ omez, Mar´ıa Virginia
Takahashi, Yoshiyasu 331, 363 Takizawa, Osamu 341 Talarn-Caparr´ os, Antoni 497 Tan, Hong-Zhou 185 Teixeira, Jorge 533 Tong, Jianhua 185 Tourtoglou, Kalliopi 385 Trichet, Francky 201 Tsakalides, Panagiotis 155 Tsihrintzis, G.A. 55, 127, 191, 615 Tzagkarakis, Christos 155 Varlamis, Iraklis 513 Vinhas, Vasco 533, 545 Virvou, Maria 385, 439, 471, 523 Yamada, Takaaki 331, 363 Yamamoto, Eiko 231 Yoo, Kee-Young 373 Yoon, Eun-Jun 373 Yoshiura, Hiroshi 331, 363
497
Zazula, Damjan Zimeras, Stelios
85, 95 489, 507