Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany
6524
Kuo-Tien Lee Wen-Hsiang Tsai Hong-Yuan Mark Liao Tsuhan Chen Jun-Wei Hsieh Chien-Cheng Tseng (Eds.)
Advances in Multimedia Modeling 17th International Multimedia Modeling Conference, MMM 2011 Taipei, Taiwan, January 5-7, 2011 Proceedings, Part II
13
Volume Editors Kuo-Tien Lee Jun-Wei Hsieh National Taiwan Ocean University Keelung, Taiwan E-mail: {po,shieh}@mail.ntou.edu.tw Wen-Hsiang Tsai National Chiao Tung University Hsinchu, Taiwan E-mail:
[email protected] Hong-Yuan Mark Liao Academia Sinica Taipei, Taiwan E-mail:
[email protected] Tsuhan Chen Cornell University Ithaca, NY, USA E-mail:
[email protected] Chien-Cheng Tseng National Kaohsiung First University of Science and Technology Kaohsiung, Taiwan E-mail:
[email protected]
Library of Congress Control Number: 2010940989 CR Subject Classification (1998): H.5.1, I.5, H.3, H.4, I.4, H.2.8 LNCS Sublibrary: SL 3 – Information Systems and Application, incl. Internet/Web and HCI ISSN ISBN-10 ISBN-13
0302-9743 3-642-17828-6 Springer Berlin Heidelberg New York 978-3-642-17828-3 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2011 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper 06/3180
Preface
Welcome to the proceedings of the 17th Multimedia Modeling Conference (MMM 2011) held in Taipei, Taiwan, during January 5–7, 2011. Following the success of the 16 preceding conferences, the 17th MMM brought together researchers, developers, practitioners, and educators in the field of multimedia. Both practical systems and theories were presented in this conference, thanks to the support of Microsoft Research Asia, Industrial Technology Research Institute, Institute for Information Industry, National Museum of Natural Science, and the Image Processing and Pattern Recognition Society of Taiwan. MMM 2011 featured a comprehensive program including keynote speeches, regular paper presentations, posters, and special sessions. We received 450 papers in total. Among these submissions, we accepted 75 oral presentations and 21 poster presentations. Six special sessions were organized by world-leading researchers. We sincerely acknowledge the Program Committee members who have contributed much of their precious time during the paper reviewing process. We would like to sincerely thank the support of our strong Organizing Committee and Advisory Committee. Special thanks go to Jun-Wei Hsieh, Tun-Wen Pai, Shyi-Chyi Cheng, Hui-Huang Hsu, Tao Mei, Meng Wang, Chih-Min Chao, Chun-Chao Yeh, Shu-Hsin Liao, and Chin-Chun Chang. This conference would never have happened without their help.
January 2011
Wen-Hsiang Tsai Mark Liao Kuo-Tien Lee
Organization
MMM 2011 was hosted and organized by the Department of Computer Science and Engineering, National Taiwan Ocean University, Taiwan. The conference was held at the National Taiwan Science Education Center, Taipei, during January 5–7, 2011.
Conference Committee Steering Committee
Conference Co-chairs
Program Co-chairs
Special Session Co-chairs Demo Co-chair Local Organizing Co-chairs
Publication Chair Publicity Chair
Yi-Ping Phoebe Chen (La Trobe University, Australia) Tat-Seng Chua (National University of Singapore, Singapore) Tosiyasu L. Kunii (University of Tokyo, Japan) Wei-Ying Ma (Microsoft Research Asia, China) Nadia Magnenat-Thalmann (University of Geneva, Switzerland) Patrick Senac (ENSICA, France) Kuo-Tien Lee (National Taiwan Ocean University, Taiwan) Wen-Hsiang Tsai (National Chiao Tung University, Taiwan) Hong-Yuan Mark Liao (Academia Sinica, Taiwan) Tsuhan Chen (Cornell University, USA) Jun-Wei Hsieh (National Taiwan Ocean University, Taiwan) Chien-Cheng Tseng (National Kaohsiung First University of Science and Technology, Taiwan) Hui-Huang Hsu (Tamkang University, Taiwan) Tao Mei (Microsoft Research Asia, China) Meng Wang (Microsoft Research Asia, China) Tun-Wen Pai (National Taiwan Ocean University, Taiwan) Shyi-Chyi Cheng (National Taiwan Ocean University, Taiwan) Chih-Min Chao (National Taiwan Ocean University, Taiwan) Shu-Hsin Liao (National Taiwan Ocean University, Taiwan)
VIII
Organization
US Liaison Asian Liaison European Liaison Webmaster
Qi Tian (University of Texas at San Antonio, USA) Tat-Seng Chua (National University of Singapore, Singapore) Susanne Boll (University of Oldenburg, Germany) Chun-Chao Yeh (National Taiwan Ocean University, Taiwan)
Program Committee Allan Hanbury Andreas Henrich Bernard Merialdo Brigitte Kerherve Cathal Gurrin Cees Snoek Cha Zhang Chabane Djeraba Changhu Wang Changsheng Xu Chia-Wen Lin Chong-Wah Ngo Christian Timmerer Colum Foley Daniel Thalmann David Vallet Duy-Dinh Le Fernando Pereira Francisco Jose Silva Mata Georg Thallinger Guntur Ravindra Guo-Jun Qi Harald Kosch Hui-Huang Hsu Jen-Chin Jiang Jia-hung Ye Jianmin Li Jianping Fan Jiebo Luo Jing-Ming Guo Jinhui Tang
Vienna University of Technology, Austria University of Bamberg, Germany EURECOM, France University of Quebec, Canada Dublin City University, Ireland University of Amsterdam, The Netherlands Microsoft Research University of Sciences and Technologies of Lille, France University of Science and Technology of China NLPR, Chinese Academy of Science, China National Tsing Hua University, Taiwan City University of Hong Kong, Hong Kong University of Klagenfurt, Austria Dublin City University, Ireland EPFL, Swiss Universidad Aut´ onoma de Madrid, Spain National Institute of Informatics, Japan Technical University of Lisbon, Portugal Centro de Aplicaciones de Tecnologias de Avanzada, Cuba Joanneum Research, Austria Applied Research & Technology Center, Motorola, Bangalore University of Science and Technology of China Passau University, Germany Tamkang University, Taiwan National Dong Hwa University, Taiwan National Sun Yat-sen University, Taiwan Tsinghua University, China University of North Carolina, USA Kodak Research, USA National Taiwan University of Science and Technology, Taiwan University of Science and Technology of China
Organization
Jinjun Wang Jiro Katto Joemon Jose Jonathon Hare Joo Hwee Lim Jose Martinez Keiji Yanai Koichi Shinoda Lap-Pui Chau Laura Hollink Laurent Amsaleg Lekha Chaisorn Liang-Tien Chia Marcel Worring Marco Bertini Marco Paleari Markus Koskela Masashi Inoue Matthew Cooper Matthias Rauterberg Michael Lew Michel Crucianu Michel Kieffer Ming-Huei Jin Mohan Kankanhalli Neil O’Hare Nicholas Evans Noel O’Connor Nouha Bouteldja Ola Stockfelt Paul Ferguson Qi Tian Raphael Troncy Roger Zimmermann Selim Balcisoy Sengamedu Srinivasan Seon Ho Kim Shen-wen Shr Shingo Uchihashi Shin’ichi Satoh
IX
NEC Laboratories America, Inc., USA Waseda University, Japan University of Glasgow, UK University of Southampton, UK Institute for Infocomm Research, Singapore UAM, Spain University of Electro-Communications, Japan Tokyo Institute of Technology, Japan Nanyang Technological University, Singapore Vrije Universiteit Amsterdam, The Netherlands CNRS-IRISA, France Institute for Infocomm Research, Singapore Nanyang Technological University, Singapore University of Amsterdam, The Netherlands University of Florence, Italy EURECOM, France Helsinki University of Technology, Finland Yamagata University, Japan FX Palo Alto Lab, Inc., Germany Technical University Eindhoven, The Netherlands Leiden University, The Netherlands Conservatoire National des Arts et M´etiers, France Laboratoire des Signaux et Syst`emes, CNRS-Sup´elec, France Institute for Information Industry, Taiwan National University of Singapore Dublin City University, Ireland EURECOM, France Dublin City University, Ireland Conservatoire National des Arts et M´etiers, France Gothenburg University, Sweden Dublin City University, Ireland University of Texas at San Antonio, USA CWI, The Netherlands University of Southern California, USA Sabanci University, Turkey Yahoo! India University of Denver, USA National Chi Nan University, Taiwan Fuji Xerox Co., Ltd., Japan National Institute of Informatics, Japan
X
Organization
Shiuan-Ting Jang Shuicheng Yan Shu-Yuan Chen Sid-Ahmed Berrani Stefano Bocconi Susu Yao Suzanne Little Tao Mei Taro Tezuka Tat-Seng Chua Thierry Pun Tong Zhang Valerie Gouet-Brunet Vincent Charvillat Vincent Oria Wai-tian Tan Wei Cheng Weiqi Yan Weisi Lin Wen-Hung Liau Werner Bailer William Grosky Winston Hsu Wolfgang H¨ urst Xin-Jing Wang Yannick Pri´e Yan-Tao Zheng Yea-Shuan Huang Yiannis Kompatsiaris Yijuan Lu Yongwei Zhu Yun Fu Zha Zhengjun Zheng-Jun Zha Zhongfei Zhang Zhu Li
National Yunlin University of Science and Technology, Taiwan National University of Singapore Yuan Ze University, Taiwan Orange Labs - France Telecom Universit`a degli studi di Torino, Italy Institute for Infocomm Research, Singapore Open University, UK Microsoft Research Asia, China Ritsumeikan University, Japan National University of Singapore University of Geneva, Switzerland HP Labs Conservatoire National des Arts et Metiers, France University of Toulouse, France NJIT, USA Hewlett-Packard, USA University of Michigan, USA Queen’s University Belfast, UK Nanyang Technological University, Singapore National Chengchi University, Taiwan Joanneum Research, Austria University of Michigan, USA National Taiwan University, Taiwan Utrecht University, The Netherlands Microsoft Research Asia, China LIRIS, France National University of Singapore, Singapore Chung-Hua University, Taiwan Informatics and Telematics Institute Centre for Research and Technology Hellas, Greece Texas State University, USA Institute for Infocomm Research Asia, Singapore University at Buffalo (SUNY), USA National University of Singapore, Singapore National University of Singapore, Singapore State University of New York at Binghamton, USA Hong Kong Polytechnic University, Hong Kong
Organization
Sponsors Microsoft Research Industrial Technology Research Institute Institute For Information Industry National Taiwan Science Education Center National Taiwan Ocean University Bureau of Foreign Trade National Science Council
XI
Table of Contents – Part II
Special Session Papers Content Analysis for Human-Centered Multimedia Applications Generative Group Activity Analysis with Quaternion Descriptor . . . . . . . Guangyu Zhu, Shuicheng Yan, Tony X. Han, and Changsheng Xu
1
Grid-Based Retargeting with Transformation Consistency Smoothing . . . Bing Li, Ling-Yu Duan, Jinqiao Wang, Jie Chen, Rongrong Ji, and Wen Gao
12
Understanding Video Sequences through Super-Resolution . . . . . . . . . . . . Yu Peng, Jesse S. Jin, Suhuai Luo, and Mira Park
25
Facial Expression Recognition on Hexagonal Structure Using LBP-Based Histogram Variances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lin Wang, Xiangjian He, Ruo Du, Wenjing Jia, Qiang Wu, and Wei-chang Yeh
35
Mining Social Relationship from Media Collections Towards More Precise Social Image-Tag Alignment . . . . . . . . . . . . . . . . . . . Ning Zhou, Jinye Peng, Xiaoyi Feng, and Jianping Fan Social Community Detection from Photo Collections Using Bayesian Overlapping Subspace Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Peng Wu, Qiang Fu, and Feng Tang Dynamic Estimation of Family Relations from Photos . . . . . . . . . . . . . . . . Tong Zhang, Hui Chao, and Dan Tretter
46
57
65
Large Scale Rich Media Data Management Semi-automatic Flickr Group Suggestion . . . . . . . . . . . . . . . . . . . . . . . . . . . . Junjie Cai, Zheng-Jun Zha, Qi Tian, and Zengfu Wang A Visualized Communication System Using Cross-Media Semantic Association . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xinming Zhang, Yang Liu, Chao Liang, and Changsheng Xu
77
88
XIV
Table of Contents – Part II
Effective Large Scale Text Retrieval via Learning Risk-Minimization and Dependency-Embedded Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sheng Gao and Haizhou Li
99
Efficient Large-Scale Image Data Set Exploration: Visual Concept Network and Image Summarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chunlei Yang, Xiaoyi Feng, Jinye Peng, and Jianping Fan
111
Multimedia Understanding for Consumer Electronics A Study in User-Centered Design and Evaluation of Mental Tasks for BCI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Danny Plass-Oude Bos, Mannes Poel, and Anton Nijholt Video CooKing: Towards the Synthesis of Multimedia Cooking Recipes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Keisuke Doman, Cheng Ying Kuai, Tomokazu Takahashi, Ichiro Ide, and Hiroshi Murase
122
135
Snap2Read: Automatic Magazine Capturing and Analysis for Adaptive Mobile Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yu-Ming Hsu, Yen-Liang Lin, Winston H. Hsu, and Brian Wang
146
Multimodal Interaction Concepts for Mobile Augmented Reality Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wolfgang H¨ urst and Casper van Wezel
157
Image Object Recognition and Compression Morphology-Based Shape Adaptive Compression . . . . . . . . . . . . . . . . . . . . . Jian-Jiun Ding, Pao-Yen Lin, Jiun-De Huang, Tzu-Heng Lee, and Hsin-Hui Chen People Tracking in a Building Using Color Histogram Classifiers and Gaussian Weighted Individual Separation Approaches . . . . . . . . . . . . . . . . Che-Hung Lin, Sheng-Luen Chung, and Jing-Ming Guo Human-Centered Fingertip Mandarin Input System Using Single Camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chih-Chang Yu, Hsu-Yung Cheng, Bor-Shenn Jeng, Chien-Cheng Lee, and Wei-Tyng Hong Automatic Container Code Recognition Using Compressed Sensing Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chien-Cheng Tseng and Su-Ling Lee
168
177
187
196
Table of Contents – Part II
Combining Histograms of Oriented Gradients with Global Feature for Human Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shih-Shinh Huang, Hsin-Ming Tsai, Pei-Yung Hsiao, Meng-Qui Tu, and Er-Liang Jian
XV
208
Interactive Image and Video Search Video Browsing Using Object Trajectories . . . . . . . . . . . . . . . . . . . . . . . . . . Felix Lee and Werner Bailer Size Matters! How Thumbnail Number, Size, and Motion Influence Mobile Video Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wolfgang H¨ urst, Cees G.M. Snoek, Willem-Jan Spoel, and Mate Tomin An Information Foraging Theory Based User Study of an Adaptive User Interaction Framework for Content-Based Image Retrieval . . . . . . . . Haiming Liu, Paul Mulholland, Dawei Song, Victoria Uren, and Stefan R¨ uger
219
230
241
Poster Session Papers Generalized Zigzag Scanning Algorithm for Non-square Blocks . . . . . . . . . Jian-Jiun Ding, Pao-Yen Lin, and Hsin-Hui Chen
252
The Interaction Ontology Model: Supporting the Virtual Director Orchestrating Real-Time Group Interaction . . . . . . . . . . . . . . . . . . . . . . . . . Rene Kaiser, Claudia Wagner, Martin Hoeffernig, and Harald Mayer
263
CLUENET: Enabling Automatic Video Aggregation in Social Media Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhuhua Liao, Jing Yang, Chuan Fu, and Guoqing Zhang
274
Pedestrian Tracking Based on Hidden-Latent Temporal Markov Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Peng Zhang, Sabu Emmanuel, and Mohan Kankanhalli
285
Motion Analysis via Feature Point Tracking Technology . . . . . . . . . . . . . . . Yu-Shin Lin, Shih-Ming Chang, Joseph C. Tsai, Timothy K. Shih, and Hui-Huang Hsu Traffic Monitoring and Event Analysis at Intersection Based on Integrated Multi-video and Petri Net Process . . . . . . . . . . . . . . . . . . . . . . . . Chang-Lung Tsai and Shih-Chao Tai Baseball Event Semantic Exploring System Using HMM . . . . . . . . . . . . . . Wei-Chin Tsai, Hua-Tsung Chen, Hui-Zhen Gu, Suh-Yin Lee, and Jen-Yu Yu
296
304 315
XVI
Table of Contents – Part II
Robust Face Recognition under Different Facial Expressions, Illumination Variations and Partial Occlusions . . . . . . . . . . . . . . . . . . . . . . . Shih-Ming Huang and Jar-Ferr Yang Localization and Recognition of the Scoreboard in Sports Video Based on SIFT Point Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jinlin Guo, Cathal Gurrin, Songyang Lao, Colum Foley, and Alan F. Smeaton 3D Model Search Using Stochastic Attributed Relational Tree Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Naoto Nakamura, Shigeru Takano, and Yoshihiro Okada A Novel Horror Scene Detection Scheme on Revised Multiple Instance Learning Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bin Wu, Xinghao Jiang, Tanfeng Sun, Shanfeng Zhang, Xiqing Chu, Chuxiong Shen, and Jingwen Fan
326
337
348
359
Randomly Projected KD-Trees with Distance Metric Learning for Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pengcheng Wu, Steven C.H. Hoi, Duc Dung Nguyen, and Ying He
371
A SAQD-Domain Source Model Unified Rate Control Algorithm for H.264 Video Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mingjing Ai and Lili Zhao
383
A Bi-objective Optimization Model for Interactive Face Retrieval . . . . . . Yuchun Fang, Qiyun Cai, Jie Luo, Wang Dai, and Chengsheng Lou
393
Multi-symbology and Multiple 1D/2D Barcodes Extraction Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daw-Tung Lin and Chin-Lin Lin
401
Wikipedia Based News Video Topic Modeling for Information Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sujoy Roy, Mun-Thye Mak, and Kong Wah Wan
411
Advertisement Image Recognition for a Location-Based Reminder System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Siying Liu, Yiqun Li, Aiyuan Guo, and Joo Hwee Lim
421
Flow of Qi: System of Real-Time Multimedia Interactive Application of Calligraphy Controlled by Breathing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kuang-I Chang, Mu-Yu Tsai, Yu-Jen Su, Jyun-Long Chen, and Shu-Min Wu Measuring Bitrate and Quality Trade-Off in a Fast Region-of-Interest Based Video Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Salahuddin Azad, Wei Song, and Dian Tjondronegoro
432
442
Table of Contents – Part II
Image Annotation with Concept Level Feature Using PLSA+CCA . . . . . Yu Zheng, Tetsuya Takiguchi, and Yasuo Ariki Multi-actor Emotion Recognition in Movies Using a Bimodal Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ruchir Srivastava, Sujoy Roy, Shuicheng Yan, and Terence Sim
XVII
454
465
Demo Session Papers RoboGene: An Image Retrieval System with Multi-Level Log-Based Relevance Feedback Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Huanchen Zhang, Haojie Li, Shichao Dong, and Weifeng Sun Query Difficulty Guided Image Retrieval System . . . . . . . . . . . . . . . . . . . . . Yangxi Li, Yong Luo, Dacheng Tao, and Chao Xu HeartPlayer: A Smart Music Player Involving Emotion Recognition, Expression and Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Songchun Fan, Cheng Tan, Xin Fan, Han Su, and Jinyu Zhang Immersive Video Conferencing Architecture Using Game Engine Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chris Poppe, Charles-Frederik Hollemeersch, Sarah De Bruyne, Peter Lambert, and Rik Van de Walle Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
476 479
483
486
489
Table of Contents – Part I
Regular Papers Audio, Image, Video Processing, Coding and Compression A Generalized Coding Artifacts and Noise Removal Algorithm for Digitally Compressed Video Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ling Shao, Hui Zhang, and Yan Liu
1
Efficient Mode Selection with BMA Based Pre-processing Algorithms for H.264/AVC Fast Intra Mode Decision . . . . . . . . . . . . . . . . . . . . . . . . . . . Chen-Hsien Miao and Chih-Peng Fan
10
Perceptual Motivated Coding Strategy for Quality Consistency . . . . . . . . Like Yu, Feng Dai, Yongdong Zhang, and Shouxun Lin Compressed-Domain Shot Boundary Detection for H.264/AVC Using Intra Partitioning Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sarah De Bruyne, Jan De Cock, Chris Poppe, Charles-Frederik Hollemeersch, Peter Lambert, and Rik Van de Walle
21
29
Adaptive Orthogonal Transform for Motion Compensation Residual in Video Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhouye Gu, Weisi Lin, Bu-sung Lee, and Chiew Tong Lau
40
Parallel Deblocking Filter for H.264/AVC on the TILERA Many-Core Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chenggang Yan, Feng Dai, and Yongdong Zhang
51
Image Distortion Estimation by Hash Comparison . . . . . . . . . . . . . . . . . . . Li Weng and Bart Preneel
62
Media Content Browsing and Retrieval Sewing Photos: Smooth Transition between Photos . . . . . . . . . . . . . . . . . . . Tzu-Hao Kuo, Chun-Yu Tsai, Kai-Yin Cheng, and Bing-Yu Chen
73
Employing Aesthetic Principles for Automatic Photo Book Layout . . . . . Philipp Sandhaus, Mohammad Rabbath, and Susanne Boll
84
Video Event Retrieval from a Small Number of Examples Using Rough Set Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kimiaki Shirahama, Yuta Matsuoka, and Kuniaki Uehara
96
XX
Table of Contents – Part I
Community Discovery from Movie and Its Application to Poster Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yan Wang, Tao Mei, and Xian-Sheng Hua
107
A BOVW Based Query Generative Model . . . . . . . . . . . . . . . . . . . . . . . . . . . Reede Ren, John Collomosse, and Joemon Jose
118
Video Sequence Identification in TV Broadcasts . . . . . . . . . . . . . . . . . . . . . Klaus Schoeffmann and Laszlo Boeszoermenyi
129
Content-Based Multimedia Retrieval in the Presence of Unknown User Preferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christian Beecks, Ira Assent, and Thomas Seidl
140
Multi-Camera, Multi-View, and 3D Systems People Localization in a Camera Network Combining Background Subtraction and Scene-Aware Human Detection . . . . . . . . . . . . . . . . . . . . . Tung-Ying Lee, Tsung-Yu Lin, Szu-Hao Huang, Shang-Hong Lai, and Shang-Chih Hung
151
A Novel Depth-Image Based View Synthesis Scheme for Multiview and 3DTV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xun He, Xin Jin, Minghui Wang, and Satoshi Goto
161
Egocentric View Transition for Video Monitoring in a Distributed Camera Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kuan-Wen Chen, Pei-Jyun Lee, and Yi-Ping Hung
171
A Multiple Camera System with Real-Time Volume Reconstruction for Articulated Skeleton Pose Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zheng Zhang, Hock Soon Seah, Chee Kwang Quah, Alex Ong, and Khalid Jabbar A New Two-Omni-Camera System with a Console Table for Versatile 3D Vision Applications and Its Automatic Adaptation to Imprecise Camera Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shen-En Shih and Wen-Hsiang Tsai 3D Face Recognition Based on Local Shape Patterns and Sparse Representation Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Di Huang, Karima Ouji, Mohsen Ardabilian, Yunhong Wang, and Liming Chen An Effective Approach to Pose Invariant 3D Face Recognition . . . . . . . . . Dayong Wang, Steven C.H. Hoi, and Ying He
182
193
206
217
Table of Contents – Part I
XXI
Multimedia Indexing and Mining Score Following and Retrieval Based on Chroma and Octave Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wei-Ta Chu and Meng-Luen Li
229
Incremental Multiple Classifier Active Learning for Concept Indexing in Images and Videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bahjat Safadi, Yubing Tong, and Georges Qu´enot
240
A Semantic Higher-Level Visual Representation for Object Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ismail El Sayad, Jean Martinet, Thierry Urruty, and Chabane Dejraba Mining Travel Patterns from GPS-Tagged Photos . . . . . . . . . . . . . . . . . . . . Yan-Tao Zheng, Yiqun Li, Zheng-Jun Zha, and Tat-Seng Chua Augmenting Image Processing with Social Tag Mining for Landmark Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Amogh Mahapatra, Xin Wan, Yonghong Tian, and Jaideep Srivastava News Shot Cloud: Ranking TV News Shots by Cross TV-Channel Filtering for Efficient Browsing of Large-Scale News Video Archives . . . . Norio Katayama, Hiroshi Mo, and Shin’ichi Satoh
251
262
273
284
Multimedia Content Analysis (I) Speaker Change Detection Using Variable Segments for Video Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . King Yiu Tam, Jose Lay, and David Levy Correlated PLSA for Image Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Peng Li, Jian Cheng, Zechao Li, and Hanqing Lu Genre Classification and the Invariance of MFCC Features to Key and Tempo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tom L.H. Li and Antoni B. Chan Combination of Local and Global Features for Near-Duplicate Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yue Wang, ZuJun Hou, Karianto Leman, Nam Trung Pham, TeckWee Chua, and Richard Chang Audio Tag Annotation and Retrieval Using Tag Count Information . . . . . Hung-Yi Lo, Shou-De Lin, and Hsin-Min Wang
296
307
317
328
339
XXII
Table of Contents – Part I
Similarity Measurement for Animation Movies . . . . . . . . . . . . . . . . . . . . . . . Alexandre Benoit, Madalina Ciobotaru, Patrick Lambert, and Bogdan Ionescu
350
Multimedia Content Analysis (II) A Feature Sequence Kernel for Video Concept Classification . . . . . . . . . . . Werner Bailer
359
Bottom-Up Saliency Detection Model Based on Amplitude Spectrum . . . Yuming Fang, Weisi Lin, Bu-Sung Lee, Chiew Tong Lau, and Chia-Wen Lin
370
L2 -Signature Quadratic Form Distance for Efficient Query Processing in Very Large Multimedia Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christian Beecks, Merih Seran Uysal, and Thomas Seidl Generating Representative Views of Landmarks via Scenic Theme Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yi-Liang Zhao, Yan-Tao Zheng, Xiangdong Zhou, and Tat-Seng Chua Regularized Semi-supervised Latent Dirichlet Allocation for Visual Concept Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Liansheng Zhuang, Lanbo She, Jingjing Huang, Jiebo Luo, and Nenghai Yu Boosted Scene Categorization Approach by Adjusting Inner Structures and Outer Weights of Weak Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xueming Qian, Zhe Yan, and Kaiyu Hang A User-Centric System for Home Movie Summarisation . . . . . . . . . . . . . . . Saman H. Cooray, Hyowon Lee, and Noel E. O’Connor
381
392
403
413
424
Multimedia Signal Processing and Communications Image Super-Resolution by Vectorizing Edges . . . . . . . . . . . . . . . . . . . . . . . . Chia-Jung Hung, Chun-Kai Huang, and Bing-Yu Chen
435
Vehicle Counting without Background Modeling . . . . . . . . . . . . . . . . . . . . . Cheng-Chang Lien, Ya-Ting Tsai, Ming-Hsiu Tsai, and Lih-Guong Jang
446
Effective Color-Difference-Based Interpolation Algorithm for CFA Image Demosaicking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yea-Shuan Huang and Sheng-Yi Cheng
457
Table of Contents – Part I
Utility Max-Min Fair Rate Allocation for Multiuser Multimedia Communications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Qing Zhang, Guizhong Liu, and Fan Li
XXIII
470
Multimedia Applications Adaptive Model for Robust Pedestrian Counting . . . . . . . . . . . . . . . . . . . . . Jingjing Liu, Jinqiao Wang, and Hanqing Lu
481
Multi Objective Optimization Based Fast Motion Detector . . . . . . . . . . . . Jia Su, Xin Wei, Xiaocong Jin, and Takeshi Ikenaga
492
Narrative Generation by Repurposing Digital Videos . . . . . . . . . . . . . . . . . Nick C. Tang, Hsiao-Rong Tyan, Chiou-Ting Hsu, and Hong-Yuan Mark Liao
503
A Coordinate Transformation System Based on the Human Feature Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shih-Ming Chang, Joseph Tsai, Timothy K. Shih, and Hui-Huang Hsu An Effective Illumination Compensation Method for Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yea-Shuan Huang and Chu-Yung Li Shape Stylized Face Caricatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nguyen Kim Hai Le, Yong Peng Why, and Golam Ashraf i-m-Breath: The Effect of Multimedia Biofeedback on Learning Abdominal Breath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Meng-Chieh Yu, Jin-Shing Chen, King-Jen Chang, Su-Chu Hsu, Ming-Sui Lee, and Yi-Ping Hung Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
514
525 536
548
559
Generative Group Activity Analysis with Quaternion Descriptor Guangyu Zhu1 , Shuicheng Yan1 , Tony X. Han2 , and Changsheng Xu3 1
Electrical and Computer Engineering, National University of Singapore, Singapore 2 Electrical and Computer Engineering, University of Missouri, USA 3 Institute of Automation, Chinese Academy of Sciences, China {elezhug,eleyans}@nus.edu.sg,
[email protected],
[email protected] Abstract. Activity understanding plays an essential role in video content analysis and remains a challenging open problem. Most of previous research is limited due to the use of excessively localized features without sufficiently encapsulating the interaction context or focus on simply discriminative models but totally ignoring the interaction patterns. In this paper, a new approach is proposed to recognize human group activities. Firstly, we design a new quaternion descriptor to describe the interactive insight of activities regarding the appearance, dynamic, causality and feedback, respectively. The designed descriptor is capable of delineating the individual and pairwise interactions in the activities. Secondly, considering both activity category and interaction variety, we propose an extended pLSA (probabilistic Latent Semantic Analysis) model with two hidden variables. This extended probabilistic graphic paradigm constructed on the quaternion descriptors facilitates the effective inference of activity categories as well as the exploration of activity interaction patterns. The experiments on the realistic movie and human activity databases validate that the proposed approach outperforms the state-ofthe-art results. Keywords: Activity analysis, generative modeling, video description.
1
Introduction
Video-based human activity analysis is one of the most promising applications of computer vision and pattern recognition. In [1], Turaga et al. presented a recent survey of the major approaches pursued over the last two decades. Large amount of the existing work on this problem mainly focused on the relatively simple activities of single person [10,9,5,12,4], e.g., sitting, walking and handwaving, which has achieved particular success. In recent years, recognition of group activity with multiple participators (e.g., fighting and gathering) is gaining increasing amount of interests [19,18,15,17]. Upon the definition given by [1], where an activity is referred to a complex sequence of actions performed by several objects who could be interacting with each other, the interactions among the participants reflect the elementary characteristics of different activities. The effective interaction descriptor is therefore essential for developing sophisticated approaches of activity recognition. K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 1–11, 2011. c Springer-Verlag Berlin Heidelberg 2011
2
G. Zhu et al.
Most previous research stems from the local representation in image processing. Although the widely used local descriptors are demonstrated to allow for the recognition of activities in the scenes with occlusions and dynamic cluttered backgrounds, they are solely representations of appearance and motion patterns. An effective feature descriptor for activity recognition should have the capacity of describing the video in terms of the object appearance, dynamic motion as well as the interactive properties. With the activity descriptor, how to make the decision for the activity category classification using the feature representation accordingly is another key issue for activity recognition. Two types of approach are widely used: approaches based on generative model [5,4] and the ones based on discriminative model [10,9,19,18,15,12,17]. Considering the mechanism of human perception for group activity, the interactions between objects are firstly distinguished and then synthesized as the activity recognition result. Although discriminative models have been extensively employed because they are much easier to build up, the construction of discriminative models essentially focus on the differences among the activity classes yet ignore the interactive properties involved. Therefore, discriminative models cannot facilitate the interaction analysis and discover the insight of the interactive relations in the activities. In this paper, we firstly investigate how to effectively represent video activities in the interaction context. A new feature descriptor, namely quaternion descriptor, consists of four types components in terms of appearance, individual dynamic, pairwise causalities and feedbacks of the video active objects, respectively. The components in the descriptor describe the appearance and motion patterns as well as encode the interaction properties in the activities. Resorting to the bag-of-words method, the video is represented as a compact bag-ofquaternion feature vector. To recognize the activity category and facilitate the interaction pattern exploration, we then propose to model and classify the activities in a generative framework which is based on an extended pLSA model. Interactions are modeled in the generative framework, which is able to explicitly infer the activity patterns.
2
Quaternion Descriptor for Activity Representation
We propose to construct the quaternion descriptor by extracting the trajectory atoms and then modeling the spatio-temporal interaction information within these trajectories. 2.1
Appearance Component of Quaternion
It has been demonstrated that the appearance information encoded in the image frames can provide critical implication on the semantic categories [21,11]. For video activity recognition over frame sequence, this source of information is also very useful in describing the semantics of activities.
Generative Group Activity Analysis with Quaternion Descriptor
3
In recent years, the well known SIFT feature [6] is acknowledged as one of the most powerful appearance descriptors and has achieved overwhelming success in object categorization and recognition. In our approach, the appearance component of quaternion descriptor is measured as the average of all the SIFT features extracted at the salient points residing on the trajectory. For a motion trajectory with temporal length k, the SIFT average descriptor S is computed from all the SIFT descriptors {S1 , S2 , . . . , Sk } along the trajectory. The essential idea of the appearance representation is two-fold. First, the tracking process ensures that the local image patches on the same trajectory are relatively stable, and therefore the resultant SIFT average descriptor provides a robust representation for certain aspect of visual content in the activity footage. Second, the SIFT average descriptor can also encode partially the temporal context information which will contribute the recognition task [13]. 2.2
Dynamic Component of Quaternion
We propose to calculate the Markov stationary distribution [23] as the dynamic representation in the quaternion. The Markov chain is a powerful tool for modeling the dynamic properties of a system as a compact representation. We consider each trajectory as a dynamic system and extract such a compact representation to measure the spatio-temporal interactions in the activities. Fig. 1 shows the extraction procedure of dynamic component.
Fig. 1. The procedure for dynamic component extraction. (a) Displacement vector quantization; (b) State transition diagram; (c) Occurrence matrix; (d) Markov stationary distribution.
The existing work [20] has demonstrated that a trajectory can be encoded by the Markov stationary distribution π if it can be converted into an ergodic finitestate Markov chain. To facilitate this conversion, a finite number of states are chosen for quantization. Given points P and P within two consecutive frames −−→ on the same trajectory, D = P P denotes the displacement vector of two points. To perform a comprehensive quantization on D, both of the magnitude and orientation are considered as shown in Fig. 1(a). We translate the sequential relations between the displacement vectors into a directed graph, which is similar to the state diagram of a Markov chain (Fig. 1(b)). Further, we establish the equivalent matrix presentation of the graph and perform row-normalization on the matrix to obtain a valid transition matrix P for a certain Markov chain (Fig. 1(c)). Finally, we use the iterative algorithm in [20] to compute the Markov stationary distribution π (Fig. 1(d)), which is
4
G. Zhu et al.
1 (I + P + · · · + Pn ) , (1) n+1 where I is an identity matrix and n = 100 in our experiments. To further reduce the approximation error from using a finite n, π is calculated as the column average of An . More details about the extraction of Markov chain distribution can be found in [20]. An =
2.3
Causality and Feedback Components
To describe the causality and feedback properties, we propose a representation scheme based on Granger causality test (GCT) [8] and time-to-frequency tranform [24]. Given a concurrent motion trajectory pair of Ta = [Ta (1), . . . , Ta (n), . . .] and Tb = [Tb (1), . . . , Tb (n), . . .], we assume that the interaction between two trajectories is a stationary process, i.e., the prediction functions P (Ta (n)|Ta (1 : n − l), Tb (1 : n − l)) and P (Tb (n)|Ta (1 : n − l), Tb (1 : n − l)) do not change within a short time period, where Ta(1 : n − l) = [Ta (1), . . . , Ta (n − l)] and the same to Tb (1 : n − l), l is a time lag avoiding the overfitting issue in prediction. To model P (Ta (n)|Ta (1 : n − l), Tb (1 : n − l)), we can use kth order linear predictor: Ta (n) =
k
β(i)Ta (n − i − l) + γ(i)Tb (n − i − l) + a (n) ,
(2)
i=1
where β(i) and γ(i) are the regression coefficients and a (n) is the Gaussian noise with standard deviation σ(Ta (n)|Ta (1 : n − l), Tb (1 : n − l)). We use the linear predictor to model P (Ta (n)|Ta (1 : n−l)) and the standard deviation of the noise signal is denoted as σ(Ta (n)|Ta (1 : n − l)). According to the GCT theory, we can calculate two measurements, namely causality ratio rc as rc =
σ(Ta (n)|Ta (1 : n − l)) , σ(Ta (n)|Ta (1 : n − l), Tb (1 : n − l))
(3)
which measures the relative strength of the causality, and feedback ratio rf as rf =
σ(Tb (n)|Tb (1 : n − l)) , σ(Tb (n)|Ta (1 : n − l), Tb (1 : n − l))
(4)
which measures the relative strength of the feedback. We then calculate the z-transforms for both sides of Eq. (2). Afterwards, the magnitudes and the phases of the z-transform function at a set of evenly sampled frequencies are employed to describe the digital filter for the style of the pairwise causality/feedback. In our approach, we employ the magnitudes of the frequency response at {0, π/4, π/2, 3π/4, π} and the phases of the frequency response at {π/4, π/2, 3π/4} to form the feature vector fba . Similarly, we can define the feature vector fab by considering Ta as the input and Tb as the output of the digital filter, which characterizes how the object with the trajectory Ta affects the motion of the object with the trajectory Tb . The causality ratio and feedback ratio characterize the strength of one object affecting another one, while the extracted frequency response fab and fba convey
Generative Group Activity Analysis with Quaternion Descriptor
5
how one object affects another one. These mutually complementary features are hence combined to form the causality and feedback components of quaternion descriptor in the pairwise interaction context.
3
Generative Activity Analysis
Given a collection of unlabeled video sequences, we would like to discover a set of classes from them. Each of these classes would correspond to an activity category. Additionally, we would like to be able to understand activities that are composed of the mixture of interaction varieties. This resembles the problem of automatic topic discovery which can be figured out by latent topic analysis. In the following, we will introduce a new generative method based on pLSA modeling [22], which is able to both infer the activity categories and discover the interaction patterns. 3.1
Generative Activity Modeling
Fig. 2 shows the extended pLSA graphic model which is employed with the the consideration of both activity category and interaction variety. Compared with the traditional philosophy, the interaction distribution is modeled in our method and integrated into the graphic framework as a new hidden variable.
Fig. 2. The extended pLSA model with two hidden variables. Nodes are random variables. Shaded ones are observed and unshaded ones are unobserved (hidden). The plates indicate repetitions. d represents video sequence, z is the activity category, r is the interaction variety and w is the activity representation bag-of-word. The parameters of this model are learnt in an unsupervised manner using an improved EM algorithm.
Suppose we have a set of M (j = 1, . . . , M ) video sequences containing the bag-of-words of interaction representations quantized from the vocabulary of size V (i = 1, . . . , V ). The corpus of videos is summarized in a V -by-M cooccurrence table M , where m(wi , dj ) is the number of occurrences of a word wi ∈ W = {w1 , . . . , wV } in video dj ∈ D = {d1 , . . . , dM }. In addition, there are two latent topic variables z ∈ Z = {z1 , . . . , zK } and r ∈ R = {r1 , . . . , rR } which represent the activity category and interaction variety resided in a certain activity. The variable rt is sequentially associated with each occurrence of a word wi in video dj . Extending from the traditional pLSA model, the joint probability P (w, d, z, r) which translates the inference process in Fig. 2 is expressed as follows P (d, w) = P (d)P (z|d)P (r|z)P (w|r) . (5) z∈Z r∈R
6
G. Zhu et al.
It is worth noticing that an equivalent symmetric version of the model can be obtained by inverting the conditional probability P (z|d) with the help of Bayes’ rule, which results in P (d, w) = P (z)P (d|z)P (r|z)P (w|r) . (6) z∈Z r∈R
The standard procedure for maximum likelihood estimation in latent variable models is the Expectation Maximization (EM). For the proposed pLSA model in the symmetric parametrization, Bayes’ rule yields the E-step as P (z)P (d|z)p(r|z)P (w|r) P (z, r|d, w) = , P (z )P (d|z )p(r |z )P (w|r ) P (z|d, w) =
z
r
P (z, r|d, w) ,
P (r|d, w) =
r
P (z, r|d, w) .
(7)
(8)
z
By standard calculations, one arrives at the following M-step re-estimation equations: m(d, w)P (z|d, w) m(d, w)P (z|d, w) w d P (d|z) = , P (w|r) = , (9) m(d , w)P (z|d , w) m(d, w )P (z|d, w ) d ,w
P (z) =
d,w
1 m(d, w)P (z|d, w) , R
1 m(d, w)P (z, r|d, w) , (10) R
P (z, r) =
d,w
P (r|z) =
3.2
P (z, r) , P (z)
d,w
R≡
m(d, w) .
(11)
d,w
Generative Activity Recognition
Given that our algorithm has learnt the activity category models using extended pLSA, our goal is to categorize new video sequences. We have obtained the activity-category-specific interaction distribution P (r|z) and the interaction-pattern-specific video-word distribution P (w|r) from a different set of training sequences at learning stage. When given a new video clip, the unseen video is “projected” on the simplex spanned by the learnt P (r|z) and P (w|r). We need to find the mixing coefficients P (zk |dtest ) such that the Kullback-Leibler divergence between the measured empirical distribution P (w|dtest ) and P (w|dtest ) = K k=1 P (zk |dtest )P (r|zk )P (w|r) is minimized [22]. Similar to the learning scenario, we apply the EM algorithm to find the solution. The sole difference between recognition and learning is that the learnt P (r|z) and P (w|r) are never updated during inference. Thus, a categorization decision is made by selecting the activity category that best explains the observation, that is Activity Category = arg max P (zk |dtest ) . (12) k
Generative Group Activity Analysis with Quaternion Descriptor
3.3
7
Interaction Pattern Exploration and Discovery Based on Generative Model
Generative model facilitates the inference of the dependence among the different distributions in the recognition flow. In the extend pLSA paradigm, the distribution of the interaction patterns is modeled as a hidden variable r bridging the category topic z and visual-word observation w explicitly encoded by P (r|z) and P (w|r), respectively. Two tasks can be achieved by investigating one of the distributions P (w|r), namely interaction amount discovery and pattern exploration. The aim of the discovery is to infer the optimal amount of interaction patterns in the activities. The strategy is to transverse the sampled amount of interaction patterns and observe the corresponding recognition performance. We define K = {1, . . . , R} as the candidate set of the amount of interaction patterns. Give an interaction pattern amount k ∈ K, the corresponding extended pLSA model is learnt and the recognition performance is denoted as mk . Therefore, we can obtain the optimal interaction pattern amount OptInterN o in the underlying activities as OptInterN o = arg max{mk } . (13) k
4
Experiments
To demonstrate the effectiveness of our approach, we performed thorough experiments on two realistic human activity databases: the HOHA-2 database of movie videos used in [19] and the HGA database of surveillance videos used in [18]. These two databases are chosen for evaluation because they exhibit the difficulties in recognizing realistic human activities with multiple participants, which is in contrast to the controlled settings in other related databases. The HOHA-2 database is composed of 8 single activities (i.e., AnswerPhone, DriveCar, Eat, GetOutCar, Run, SitDown, SitUp, and StandUp) and 4 group activities (i.e., FightPerson, HandShake, HugPerson, and Kiss), in which 4 group activities are selected as the evaluation set. The HGA database consists of 6 group activities, in which all the samples are employed for evaluation. A brief summary of these two databases used in the experiments is provided in Table 2. More details about the databases can be found in [19,18]. To facilitate efficient processing, we employ the bag-of-words method to describe the quaternion descriptors in one activity video footage as a compact feature vector, namely bag-of-quaternion. We construct a visual vocabulary with 3000 words by K-means method over the sampled quaternion descriptors. Then, each quaternion is assigned to its closest (in the sense of Euclidean distance) visual word. 4.1
Recognition Performance on HOHA Database
In the pre-processing of this evaluation, the trajectory atoms in every shot are firstly generated by salient points matching using SIFT features. It has been
8
G. Zhu et al. Table 1. A summary of the databases for human group activity recognition Database HOHA-2 group subset [19] HGA [18] Data source Movie clips Surveillance recorders # Class category 4 6 # Training sample 823 4 of 5 collected sessions # Testing sample 884 1 of 5 collected sessions Table 2. A summary of the databases for human group activity recognition Database HOHA-2 group subset [19] HGA [18] Data source Movie clips Surveillance recorders # Class category 4 6 # Training sample 823 4 of 5 collected sessions # Testing sample 884 1 of 5 collected sessions
demonstrated that this trajectory generation method is effective for the motion capture in movie footage [20]. In the experiment, the appearance and dynamic representations are extracted from every salient point trajectory. The salient points residing in the same region may share the similar appearance and motion patterns. The extraction of causality and feedback descriptors on the raw trajectory atoms is not necessary and computation-intensive. Efficiently, we perform a spectral clustering based on normalized cut algorithm [3] on the set of raw trajectory atoms in one video shot. The average trajectory atom, which is calculated as the representative for the corresponding cluster, is employed as the input of the extraction of causality and feedback descriptors. To construct the graph for clustering, each node represents a trajectory atom and the similarity matrix W = [ei,j ] is formed with the similarities defined as ei,j = (PiT · Pj ) · (SiT · Sj ) ,
(14)
where i and j represent the indices of trajectory atoms in the video shot, P = {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )} is the set of spatial positions of a trajectory atom and S = {s1 , s2 , . . . , sn } is its SIFT descriptor set. To quantitatively evaluate the performance, we calculate the Average Precision (AP) as the evaluation metric. Note that the AP metric is calculated on the whole database for an equitable comparison with the previous work although we only investigate 4 group activities. Fig. 3 shows the recognition results by using different types of interaction features as well as their combination compared with the state-of-the-art performance in [19]. In [19], the SIFT, HoF and HoG descriptors are extracted from the spatio-temporal salient points detected by 2D and 3D Harris detectors. Bag-of-words features are built as the compact representation for the video activities, which are the input of the SVM classifier. From Fig. 3, we can conclude that our quaternion descriptor and generative model yields the highest AP performance than the latest report. More specifically, the Mean AP is improved from the latest reported 37.8% [19] to 44.9%.
Generative Group Activity Analysis with Quaternion Descriptor
9
Fig. 3. Comparison of our approach with the salient point features and discriminative model in [19] on HOHA-2 database
Another observation from the results is that the pairwise causality and feedback features outperform other components in the quaternion descriptor, which demonstrate that the interaction features are indispensable for the task of group activity recognition. 4.2
Recognition Performance on HGA Database
The HGA database is mainly proposed for human trajectory based group activity analysis. The humans in the database are much smaller so that the appearance features do not present any contribution to the recognition task. Consequently, the trajectory atoms of HGA database are generated by blob tracking. Each human in the activity video is considered as a 2D blob, and then the task is to locate the positions in the frame sequence. Our tracking method is based on the CONDENSATION algorithm [7] with manual initializations. About 100 particles were used in the experiments for the tradeoff of accuracy and computational cost. Accordingly, dynamic and causality/feedback components in the proposed quaternion descriptor are employed to describe the activity video footage. Fig. 4 lists the recognition accuracies in terms of confusion matrices by using different types of interaction descriptors as well as their combination. From Fig. 4(a) and Fig. 4(b), we can observe that the causality/feedback component outperforms the dynamic component on HGA database. This is easy to understand that for two different activity categories, e.g., walking-in-group and gathering, the motion trajectory segments of the specific person may be similar while they have different interactive relations, which can be easily differentiated by the pairwise representation. Therefore, when combing two types of interaction descriptors, the recognition performance can be further improved as shown in Fig. 4(c). Compared with the results reported in [18], in which the best performance is 74% for average accuracy of all the activities, the proposed work achieves the better result with 87% average accuracy. Note that the confusions are reasonable in the sense that most of the misclassification occurs between very similar motions, for instance, there is a confusion between run-in-group and walk-in-group.
10
G. Zhu et al.
Fig. 4. The confusion matrices of HGA database recognition results with different interaction representations
4.3
Interaction Pattern Exploration and Discovery
We further evaluate the capacity of the proposed extended pLSA for exploring and discovering the interaction patterns in HGA database. The reason we select HGA database as the evaluation set is that the trajectory atoms in HGA are the human blob loca in spatio-temporal space which bear the intuitive semantics for visualization. Taking the pairwise causality and feedback interactions as the example, we explored different numbers of interaction patterns, varying from 8 to 256. The corresponding pLSA model was learnt against each number on the training sessions and then evaluated on the testing session. Fig. 5 shows the exploration results of the recognition performance against the variant amount of interaction patterns. From Fig. 5, we can observe some insight between the supposed pattern amount and the recognition performance. The performance is significantly improved by 17.2% for average recognition accuracy when increasing the pattern number from 8 to 32 because more and more interactions can be covered by the learnt model. However, the performance is degenerated and dropped down by 10.5% with the increase of the pattern amount to 256. This is due to the
Fig. 5. The exploration results of the recognition performance against the number of pairwise interaction patterns on HGA database
Generative Group Activity Analysis with Quaternion Descriptor
11
fact that the learnt model with larger pattern amount intends to overfit to the training data, resulting to a less generalizable model. Therefore, the amount of interaction patterns in HGA database is inferred as 32.
References 1. Turaga, P., Chellappa, R., Subrahmanian, V.S., Udrea, O.: Machine recognition of human activitites: a survey. T-CSVT (2008) 2. Bobick, A., Davis, J.: The recognition of human movement using temporal templates. T-PAMI (2001) 3. Shi, J., Malik, J.: Normalized cuts and image segmentation. T-PAMI (2000) 4. Wang, Y., Mori, G.: Human action recognition by semilatent topic models. T-PAMI (2009) 5. Niebles, J.C., Wang, H., Li, F.F.: Unsupervised learning of human action categories using spatial-temoral words. IJCV (2008) 6. Lowe, D.: Distincitive image features from scale-invariant keypoints. IJCV (2004) 7. Isard, M., Blake, A.: CONDENSATION - Conditional density propagation for visual tracking. IJCV (1998) 8. Granger, C.W.J.: Investigating causal relations by econometric models and crossspectral methods. Econometrica (1969) 9. Liu, J., Luo, J., Shah, M.: Recognizing realistic actions from videos in the wild. In: CVPR (2009) 10. Laptev, I., Lindeberg, T.: Space-time interest points. In: ICCV (2003) 11. Torralba, A., Murphy, K., Freeman, W., Rubin, M.: Context-based vision system for place and object recognition. In: ICCV (2003) 12. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local svm approach. In: ICPR (2004) 13. Liu, Z., Sarkar, S.: Simplest representation yet for gait recognition: averaged silhouette. In: ICPR (2004) 14. Andrade, E., Blunsden, S., Fisher, R.: Modelling crowd scenes for event detection. In: ICPR (2006) 15. Ryoo, M.S., Aggarwal, J.K.: Hierarchical recognition of human activities interacting with objects. In: CVPR (2007) 16. Turaga, P., Veraraghavan, A., Chellappa, R.: From videos to verbs: mining videos for activites using a cascade of dynamical system. In: CVPR (2007) 17. Zhou, Y., Yan, S., Huang, T.: Pair-activity classification by bi-trajectory analysis. In: CVPR (2008) 18. Ni, B., Yan, S., Kassim, A.: Recognizing human group activities with localized causalities. In: CVPR (2009) 19. Marszalek, M., Laptev, I., Schmid, C.: Actions in Context. In: CVPR 2009 (2009) 20. Sun, J., Wu, X., Yan, S., Cheong, L.F., Chua, T.S., Li, J.: Hierarchical spatiotemporal context modeling for action recognition. In: CVPR (2009) 21. Mortensen, E., Deng, H., Shapiro, L.: A SIFT descriptor with global context. In: CVPR (2005) 22. Hofmann, T.: Probabilistic latent semanic indexing. In: ACM SIGIR (1999) 23. Breiman, L.: Probability. Society for Industrial Mathematics (1992) 24. Jury, E.I.: Sampled-data control systems. John Wiley & Sons, Chichester (1958)
Grid-Based Retargeting with Transformation Consistency Smoothing Bing Li1,2 , Ling-Yu Duan2 , Jinqiao Wang3 , Jie Chen2 , Rongrong Ji2,4 , and Wen Gao1,2 1 2
Key Lab of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190 China Institute of Digital Media, School of Electronic Engineering and Computer Science, Peking University, Beijing 100871 China 3 National Lab of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190 China 4 Visual Intelligence Laboratory, Department of Computer Science, Harbin Institute of Technology, Heilongjiang, 150001 China
Abstract. Effective and Efficient retargeting are critical to improve user browsing experiences in mobile devices. One important issue in previous works lies in their semantic gap in modeling user focuses and intensions from low-level features, which results to data noise in their importance map constructions. Towards noise-tolerance learning for effective retargeting, we propose a generalized content aware framework from a supervised learning viewpoint. Our main idea is to revisit the retargeting process as working out an optimal mapping function to approximate the output (desirable pixel-wise or region-wise changes) from the training data. Therefore, we adopt a prediction error decomposition strategy to measure the effectiveness of the previous retargeting methods. In addition, taking into account the data noise in importance maps, we also propose a grid-based retargeting model, which is robust and effective to data noise in real time retargeting function learning. Finally, using different mapping functions, our framework is generalized for explaining previous works, such as seam carving [9,13] and mesh based methods [3,18]. Extensive experimental comparison to state-of-the-art works have shown promising results of the proposed framework.
1
Introduction
More and more consumers prefer to watch images over the versatile mobile devices. As the image resolutions vary much and the aspect ratio of a mobile display differs from each other, properly adapting images to a target display is useful to make wise use of expensive display resources. Image retargeting aims to maximize the viewer experience when the size or aspect ratio of a display is different from the original one. Undoubtedly, users are sensitive to any noticeable distortion of re-targeted pictures. Persevering consistency and continuity of images is important. So we propose a generic approach to effective and efficient image retargeting, which is applicable to mobile devices. K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 12–24, 2011. c Springer-Verlag Berlin Heidelberg 2011
Grid-Based Retargeting with Transformation Consistency Smoothing
13
Fig. 1. Illustrating image/video retargeting from a supervised learning viewpoint
Many content-aware retargeting methods have been proposed such as cropping[4,5,1], seam craving[13,9,15,11], multi-operator[10], mesh based retargeting[12,18,3,7,17]. Cropping[4,5,1] works on an attention model to detect important regions, and then crops the most important region to display. Seam carving[13,9,15]tries to carve a group of optimal seams iteratively based on an energy map from images/videos. Rubinstein proposes to combine different retargeting methods including scaling, cropping, seam craving in[10]. In addition, mesh based methods[12,18,3,7] partition source images/videos, where more or less deformation is allowed by adjusting the shape of a mesh, while, for important regions, the shapes of relevant meshes are committed to be kept well. Generally speaking, content aware retargeting may be considered as a sort of supervised learning process. Under the supervision from a visual importance map, content aware retargeting aims to figure out a mapping function in charge of removing, shrinking or stretching less important regions, as well as preserving the shape of important regions, as illustrated in Fig 1. Either user study or image similarity measurement applies to evaluate the effectiveness of a retargeting method. On the other hand, the result of content aware methods heavily relies on the quality of an importance map. Most importance maps are generated by low-level visual features such as gradient, color contrast and so on. Due to the lack of high-level features, an importance map cannot recover the meaningful object regions exactly to assign proper values to objects. As the importance map cannot truly represent a user’s attention, content aware retargeting guided by noisy importance map is actually a weakly supervised learning process. From a learning viewpoint, a good model should avoid overfitting, where low variance and high bias is preferred to deal with data noise. However, the seam carving method [9] removes 8-connected seams containing the lowest energy each time. It can be considered as a sort of local approximation to keep the shape of salient regions. As a result, a seam carving method has high variance and low bias. It is very sensitive to noise. For example, when the seams crossing an object produce the lowest energy, removing seams tend to fragment an object in the resulting images. Similarly, by global optimization, mesh based methods have lower variance to reduce the negative influence of noise data similar to filter smoothing. Their resulting images are smoother than pixel-wise methods. Unfortunately, serious shape transformation leads to too complex model involving many degrees of freedom. When an object covers several meshes with each mesh assigned different importance value, the object inconsistency would happen, e.g.
14
B. Li et al.
big head and small body, or a screwed structural object. As an variant of mesh based method, Wang[18] uses vertexes of the mesh to describe quad deformation in an objective function. However, it is not easy to well control the shape transformation of quad grids in optimization; moreover, most grids are irregular quadrilateral in their results. So the resulting grids may fail to preserve the structure of complex backgrounds, although efforts have been made to minimize the bending effects of the grid lines. To summarize, in the cases of certain amount noisy training data, those existing retargeting methods are sensitive due to the models’ higher variance. Undoubtedly, too many freedom degrees of a retargeting model leads to the spatial inconsistency of salient objects and the discontinuity of less important regions. Thus, we propose a grid based optimization approach to retargeting an image. The basic motivation is to reduce the model variance by constraining the gridbased shape transformation over rectangular grids. Then the aspect ratio of a display can be characterized by arranging a set of rectangles, where the change of grids’ aspect ratio is used to measure distortion energy. A nonlinear objective function is employed to allocate unavoidable distortion to unimportant regions so as to reduce the discontinuity within less important regions. In addition, as the nonlinear optimization model to build up is convex programming, a global optimal solution can be obtained by an active-set method. Overall, as our model confines the degrees of freedom, our method is effective to accommodate the weak supervision of noisy importance maps from low-level feature computing. This makes our method more generic in a sense. Our major contributions can be summarized as follows: 1. We propose a generalized retargeting framework from a supervised learning viewpoint, which introduces an optimized retargeting strategy selection approach in term of adapting to the training data quality. by adopting different learning functions, previous retargeting approaches, such as seam carving [9], mesh based approaches [18][3] can be derived from our model. 2. We present a grid-based model to effectively reduce the mapping function complexity, which is robust to the importance map noise (from cluttered background) and inferiority (for delineating the salient objects). By a quadratic programming approximation, our objective function optimization complexity can be linear to training data. 3. Our proposed objective function makes best use of the unimportant regions for optimization consistency between meaningful objects (in important regions) and content continuity in non-important regions. Also, it enables parameter adjusting to favor desirable results with user preferences (shown in Fig 4).
2
Visual Retargeting Framework
In this section, we come up with a general content aware retargeting framework from a supervised learning process point of view. To well keep important regions at the cost of distorting less important ones, retargeting methods are working out an optimal mapping function g that g : IM → SP s.t.
boundary
constraints
(1)
Grid-Based Retargeting with Transformation Consistency Smoothing
15
IM is the importance of pixels/regions, SP denotes the desirable pixel- or regionwise changes such as removing, shrinking, stretching and preserving. 2.1
Retargeting Transformation in Either Local or Global Manner
Our framework aims to abstract any retargeting transformation in a unified way from the local and global points of view. A typical learning problem involves training data (im1 , sp1 ), ..., (imn , spn ) containing an input vector im and an corresponding output sp. The mapping function g(im) is to approximate the output from training data. The approximation can be a local or global one. Local Methods. For the local method, the input/output data come from a local region as {(imk1 ,e1 , spk1 ,e1 ), . . .,(imkn ,en , spkn ,en )}, k1 , e1 . . . kn , en = local region , ki , ei is the position of pixel. The function g is a local approximation of the output, similar to K-Nearest neighbor. In training data, sp may be set to several values for different region operations like removing, shrinking, etc. For example, an image is partitioned into several regions according to importance measurements, where the regions can be determined in different ways, such as detecting objects, locating seams with lowest energy, spotting a window with large importance and so on. As a local approximation, the mapping function leads to a sort of independent retargeting based on each individual region. In other words, the process of keeping the important regions is independent of shrinking/stretching the less important regions. For the sake of simplicity, we set spr = 1 to the region/pixel requiring good local preservation, otherwise spr = −1. The function can be simply defined as: −1 k, e ∈ unimportant region sp ˆ k,e = g(imk,e ) = (2) 1 k, e ∈ important region Global Methods. For a global method, input/output come from a whole image = source image, where the as {(imri , spri ),. . .,(imrn , sprn )}, r1 r2 . . . rn whole image is partitioned into regions or pixels r1 . . . rn = source image. The mapping function is to approximate the output in a global manner. To accomplish a satisfactory global fitting on the training data, the risk Remp(g) is defined as Remp (g) = L(spri , g(imri )) (3) r( i)∈image
L(spri , g(imri ) calculates a weighted discrepancy between an original region and a target region. The g(im) is thus obtained by minimizing Remp(g). As a typical global one, meshes based methods impose mesh-based partition to a source image.g(im) measures each region’s original shape. In meshes based methods, L(spri , g(imri ) is defined as: L(spri , g(imri )) = D(ri ) · w(imri )
(4)
where D(ri ) measures the distortion of the meshes, w(imri ) is a weighting function of the importance of regions ri . Increasing or reducing the distortion of meshes can be controlled by adjusting w(imri ) accordingly.
16
2.2
B. Li et al.
On the Effectiveness of a Retargeting Method
To measure the effectiveness is an important issue to design a good retargeting method. This is closely related to how to measure the performance of a mapping function. The performance of a mapping function strongly depends on correctly choosing important regions in training data. In cases of noisy data, too complex mapping function would lead to overfitting like distortion of an object. To select a good model, a performance measurement of the mapping function should be provided to select the best model based on different types of noisy or clean data. [14] presented the prediction error to measure the effectiveness of a mapping function. A learnt function’s prediction error is related to the sum of the bias and the variance of a learning algorithm [6], which can be formulated as [14]. For the training data Q and any im , the prediction error decomposition is: EQ [(g(im; Q) − E(sp|im))2 ] =(EQ [g(im; Q)] − E[sp|im])2 + EQ [(g(im; Q) − EQ [g(im; Q)])2 ]
(5)
EQ [g(im; Q)] − E[sp|im])2 is bias, EQ [(g(im; Q) − EQ [g(im; Q)])2 ] is variance. To avoid overfiting for the generality of a retargeting function, our goal is to decrease the variance in fitting the particular input im. For a local method, the mapping function variance depends on the pixel number k of each local region. When k is too small, the function has higher variance but lower bias. Such a function exhibits higher complexity that incurs many degrees of freedom. Thus, retargeting is sensitive to noises and may artifact in the objects with rich structures like seam carving[13,9]. When k is large (e.g., all the pixels of a cropping window), this variance is lower. So the impact of noisy data is decreased; however, taking cropping methods [4,5,1] as an example, some objects or parts would be discarded when several important objects are far from each other. For global methods[18,3], the mapping function not only depends on region importance but also their distributions, for which the model would be more complex. Overall, the mapping function is smoother than a local method. 2.3
An Instance of Our Proposed Framework
As discussed above, a good retargeting method has to seek a tradeoff between the bias and the variance based on the quality of training data. In this section, we come up with an instance by taking into account the quality of the importance map. As a visual importance map cannot recover the regions of salient objects exactly, the actual training data is noisy. Therefore, we would like to choose a mapping function with lower variance to reduce the negative influence of noise data. So our instance prefers a global method. Our instance is committed to maintain lower variance. An optimization approach with lower variance is applied to reduce noise data’s influence. We constrain the grid-based shape transformation over rectangular grids. The change of grids aspect ratio is used to measure distortion energy in retargeting. This is advantageous than Wang’s models[18] that too many degrees of freedom often leads
Grid-Based Retargeting with Transformation Consistency Smoothing
17
to deformation of objects. Moreover, we provide a user a few parameters to optimize the use of unimportant regions to keep important regions’ shape. In the cases of noisy data, a lower variance can reduces the influence of noisy data. But the shape transformation of unimportant region is less flexible, which would affect the preservation of important region’s shape. So we introduce user input to alleviate this disadvantage. Through setting a few parameters in our object function, we may amplify the difference between the important and unimportant region’s importance values. The objective function is described as follows: Objective Function. We use the edges of grids rather than the coordinates of vertices to measure the distortion energy of each grid. A nonlinear objective function is employed to reallocate distortion to a large proportion of (all) unimportant regions to avoid discontinuity. To minimize the grid distortion energy, the objective function is defined as: (6) min (yi (t) − ars · xj (t))m · snij m > 2 and is a even number, n > 1. ars is the aspect ratio of the original grid respectively, xi , yj is the width and height of the target grid gij ,respectively. sij is the importance value of grid gij . The weight snij is to control the distortion of grid gij . The more snij is, the more the grid would be preserved. As our approach has high bias and low variance, restricting shape transformation of grids is at the expense of less flexible adjustment of unimportant regions, which would in turn affect the shape preservation of the remaining important region. To optimize the use of unimportant regions’ shape, we introduce user input to alleviate this disadvantage. By adjusting parameters n, m, we may get two types of retargeting effects: 1) the adapted results tend to preserve important regions more; or 2) allowing more smoothness between grids within unimportant regions. More details can be found in the subsequent section.
3
Grid Based Image Retargeting
In this section, we introduce our grid-based image retargeting, involving rectangular grids based shape transformation constraints as well as a nonlinear objective function to reallocate distortion to less important regions. Our method contains three basic stages. Firstly, we calculate gradient map and visual attention map to determine important regions. Secondly,we divide the source image into grids, each grid generating a importance measure. The model of grid optimization is solved at the granularity of grids. The optimal solution is applied to transform source grids to target grids. Finally, the image retargeting is accomplished by a grid based texture mapping algorithm[2]. 3.1
Importance Map
We combine gradient map and visual attention map[16] to generate importance map.The importance map is defined as: IM Sk,e = α · GSk,e + (1 − α) · ASk,e
α ≥ 0;
(7)
18
B. Li et al. Comparation on Power of Importance 4 3.5
n=1 n=2 n=3 n=4
3
Weight
2.5 2 1.5 1 0.5 0
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Importance
Fig. 2. Influence of the importance variant’s power on the resulting weights
GSk,e and ASk,e is important value of the pixel at the position (k, e) in the gradient map and the attention map, respectively. In different types of images such as nature and building, different empirical parameters α can be set to obtain good results.It is noted that, for some homogeneous regions, whose distortions are rarely perceived by humans after transformation; while, for irregular textured regions, i.e., higher values in a gradient map, people can tolerate such distortions. Consequently, the addition of an attention map would reduce the overall importance values in irregular textured regions. 3.2
Grid-Based Resizing Model
The grids are divided as follows. An image is divided into N × N grids, and the grids are denoted by M = (V, E) in which V is the 2D grid coordinate, E are the edges of grids. Each grid is denoted by G = {g11, g12 , . . . , gij . . . gN N } with its location i, j. Owing to the constraint of rectangular grids, all the grids in each row have the same height while the grids in each column have the same width. So the edge is simply denoted by E = {(x1 , y1 ), . . . , (xi , yj ), (xN , yN )}, and xi , yj is the width and the height of the grid gij , respectively. Clearly, our model has fewer parameters to optimize than [18]. Computing the importance of a grid. With the uniform division of N × N grids, the importance of grid gij can be calculated as follows: k,e∈g IM Sk,e sij = × N2 (8) Stotal Stotal is the sum of importance values of all the pixels in the image. A mean importance value of 1 is imposed to isolate important grids from unimportant ones instead of any value between 0 to 1 in [18]. A grid is considered as an unimportant one if its important value is less than 1. From Fig 2, for an important grid gij , the larger n is, the bigger snij is; while, for an unimportant grid gij , the larger n is, the less snij is. Thus increasing n can further improve the important grid’s importance but decrease the unimportant grid’s importance. However, n cannot be too large, otherwise it would break the visual discontinuity in unimportant regions. Empirically, the range of n is [1,7].
Grid-Based Retargeting with Transformation Consistency Smoothing Distortion of Grid:m=2
19
Distortion of Grid:m=4
Trend of Feasible Solution 20
10
10
0
100
50 y
50 x
0 0
100
0
Projecetion of funtion 25 20
m=2 m=4
Equation Constraint:x1+x2=12
100
50 y
50 x
0 0
Projecetion of funtion 25
m=2
20
15
15
10
10
5
5
0
m=4
0 100 50 0 y
50
100 x
100 50 0 y
50
100
100
Value of Objective
20
Variable
x
Fig. 3. The effects of choosing different m on the retargeted image (ars = 2)
Boundary constraints. We introduce the constraints as follows: N yi (t) = HT i=1 i=1 xj (t) = WT
(9)
HT ,WT are the height and width of a target grid, respectively. Note that the minimum height or length of a grid is set to one pixel, as adjacent grids should not overlap each other. The Objective Function. To minimize the sum of grid distortion energy, we employ the objective function 6 as mentioned in section 2. Increasing m improves the continuity of a whole image. With increasing m , these grids with similar weights are subject to similar shape changes the edge disparity between large weighted grids and small weighted grids also becomes smaller. To clarify, we provide a three-dimensional graph of the object function in Fig 3, and take m = 2 and m = 4 respectively. For simplicity, the images are divided into 2 × 2 grids and the height of grids remains unchanged; the 3D graph is projected to a 2D curve to indicate the flat trend of m = 2 and the dramatic trend of m = 4. A two-way arrow and a hollow circle are used to illustrate the solutions’ movements constrained by equation (11) when increasing m. In addition, m cannot be too large; otherwise, the effects is similar to scaling an image. We empirically set m = 2, 4, 6. Global Solution. To get a global solution, we employ an active-set method to solve this optimization problem. This nonlinear program is a convex programming, and any local solution of a convex programming is actually a global solution. (yi (t) − ars · xj (t))m · snij is a convex function, so our objective function (yi (t) − ars · xj (t))m · snij is a convex one. Moreover, the equality constraints are linear functions and the inequality constraints can be seen as a concave function. The solutions satisfying equality and inequality constraints finally form a convex set. When a local solution is resolved, the global solution is yielded.
20
B. Li et al.
For the convex programming, the Hessian matrix of the objective function is positive semidefinite. The complexity is similar to a linear programming that depends on the number of grids instead of real resolution of an image.
4
Experiments
To evaluate the effectiveness and efficiency of our retargeting method, we conducted our experiments on a variety of images. In total 120 images are collected to cover six typical classes: landscape, structure (e.g., indoor/outdoor architecture), computer graphics & painting, daily life (with human/animal as the subject). Daily life are further classified into long shot, medium shot, and close up. To avoid the confusion from picture classification, we apply two rules as below: (1) an image is primarily classified into daily life as long as the image contains human/animal; otherwise, (2) an image containing building is classified into structure, no matter whether the image belongs to computer graphics/painting or not. Those categories may be related to different types of noisy or clean data in a sense. For instance, computer graphics & painting or a long shot is comparatively cleaner (i.e., the importance map tend to exactly delineate the salient object) than other classes. Efficiency. Our algorithm was implemented on a standard laptop with 2.26 GHz duo core CPU, 2GB memory. Our dataset consists of diverse images in sizes and aspect ratios. The largest image size is 1920×1200, while the smallest size is 288×240. Each optimization process costs less than 0.015s for 20×20 grids. As the complexity of our algorithm relies on the grid division instead of the real image resolution, our algorithm is much more efficient than seam carving. Effectiveness. Fig 4 illustrates the effects of parameter m, n on retargeting results. Note that the girl’s head and the volleyball have higher importance values so that they are kept well in Fig 4(c)(d)(e)(f). By comparing different effects from Fig 4(c)(d)(e), we can find that when choosing the same m, increasing n can preserve the shape of the important region (e.g., the girl’s head), while unimportant regions (background) are much distorted; By comparing Fig 4(d)(f), we can find that when keeping the same n , increasing m improves the visual content continuity at unimportant regions. In the subsequent user study, most subjects prefer (f) since its entire consistency are best achieved, even though the grids covering the girl are somehow squeezed. Furthermore, our method are compared with the scaling and other two representative methods [18,9]. Note that when noisy importance maps in source images (especially for structure and close up) cannot delineate important objects from non-important regions, the retargeting problem become more challenging. In our empirical dataset, each type has at least 20 images Fig 5 demonstrate that our method is more efficient in preserving the consistence of objects and the continuity of unimportant regions, even in the case of serious noises (e.g. the sixth row in Fig 5). In contrast, the seam carving method [9] brings about considerable shape artifacts of an object, especially more structured ones, as indicated in the 1, 2, 3, 4, 6, 7 rows in Fig 5), since, at object regions, some seams
Grid-Based Retargeting with Transformation Consistency Smoothing
(a)
(b)
(c)
(d)
(e)
21
(f)
Fig. 4. Results for a daily life picture at medium shot. (a) Original image. (b) Scaling. (c) m = 2,n = 1.(d)m = 2,n = 3. (e)m = 2,n = 5. (f)m = 4,n = 3.
with lower importance are falsely removed. Wang’s method [18] distorts objects, as indicated in the 1, 2, 3 , 4 rows in Fig 5). When an important object covers several grids but the importance of these grids are different from each other, most grids with low importance become irregularly quadrilateral after retargeting, so that Wang’s method may fail to preserve the structure of an object. User Study. A subjective evaluation is further performed by user study. The results of several popular methods are provided to subjects. By means of user preference and scoring evaluation, the effectiveness are measured quantitatively. In total 10 students participated in user study. We showed each participant an original image and a randomly ordered sequence of retargeting results with different methods including scaling, non-homogeneous resizing[8], seam carving[9], Wang’s method[18] and our method. Each participant are required to choose the results most visually similar to the original image. As listed in Tab1, most participants prefer our results. Referring to Fig5, let’s investigate the results. For landscape, our results are comparable to Wang’s method. But seam craving may greatly alter the content of an image, probably without any noticeable distortion sometimes (see the 7th row in Fig 5). For a long shot, except seam carving and scaling, most methods produce similar results. The reason is unimportant region occupies a large portion and has lower values in computed importance map. However, seam carving tends to change the depth of field. See the fifth row of Fig 5. For a close-up, subjects prefer our method and scaling, since the calculated importance maps are often seriously noisy. In such cases, the major part of an image is important, whereas these parts have lower value in importance map. Moreover, the unimportant part is not large enough to adapt it to target size. Consequently, non-homo resizing, seam carving, and Wang’s method distort important regions with lower importance. So the objects are distorted, while our method produces smoother results. It is easy to observe that, for structure, medium shot, and computer graphics & painting, users generally prefer our results, as discussed above. Limitations. Like all content aware retargeting methods, our method are still impacted by the importance map. Our results may be reduced to that of scaling when the most major part of an image is considered as important by visual content computing. Due to the importance map, the retargeting results are not preferred by users actually. If we could lower such value in irregular textured regions, more spaces would be saved for important objects. Ideally, if some
22
B. Li et al.
(a)
(b)
(c)
(d)
(e)
Fig. 5. Comparison Results. Columns from left to right: (a) Source image, (b) scaling, (c) Rubinstein et al’s results[9], (d) Wang et al’s results [18], (e) Our results. Rows from top to bottom: (1) Computer Graphics, (2) Architecture, (3) Indoor, (4) Medium Shot, (4) Long Shot, (5) Close Up, (6) Landscape.
Grid-Based Retargeting with Transformation Consistency Smoothing
23
Table 1. Preference statistics of ten participants in user study (%) Daily
Landscape
Structure
CG& Painting
Long
Medium
Close up
Scaling
2.1
3.5
4.5
5.10
1.30
24.40
Non-homo resizing[8]
8.4
6.2
8.9
22.4
11.8
20.2
Seam Carving[9]
15.2
10 .6
9.4
11.2
12.7
1.4
Wang’s method[18]
30.1
10.7
23.7
25.8
17.3
19.3
Our results
44.2
79.5
53.5
35.6
56.9
34.7
descriptors are available that distinguish irregular textured regions out from objects, our method is able to produce more desirable results.
5
Conclusion and Discussions
We proposed a general content aware retargeting framework from a supervised learning viewpoint. We proposed to measure retargeting performance based on prediction error decomposition [14]. We further propose a grid-based retargeting model to ensures transformation consistency in model learning. This grid-based model is optimized by solving a nonlinear programming problem. There are two merits in the proposed framework: (1) our framework is generalized for previous works, in which by incorporating different learning functions, many state-of-theart retargeting methods can be derived from our framework. (2) based on our grid-based learning structure, our model is suitable for real time applications on mobile devices. Acknowledgements. This work was supported in part by National Basic Research Program of China (973 Program) 2009CB320902, in part by National Natural Science Foundation of China No. 60902057 and No. 60905008, in part by CADAL Project and NEC-PKU Joint Project.
References 1. Santella, A., Agrawala, M., DeCarlo, D., Salesin, D., Cohen, M.: Gaze-based interaction for semi-automatic photo cropping. In: SIGCHI Conference on Human Factors in Computing Systems (2006) 2. Hearn, D., Baker, M.: Computer graphics with OpenGL (2003) 3. Guo, Y., Liu, F., Shi, J., Zhou, Z., Gleicher, M.: Image retargeting using mesh parametrization. IEEE Transactions on Multimedia (2009) 4. Chen, L., Xie, X., Fan, X., Ma, W., Zhang, H., Zhou, H.: A visual attention model for adapting images on small displays. Multimedia Systems (2003) 5. Liu, H., Xie, X., Ma, W., Zhang, H.: Automatic browsing of large pictures on mobile devices. In: ACM Multimedia (2003) 6. James, G.M.: Variance and bias for general loss functions. Mach. Learn. (2003) 7. Shi, L., Wang, J., Duan, L., Lu, H.: Consumer video retargeting: context assisted spatial-temporal grid optimization. In: ACM Multimedia (2009) 8. Wolf, L., Guttmann, M., Cohen-Or, D.: Non-homogeneous content-driven videoretargeting. In: ICCV (2007)
24
B. Li et al.
9. Rubinstein, M., Shamir, A., Avidan, S.: Improved seam carving for video retargeting. ACM Transactions on Graphics (2008) 10. Rubinstein, M., Shamir, A., Avidan, S.: Multi-operator media retargeting. ACM Transactions on Graphics (2009) 11. Matthias, G., Kwatra, V., Han, M., Essa, I.: Discontinuous seam-carving for video retargeting. In: CVPR (2010) 12. Gal, R., Sorkine, O., Cohen-Or, D.: Feature-aware texturing. In: Eurographics Symposium on Rendering (2006) 13. Avidan, S., Shamir, A.: Seam carving for content-aware image resizing. ACM Transactions on Graphics (2007) 14. Geman, S., Bienenstock, E., Doursat, R.: Neural networks and the bias/variance dilemma. Neural. Comput. (1992) 15. Kopf, S., Kiess, J., Lemelson, H., Effelsberg, W.: Fscav: fast seam carving for size adaptation of videos. In: ACM Multimedia (2009) 16. Liu, T., Sun, J., Zheng, N., Tang, X., Shum, H.: Learning to detect a salient object. In: CVPR (2007) 17. Niu, Y., Liu, F., Li, X., Gleicher, M.: Warp propagation for video resizing. In: CVPR (2010) 18. Wang, Y., Tai, C., Sorkine, O., Lee, T.: Optimized scale-and-stretch for image resizing. ACM Transactions on Graphics (2008)
Understanding Video Sequences through Super-Resolution Yu Peng, Jesse S. Jin, Suhuai Luo, and Mira Park The School of Design, Communication and IT, University of Newcastle, Australia {yu.peng,jesse.jin,suhuai.luo,mira.park}@uon.edu.au
Abstract. Human-centred multimedia applications are a set of activities that human directly interact with multimedia, which consists of different forms. Within all multimedia, video is an ultimate resource, by which people could obtain sensory information. Since limitations on the capacity of imaging devices as well as shooting conditions, we cannot usually acquire high quality video records that desired. This problem could be addressed by super-resolution. We propose a novel scheme in the present paper for super-resolution problem, and make three contributions: (1) on the stage of image registration according to previous approaches, the reference image is picked out through observing or randomly. We utilise a simple but efficient method to select the base image; (2) a median-value image, rather than the average image used previously, is adopted as the initialization for estimate of super-resolution; (3) we adapt the traditional Cross Validation (CV) to a weighted version in the process of learning parameters from input observations. Experiments on synthetic and real data are provided to illustrate the effectiveness of our approach. Keywords: Super-resolution, reference image selection, median-value image, Weighted Cross Validation.
1 Introduction With the dramatic development of digital imaging as well as internet, multimedia blending seamlessly text, pictures, animation, and movies becomes one of the most dazzling and fastest growing areas in the fields of information technology [1]. Multimedia is widely used in entertainment, business, education, training, science simulations, digital publications, exhibits and so much more, within all of which human’s interaction and perception is crucial. Low-level features, however, cannot engender accurate understanding of sensory information. As an important media, video providing people visual information usually associates with low quality problems since the limitations on imaging device itself and shooting conditions. For example, people cannot recognize faces or identify car license plates from a video record with low resolution and noises. We usually tackle this botheration through super-resolution, which is a powerful technique of reconstructing a high resolution image from a set of low-resolution observed images, at the same time complementing de-blurring and noise removal. The “super” here means the well characters closing to original scene. K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 25–34, 2011. © Springer-Verlag Berlin Heidelberg 2011
26
Y. Peng et al.
In order to interpret it in perspective, a brief review of super-resolution is needed. As a well-studied field, comprehensive reviews were carried out by Borman and Stevenson [2] in 1998 and Park [3] in 2003. Although there are lots of approaches to super-resolution, the formulation of this problem usually falls into either frequency or spatial domain. Historically, the earliest super-resolution methods are raised in frequency domain [4, 5, 6] as the underlying mechanism of resolution improvement, that is restoration of frequency components beyond the Nyquist limit of the individual observation image samples [3]. However, the unreality assumption of global translation with these kind methods is too limited to use widely. On the other hand, spatial methods [7, 8, 9, 10] dominated soon because of its better handling of noise, and a more natural treatment of the point spread blur in imaging. Many super-resolution techniques are derived from Maximum Likelihood (ML) methods, which seek a super-resolution that maximizes the probability of the observed low-resolution input images under a given model, or Maximum a Posteriori (MAP) methods, which must use an assumptive prior information of the highresolution image. Recently, Pickup [10] proposed a completely simultaneous algorithm for super-resolution. In this MAP framework based method, parameters of prior distributions are learnt automatically from input data, rather than being set by trial and error. Additionally, superior results are obtained by optimizing over both the registrations and image pixels. In this paper, we take an adaptive method for super-resolution basing on this simultaneous framework. The three improvements are presented: at the beginning of superresolution, image registration is ultimately crucial for accurate reconstructed results. Most of previous approaches ignored the selection of reference image, which is the basic frame for other sensed images in the process of registration. In our method, we introduce an effective evaluation for reference image selection. As the second development, a median-value image is utilized as the initialization in place of average image, the outperformance is demonstrated. Finally, we adopt a Weighted Cross Validation (WCV) to learn prior parameters. As we present, our is more efficient than the traditional Cross Validation. The paper is organized as follows. Notations for reasoning about the superresolution problem are laid out in section 2, which then goes on to present popular methods for handling super-resolution problem. In section 3, we give particular consideration to image registration and prior information, which are ultimately crucial in the whole process. Section 4 introduces the adaptive super-resolution scheme basing on simultaneous model with our three developments. All the experiments and results are given in section 5, while section 6 will concludes the paper and highlights several promising directions for further research.
2 The Notation of Super-Resolution As we know, in the process of recording video, some spatial resolution would be lost unavoidably for a number of reasons. The most popular choice is describing this information lost process through a generative model, which is a parameterized, probabilistic forward process model of observed data generation. Several comprehensive
Understanding Video Sequences through Super-Resolution
27
generative models with subtle differences were proposed [2,10,11,12,13], generally these models described the corruptions with respect to geometrical transformation, illumination variation, blur engendered by limited shutter speed or sensor point spread effect, down-sampling, and also noise that occurs within sensor or during transmission. As [8,14], the warping, blurring and sub-sampling of the original highresolution image is modelled by an sparse matrix , where ,with pixels, of which has pixels. is assumed to generate a set of low-resolution images, the We could use the following equation to express this model: (2.1) Where scalars and characterize the global affine photometric correction results from multiplication and addition across all pixels respectively, and is noise occurring in this process. In terms of generative model, the simple Maximum Likelihood (ML) solution to the super-resolution problem is the super-resolution image which maximizes the , Although ML super-resolution is really fast, it likelihood of the observed data is an ill-conditioned problem, some of whose solutions are subjectively very implausible to the human viewer. The observations are corrupted by noise, the probability of which would be also maximized at the same time when we maximize the probability of observations. The Maximum a Posterior (MAP) solution is introduced to circumvent this deficit. The MAP solution is derived from Bayesian rule with a prior distribution over x to avoid infeasible solutions, it is given as: ,
,
(2.2)
More recently, Pickup [10] proposed a simultaneous method depending on MAP framework by which we find super-resolution image at the same time as optimizing registration, rather than reconstructing super-resolution image after fixing registration results of low-resolution frames.
3 Image Registration and Prior Distribution Almost all multi-frame images super-resolution algorithms need to use the estimate of the motion relating the low-resolution inputs within sub-pixel accuracy in order to set up the constraints on the high-resolution image pixel intensities which are necessary for estimating the super-resolution image. Generally speaking, image registration is the process of overlaying two or more images of the same scene taken under different conditions. It geometrically aligns two images—the reference and sensed image. The majority of the registration methods consist of the following four steps: feature detection, feature matching, transformmodel estimation, and image re-sampling with transformation [15]. In the sense of super-resolution, a typical method for registering image is to find interest points in low-resolution image set, then use robust methods such as RANSAC [16] to estimate the point correspondences and compute homographies between images.
28
Y. Peng et al.
With standard super-resolution methods, super-resolution image is estimated after fixing registration between low-resolution observations. Accurate low-resolution images registration is critical for the success of super-resolution algorithms. Even very small errors in the registration of the low-resolution frames can have negative consequences on the estimation of other super-resolution components. Within many superresolution methods, many authors consider that image registration and super-resolution image estimation as two distinct stages. However, we prefer to consider that image registration should be optimizing at the same time as recovering super-resolution, as these two stages are not truly independent, as the super-resolution problem is akin to a neural network, the components and the parameters of them are cross-connected with each other. Besides registration, prior distribution is another core in super-resolution, we need it to constrain the original high-resolution image in order to steering this inverse problem away from the infeasible solutions, in practice the exact selection of image priors is a tricky job as we usually should guarantee image reconstruction accuracy and the computational cost, since some priors are much more expensive to evaluate than others. In general, authors propose to use a Markov Random Filed (MRF) with smooth function over the original image to enable robust reconstruction. The practicality of MRFs model depends on the fact that through information contained in the local physical structure of images, we could sufficiently obtain a good global image representation [14]. Therefore, MRFs allow representation of image in terms of local characterization of images. A typical MRF models assumes that the image is locally smooth to avoid the excessive noise amplification at high frequency in ML solution, except those natural discontinuities such as boundaries or edges. As well, we aim to make MAP solution a unique optimal super-resolution image. As the basic ML solution is already convex, the convexity-preserving smooth-priors become attractive. Gaussian image priors give a closed-form solution quickly with a least-squares-style penalty term for image gradient values. Natural images contain edges, however, where there are locally high image gradients that are undesirable to smooth out. As a good substitute, Huber function is quadratic for small values of input, but linear for large values, so it benefits from penalizing edges less severely than Gaussian prior, whose potential function is purely quadratic. The smooth functions vary with different parameters. Therefore, appropriate priors parameters would make considerable benefit on the super-resolution results. There are several methods available that allow us to learn values of prior parameters, which we will return later.
4 The Adaptive Simultaneous Approach 4.1 Overview of This Adaptive Simultaneous Method This simultaneous model consists of two distinct stages: as in Fig. 1, the first stage covers a set of initializations of image registration, point spread function, prior parameters, and coarse estimate of super-resolution image. In the second stage, it is a big optimization loop, which incorporates an outer loop that updating prior parameters, super-resolution image and registration parameters, and two inner loops that if the maximum absolute change in any updated parameters in the outer loop beyond preset convergence thresholds, return to the beginning of the second stage.
Understanding Video Sequences through Super-Resolution
29
Fig. 1. Basic structure of the adaptive simultaneous method. The first part is initialization as the start point for super-resolution. The second part is optimizing for accurate solutions.
4.2 Reference Image Selection Image registration, as the beginning step of this simultaneous method, is crucial for the final super-resolution image. As interpreted above, photometric registration depends on geometric registration, which we usually achieve by a standard algorithm— RANSAC [16]. The problem of reference image selection, however, is too often ignored in current super-resolution techniques. In the procedure of geometric registration, other images are aligned according to the reference image. Therefore, inappropriate selection of reference image would bring in poor quality on registration results, then super-resolution image. Within the vast majority of papers, authors either pick it out randomly or omit the step. In our proposed super-resolution scheme, we handle this problem via a simple but effective method, that is Standard Mean Deviation (SMD), with which the information of low-resolution images is estimated as: ∑
∑
,
,
∑
∑
,
,
(4.1)
Where and represent the number of row and column pixels in each input observations respectively, and , is the intensity of the pixel , . This approach for the reference image selection we propose here is practically sensible. In most superresolution cases, reference image is hard to be told via observing the image content, in which there is no huge difference but only subtle shift exists between low-resolution images. Through (4.1), higher SMD of the image means ample intensity changes, which implies that great information this image has. We can keep much detail as possible as the most informative image is chose as reference image, which would provide a fine foundation for the following operations in the whole diagram. 4.3 Median-Value Image In the traditional methods, an average image is always chosen as a starting point for super-resolution estimation. The average image is formed by a simple re-sampling
30
Y. Peng et al.
scheme applied to the registration input images. Every pixel in average image is a weighted combination of pixels in observations, which are generated from original high-resolution image according to the weights in W. Although it is very robust to noise in observations, the average image is too smooth to bring incorrectness into estimation of super-resolution. We overcome this problem via choosing median-value image instead, each pixel in which is obtained by: , ∑
,
(4.2) (4.3)
Where is recovered from low-pixels in observations, each pixel in which are generated from super-pixels contained within the pixel’s Point Spread Function (PSF) according to weights . Therefore, depends on the point spread function. The process of acquiring median image is illustrated in Fig. 2. We obtain each pixel of median-value image via selecting median from recovered from all low-resolution images, as (4.2). In (4.3), the care about the lighting change is taken.
Fig. 2. The obtaining of median-value image. From original high-resolution to low resolution images is imaging process. A pixel in median value image depends on certain pixels in low-resolution image, and is created from original high-resolution according to specific Point spread function.
Fig. 3. Comparing median image with average image. The right is ground truth image. The center and the right are average and median image respectively, which are obtained from 5 synthetic low-resolution images. Median image outperforms average one, which is overly smooth.
Understanding Video Sequences through Super-Resolution
31
Note that the resulting images shown in Fig. 3, average image is overly smooth comparing with the original high-resolution image while the resulting median-value image preserves edge information as well as removing noise. 4.4 Learning Prior Parameters via Weighted Cross-Validation As interpreted in 3.2, an appropriate prior distribution, as a regularizer, is crucial to keep the super-resolution image away from infeasible solutions. Different superresolutions are obtained according to varying prior parameters such that prior strength and function ‘shape’ in Huber-MRF. Rather than choosing prior parameters empirically, learning them from inputs is more sensible. Cross validation is a typical method to estimate these parameters. Recently, Pickup [10] raised a pixel-wise validation approach, which has a clear advantage than cross-validation method based on whole-image –wise. Leave-one-out is the basic philosophy of Cross-Validation (CV), in the case of super-resolution, the idea is that a good selection of prior parameters should predict the miss low-resolution image of the inputs. That is, if an arbitrary observations of all the low-resolution is left out, then the regularized super-resolution image, according to the selected prior parameters, should be able to predict this one fairly well. In previous CV for super-resolution, input observations is split into two sets those are training set and validation set, from the former of which we obtain the super-resolution, via the latter of which the error is found. We would make mistakes, however, when there are validation images misregistered. Instead, in pixel-wise method, validation pixels are selected at random from the collection of all the pixels comprising the overall input observations. The cross-validation error is measured by holding a small percentage of the low-resolution pixels in each image back, and performing Huber-MAP . The obtained super-resolution super-resolution using the remaining pixels image is then projected down into the low-resolution frame ) under corresponding generative model and the mean absolute error is recorded. By this recorded error, we determine whether the estimate of super-resolution gets an improvement and is therefore worthy for consideration for gradient descent, when we optimize the prior parameters. We leave out a small percent of pixels from each low-resolution image and seek the value of prior parameters that minimizes the prediction errors, measured by the CV function: ,
(4.5)
Where is the prior operator chosen as Huber-MRF in our experiment, the prior parepresents regularized by prior rameter is a scalar. . Note that lighting selected, and gives the regularized solution, change is not taken in account as the global photometric variation does not make contribution in gradient direction. Although this developed CV overcomes the misregistration problem, we would lose many valuable pixels in each observation those could be the training set. In this
32
Y. Peng et al.
paper, we utilized Weighted Cross Validation (WCV) [17] method in the process of learning prior parameters, which could avoid misregistration problem as well as keep as many pixels in training set as possible. The WCV method is a “leave-weighted one-out” prediction method, we leave out part pixels of an observation in turn and optimize the value of that minimizes the prediction error. The WCG method in one image is illustrated in Fig. 4, we should notice that the operation should be implemented for every low-resolution image in each iteration. In fact, in leaving out the whole ith observation, the derivation seeks to minimize | | the prediction error , when x is the minimizer of ∑
|
,
|
(4.6)
Fig. 4. Weigthed cross validation for one low-resolution image. For each input observation, we first leave out part pixels from it as the validation set, the remaining pixels with all other images are taken as training set for super-resolution. We then project this obtained superresolution into the low-resolution frame, the error is recorded.
We could define the matrix 1,1,1,1 the jth entry, then the above minimization is equivalent to min
|
|
1,0,1
1 , where 0 is
(4.7)
The weighted cross validation method is derived in a similar manner, but we instead use a weighted “leave-one- out” philosophy. We define a new matrix 1,1,1,1 1, √1 ,1 1 , consider 0 1, where √1 is the jth diagonal entry of . By using the WCV method, we seek a super-resolution estimate to the minimization problem, min
|
|
(4.8)
Understanding Video Sequences through Super-Resolution
33
5 Experiments Results A set of selected frames from a clip of video record are used to test the proposed approach. Fig. 5 shows one of the sequences consisting of 9 frames, and the area of the interest that the logo of Newcastle University, Austrtalia. The logo sequences are unclear as a result of down-sampling, warping, and noise. In our experiment, we compare the result obtained by this raised method with the reconstructions via ML and tradition MAP. We choose Gaussian PSF with std=0.375 with zoom factor of 2 for all three methods, Huber prior with 0.08 and =0.04 for MAP, and Weighted Cross Validation with w=0.2 for learning prior parameters. As the three super-resolution images shown in Fig. 6, this proposed method outperforms the others.
Fig. 5. Input low-resolution frames from a video record. The top is one of 9 frames from a clip. The bottoms are interest areas from all 9 frames.
Fig. 6. Super-resolution obtained through three methods. The left is the ML result, in which we can see the corruptions by amplified noise clearly. The next one is reconstruction from MAP. The final one is super-resolution by this adaptive simultaneous method, the significant development can be seen.
6 Conclusion A novel simultaneous approach to super-resolution is presented in this paper. We make three developments: First, we introduce an effective method for reference image selection in image registration which is always ignored. Next, a median value image is taken in place of average image as the initialization, and we show the advantages. Finally, we utilized Weighted Cross Validation for learning prior-parameters, this method could keep as much pixels in training set as possible while avoiding
34
Y. Peng et al.
misregistration problem. In the end, experimental results show the outperformance of this adaptive method. Further work directions include more robust methods for reference image selection and super-resolution with object tracking.
References 1. Multimedia Design and Entertainment, http://www.clearleadinc.com 2. Borman, S., Stevenson, R.L.: Spatial Resolution Enhancement of Low-resolution Image Sequences. Technical report, Department of Electrical Engineering, University of Notre Dame, Notre Dame, Indiana, USA (1998) 3. Park, S.C., Park, M.K., Kang, M.G.: Super Resolution Image Reconstruction: a technical overview. IEEE Signal Processing Magazine, 1053–5888 (2003) 4. Tsai, R.Y., Huang, T.S.: Uniqueness and Estimation of Three Dimensional Motion Parameters of Rigid Objects with Curved Surfaces. IEEE Transaction on Pattern Analysis and Machine Intelligence 6, 13–27 (1984) 5. Tom, B.C., Katsaggelos, A.K., Galatsanos, N.P.: Reconstruction of A High Resolution Image from Registration and Restoration of Low Resolution Images. In: Proceedings of the IEEE International Conference on Image Processing (ICIP), pp. 553–557 (1994) 6. Nguyen, N., Milanfar, P., Golud, G.: Efficient Generalized Cross Validation with Applications to Parametric Image Restoration and Resolution Enhancement. IEEE Transaction on Inage Proceeding 10(9), 1299–1308 (2001) 7. Peleg, S., Keren, D., Schweitzer, L.: Improving Image Resolution using Subpixel Motion. Pattern Recognition Letters 5(3), 223–226 (1987) 8. Irani, M., Peleg, S.: Motion Analysis for Image Enhancement: Resolution, Occlusion, and Transparency. Journal of Visual Communication and Image Representaion 4, 324–355 (1993) 9. Capel, D.P.: Image Mosaicing and Super-resolution. PhD thesis, University of Oxford (2001) 10. Pickup, L.: Machine Learning in Multi-frame Image Super-resolution. PhD thesis, University of Oxford (2001) 11. Farsiu, S., Robinson, M.D., Elad, M., Milanfar, P.: Fast and Robust Multi-Frame Super Resolution. Image Processing 13(10), 1327–1344 (2004) 12. Irani, M., Peleg, S.: Improving Resolution by Image Registration. Graphical Models and Image Processing 53, 231–239 (1991) 13. Suresh, K.V., Mahesh Kumar, G., Rajagplalan, A.N.: Super-resolution of License Plates in Real Traffic Videos. IEEE Transaction on Intelligent Transportation Systems 8(2) (2007) 14. Zhao, W., Sawhney, H.S.: Is super-resolution with optical flow feasible? In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2350, pp. 599–613. Springer, Heidelberg (2002) 15. Zitova, B., Flusser, J.: Image Registration Methods: a survey. Image and Vision Computing 21, 977–1000 (2003) 16. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn. Cambridge University Press, Cambridge (2004) 17. Chung, J., Nagy, J.G., O’leary, D.: Weighted-GCV Method for Lanczos-Hybrid Regularization. Electronic Transactions on Numerical Analysis 28, 149–167 (2008)
Facial Expression Recognition on Hexagonal Structure Using LBP-Based Histogram Variances Lin Wang1 , Xiangjian He1,2 , Ruo Du2 , Wenjing Jia2 , Qiang Wu2 , and Wei-chang Yeh3 1
Video Surveillance Laboratory Guizhou University for Nationalities Guiyang, China
[email protected] 2 Centre for Innovation in IT Services and Applications (iNEXT) University of Technology, Sydney Australia {Xiangjian.He,Wenjing.Jia-1,Qiang.Wu}@uts.edu.au,
[email protected] 3 Department of Industrial Engineering and Engineering Management National Tsing Hua University Taiwan
[email protected]
Abstract. In our earlier work, we have proposed an HVF (Histogram Variance Face) approach and proved its effectiveness for facial expression recognition. In this paper, we extend the HVF approach and present a novel approach for facial expression. We take into account the human perspective and understanding of facial expressions. For the first time, we propose to use the Local Binary Pattern (LBP) defined on the hexagonal structure to extract local, dynamic facial features from facial expression images. The dynamic LBP features are used to construct a static image, namely Hexagonal Histogram Variance Face (HHVF), for the video representing a facial expression. We show that the HHVFs representing the same facial expression (e.g., surprise, happy and sadness etc.) are similar no matter if the performers and frame rates are different. Therefore, the proposed facial recognition approach can be utilised for the dynamic expression recognition. We have tested our approach on the well-known Cohn-Kanade AU-Coded Facial Expression database. We have found the improved accuracy of HHVF-based classification compared with the HVF-based approach. Keywords: Histogram Variance Face, Action Unit, Hexagonal structure, PCA, SVM.
1 Introduction Human facial expression reflects human’s emotions, moods, attitudes and feelings etc.. Recognising expressions can help computer learn more about human’s mental activities and react more sophisticatedly, so it has enormous potentials in the field of humancomputer interaction (HCI). Explicitly, the expressions are some facial muscular movements. Six basic emotions (happiness, sadness, fear, disgust, surprise and anger) were K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 35–45, 2011. c Springer-Verlag Berlin Heidelberg 2011
36
L. Wang et al.
defined in the Facial Action Coding System (FACS) [4]. FACS consists of 46 action units (AU) which depict basic facial muscular movements. Basically, how to capture the expression features precisely is vital for expression recognition. Getting expression features can be divided into two categories: spatial and spatio-temporal approaches. Because spatial approaches [5][10][24] do not model the dynamics of facial expressions, spatio-temporal approaches have dominated the recent research for ficial expression recognition. In their spatio-temporal approaches, Maja et al. [17][18][23] detected AUs by using individual feature GentleBoost [22] templates built from Gabor wavelet features and tracked temporal AUs based on Particle Filter (PF). Then, the SVMs [2][19][20] was applied for classification. Petar et al. [1] treated the facial expression as a dynamic process and proposed that the performance of an automatic facial expression recognition system could be improved by modeling the reliability of different streams of facial expression information utilising multistream Hidden Markov Models (HMMs) [1]. Although the above spatio-temporal approaches have taken into account modeling dynamic features, the model parameters are often hard to be obtained accurately. In our previous approach [3], the extraction of expression features saved the dynamic features into a Histogram Variance Face(HVF) image by computing the texture histogram variances among the frames of a face video. The common Local Binary Pattern (LBP) [12][13] was employed to extract the face texture. We classified the HVFs using Support Vector Machines (SVMs) [2][19][20] after applying Principal Component Analysis (PCA) [15][16] for dimensionality reduction. The accuracy of HVF classification was very encouraging and highly matched human’s perception on original videos. In this paper, we extend our work shown in [3]. We, for the first time, apply LBPs defined on the hexagonal structure [6] to extract the Hexagonal Histogram Variance Faces (HHVFs). This novel approach not only greatly reduce the computation costs but also improve the accuracy for facial expression recognition. The rest of the paper is organised as follows. Section 2 introduces the LBPs on the hexagonal structure, and describes the approach to generating the HHVF image from an input video based on the LBP operator. Section 3 presents the dimensionality reduction using PCA, and the training and recognition steps using SVMs. Experimental results are demonstrated in Section 4. Section 5 makes the conclusions of this paper.
2 Hexagonal Histogram Variance Face (HHVF) The HVF image is a representation of the dynamic features in a face video [3]. An extension of HVF represented on the hexagonal structure, namely HHVF, is performed in this section. 2.1 Fiducial Point Detection and Face Alignment For different expression videos, normally the scales and locations of human faces in frames are various. To make all the HHVFs have the same scale and location, it is critical to detect the face fiducial points. Bilinear interpolation is used to scale the face images to the same size. To detect the fiducial points, we apply Viola-Jones face detector [21], a real-time face detection scheme based on Haar-like features and AdaBoost learning. We detect and locate the positions of eyes for the iamges on the Cohn-Kanade
Facial Expression Recognition on Hexagonal Structure
37
expression database [9]. Each face image is cut and scaled according to eyes’ positions and the distance between the eyes. 2.2 Conversion from Square Structure to Hexagonal Structure We follow the work shown in [7] to represent images on the hexagonal structure. As shown in Figure 1, the hexagonal pixels appear only on the columns where the square pixels are located. As illustrated in Figure 1, for a given hexagonal pixel (denoted by X but not shown in the figure), there exist two square pixels (denoted by A and B but again not shown in the figure), lying on two consecutive rows and the same column of X, such that point X falls between A and B. Therefore, we can use the linear interpolation algorithm to obtain the light intensity value of X from the intensities of A and B. When we display the image on the hexagonal structure, every two hexagonal rows as shown in Figure 1 is combined into one single square row with their columns unchanged.
Fig. 1. A 9x8 square structure and a constructed 14x8 hexagonal structure [7]
2.3 Preprocessing and LBP Texturising After each input face image is aligned, cut, rescaled and normalized, we replace the values of the pixels outside the ellipse area around the face by 255, and keep the pixel values unchanged in the elliptic face area. To eliminate the illumination interference, we employ an LBP operator [6][12][13] to extract the texture (i.e., LBP) values in each masked face. LBP was originally introduced by Ojala et al. in [11] as texture description and defined on the traditional square image structure. The basic form of an LBP operator labels the pixels of an image by thresholding the 3 × 3 neighborhood of each pixel by the grey value of the pixel (the centre). An illustration of the basic LBP operator is
38
L. Wang et al.
Fig. 2. An example of computing LBP in a 3 × 3 neighborhood on square structure [6]
Fig. 3. An example of computing HLBP in a 7-pixel hexagonal cluster [6]
shown in Figure 2. Similar to the construction of basic LBP on the square structure, the basic LBP on the hexagonal structure, called Hexagonal LBP (HLBP) is constructed as shown in Figure 3 [6] on a cluster of 7 hexagonal pixels. By defining the HLBPs on the hexagonal structure, the number of different patterns has been reduced from 28 = 256 on the square structure to 26 = 64 on the hexagonal structure. More importantly, because all neighboring pixels of a reference pixel have the same distance to it, the grey values of the neighboring pixels have the same contributions to the reference pixel on the hexagonal structure. 2.4 Earth Mover’s Distance (EMD) Earth Mover’s Distance (EMD) [14] is a cross-bin approach and able to address the shift problem caused by noise because slight histogram shifts do not affect the EMD much. EMD is consistent with the human’s vision because that two histograms will have greater EMD value if they look more different in most cases. EMD can be formalized as the following linear programming problem. Let P = {(p1 , wp1 ), . . . , (pm , wpm )}
(1)
be the first signature with m clusters, where pi is the cluster representative and wpi is the weight of the cluster. Let Q = {(q1 , wq1 ), . . . , (qn , wqn )}
(2)
be the second signature with n clusters. Let also D = [dij ] be the ground distance matrix where dij is the ground distance between clusters pi and qj . Then EMD is to find a flow F = [fij ], where fij is the flow between pi and qj , that minimizes the overall cost [14]
Facial Expression Recognition on Hexagonal Structure
W ORK(P, Q, F ) =
m n
dij fij
39
(3)
i=1 j=1
subject to the following constraints [14]: 1. f ij ≥ 0; i ∈ [1, m] , j ∈ [1, n], n 2. fij ≤ wpi ; i ∈ [1, m], j=1 m 3. fij ≤ wqj ; j ∈ [1,n] and i=1 m n m n 4. i=1 j=1 fij = min i=1 wpi , j=1 wqj . EMD is defined as the resulting work normalized by the total flow [14]: m n i=1 j=1 dij fij EM D(P, Q) = m n i=1 j=1 fij
(4)
In general, the ground distance dij can be any distance and will be chosen according to the problem in question. We employ EMD to measure the distance between two histograms when calculating histogram variances in the temporal direction. In our case, pi and qj are the grayscale pixel values, which are in [0, 63]. wpi and wqj are the pixel distributions at pi and qj respectively. The ground distance dij that we choose is the square of Euclidean distance between pi and qj , i.e., dij = (pi − qj )2 . 2.5 Histogram Variances The following steps for computing the histogram variance are similar to the ones shown in [3]. 1. Suppose that a sequence consists of P face texture images. We firstly break down each image evenly into M × N blocks, denoted by Bx,y;k , where x is row index, y is column index and k is the k-th frame in the sequence. Here, the block size, rows and columns are corresponding to those images displayed on the square structure as described in Subsection 2.2. We then calculate every gray-value histogram of Bx,y;k , denoted by Hx,y;k , where x = 0, 1, . . . , M − 1; y = 0, 1, . . . , N − 1; k = 0, 1, . . . , P − 1. 2. Calculate the histogram variance var(x, y): var(x, y) =
P −1 1 EM D(Hx,y;k , μx,y ), P
(5)
k=0
where μx,y is the mean histogram μx,y =
P −1 1 Hx,y;k P
(6)
k=0
and EM D(Hx,y;k , μx,y ) is the Earth Mover’s Distance between Hx,y;k and μx,y .
40
L. Wang et al.
3. Construct an M × N 8-bit grayscale image as our HHVF. Suppose that hhvf (x, y) denotes the pixel value at coordinate (x, y) in an HHVF image. Then, hhvf (x, y) is computed by 255 ∗ var(x, y) hhvf (x, y) = 255 − . (7) max{var(x, y)} Figure 4 shows some HHVF examples extracted from happiness, surprise and sadness videos respectively. To consider if the different block sizes may affect the recognition, we obtain HHVFs with size 3 × 3 in our experiments, then with the sizes of 6 × 6 and 12 × 12.
Fig. 4. Examples of HHVF images
3 Classifying HHVF Images Using PCA+SVMs HHVF records the dynamic features of the expression. As we can see in Figure 4, for the expressions of happiness, surprise and sadness, the homogeneous HHVFs look similar and HHVFs belonging to different expressions have very distinct features. To verify the performance of HHVF image’s features, we utilise the typical facial recognition technologies PCA+SVMs. 3.1 PCA Dimensionality Reduction In our experiments, all pixel values of an HHVF image construct an n × 1 column vector zi ∈ Rn , and an n × l matrix Z = {z1 , z2 , . . . , zl } denotes the training set which consists of l sample HHVF images. The PCA algorithm finds a linear transformation orthonormal matrix Wn×r (n >> r), projecting the original high n-dimensional feature space into a much lower r-dimensional feature subspace. xi denotes the new feature vector: xi = W T · zi (i = 1, 2, . . . , l). (8)
Facial Expression Recognition on Hexagonal Structure
41
The columns of matrix W (i.e., eigenfaces [16]) are the eigenvectors corresponding to the largest r eigenvalues of the scatter matrix S: l S= (zi − μ)(zi − μ)T
(9)
i=1
where μ is the mean image of all HHVF samples and μ =
1 l
1
i=1 zi .
3.2 SVMs Training and Recognition SVM [2][20][19] is an effective supervised classification algorithm and its essence is to find a hyperplane that separates the positive and negative feature points with maximum margin in the feature space. Suppose αi (i = 1, 2, ..., l) denotes the Lagrange parameters that describe the separating hyperplane ω in an SVM. Finding the hyperplane involves getting the nonzero solutions αi (i = 1, 2, ..., l) of a Lagrangian dual problem. Once we have found all αi for given a labeled training set {(xi , yi )|i = 1, 2, ..., l}, the decision function becomes: l f (x) = sgn αi yi K(x, xi ) + b , (10) i=1
where b is the bias of the hyperplane, l is the number of training samples, xi (i = 1, 2, ..., l) is the vector of PCA projection coefficients of HHVFs, yi are the labels of train data xi for (i = 1, 2, ..., l), and K(x, xi ) is the ‘kernel mapping’. In this paper, we use linear SVMs, so K(x, xi ) = x, xi , (11) where x, xi represents the dot product of x and xi . Since the SVM is basically a two-class classification algorithm, here we adopt the pairwise classification (one-versus-one) for multi-class. In the pairwise classification approach, there is a two-class SVM for each pair of classes to separate the members of one class from the members of the others. There are maximum C62 = 15 two-class SVM classifiers for the classification of six classes of expressions. For recognising a new HHVF image, all C62 = 15 two-class classifiers are applied and the winner class is the one that receives the most votes.
4 Experiments 4.1 Dataset Our experiments use the Cohn-Kanade AU-Coded Facial Expression Database [9]. We select 49 subjects randomly from the database. Each subject has up to 6 expressions. The total number of expression images is 241. The image sequences belonging to the same expression have the similar duration but their frame rates are various and are from 15-30 frames per second. For each expression, we use about 80% HHVFs for PCA+SVMs training and use all HHVFs for classification (i.e., testing).
42
L. Wang et al.
4.2 HHVFs Generation The faces of selected subjects are detected and cut. Then these faces are scaled to size of 300 × 300 and aligned. These cut and rescaled images are then converted to the images on the hexagonal structure. The size of the new images are of 259×300. HLBP operator is then applied to the images on the hexagonal structure. The final data dimension is reduced to 220 after the PCA operation. 4.3 Training and Recognition As a supervised learning, the training data (HHVFs) are labeled according to their classes. Since the Cohn-Kanade database contains only the AU-Coded combinations for image sequences instead of expression definitions (i.e. surprise, happy, anger etc.), we need to label each HHVF with an expression definition manually according to FACS before feeding it to SVMs. In terms of human perception, we are quite confident to recognise original image sequences of happiness and surprise. Therefore, the training data for these two classes can be labeled with high accuracy. This implies that these two expressions have evidently unique features. Our experimental results (Table 1) testify this point with high recognition rate when we train and test only these two expressions. In Table 1, FPR stands for false positive rate. Table 1. Recognition rates of happy and surprise HHVFs 3 × 3 blocks 6 × 6 blocks 12 × 12 blocks Recognition rate FPR Recognition rate FPR Recognition rate FPR Happy 100% 2.17% 97.78% 2.17% 97.78% 2.17% Surprise 97.82% 0.00% 97.82% 2.22% 97.82% 2.22%
When we manually label HHVFs of anger, disgust, fear and sadness, nearly half of them are very challenging to be correctly classified. Neither AU-Coded combinations nor human’s perception can satisfactorily classify these facial expression sequences, especially the anger and sadness sequences. Actually, AU-Coded prototypes in FACS (2002 version) [9] have created large overlaps for these expressions. From human’s vision perspective, one expression appearance of a person may reflect several different expressions, so the image sequence of a facial expression can often be interpreted as for different expressions. An investigation [8] about facial expression recognition by human indicates that compared to the expressions of happy and surprise, the expressions of anger, fear, disgust and sadness are much more difficult to be recognised by people. (see Table 2). Despite the above-mentioned difficulties, we have tried our best to manually label the expressions. We have also conducted the following experiments: 1. We test for only anger, disgust, fear and sadness classes. For these four tough expressions, we train a set of classifiers having C42 = 6 two-class classifiers and test the HHVFs based on majority voting. Table 3 shows our results.
Facial Expression Recognition on Hexagonal Structure
43
Table 2. A recent investigation of facial expression recognition by human in [8] Happy Surprise Angry Disgust Fear Sadness Recognition rate(%) 97.8 79.3 55.9 60.2 36.8 46.9
Table 3. Recognition rates of anger, disgust, surprise and sadness HHVFs 3 × 3 blocks 6 × 6 blocks 12 × 12 blocks Recognition rate FPR Recognition rate FPR Recognition rate FPR Angry 74.36% 13.51% 76.92% 13.51% 74.36% 14.41% Disgust 73.68% 10.71% 73.68% 9.82% 71.05% 10.71% Fear 68.57% 8.69% 71.43% 7.83% 68.57% 8.69% Sadness 68.42% 5.36% 68.42% 5.36% 68.42% 5.35%
2. We consider all HHVFs, and train C62 = 15 two-class classifiers. We use this set of classifiers to recognise all HHVFs based on voting. We obtain the experimental results as shown in Table 4. Table 4. Recognition rates of all sorts of HHVFs 3 × 3 blocks 6 × 6 blocks 12 × 12 blocks Recognition rate FPR Recognition rate FPR Recognition rate FPR Happy Surprise Angry Disgust Fear Sadness
93.33% 91.3% 82.05% 78.95% 77.14% 78.95%
0.0% 0.51% 7.43% 4.93% 5.34% 0.49%
95.56% 93.48% 82.05% 81.58% 80.00% 78.95%
0.0% 0.51% 5.94% 4.43% 5.34% 0.49%
95.56% 93.48% 76.92% 81.58% 77.14% 78.95%
0.0% 0.51% 5.94% 4.93% 5.34% 1.48%
4.4 Discussion 1. From Table 1, we can see that both happy and surprise HHVFs have produced very high recognition rates. For example, the happy HHVFs reach amazing 100% recognition rate with only 2.17% false positive rate (FPR). These results coincide with our observation on the original image sequences, as human can also easily identify the original happy and surprise sequences from the Cohn-Kanade database. 2. From Table 3, the recognition rates for anger, fear, disgust and sadness HHVFs are much lower. This reflects the challenges that we have encountered when labeling the training data (that nearly half of the training data for these four expressions are not convincing us to accurately label their classes because of the expression feature entanglement). 3. Table 4 shows the recognition results when all six expressions were fed to SVMs for training, we can see that the happy and surprise HHVFs still stand out and the rest are hampered by the entanglement of features. Taking into account our difficulties for labeling the training data, the recognition rates in Table 4 make sense.
44
L. Wang et al.
4. The size of blocks is not critical to our results, although the segmentation into 6 × 6 blocks has the best performance in our experiments as shown in Table 4.
5 Conclusions Our experiments demonstrate that HHVF is an effective representation of the dynamic and internal features of a face video or an image sequence. They show that the accuracy has been improved when comparing with the results shown in [3]. The accuracy has been increased on average from 83.68% (see Table 5 in [3]) to 85.27% (see Table 4 above) when looking only the block size of 6 × 6. HHVF is able to integrate well the dynamic features of a certain duration of expression into a static image through which the static facial recognition approaches can be utlised to recognise the dynamic expressions. The application of HHVFs fills the gap between the expression recognition and facial recognition.
Acknowledgements This work is supported by the Houniao Program through Guizhou University for Nationalities, China.
References 1. Aleksic, P.S., Katsaggelos, A.K.: Automatic facial expression recognition using facial animation parameters and multistream hmms. In: Information Forensics and Security, vol. 1, pp. 3–11 (2006) 2. Boser, B., Guyon, I., Vapnik, K.: An training algorithm for optimal margin classifiers. In: Fifth Annual Workshop on Computational Learning Theory, pp. 144–152 (1992) 3. Du, R., Wu, Q., He, X., Jia, W., Wei, D.: Local binary patterns for human detection on hexagonal structure. In: IEEE Workshop on Applications of Computer Vision, pp. 341–347 (2009) 4. Ekman, P., Friesen, W.: Facial action coding system. Consulting Psychologists Press, Palo Alto (1978) 5. Feng, X., Pietikinen, M., Hadid, A.: Facial expession recognition with local binary partterns and linear programming. Pattern Recognition and Image Analysis 15(2), 546–548 (2005) 6. He, X., Li, J., Chen, Y., Wu, Q., Jia, W.: Local binary patterns for human detection on hexagonal structure. In: IEEE International Symposium in Multimedia, pp. 65–71 (2007) 7. He, X., Wei, D., Lam, K.-M., Li, J., Wang, L., Jia, W., Wu, Q.: Local binary patterns for human detection on hexagonal structure. In: Advanced Concepts for Intelligent Vision Systems (to appear, 2010) 8. Jinghai, T., Zilu, Y., Youwei, Z.: The contrast analysis of facial expression recognition by human and computer. In: ICSP, pp. 1649–1653 (2006) 9. Kanade, T., Cohn, J.F., Tian, Y.: Comprehensive database for facial expression analysis. In: Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture Recognition, Grenoble, France, pp. 46–53 (2000) 10. Littlewort, G., Bartlett, M., Fasel, I., Susskind, J., Movellan, J.: Dynamics of facial expression extracted automatically from video. In: CVPR (2004)
Facial Expression Recognition on Hexagonal Structure
45
11. Ojala, T., Pietik¨ainen, M., Harwood, D.: A comparative study of texture measures with classification based on feature distributions. Pattern Recognition 29, 51–59 (1996) 12. Ojala, T., Pietik¨ainen, M., M¨aenp¨aa¨ , T.: Gray scale and rotation invariant texture classification with local binary patterns. In: Vernon, D. (ed.) ECCV 2000. LNCS, vol. 1842, pp. 404–420. Springer, Heidelberg (2000) 13. Ojala, T., Pietik¨ainen, M., M¨aenp¨aa¨ , T.: Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. Pattern Analysis and Machine Intellifence 24, 971–987 (2002) 14. Rubner, Y., Tomasi, C., Guibas, L.J.: The earth mover’s distance as a metric for image retrieval. International Journal of Computer Vision 40(2), 99–121 (2000) 15. Turk, M., Pentland, A.: Eigenfaces for recognition. Journal of Cognitive Neuroscience 3(1), 71–86 (1991) 16. Turk, M., Pentland, A.: Face recognition using eigenfaces. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–591 (1991) 17. Valstar, M., Patras, I., Pantic, M.: Facial action unit detection using probabilistic actively learned support vector machines on tracked facial point data. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2005) 18. Valstar, M., Pantic, M.: Fully automatic facial action unit detection and temporal analysis. In: Computer Vision and Pattern Recognition Workshop, vol. 3, June 17-22, p. 149 (2006) 19. Vapnik, V.N.: Statistical learning theory. Wiley Interscience, Hoboken (1998) 20. Vapnik, V.: The nature of statistical learning theory. Springer, Heidelberg (1995) 21. Viola, P., Jones, M.J.: Robust real-time object detection. In: ICCV (2001) 22. Vukadinovic, D., Pantic, M.: Fully automatic facial feature point detection using gabor feature based boosted classifiers. In: SMC, pp. 1692–1698 (2005) 23. Vukadinovic, D., Pantic, M.: Fully automatic facial feature point detection using gabor feature based boosted classifiers. In: Systems, Man and Cybernetics (ICSMC), vol. 2, pp. 1692–1698 (2005) 24. Ying, Z., Fang, X.: Combining lbp and adaboost for facial expression recognition. In: ICSP, October 26-29, pp. 1461–1464 (2008)
Towards More Precise Social Image-Tag Alignment Ning Zhou1,2 , Jinye Peng1 , Xiaoyi Feng1 , and Jianping Fan1,2 1
School of Electronics and Information, Northwestern Polytechnical University, Xi’an, P.R. China {jinyepeng,fengxiao}@nwpu.edu.cn 2 Dept. of Computer Science, UNC-Charlotte, Charlotte, NC 28223, USA {nzhou,jfan}@uncc.edu
Abstract. Large-scale user contributed images with tags are increasingly available on the Internet. However, the uncertainty of the relatedness between the images and the tags prohibit them from being precisely accessible to the public and being leveraged for computer vision tasks. In this paper, a novel algorithm is proposed to better align the images with the social tags. First, image clustering is performed to group the images into a set of image clusters based on their visual similarity contexts. By clustering images into different groups, the uncertainty of the relatedness between images and tags can be significantly reduced. Second, random walk is adopted to re-rank the tags based on a cross-modal tag correlation network which harnesses both image visual similarity contexts and tag co-occurrences. We have evaluated the proposed algorithm on a large-scale Flickr data set and achieved very positive results. Keywords: Image-tag alignment, relevance re-ranking, tag correlation network.
1
Introduction
Recently, as online photo sharing platforms (e.g. Flickr [7], Photostuff [9], etc) are becoming increasingly popular, massive images have been uploaded onto the Internet and have been collaboratively tagged by a large population of real world users. User contributed tags provide semantically meaningful descriptors of the images which are essential for large-scale tag-based retrieval systems to work in practice [14]. In such a collaborative image tagging system, users have been tagging the images according to their own social or cultural backgrounds, personal knowledge and interpretation. Without controlling the tagging vocabulary and behavior, many tags in the user provided tag list might be synonyms or polysemes or even spams. Also, many tags are personal tags which are weakly related or even irrelevant to the image content. A recent work reveals that many Flickr tags are imprecise and only around 50% tags are actually related to the image contents [11]. Therefore, the collaboratively tagged images are weakly-tagged images because the social tags may not have exact correspondences with the underlying K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 46–56, 2011. c Springer-Verlag Berlin Heidelberg 2011
Towards More Precise Social Image-Tag Alignment
47
image semantics. The synonyms, polysemes, spams and content irrelevant tags in these weakly-tagged images may either return incomplete sets of the relevant images or result in large amounts of ambiguous images or even junk images [4,6,5]. It is important to better align the images with the social tags to make them more searchable and to leverage these large-scale weakly-tagged images for computer vision tasks. By achieving more accurate alignment between the weakly-tagged images and their tags, we envision many potential applications: 1. Enabling Social Images to be More Accessible: The success of Flickr proves that users are willing to manually annotate their images with the motivation to make them more accessible to the general public [1]. However, the tags are provided in a free way without a controlled vocabulary based on their own personal perceptions. Lots of social tags are synonyms or polysemes or enven spams, which pose a big limitation for users to locate their interested images accurately and efficiently. To enable more effective social image organization and retrieval, it is very attractive to develop new algorithms to achieve more precise alignment between the images and their social tags. 2. Creating Labeled Images for Classifier Training: For many computer vision tasks, e.g. object detection and scene recognition, machine learning models are often used to learn the classifiers from a set of labeled training images [2]. To achieve satisfied performance, large-scale of training examples are required, due to (a) the number of object classes and scenes of interest could be very large; (b) the learning complexity for some object classes and scenes could be very high because of visual ambiguity; and (c) a small number of labeled training images are incomplete or insufficient to interpret the diverse visual properties of large amounts of unseen test images. Unfortunately, hiring professionals to label large amounts of training images is cost-sensitive which is a key limitation for the practical use of some advanced computer vision techniques. On the other hand, the increasing availability of large-scale user contributed digital images with tags on the Internet has provided new opportunities to harvest large-scale images for computer vision tasks. In this paper, we proposed a clustering based framework to align images and social tags. By crawling large number of weakly-tagged images from Flickr, we first cluster them into different groups based on visual similarities. For each group, we aggregate all the tags associated to the images within this cluster to form a larger set of tags. A random walk process is then applied onto this tag set to further improve the alignment between images and tags based on a tag correlation network which is constructed by combining both image visual similarity contexts and tag co-occurrences. We evaluated our proposed algorithm on a Flickr database with 260,000 images and achieved very positive results.
2
Image Similarity Characterization
To achieve more effective image content representation, four grid resolutions (whole image, 64 × 64, 32 × 32 and 16 × 16 patches) are used for image partition
48
N. Zhou et al.
and feature extraction. To sufficiently characterize various visual properties of the diverse image contents, we have extracted three types of visual features: (a) grid-based color histograms; (b) Gabor texture features; and (c) SIFT (scale invariant feature transform) features. For the color features, one color histogram 3 is extracted for each image grid, thus there are r=0 2r ×2r = 85 grid-based color histograms. Each grid-based color histogram consists of 36 RGB bins to represent the color distributions in the corresponding image grid. To extract Gabor texture features, a Gabor filter bank, which contains twelve 21 × 21 Gabor filters in 3 scales and 4 orientations, is used. The Gabor filter is generated by using a Gabor function class. To apply Gabor filters on an image, we need to calculate the convolutions of the filters and the image. We transform both the filters and the image into the frequency domain to get the products and then transform them back to the space domain. This process can calculate Gabor filtered image more effectively. Finally, the mean values and standard deviations are calculated from 12 filtered images, making up to 24-dimensional Gabor texture features. By using multiple types of visual features for image content representation, it is able for us to characterize various visual properties of the images more sufficiently. Because each type of visual features is used to characterize one certain type of visual properties of the images, the visual similarity contexts between the images are more homogeneous and can be approximated more precisely by using one particular type of base kernel. Thus one specific base kernel is constructed for each type of visual features. For two images u and v, their color similarity relationship can be defined as: κc (u, v) =
R−1 1 1 D (u, v) + Dr (u, v) 0 R R−r+1 2 2 r=0
(1)
where R = 4 is the total number of grid resolutions for image partition, D0 (u, v) is the color similarity relationship according to their full-resolution (image-based) color histograms, Dr (u, v) is the grid-based color similarity at the rth resolution. Dr (u, v) =
36
D(Hir (u), Hir (u))
(2)
i=1
where Hir (u) and Hir (v) are the ith component of the grid-based color histograms at the rth image partition resolution. Their local similarity relationship can be defined as: κs (u, v) = e−ds (u,v)/σs (3) ds (u, v) =
i
j
ωi (u)ωj (v)d1 (si (u), sj (v)) i j ωi (u)ωj (v)
(4)
where σs is the mean value of ds (u, v) in our images, ωi and ωj are the Hessian values of the ith and jth interesting points. Their textural similarity relationship can be defined as: κt (u, v) = e−dt(u,v)/σt , dt (u, v) = d1 (gi (u), gj (v))
(5)
Towards More Precise Social Image-Tag Alignment
49
where σt is the mean value of dt (u, v) in our images, d1 (gi (u), gj (v)) is the L1 norm distance between two Gabor textural descriptors. The diverse visual similarity contexts between the online images can be characterized more precisely by using a mixture of these three base image kernels (i.e., mixture-of-kernels) [3]. κ(u, v) =
3
βi κi (u, v),
i=1
3
βi = 1
(6)
i=1
where u and v are two images, βi ≥ 0 is the importance factor for the ith base kernel κi (u, v). Combining multiple base kernels can allow us to achieve more precise characterization of the diverse visual similarity contexts between the two images.
3
Image Clustering
To achieve more effective image clustering, a graph is first constructed for organizing the social images according to their visual similarity contexts [8], where each node on the graph is one particular image and an edge between two nodes is used to characterize the visual similarity context between two images, κ(·, ·). By taking such image graph as the input measures of the pairwise image similarity, automatic image clustering is achieved by passing messages between the nodes of the image graph through affinity propagation [8]. After the images are partitioned into a set of image clusters according to their visual similarity contexts, the cumulative inter-cluster visual similarity context s(Gi , Gj ) between two image clusters Gi and Gj is defined as: s(Gi , Gj ) = κ(u, v) (7) u∈Gi v∈Gj
The cumulative intra-cluster visual similarity contexts s(Gi , Gi ) and s(Gj , Gj ) are defined as: s(Gi , Gi ) = κ(u, v), s(Gj , Gj ) = κ(u, v) (8) u∈Gi v∈Gi
u∈Gj v∈Gj
By using the cumulative intra-cluster visual similarity context to normalize the cumulative inter-cluster visual similarity context, the inter-cluster correlation c(Gi , Gj ) between the image clusters Gi and Gj is defined as: c(Gi , Gj ) =
2s(Gi , Gj ) s(Gi , Gi ) + s(Gj , Gj )
(9)
Three experimental results for image clustering are given in Fig. 1. From these experimental results, one can observe that visual-based image clustering can provide a good summarization of large amounts of images and discover comprehensive knowledge effectively. The images in the same cluster will share similar visual properties and their semantics can be effectively described by using a same set of tags.
50
N. Zhou et al.
Fig. 1. Image clustering results and their associated tags
4
Image and Tag Alignment
In order to filter out the misspelled and content irrelevant tags, we adopt a knowledge-based method [12] to discern content-related tags from contentunrelated ones and retain only the content-related tags to form our tag vocabulary. In particular, we first choose a set of high level categories, including “organism”, “artifact”, “thing”, “color” and “natural phenomenon”, as a taxonomy of the domain knowledge in computer vision area. The content relatedness of a particular tag is then determined by resorting to the WordNet [15] lexicon which preserves a semantical structure among words. Specifically, for each tag, if it is in the noun set of WordNet, we further traverse along the path which contains of hypernyms of the tag until one of the pre-defined categories is matched. If successfully matched, it is considered as a content-related tag, otherwise it is decided as content-unrelated one. Rather than indexing the visually-similar images in the same cluster by loosely using all these tags, a novel tag ranking algorithm is developed for aligning the the tags with the images according to their relevance scores. The alignment algorithm works as follows: (a) The initial relevance scores for the tags are calculated via tag-image co-occurrence probability estimation; (b) The relevance scores for these tags are then refined according to their inter-tag cross-modal similarity contexts based on tag correlation network; and (c) The most relevant tags (top k tags with largest relevance scores) are selected automatically for image semantics description. Our contributions lie in integrating tag correlation network and random walk for automatic relevance score refinement. 4.1
Tag Correlation Network
When people tag images, they may use multiple different tags with similar meanings to describe the semantics of the relevant images alternatively. On the other hand, some tags have multiple senses under different contexts. Thus the tags are strongly inter-related and such inter-related tags and their relevance scores should be considered jointly. Based on this observation, a tag correlation network is generated automatically for characterizing such inter-tag similarity contexts more precisely and providing a good environment to refine the relevance
Towards More Precise Social Image-Tag Alignment
51
Fig. 2. Different views of our tag correlation network
scores for the inter-related tags automatically. In our tag correlation network, each node denotes a tag, and each edge indicates the pairwise inter-tag correlation. The inter-tag correlations consist of two components: (1) inter-tag cooccurrences; and (2) inter-tag visual similarity contexts between their relevant images. Rather than constructing such tag correlation network manually, an automatic algorithm is developed for this aim. For two given tags (one tag pair) ti and tj , their visual similarity context γ(ti , tj ) is defined as: γ(ti , tj ) =
1 κ(u, v) |Ci ||Cj |
(10)
u∈Ci v∈Cj
where Ci and Cj are the image sets for the tags ti and tj , |Ci | and |Cj | are the numbers of the web images for Ci and Cj , κ(u, v) is the kernel-based visual similarity context between two images u and v within Ci and Cj respectively. The co-occurrence β(ti , tj ) between two tags ti and tj is defined as: β(ti , tj ) = −P (ti , tj )log
P (ti , tj ) P (ti ) + P (tj )
(11)
where P (ti , tj ) is the co-occurrence probability for two tags ti and tj , P (ti ) and P (tj ) are the occurrence probabilities for the tags ti and tj . Finally, we define the cross-modal inter-tag correlation between ti and tj by combining their visual similarity context and co-occurrence, given as φ(ti , tj ) = α · γ(ti , tj ) + (1 − α) · β(ti , tj )
(12)
where α is the weighting factor and it is determined through cross-validation. The combination of such cross-modal inter-tag correlation can provide a powerful framework for re-ranking the relevance scores between the images and their tags.
52
N. Zhou et al.
The tag correlation network for tagged image collections from Flickr is shown in Fig. 2, where each tag is linked with multiple most relevant tags with values of φ(·, ·) over a threshold. By seamlessly integrating the visual similarity contexts between the images and the semantic similarity contexts between the tags for tag correlation network construction, the tag correlation network can provide a good environment for addressing the issues of polysemy and synonyms more effectively and disambiguating the image senses precisely, which may allow us to find more suitable tags for more precise image tag alignment. 4.2
Random Walk for Relevance Refinement
In order to take advantage of the tag correlation network to achieve more precise alignment between the images and their tags, a random walk process is performed for automatic relevance score refinement [13,10,16]. Given the tags correlation network with n most frequent tags, we use ρk (t) to denote the relevance score for the tag t at the kth iteration. The relevance scores for all these tags at the −−→ kth iteration will form a column vector ρ(t) ≡ [ρk (t)]n×1 . We further define Φ as an n × n transition matrix, its element φij is used to define the probability of the transition from the tag i to its inter-related tag j. φij is defined as: φ(i, j) φij = k φ(i, k)
(13)
where φ(i, j) is the pairwise inter-tag cross-modal similarity context between i and j as defined in (12). The random walk process is thus formulated as: ρk (t) = θ ρk−1 (j)φtj + (1 − θ)ρ(C, t) (14) j∈Ωj
where Ωj is the first-order nearest neighbors of the tag j on the tag correlation network, ρ(C, t) is the initial relevance score for the given tag t and θ is a weight parameter. This random walk process can promote the tags which have many nearest neighbors on the tag correlation network, e.g., the tags, which have close visualbased interpretations of their semantics and higher co-occurrence probabilities. On the other hand, this random walk process can also weaken the isolated tags on the tag correlation network, e.g., the tags, which have weak visual correlations with other tags and low co-occurrence probabilities with other tags. This random walk process is terminated when the relevance scores converge. For a given image cluster C, all its tags are re-ranked according to their relevance scores. By performing random walk over the tag correlation network, the refinement algorithm can leverage both the co-occurrence similarity and the visual similarity simultaneously to re-rank the tags more precisely. The top-k tags, which have higher relevance scores with the image semantics, are then selected as the keywords to interpret the images in the given image cluster C. Such image-tag alignment process provides better understanding of the crossmedia information (images and tags) as it couples different sources of information
Towards More Precise Social Image-Tag Alignment
53
(a)
(b)
(c)
Fig. 3. Image-tag alignment: (a) image cluster; (b) ranked tags before performing random walk; (c) re-ranked tags after performing random walk
together and allow us to resolve the ambiguities that may arise from a single media analysis. Some experimental results for re-ranking the tags are given in Fig. 3. From these experimental results, one can observe that our image-tag alignment algorithm can effectively find the most relevant keywords to better align images with corresponding tags.
5
Algorithm Evaluation
All our experiments are conducted on an Flickr data set collected from Flickr by using Flickr API. There are 260,000 weakly tagged images in this set. To assess the effectiveness of our proposed algorithms, our algorithm evaluation work focuses on assessing the effectiveness of image clustering and random walk on our algorithm for image-tag alignment. The accuracy rate is used to assess the effectiveness of the algorithms for image-tag alignment, given as: N δ(Li , Ri )
= i=1 (15) N where N is the total number of images, Li is a set of the most relevant tags for the ith image which are obtained by automatic image-tag alignment algorithms, Ri is a set of the keywords for the ith image which are given by a benchmark image set. δ(x, y) is a delta function, ⎧ ⎨ 1, x = y, δ(x, y) = (16) ⎩ 0, otherwise
bread carriage caribou carrot tiger brush maker cruiser lake bridge cart car river pluto rainbow brooklyn yacht boat mountain park peak machine brick pin mickey yukon wood aircraft fall toyota dish vegetable shirt cupboard sunset carpet galaxy cabin district black flour cabinet oak candy cavern arch rug street tabriz forest furniture spectrum reef broccoli wall vintage glacier island mount harbour cascade
Accuracy bread carriage caribou carrot tiger brush maker cruiser lake bridge cart car river pluto rainbow brooklyn yacht boat mountain park peak machine brick pin mickey yukon wood aircraft fall toyota dish vegetable shirt cupboard sunset carpet galaxy cabin district black flour cabinet oak candy cavern arch rug street tabriz forest furniture spectrum reef broccoli wall vintage glacier island mount harbour cascade
Accuracy
54 N. Zhou et al.
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2 Without Clustering Without Random Walk Integration
0.2
Fig. 4. Image-tag alignment accuracy rate, where top 20 images are evaluated interactively. Average accuracy: Without Clustering = 0.6516, Without Random Walk = 0.7489, Integration = 0.8086.
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
Without Clustering Without Random Walk Integration
0.1
Fig. 5. Image-tag alignment accuracy rate, where top 40 images are evaluated interactively. Average accuracy: Without Clustering = 0.5828, Without Random Walk = 0.6936, Integration = 0.7373.
Towards More Precise Social Image-Tag Alignment
55
It is hard to obtain suitable benchmark image set in large size for our algorithm evaluation task. To avoid this problem, an interactive image navigation system is designed to allow users to provide their assessments of the relevances between the images and the ranked tags. To assess the effectiveness of image clustering and random walk for image-tag alignment, we have compared the accuracy rates for our image-tag alignment algorithm under three different scenarios: (a) image clustering is not performed for reducing the uncertainty between the relatedness between the images and their tags; (b) random walk is not performed for relevance re-ranking; (c) both image clustering and random walk are performed for achieving more precise alignment between the images and their most relevant social tags. As shown in Fig. 4 and Fig. 5, it can be seen that incorporating image clustering for uncertainty reduction and performing random walk for relevance re-ranking can significantly improve the accuracy rates for image-tag alignment.
6
Conclusions
In this paper, we proposed a cluster-based framework to provide a more precise image-tag alignment. By clustering visually similar images into clusters, the uncertainty between images and social tags has been reduced dramatically. In order to further refine the alignment, we have adopted a random walk process to re-rank the tags based on a cross-modal tag correlation network which is generated by using both image visual similarity and tag co-occurrences. Experimental results on real Flickr data set has empirically justified our developed algorithm. The proposed research on image-tag alignment may provide two potential applications: (a) enabling more effective tag-based web social image retrieval with higher precision rates by finding more suitable tags for social image indexing; (b) by harvesting social tagged images from Internet, our proposed research can create more representative image sets for training a large number of object and concept classifiers more accurately, which is a long-term goal of the multimedia research community. Acknowledgments. This research is partly supported by NSFC-61075014 and NSFC-60875016, by the Program for New Century Excellent Talents in University under Grant NCET-07-0693, NCET-08-0458 and NCET-10-0071 and the Research Fund for the Doctoral Program of Higher Education of China (Grant No.20096102110025).
References 1. Ames, M., Naaman, M.: Why we tag: motivations for annotation in mobile and online media. In: CHI, pp. 971–980 (2007) 2. Datta, R., Joshi, D., Li, J., Wang, J.Z.: Image retrieval: Ideas, influences, and trends of the new age. ACM Comput. Surv. 40(2) (2008)
56
N. Zhou et al.
3. Fan, J., Gao, Y., Luo, H.: Integrating concept ontology and multitask learning to achieve more effective classifier training for multilevel image annotation. IEEE Transactions on Image Processing 17(3), 407–426 (2008) 4. Fan, J., Luo, H., Shen, Y., Yang, C.: Integrating visual and semantic contexts for topic network generation and word sense disambiguation. In: CIVR (2009) 5. Fan, J., Shen, Y., Zhou, N., Gao, Y.: Harvesting large-scaleweakly-tagged image databases from the web. In: Proc. of CVPR 2010 (2010) 6. Fan, J., Yang, C., Shen, Y., Babaguchi, N., Luo, H.: Leveraging large-scale weaklytagged images to train inter-related classifiers for multi-label annotation. In: Proceedings of the First ACM Workshop on Large-scale Multimedia Retrieval and Mining, LS-MMRM 2009, pp. 27–34. ACM, New York (2009) 7. Flickr. Yahoo! (2005), http://www.flickr.com 8. Frey, B.J., Dueck, D.: Clustering by passing messages between data points. Science 315, 972–976 (2007) 9. Halaschek-Wiener, C., Golbeck, J., Schain, A., Grove, M., Parsia, B., Hendler, J.: Photostuff-an image annotation tool for the semantic web. In: Gil, Y., Motta, E., Benjamins, V.R., Musen, M.A. (eds.) ISWC 2005. LNCS, vol. 3729. Springer, Heidelberg (2005) 10. Hsu, W.H., Kennedy, L.S., Chang, S.-F.: Video search reranking through random walk over document-level context graph. In: ACM Multimedia, pp. 971–980 (2007) 11. Kennedy, L.S., Chang, S.-F., Kozintsev, I.: To search or to label?: predicting the performance of search-based automatic image classifiers. In: Multimedia Information Retrieval, pp. 249–258 (2006) 12. Liu, D., Hua, X.-S., Wang, M., Zhang, H.-J.: Retagging social images based on visual and semantic consistency. In: WWW, pp. 1149–1150 (2010) 13. Liu, D., Hua, X.-S., Yang, L., Wang, M., Zhang, H.-J.: Tag ranking. In: WWW, pp. 351–360 (2009) 14. Sigurbj¨ ornsson, B., van Zwol, R.: Flickr tag recommendation based on collective knowledge. In: Proceedings of the 17th International Conference on World Wide Web (WWW 2008), Beijing, China, April 21-25, pp. 327–336 (2008) 15. Stark, M.M., Riesenfeld, R.F.: Wordnet: An electronic lexical database. In: Proceedings of 11th Eurographics Workshop on Rendering. MIT Press, Cambridge (1998) 16. Tan, H.-K., Ngo, C.-W., Wu, X.: Modeling video hyperlinks with hypergraph for web video reranking. In: Proceeding of the 16th ACM International Conference on Multimedia, MM 2008, pp. 659–662. ACM, New York (2008)
Social Community Detection from Photo Collections Using Bayesian Overlapping Subspace Clustering Peng Wu1, Qiang Fu2, and Feng Tang1 1
Multimedia Interaction and Understanding Lab, HP Labs 1501 Page Mill Road, Palo Alto, CA, USA {peng.wu,feng.tang}@hp.com 2 Dept. of Computer Science & Engineering University of Minnesota, Twin Cities
[email protected]
Abstract. We investigate the discovery of social clusters from consumer photo collections. People’s participation in various social activities is the base on which social clusters are formed. The photos that record those social activities can reflect the social structure of people to a certain degree, depending on the extent of coverage of the photos on the social activities. In this paper, we propose to use Bayesian Overlapping Subspace Clustering (BOSC) technique to detect such social structure. We first define a social closeness measurement that takes people’s co-appearance in photos, frequency of co-appearances, etc. into account, from which a social distance matrix can be derived. Then the BOSC is applied to this distance matrix for community detection. BOSC possesses two merits fitting well with social community context: One is that it allows overlapping clusters, i.e., one data item can be assigned with multiple memberships. The other is that it can distinguish insignificant individuals and exclude those from the cluster formation. The experiment results demonstrate that compared with partition-based clustering approach, this technique can reveal more sensible community structure. Keywords: Clustering, Social Community.
1 Introduction Social relationship is formed based on people’s social activities. Individual’s daily activities define, evolve and reflect his/her social relationship with the rest of the world. Given the tremendous business value embedded in the knowledge of such social relationship, many industrial and academic efforts have been devoted to reveal the (partial) social relationship through studying certain types of social activities, such as email, online social networking, and instant messaging. In this paper, we present our work on discovering people’s social clusters and relationship closeness from analyzing photo collections. The work presented in [1] focuses on human evaluation of the social relationship in photos but not constructs the relationship through photo analysis. In [5], the social relationship is considered known and used to improve the face recognition performance. K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 57–64, 2011. © Springer-Verlag Berlin Heidelberg 2011
58
P. Wu, Q. Fu, and F. Tang
Outside the photo media scopes, many works have been reported to discover social relationship in other kinds of social activities, such as the one in [4], which aims to construct the social structure from analyzing emails. In [6], a graph partition based clustering algorithm is proposed to address the social community detection from a collection of photos. Given a collection of photos, the faces identified through tagging or face detection are first grouped into different people identities. These identities form the vertices in the social graph, and the strength of the connection among them is measured by a social distance metric that takes the following factors into account. Given vertices of people and the distance measure of vertices, an undirected graph is constructed. The social community detection is formulated as a graph clustering problem and an eigenvector-based graph clustering approach [3] is adopted to detect the communities. Although using the eigenvector-based graph clustering algorithm, a reasonable set of communities are detected. There are two prominent shortcomings of the graph partition based clustering approach in detecting the social communities. First, one vertex can only belong to one cluster. In social context, that is equivalent to enforce one person play only one social role in the society, which is almost universally untrue in real life; Secondly, every vertex has to belong to a cluster. It is common that the photo collection contains not only the active members of communities, but also some individuals who just happen to present or are passing-through attendees of social events. However, as long as they are captured in the social graph, they will be assigned to a certain community, although they should be indeed treated as the “noise” data from analysis perspective. These shortcomings motivate us to use the Bayesian Overlapping Subspace Clustering technique [7] to address the social community detection challenge, as it has intrinsic treatment of noise data entries and support of overlapping memberships. The rest of the paper is organized as follows: Section 2 introduces a distance measurement of social closeness of people in a photo collection and the resulted distance matrix; In Section 3, we provide an overview of the BOSC framework and in particular, its application on the distance matrix; Experiment results are presented in Section 4. Section 5 concludes a paper with a short discussion.
2 Social Closeness Denote all the photos in the collections as set I = {I1, I 2 ,L, I M }, and all the people that appeared in the photos as set P = {P1, P2 ,L, PN }. We identify people in the photo1 collection through two channels: 1) automatic face detection and clustering; 2) manual face tagging. If a person’s face was manually tagged, the face detection and clustering will be skipped to respect the subjective indication. Otherwise, the algorithms proposed in 0 are applied to produce people identifiers. The end result of this identification process is a list of rectangles {R(Pi, Ij)} for any Pi ∈ P that is found in Ij∈ I. For any two people Pi and Pj , we define a function to indicate the social closeness of the pair that is formed by taking the following heuristics into account: 1) if Pi ’ and Pj ’s face locations are close/closer in photos, they are also close/closer in relationship; 2) the more faces are found in a photo, the less trustworthy the face location
Social Community Detection from Photo Collections Using BOSC
59
distance is; 3) the more co-appearances of Pi and Pj, the more trustworthy the face location distance is. These assumptions are captured in the following formula: d ( Pi , Pj ) = [
1 q ∑ (d Il ( Pi , Pj ) ∗ f Il − 1)] q l =1
(1)
where Il , l = 1,..., q , are all the photos that contain both Pi and Pj . fIl is the number of faces detected in photo Il . dIl ( Pi , Pj ) is the distance of faces of Pi and Pj captured in photo Il .Using the above distance measure, we derive a distance matrix X of dimension N × N, with each entry xuv = d ( Pu , Pv ) = xvu .
3 Community Detection Using BOSC In this Section, we first give an overview of the Bayesian Overlapping Clustering (BOSC) model and algorithm. We then modify the BOSC algorithm to handle distance matrices and detect overlapping communities. 3.1 BOSC Overview Given a data matrix, the BOSC algorithm aims to find potentially overlapping dense sub-blocks and noise entries. Suppose the data matrix X has m rows, n columns and k sub-blocks. The BOSC model assumes that each sub-block is modeled using a parametric distribution p j [ j]1k ( [ j ]1k j 1, ... k ) from any suitable exponential family. The noise entries are modeled using another distribution p( θ j 1 ) from the same family. The main idea behind the BOSC model is as follows: Each row u and each column u v v respectively have k -dimensional latent bit vectors zr and zc which indicate their sub-block memberships. The sub-block membership for any entry xuv in the matrix is obtained by an element-wise (Hadamard) product of the corresponding row and u v column bit vectors, i.e., z zr . zc . Given the sub-block membership z and the sub-block distributions, the actual observation xuv is assumed to be generated by a multiplicative mixture model so that k ⎧ 1 z p j ( xuv θ j ) j if z ≠ 0 ⎪ j =1 p( xuv zru , zcv , θ } = ⎨ c( z ) ⎪ p( x θ ) otherwise uv k +1 ⎩
∏
u
v
(2)
where c( z ) is a normalization factor to guarantee that p(• zr , zc ,θ } is a valid distribuu v tion. If z = zr . zc = 0 , the all zeros vector, then xuv is assumed to be generated from the nose component p(• θ k +1 ) . The Hadamard product ensures that the matrix has uniform/dense sub-blocks with possible overlap while treating certain rows/columns as noise. Since it can be tricky to work directly with the latent bit vectors, the BOSC model places suitable Bayesian priors on the sub-block memberships. In particular, it asj j k sumes that there are k Beta distributions Beta(α r , β r ) , [ j ]1 corresponding to the rows
60
P. Wu, Q. Fu, and F. Tang j
j
k
u, j
and k Beta distributions Beta (α c , βc ) , [ j ]1 corresponding to the columns. Let π r dej j note the Bernoulli parameter sampled from Beta(α r , β r ) for row u and sub-block j m k v, j where [u]1 and [ j ]1 . Similarly, let π c denote the Bernoulli parameter sampled from j j Beta (α c , βc ) for row v and sub-block j where [v]1n and [ j]1k . The Beta-Bernoulli distributions are assumed to be the priors for the latent row and column membership vecu v tors zr and zc . In particular, the generative process is shown in [2]. Let Zr and Zc be m × k and n × k binary matrices that have the latent row and column sub-blocks assignments for each row and column. Given the matrix X , the learning task is to infer the joint posterior distribution of ( Zr , Zc ) and compute the model parameters (α r∗, β r∗ ,α c∗ , βc∗ ,θ ∗ ) that maximize log p( X α r , βr , α c , β c ,θ ) . We can then draw samples from the posterior distribution and compute the dense-block assignment for each entry. The BOSC algorithm is an EM-like algorithm. In E-step, given the model parameters (α r , β r , α c , βc ,θ ), the goal is to estimate the expectation of the loglikelihood E[log p ( X α r , βr ,αc , β c , θ )] where the expectation is with respect to the posterior probability p( Z r , Z c | X ,α r , β r , α c , βc ,θ ) . The BOSC algorithm uses Gibbs sampling to approximate the expectation. Specifically, it computes the conditional probabilities u, j v, j of each row (column) variable zr ( zc ) and constructs a Markov chain based on the conditional probabilities. On convergence, the chain will draw samples from the posterior joint distribution of ( Zr , Zc ) , which in turn can be used to get an approximate estimate of the expected log-likelihood. In M-Step, the BOSC algorithm estimates (α r∗ , β r∗ , α c∗ , βc∗ ,θ ∗ ) that maximizes the expectation.
Fig. 1. BOSC generative process
Social Community Detection from Photo Collections Using BOSC
61
Fig. 2. BOSC generative process for distance matrix
3.2 BOSC for Distance Matrix The BOSC framework deals with general matrices. However, distance matrices are symmetric and the rows and columns represent the same set of users in the social network, which implies the row and column cluster assignments have to be identical, i.e., Zr = Zc . Suppose there are N individuals in the social network and the distance matrix is X . We can slightly modify the OSC generative process to support the symmetry nature of the distance matrix as shown in [2]. Note that because the matrices are symmetric, we only need k Beta distributions Beta(α j , β j ) , [ j ]1k for the individuals in the social network. The learning task now becomes inferring the joint posterior distribution of Z and computing the model parameters (α ∗ , β ∗ ,θ ∗ ) that maximizes log p ( X α , β ,θ ). The EM-like process is similar to the description in Section 0.
4 Experiment Results The image set used in the experiment contains 4 photo collections and each corresponding to a distinct social event. The 4 social events are “Birthday party” (78 photos), “Easter party” (37 photos), “Indiana gathering” (38 photos) and “Tennessee gathering” (106 photos). There are 54 individuals captured in those photos. The core people that establish linkages among these events are a couple (Dan and Oanh) and their son (Blake). The “Birthday party” is held for the son, and attended by his kid friends, their accompanying parents, members from Oanh’s family and the couple. The “Easter party” is attended by members of the wife’s family, the couple and the son. Both the “Indiana gathering” (38 photos) and “Tennessee gathering” are attended by the members of the husband’s family, in addition to the couple and the son. Table 1 lists the number of photos in each collection and number of individuals appeared in each collection.
62
P. Wu, Q. Fu, and F. Tang
The clustering results using the graph partition based method 0 can be found in Table 2. Table 3 shows the clustering results from using BOSC. In BOSC implementation, we set k = 4. Several observations emerge from comparing the results in Table 2 and Table 3: • Handling of the noise date: The most noticeable difference is that Cluster 4 in Table 32 is no more a cluster in Table 3. In fact, the two individuals “adult 2” and “adult 3” do not appear in any of the clusters in Table 3. Examine the photo collection shows that “adult 2” and “adult 3” co-appears in just one photo and only with each other. Apparently, excluding such individuals in the community construction is more sensible than treating them as singular social cluster. In addition, a few other individuals, such as “adult 8”, who appears in just one photo with one active member is the “Birthday party” collection, are also excluded in the community formation in Table 3. This difference is due to BOSC’s capability of treating some data entry as noise data, which allows the community detection less interfered by noise. • Overlapping cluster membership: Another key observation is that the clusters are overlapping. As seen in Table 3, Dan appears in all 4 clusters, Blake in 3 clusters and Oanh in 2 clusters. Such result reflects the pivot role played by the family in associating different communities around them. This insight is revealed by BOSC’s overlapping clustering capability, and not achievable through the graph partition approach. • Modeling capability and limitation: As shown in Table 3, Cluster 1 and 3 both consist of people who attended “Indiana gathering” and “Tennessee gathering”. And Cluster 3 is actually a subset of Cluster 1. Cluster 2 and 4 consist of most of people attended the “Birthday party” and the “Easter party” and Cluster 2 is almost a subset of Cluster 4. One can clearly see two camps of social clusters, one is Dan’s family members and the other is Blake’s friends and Oanh’s family members. Considering the photos of “Birthday party” are twice as much as that of “Easter party” and “Birthday party” is attended by both Blake’s friends and Oanh’s family members, such partition is reasonable. Another factor that may contribute to the less distinction of Blake’s friends from Oanh’s family members is the EM-like algorithm’s convergence to local maximum, which can be remedied algorithmically. Overall, the experiment results validate the merits of the BOSC algorithm in treating the noise data and support overlapping membership. Table 1. Photo collections and people appeared in each collection Collections
#. Photo
Birthday party
78
Easter party Indiana gathering Tennessee gathering
37 38 106
People adult 1, adult 4, adult 5, adult 6, adult 7, adult 8, adult 9, adult 10, Alec, Anh, Blake, Blake’s friend 1, Blake’s friend 2, Blake’s friend 3, Blake’s friend 4, Blake’s friend 5, Blake’s friend 6, Blake’s friend 7, Blake’s friend 8, Blake’s friend 9, Dan, Jo, Landon, Mel, Nicolas, Oanh Alec, Anh, Anthony, Bill, Blake, Calista, Dan, Jo, Landon, Mel, Nicolas, Nini, Oanh, adult 4, adult 2, adult 3 Allie, Amanda, Blake, Dad, Dan, Hannelore, Jennifer, Lauren, Mom, Rachel, Stan, Tom, Tracy Allie, Amanda, Blake, Bret, Cindy, Dad, Dan, Grace, Hannelore, Jennifer, Jillian, Katherine, Kevin, Kurt, Lauren, Mom, Oanh, Tracy, Phil, Rachel, Reid, Rich, Sandra, Stan, Tom
Social Community Detection from Photo Collections Using BOSC
63
Table 2. Detected social clusters Cluster Cluster 1 Cluster 2
People Stan, Lauren, Dad, Reid, Amanda, Bret, Kevin, Hannelore, Cindy, Jennifer, Tom, Rachel, Allie, Sandra, Jillian, Grace, Oanh, Phil, Tracy, Kurt, Dan, Rich, Katharine, Mom adult 10, Blake's friend 2, Blake's friend 1, Blake's friend 5, adult 6, adult 9, Nicolas, Blake's friend 3, Blake's friend 4, Blake's friend 6, Blake's friend 7, Alec, Blake's friend 9, Blake's friend 8, adult 1, adult 5, Blake
Cluster 3
Jo, Landon, Anthony, Anh, Bill, adult 4, adult 7, Nini, Calista, Mel, adult 8
Cluster 4
adult 3, adult 2
Table 3. Detected social clusters using BOSC Cluster Cluster 1 Cluster 2 Cluster 3
Cluster 4
People Lauren, Reid, Amanda, Bret, Kevin, Cindy, Tom, Rachel, Sandra, Oanh, Jillian, Phil, Kurt, Dan, Rich, Blake, Mom, Katherine, Dad, Stan, Hannelore, Jennifer, Allie, Grace, Tracy adult 10, adult 9, Dan, Blake, Blake's friend 2, Blake's friend 1, Blake's friend 4, Blake's friend 6, Blake's friend 8 Lauren, Reid, Amanda, Bret, Cindy, Tom, Rachel, Jillian, Phil, Kurt, Dan, Mom, Dad, Jennifer, Allie, Grace Adult 10, Landon, adult 7, Mel, Oanh, Anh, adult 1, Calista, Dan, Blake, Anthony, Jo, Bill, Nini, Nicolas, Alec, Blake’s friend 1, Blake’s friend 2, Blake’s friend 5, Blake’s friend 6, Blake’s friend 3, Blake’s friend 4, Blake’s friend 9, Blake’s friend 7
5 Conclusions In this work, we first described a metric to measure people’s social distance by examining their co-appearances in photo collections. Then a subspace clustering algorithm is applied to the social distance matrix of people to detect the social communities embedded in the photo collections. The experiment results illustrate meaningful social clusters within photo collections can be revealed by the proposed approach effectively.
References 1. Golder, S.: Measuring social networks with digital photograph collections. In: Proceedings of the Nineteenth ACM Conference on Hypertext and Hypermedia (June 2008) 2. Gu, L., Zhang, T., Ding, X.: Clustering consumer photos based on face recognition. In: Proc. of IEEE International Conference on Multimedia and Expo, Beijing, pp. 1998–2001 (July 2007) 3. Newman, M.E.J.: Finding community structure in networks using the eigenvectors of matrices. Phys. Rev. E 74, 036104 (2006)
64
P. Wu, Q. Fu, and F. Tang
4. Rowe, R., Creamer, G., Hershkop, S., Stolfo, S.J.: Automated social hierarchy detection through email network analysis. In: Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 Workshop on Web Mining and Social Network Analysis (August 2007) 5. Stone, Z., Zickler, T., Darrell, T.: Autotagging Facebook: Social network context improves photo annotation. In: Computer Vision and Pattern Recognition Workshops (June 2008) 6. Wu, P., Tretter, D.: Close & closer: social cluster and closeness from photo collections. In: ACM Multimedia 2009, pp. 709–712 (2009) 7. Fu, Q., Banerjee, A.: Bayesian Overlapping Subspace Clustering. In: ICDM 2009, pp. 776– 781 (2009)
Dynamic Estimation of Family Relations from Photos Tong Zhang, Hui Chao, and Dan Tretter Hewlett-Packard Labs 1501 Page Mill Road, Palo Alto, CA 94304, USA {tong.zhang,hui.chao,dan.tretter}@hp.com
Abstract. In this paper, we present an approach to estimate dynamic relations among major characters in family photo collections. This is a fully automated procedure which first identifies major characters in photo collections through face clustering. Then, based on demographic estimation, facial similarity, coappearance information and people's positions in photos, social relations of people such as husband/wife, parents/kids, siblings, relatives and friends can be derived. A workflow is proposed to integrate the information from different aspects in order to give the optimal result. Especially, based on timestamps of photos, dynamic relation trees are displayed to show the evolution of people's relations over time. Keywords: consumer photo collections, family relation estimation, dynamic relation tree, face analysis, face clustering.
1 Introduction We can often figure out people’s relations after looking at somebody’s family photo collections for a while. Then, what about letting the computer automatically identify important people and their relations, and even form a family relation tree like the one shown in Fig.1, through analyzing photos in a family collection? Such a technology would have a wide range of applications. First of all, major people involved in images and their relations form important semantic information which is undoubtedly useful in browsing and searching images. For example, the information may be used in generating automatic recommendations of pictures for making photo products such as photo albums and calendars. Secondly, it may have other social and business values and broader uses in social networking, personalized advertisement and even homeland security. Moreover, a dynamic relation tree that shows the evolution of family relations over a relatively long period of time (e.g. multiple years) provides one kind of life log which provides perspectives of people, events and activities within glances. Existing work on discovering people’s relations based on image content analysis is quite rare. Golder presented a preliminary investigation on measuring social closeness in consumer photos, where a weighted graph was formed with people co-appearance information [1]. A similar approach was taken in [2], but a graph clustering algorithm was employed to detect social clusters embedded in photo collections. Comparing with such prior work, our proposed approach aims at integrating information from K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 65–76, 2011. © Springer-Verlag Berlin Heidelberg 2011
66
T. Zhang, H. Chao, and D. Tretter
wife
husband
grandma grandpa
grandma grandpa parents
parents Extended family
Extended family
daughter
Nuclear family
siblings
siblings brother
niece
sister-in-law
sister
niece nephew
brother
sister-in-law
nephew nephew
brother
niece
brother
niece
couple
Relatives
Friends
Fig. 1. One example of family relation tree automatically estimated based on face analysis
multiple aspects of image analyses and revealing more details of social relationships among the people involved in an image collection.
2 Estimation of People’s Relations 2.1 What Can Be Obtained with Image Analysis With results from previously developed image analysis techniques including face recognition and clustering, demographic estimation and face similarity measurement, as well as contextual information such as co-appearance of people in photos, people’s relative positions in photos and photo timestamps, the following clues can be obtained to discover people’s relations. •
Major characters in a family photo collection
Based on state-of-the-art face recognition technology, we developed a face clustering algorithm in earlier work which automatically divides a photo collection into a number of clusters, with each cluster containing photos of one particular person [3]. As the result, major clusters (that is, those having a relatively large number of photos) corresponding to frequently appearing people may be deemed as containing main characters of the collection. Shown in Fig.2 are major clusters obtained from one consumer image collection by applying face clustering. Each cluster is represented by one face image in the cluster, called a face bubble; and the size of the bubble is proportional to the size of the cluster. In such a figure, it is straightforward to identify people who appear frequently in the photo collection.
Dynamic Estimation of Family Relations from Photos
67
Fig. 2. Main characters identified in a photo collection through face clustering
•
Age group and gender of major characters
We applied learning based algorithms developed in prior work to estimate the gender and age of each main character in a photo collection [4] [5]. In age estimation, a person is categorized into one of five groups: baby (0-1), child (1-10), youth (10-18), adult (18-60) and senior (>60). Even though each classifier only has accuracy rate around 90% for an individual face image, it can be much more reliable when applied to a cluster with big enough number of face images. In Table 1, it shows the number of faces classified into each gender or age group within each face cluster in a photo set, where M, F indicate male and female, respectively; B, C, A and S indicate baby, child, adult and senior, respectively. Since the photos span a period of several years, some subjects may belong to multiple age groups such as B/C and A/S. The numbers in red indicate the gender/age group estimated for the cluster from majority vote, and the numbers in orange indicate the second dominant age group if available. As can be seen, while there are mistakes in estimating gender and age for faces in each cluster; cluster wise, all the classification results match with the ground truth correctly. •
Who look similar to each other
Since each face cluster contains images of one person, the similarity measure between two clusters then indicates how similar the two people look like. In our work, each cluster is represented by all of its member faces with which every face in another
68
T. Zhang, H. Chao, and D. Tretter
cluster is compared, and the distance of the two clusters is determined by the average distance of the K nearest neighbors. It has been found out that clusters of adults with blood relations such as parents/children and siblings normally have high similarity measures with each other. Table 1. Estimation of gender and age group of main characters in a photo collection Face Clusters
Ground Truth
Detected Face
Gender Estimation M
No.1 No.2 No.3 No.4 No.5 No.6 No.7 No.8 No.9 No.10 No.11 No.12 No.13 No.14
•
F, B/C F, B/C F, A M, A M, C F, A M, S M, S F, S M, A/S F, A F, S M, A M, C
436 266 247 215 86 72 65 62 57 40 33 30 18 17
178 109 37 171 46 13 49 41 17 22 2 4 16 13
Age Estimation
F
B
C
A
S
258 157 210 44 40 59 16 21 40 18 31 26 2 4
94 88 2 4 12 1 1 2 0 1 0 0 0 1
274 162 30 44 71 1 7 1 6 4 0 4 4 13
61 14 186 115 2 59 11 3 11 8 33 4 12 3
7 2 29 52 1 11 46 56 40 27 0 22 2 0
Who are close with each other (appear in the same photo, and how often)
Whether and how often two people appear together in photos reveal how close they are with each other. A co-appearance matrix can be obtained for major clusters in a photo collection containing the number of co-occurrences between people. It provides compensating information to the similarity matrix of clusters introduced above. As shown in Fig.3, for one person, people who look most like him (e.g. siblings, parents) and people who are most close with him (e.g. wife, kids) are listed separately. •
Who are in the same social circle (appear in the same event)
Applying clustering to the co-appearance matrix and using photo timestamps, we can find out groups of people who appear together in the same event, and thus figure out social circles in a photo dataset. Particularly, people who appear in a series of group photos taken in one event (e.g. family reunion, company outing, alumni reunion) may be recognized as belonging to one circle of relatives, colleagues or classmates. •
Who are intimate with each other
In a family photo collection, couples, sibling kids and nuclear family members often have exclusive photos of themselves. Besides that, people’s positions in photos also
Dynamic Estimation of Family Relations from Photos
69
provide useful cues regarding intimate relations. For example, couples usually stand or sit next to each other in group photos. Parents/grandparents tend to hold the family kids in photos. Touching-faces position (two faces that touch each other) usually only happens between husband and wife, lovers, parents/grandparents and kids, or siblings.
These are people who look like him
These are people who are close with him
Fig. 3. Face clustering result of CMU Gallagher’s dataset [6]. On the left: face clusters in order of cluster size. On the right: first row – clusters most similar to selected cluster in face; second row – clusters having most co-appearances with selected cluster. Lower-right panel: images in selected cluster.
•
Evolution of relations
People’s relations evolve over time. With the help of photo timestamps, for a photo collection spanning a relatively long period of time (e.g. from a few years to dozens of years), the changes in major characters can be detected. Events such as adding or passing away of family members and occasional visits of extended family members, relatives or friends can be discovered. Also, people who appeared at different stages of one’s life may be identified. 2.2 Constructing a Relation Tree We proposed a workflow to identify people’s relations and derive a relation tree from photo collections using the clues introduced above. Members of the nuclear family of the photo owner are first recognized, followed by extended family members, and then other relatives and friends. •
Identifying nuclear family members
From all major clusters, which are defined as clusters containing more than M images (M is empirically determined, e.g. M=4 or 5), the most significant clusters are selected through a GMM procedure in the kid’s clusters (baby, child, junior) and adult clusters
70
T. Zhang, H. Chao, and D. Tretter
Nuclear family
447
274
254
224
92
Candidate nuclear family members Fig. 4. Nuclear family estimation. Left: working sheet (the number under each face is the size of face cluster of that character); Right: identified nuclear family members.
(adult, senior), respectively, in order to find candidates of kids and parents of the nuclear family [7]. Among these candidates, a link is computed between each pair of people based on the number of co-appearances, and with weights added to cases such as exclusive photos, close positions in group photos, face-touching position and babyholding position. An example is shown in Fig.4, where four of the candidates have strong link between each two of them, while the fifth person only has light link with some of the others. Thus, the nuclear family members were identified as in the right. •
Identifying extended family members
Extended family members are defined as parents and siblings of the husband or wife, as well as siblings’ families. These people are identified through facial similarity, coappearances and positions in photos. As shown in Fig.5, major senior characters that have high facial similarity with husband or wife and strong links with nuclear family members were found. Each senior couple can be further recognized as the parents of the husband or the wife according to the link between them and their links with the husband or wife. A sibling of the husband/wife usually has strong link not only with the husband/wife but also with the corresponding parents in terms of facial similarity and co-appearances. He/she also often has co-occurrences with nuclear family kids with intimate positions. A sibling’s family can be identified as those who have strong links with the sibling, as well as co-appearances with nuclear family members and the corresponding parents. •
Identifying relatives and friends
Other major characters are determined as relatives or friends of the family. People are identified as relatives if they co-appear not only with the nuclear family, but also with the extended family members in photos of different events or in a relatively large amount of photos in one event. The rest major characters are determined as friends. In Fig.6, a group of nine people were found to appear together in dozens of photos (including a number of group photos) taken within two events. The nuclear family and the husband’s sister’s family are also in the group photos. Therefore, these people are identified as in one relative circle of the husband’s side.
Dynamic Estimation of Family Relations from Photos
69
58
65
Extended family
30
par
ent s
Nuclear family
par
sib li
ling s ib
74
43
71
ent
s
ng
18
Fig. 5. Extended family estimation. Left upper: identifying parents of husband/wife; Left lower: identifying a sibling of the husband and her family. Right: estimated extended family. Extended family par ent s
Nuclear family ling sib
e n ts par
sib
ling
Relatives or friends
Fig. 6. Identifying circles of relatives and friends
3 Estimation of the Evolution in People’s Relations 3.1 Dynamic Relation Tree – One Kind of Life Log With people’s relations identified in the above described process, a dynamic relation tree is built which places the people in their corresponding positions in the relation circles, and the view of the tree changes when different time periods are selected.
72
T. Zhang, H. Chao, and D. Tretter
Fig. 7. Snapshots of a dynamic relation tree which was constructed from a family photo collection. This collection contains photos spanning seven years. One view of the tree was generated for each of the years from 2001 to 2007.
Dynamic Estimation of Family Relations from Photos
73
That is, while people and their relations are identified by analyzing all the images in the photo collection, only those people who appear in the specified period of time (e.g. a certain year or a certain event) are shown in one view of the dynamic tree. It thus reveals the evolution of relations over time, and provides one kind of life log with glance views that reminds one of events and people involved in his past. Fig.7 contains snapshots of a dynamic relation tree derived from a family photo collection containing photos spanning 7 years. One view is included for each of the years 2001~2007, showing people appearing that year with face images taken during that time. Many stories can be told from viewing these snapshots. For example, it was the end of 2001 that the family had their first digital camera and only a few of them appeared. Grandparents on both sides got together with the nuclear family every year except that the wife’s parents did not make it in 2006. The family went to see greatgrandmother in 2003. In that same year, the husband’s sister came to visit them. We can see the daughter’s favorite teacher in 2004. Two neighbor families appear in almost every year and the kids have grown up together. If we zoom into specific events in certain years, more details may be discovered such as parents of the wife’s side were together with the family during Thanksgiving in 2004, while parents of the husband’s side spent Christmas with them that year. From these snapshots, we can also see how the looks of the kids have changed when they grew from pre-schoolers to early teenagers. Comparing with a static tree showing who are the major characters and how they connect with each other [7], a dynamic tree presents additional details about the activities and changes in the people and their relations.
Fig. 8. Determining a kid’s age based on photo timestamps and age group classification
74
T. Zhang, H. Chao, and D. Tretter
3.2 Discovery of Current Status and Changes in the Family Circle With the help of photo timestamps and estimation of people’s relations, many events in the family circle can be discovered such as the birth of a baby or the passing away of a family member. Here, in particular, we propose a method that predicts the current age of an individual based on images in his/her face cluster over multiple years. One example is shown in Fig.8. The age group classifier is applied to each face image in a person’s cluster. The images are then divided into one-month periods, and within each period, the percentage of images classified to each age group is computed as plotted in the figure. Next, for each age class, a polynomial function is estimated to fit the data. The polynomial curves of the baby class and the child class over 4 years are shown. The crossing points of the polynomial functions for any two neighboring classes are found, and if one crossing point has value larger than a threshold (e.g. 0.35), it is considered to be a significant transitional point that indicates the time the person transits from one age group to the next. In this example, the transitional time from baby to child class is around June 2008; thus the person was estimated to be around one year old at that time and his/her current age can be estimated accordingly.
4 Experiments and Discussions We tested the proposed approach on three typical family photo collections. Each one spans between 7-9 years, and contains photos of a large number of major people, including nuclear family members (husband, wife and kids), extended families on both sides, other relatives and friends. Using our approach described in this paper, the estimated relations match with the ground truth quite accurately except for a few extreme cases. For example, the final Extended family par e
nts
par
s ent
Nuclear family si
Relatives or friends
g blin
sib ling
Relatives or friends
Fig. 9. Relation tree automatically derived from a family photo collection
Dynamic Estimation of Family Relations from Photos
75
relation tree derived from the collection introduced in section 2.2 is shown in Figure 9. In this tree, one person indentified as a “family relative or friend” is actually the portrait of Chairman Mao in Beijing’s TianAnMen Square. This false alarm happened because the family took quite some photos there at different times. Another issue is that the estimated gender is wrong for three people in this collection, showing that our gender estimation may not be reliable for young kids and seniors. Furthermore, the concept of dynamic tree helps to resolve some confusing cases in a static tree. Still in this collection, one senior man is close with two senior women (one is his late wife, another is his current wife), and both women have strong links with the nuclear family, which makes a difficult case in a static tree. However, as the two women’s appearances do not have any overlap in time, it is straightforward to place them in the dynamic tree. There are also cases that need extra rules to guide to the right result. For instance, in one collection, an adult male has a number of co-appearances with the wife’s sister including two exclusive ones; however, he is also in photos with the husband’s family plus the person and the husband are highest on each other’s facial similarity ranking. We had to add the rule that blood relation estimation has higher priority than that of the significant-other relation, and thus assign the person as a sibling of the husband. We believe that with experiments on more family photo collections, there will be new cases appearing all the time that need rules to be added or expanded to accommodate all different variations of relations.
5 Conclusions and Future Work We presented an approach to automatically figure out main characters and their relations in a photo set based on face analysis technologies and image contextual information. On top of this, a dynamic relation tree is built in which the appearing people and their looks change over time to reveal the evolutions and events in the family’s life over multiple years. Experiments have shown that with existing techniques, quite accurate results can be obtained on typical family photo collections. Only preliminary work has been done with this approach. In the following, more family image datasets will be collected and tested on to make the rule-based system more robust. More learning elements will be added into the workflow to replace hard coded rules so that the system can adapt to different relation cases by itself. We will also investigate use cases of the approach and produce more useful relations trees.
References 1. Golder, S.: Measuring Social Networks with Digital Photo-graph Collections. In: 19th ACM Conference on Hypertext and Hypermedia, pp. 43–47 (June 2008) 2. Wu, P., Tretter, D.: Close & Closer: Social Cluster and Closeness from Photo Collections. In: ACM Conf. on Multimedia, Beijing, pp. 709–712 (October 2009) 3. Zhang, T., Xiao, J., Wen, D., Ding, X.: Face Based Image Navigation and Search. In: ACM Conf. on Multimedia, Beijing, pp. 597–600 (October 2009)
76
T. Zhang, H. Chao, and D. Tretter
4. Gao, W., Ai, H.: Face Gender Classification on Consumer Images in a Multiethnic Environment. In: The 3rd IAPR Conf. on Biometrics, Univ. of Sassari, Italy, June 2-5 (2009) 5. Gao, F., Ai, H.: Face Age Classification on Consumer Images with Gabor Feature and Fuzzy LDA Method. In: The 3rd IAPR International Conference on Biometrics, Univ. of Sassari, Italy, June 2-5 (2009) 6. Gallagher, A.C., Chen, T.: Using Context to Recognize People in Consumer Images. IPSJ Trans. on Computer Vision and Applications 1, 115–126 (2009) 7. Zhang, T., Chao, H., et al.: Consumer Image Retrieval by Estimating Relation Tree from Family Photo Collections. In: ACM Conf. on Image and Video Retrieval, Xi’an, China, pp. 143–150 (July 2010)
Semi-automatic Flickr Group Suggestion Junjie Cai1 , Zheng-Jun Zha2 , Qi Tian3 , and Zengfu Wang1 1
University of Science and Technology of China, Hefei, Anhui, 230027, China 2 National University of Singapore, Singapore, 639798 3 University of Texas at San Antonio, USA, TX 78249
[email protected],
[email protected],
[email protected],
[email protected]
Abstract. Flickr groups are self-organized communities to share photos and conversations with common interest and have gained massive popularity. Users in Flickr have to manually assign each image to the appropriated group. Manual assignment requires users to be familiar with existing images in each group and it is intractable and tedious. Therefore it prohibits users from exploiting the relevant groups. For solution to the problem, group suggestion has attracted increasing attention recently, which aims to suggest groups to user for a specific image. Existing works pose group suggestion as the automatic group prediction problem with a purpose of predicting the groups of each image automatically. Despite of dramatic progress in automatic group prediction, the prediction results are still not accurate enough. In this paper, we propose a semiautomatic group suggestion approach with Human-in-the-Loop. Given a user’s image collection, we employ the pre-built group classifiers to predict the group of each image. These predictions are used as the initial group suggestions. We then select a small number of representative images from user’s collection and ask user to assign the groups of them. Once obtaining user’s feedbacks on the representative images, we infer the groups of remaining images through group propagation over multiple sparse graphs among the images. We conduct experiment on 15 Flickr groups with 127,500 images. The experimental results demonstrate the proposed framework is able to provide accurate group suggestions with quite a small amount of user effort. Keywords: Flickr Group, Semi-automatic, Group Suggestion.
1
Introduction
In Web 2.0 era, social networking is a popular way for people to connect, express themselves, and share interests. Popular social networking websites include MySpace1 , Facebook2 , LinkedIn3 , and Orkut4 for finding and organizing 1 2 3 4
http://www.myspace.com/ http://www.facebook.com/ http://www.linkedin.com/ http://www.orkut.com/
K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 77–87, 2011. c Springer-Verlag Berlin Heidelberg 2011
78
J. Cai et al.
Fig. 1. Home page of the group ”cars”
contacts, LiveJournal5 and BlogSpot6 for sharing blogs, Flickr7 and Youtube8 for sharing images or videos, and so on. One important social connection feature on these Websites is Group, which refers to the self-organized community with declared, common interest. In particular, groups in Flickr are the communities where users gather to share their common interests on certain type of events or topics captured by photos. For example, the Flickr group of “cars” is illustrated in Fig.1. Most of the activities within a Flickr group start from sharing images: user would contribute his/her own images to the related group, comment on other members’ images, and discuss related photographic techniques[1][3]. Flickr now obligates user to manually assign each image to the appropriated group. Manual assignment requires users to be familiar with existing images in each group and match the subject of each image in user’s collection with the topics of various groups. This work is intractable, tedious, and thus prohibits user from exploiting the relevant groups[1][2]. To tackle this problem, group suggestion has been proposed recently, which aims to suggest appropriate groups to user for a specific image. Existing works pose group suggestion as an automatic group prediction problem with a purpose of predicting the groups of each image automatically. For example, Perez and 5 6 7 8
http://www.livejournal.com/ http://googleblog.blogspot.com/ http://www.flickr.com/ http://www.youtube.com/
Semi-automatic Flickr Group Suggestion
79
« ᱑tree᱒ ᱑baby᱒ ᱑animal᱒ groups
animal baby baby
animal baby baby baby
baby baby baby
baby baby baby
baby baby baby
prediction: animal red ᱑tree᱒ «« User Image Collection
baby baby
᱑animal᱒
red user input: baby
«« ᱑baby᱒
᱑sky᱒
Groups Classifiers based on Multiple Kernel Learning
animal baby Group Prediction
««
«« animal baby baby Representative Image Selection and User Labeling
Group Inference
baby baby Group Suggestion
Fig. 2. Flowchart of our approach. The green rectangles indicate final group suggestions corrected by our method.
Negoescu analyzed the relationship between image tags and the tags in groups to automatically suggest groups to users[6]. In addition to tags used in[6], visual content was also utilized to predict the groups of images. Duan et al. presented a framework that integrates a PLSA-based image annotation model with a style model to provide users with groups for their images[4]. Chen et al. developed a system named SheepDog to automatically add images into appropriate groups by matching the semantic of images with the groups, where image semantic is predicted through image classification technique[8]. Recently, as reported in [1][3], Yu et al. converted group suggestion to group classification problem. They integrated both visual content and textual annotations(i.e. tags) to predict the events or topics of the images. Specifically, they firstly trained a SVM classifier to predict the group of each image and then refined the predictions through sparse graph-based group propagation over images within a same user collection. Although encouraging advances have been achieved in automatic group suggestion, the suggestion result is still not accurate enough. Motivated by above observations, in this paper, we propose a semi-automatic group suggestion approach with Human-in-the-Loop. Fig.2 illustrates the flowchart of our approach. Given a user’s image collection, we employ the prebuilt group classifiers to predict the group of each image. These group predictions are used as the initial suggestions for the user. As aforementioned, the automatic group prediction is not accurate enough. We thus introduce user in the loop [15] and conduct the following two steps repeatedly: (a) sample images are selected from user’s collection and the corresponding suggestions from group classifiers are then presented to user to amend the wrong ones, and (b) the groups of remaining images are inferred based on user’s feedbacks and group predictions in last round. From technical perspective, the semi-automatic group suggestion framework contains three components: (a) group classifier building, (b) representative image selection, and (c) group inference. Specifically, we employ Multiple Kernel Learning(MKL) technique [9] to build a SVM classifier for each group with one-vs-all strategy. To select representative images, we incorporate sample uncertainty [12] into Affinity Propagation algorithm and select the images with high uncertainty and representativeness. After obtaining user’s feedbacks
80
J. Cai et al.
on the selected images, we infer the groups of remaining images through group propagation over multiple sparse graphs of the images. We conduct experiment on 15 Flickr groups with 127,500 images. The experimental results demonstrate the proposed framework is able to provide accurate group suggestions with quite a small amount of user effort. The rest of this paper is organized as follows. The proposed semi-automatic group suggestion framework is elaborated in Section 2. The experimental results are reported in Section 3 followed by the conclusions in Section 4.
2 2.1
Approach Image Features
Three popular visual descriptors, i.e., GIST, CEDD and Color Histogram, are extracted to represent image content[1]. GIST. GIST descriptor [5]has recently received increasing attention in the context of scene recognition and classification tasks. To compute the color GIST description, the image is segmented by a 4 by 4 grids for which orientation histograms are extracted. Our implementation takes as input a square image of fixed size and produces a 960-dimensional feature vector. CEDD. Color and Edge Directivity Descriptor(CEDD)[10] is a new low level feature which incorporates color and texture information into a histogram. CEDD size is limited to 54 bytes per image, rendering this descriptor suitable for use in large image databases. A 144-dimensional CEDD feature vector is extracted for each image. Color Histogram. Color histogram is a widely used visual feature. We extract a 512 RGB color histogram with dividing the RGB color space into 8*8*8 bins. Finally, a 512-dimensional color feature vector is extracted for each image. 2.2
Group Classifier Building
The curse of dimensionality has always been a critical problem for many machine learning tasks. Directly concatenating different types of visual features into a long vector may lead to poor statistical modeling and high computational cost. In order to bypass this problem, we employ Multiple Kernel Learning (MKL)method[9] to build a SVM classifier for each group. Denoting the kernel similarity between sample x and y over j-th feature by Kj (x, y), we combine multiple kernels as a convex combination as: K(x, y) =
K j=1
βj Kj (x, y), with
βj ≥ 0,
K
βj = 1.
(1)
j=1
The kernel combination weights βj and the parameters of the SVM can be jointly learned by solving a convex, but non-smooth objective function. We here follow the implementation at http://asi.insa-rouen.fr/enseignants/ arakotom/.
Semi-automatic Flickr Group Suggestion
2.3
81
Representative Image Selection
We use the Affinity Propagation (AP) method to identify small number of images that accurately represent user’s image collection[11][14]. Different from the typical AP, sample uncertainty is incorporated into AP as the prior of images. This modified AP is thus able to select the images with high representativeness as well as high uncertainty. The entropy has been widely used to measure sample uncertainty [12]. The uncertainty in binary classification problem is defined as follows p(y|x) log p(y|x), (2) H(x) = − y∈{0,1}
where p(y|x) represents a distribution of estimated class membership. We simply extend Eq.2 to compute sample uncertainty in multiple group prediction as H(x) = −
k
p(yi |x) log p(yi |x).
(3)
i=1 yi ∈{0,1}
The resulted H(x) is used as the preference of sample x in AP algorithm. AP aims to cluster image set I = {Ii }N i=1 into M (M < N ) clusters based on sample similarity s(Ii , Ij ). Each represented by the most representative image is called “exemplar”. In AP, all the images are considered as potential exemplars. Each of them is regarded as a node in a network. The real-valued message is recursively transmitted via the edges of the network until a good set of exemplars and their corresponding clusters emerge. Let Ie = {Iei }M i=1 denote the final exemplars and e(I) represent the exemplar of image I. In brief, the AP algorithm propagates two kinds of information between images: 1) the “responsibility” r(i, j) transmitted from image i to image j, which measures how well-suited Ij is to serve as the exemplar for Ii by simultaneously considering other potential exemplar for Ii , and 2) the “availability” a(i, j) sent from candidate exemplar Ij to Ii , which reflects how appropriately Ii chooses Ij as exemplar by simultaneously considering other potential images that may choose Ij as their exemplar. This information is iteratively updated by r(i, j) ← s(Ii , Ij ) − max {a(i, j ) + s(Ii , Ij )}, j=j a(i, j) ← min{0, r(j, j)} + max{0, r(i , j)}
(4)
i i,j
The “self-availability” a(j, j) is updated by a(j, j) := i =j max{0, r(i , j)}. The above information is iteratively propagated until convergence. Then, the exemplar e(Ii ) of image Ii is chosen as e(Ii ) = Ij by solving arg maxj {r(i, j)+a(i, j)}. 2.4
Group Inference
After obtaining user’s feedbacks on selected images, our task is to infer the groups of remaining images. It is reasonable to assume that many images in a user image
82
J. Cai et al.
collection are usually similar and the similar images should be assigned to same group. Therefore, the groups of remaining images can be inferred by propagating user’s feedbacks to these images. Let I = {I1 , . . . , Il , Il+1 , . . . , IN } denote images in certain user’s collection containing l labeled samples and N − l unlabeled samples. Xg = {xg,1 , . . . , xg,l , xg,l+1 , . . . , xg,N } denotes the feature vectors on g-th modality, where xg,i ∈ Rdg represents the i-th sample. Here, we infer the groups of {Il+1 , . . . , IN } resorting to group propagation over multiple sparse graphs among images I. Let Gg = {I, Wg } denote the sparse graph on g-th modality. Wg is the affinity matrix, in which Wg,ij indicates the affinity between sample i and j. Wg can be obtained by solving the following optimization problem[13]. Wg = arg min Wg 1 ,
s.t. xg,i = Ag,i Wg ,
(5)
where the matrix Ag,i = [xg,1 , . . . , xg,i−1 , xg,i+1 , . . . , xg,N ]. Afterwards, we conduct group propagation over K sparse graphs {Gg }K g=1 . Wang et al.[7] have proposed an optimized multi-graph-based label propagation algorithm, which is able to integrate multiple complementary graphs into a regularization framework. The typical graph-based propagation framework estimates class labels (i.e.groups in our case) of images over the graphs such that they satisfy two properties: (a) they should be close to the given labels on the labeled samples, and (b) they should be smooth on the whole graphs. Here, we extend Wang et al.’s method to further require the estimated groups should be consistent with the initial predictions from our group classifiers. The group inference problem can then be formulated as the following optimization problem. K fi fj f ∗ = arg min{ αg (Wg,ij | − |2 +μ |fi −yi |2 +ν |fi −fi0 |2 )}, f D D g,ii g,jj g=1 i,j i i (6)
where Dg,ii = j Wij , fi is the to-be-learned confidence score of sample i with respect to certain group, fi0 is the initial prediction from the corresponding group classifier, and yi is the user’s feedback on sample i. α = {α1 , α2 , ..., αK } are the K weights which satisfy αg > 0 and g=1 {αg } = 1. The regularization framework consists of three components: the first term is a loss function that corresponds to the first property to penalize the deviation from user’s feedbacks; the second term is a regularizer to address the label smoothness; and the last term is a regularizer to prefer the consistence between the estimated group assignment and initial prediction. If the weights α are fixed, we can derive the optimal f as f = (I +
K K 1 ν 1 μ αg Lg + )−1 + (I + αg Lg + )−1 f0 μ g=1 μ ν g=1 ν
(7)
where Lg = D−1/2 (D − Wg )D1/2 is the normalized graph Laplacian. g g However, the weights α, which reflects the utilities of different graphs, are crucial to the propagation performance. Thus, α should be also optimized automatically to reflect the utility of the multiple graphs. To achieve this, we make a
Semi-automatic Flickr Group Suggestion
83
relaxation by changing αg to ατg , τ > 1[7]. Note that ατg achieves its minimum when αg = 1/K with the constraint αg = 1. We then solve the joint optimization of f and α by using the alternative optimization technique. We iteratively optimize f with fixed α and then optimize α with fixed f until convergence [7]. The solution of f and α can be obtained as ( fT L
αg = K
1
1 g f +μ|f −Y
|2 +ν|f −f0 |2
) r−1 (8)
1 1 r−1 g=1 ( f T Lg f +μ|f −Y |2 +ν|f −f0 |2 )
K K τ τ 1 g=1 αg Lg ν −1 1 g=1 αg Lg μ f = (I + + ) + (I + + )−1 f0 K K μ μ ν ν ατg ατg g=1
3
(9)
g=1
Experiments
3.1
Data and Methodologies
We collect 127,500 user images from 15 groups automatically via Flickr Group9 API. All of groups are related to popular visual concepts: “baby”, “bird”, “animal”, “architecture”, “car”, “flower”, “green”, “music”, “night”, “tree”, “red”, “wedding”, “sky”, “snow” and “sunset.” Each group contributes 8,500 images to our dataset on average. For each group, we assign the users and their images into testing subset if they contribute more than 100 images to this group. As a result, the testing subset contains 203 users with 24,367 images. The remaining 103,133 images are used as training samples. Some sample images of each group are illustrated in Fig. 3.
animal
flower
red
architecture
green
wedding
baby
bird
car
music
night
tree
sky
snow
sunset
Fig. 3. Sample images from our dataset
In the experiment, we utilize a simple but effective linear kernel K(xi , xj ) = xTi xj for SVM classifier. We employ the modified AP algorithm to select representative images from each user’s collection. The visual similarity between two 9
http://www.flickr.com/groups/
84
J. Cai et al. Table 1. The comparison of average accuracy on our dataset Visual Feature Descriptor Average Accuracy Color Histogram 39.0% GIST 43.7% CEDD 51.2% MKL 54.0%
K images is calculated as g=1 exp(−xg,i − xg,j 2 ), where xg,i is the feature vector of image Ii . For each representative image selected by AP, there is a score reflecting the significance of that image. In each round, we choose five representative images which have the highest scores and query the user for the groups. 3.2
Evaluation of Group Classifiers
We firstly evaluate our group classifiers and compare them against SVMs with the three features in section 2.1 respectively. Table 1 shows the comparison of average prediction accuracy, while Fig. 4 illustrates the comparison of accuracy over each class. We can see that our MKL classifiers achieve the best performance among four methods. It achieves around 15.0%, 10.3% and 2.8% improvements as compared to SVMs with Color Histogram, GIST and CEDD visual feature descriptors, respectively. 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
Color Histogram
Gist Descriptor
CEDD Descriptor
MKL
Fig. 4. Classification performance comparison on our dataset
3.3
Evaluation of Group Inference
We evaluate the effectiveness of group inference that is achieved through multisparse-graph group propagation based on initial predictions from group classifiers as well as users’ feedbacks on selected images. The weights in Eq.6 are initialized as α1 =α2 =α3 = 1/3. The parameter μ and ν in Eq.7 are set empirically to 50 and 5, respectively.
Semi-automatic Flickr Group Suggestion
85
Prediction Accuracy
0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0
1
2
3
4 5 Iteration
6
7
8
9
10
Fig. 5. Group suggestion performance comparison over ten iterations 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
Initial Label Propagation
Multi-Graph Label Propagation (10 times)
Fig. 6. Prediction accuracy comparison on our dataset
Fig. 5 shows the group suggestion performance over ten iterations. We can see that group inference plays an active role in the loop of group suggestion and effectively improves the group prediction accuracy. As compared to the predictions from group classifiers, group inference improves the prediction accuracy by 8% and 24% in the first and last iteration, respectively. In each iteration, it takes only 0.5 seconds to infer the groups of images and select new representative images for the next iteration. Fig. 6 provides the detailed comparison of prediction accuracy over each group. It shows that the average prediction accuracy can be significantly improved from 0.62 in the first iteration to 0.78 in the last iteration. Fig. 7 illustrates the suggested groups for some sample images within three users’ collections. The first image in each collection is the selected representative image. The arrow indicates the change of group suggestions. The group lying at the left side is the initial suggestion from group classifiers, while the group lying at the right side is the one inputted by users for selected images or the final group suggestion generated by group inference. From the above experimental results, we can see that our semi-automatic group suggestion approach can outperform the automatic approach requiring quite a small amount of user effort.
86
J. Cai et al.
animal baby baby
baby
baby animal baby baby
bird
baby baby red baby
««
flower animal animal
animal
animal baby animal car animal
car
animal animal
««
car animal
sky tree tree
night tree tree
tree
car tree
night tree tree night tree
««
Fig. 7. Final group suggestions for sample images. Initial predication results lie on the left of arrow, while suggestion results are located on the right of arrow.
4
Conclusion
In this paper, we have proposed a semi-automatic group suggestion framework with Human-in-the-Loop. The framework contains three components: group classifier building, representative image selection, and group inference. Specifically, we employ the pre-built group classifiers to predict the group of each image in each user’s collection. After obtaining user’s feedbacks on some selected images, we infer the groups of remaining images through group propagation over multiple sparse graphs of the images. The extensive experiments demonstrate our proposed framework is able to provide accurate group suggestions with minimal user’s effort.
References 1. Yu, J., Jin, X., Han, J., Luo, J.: Mining Personal Image Collection for Social Group Suggestion. In: IEEE International Conference on Data Mining Workshops, Washington, DC, USA, pp. 202–207 (2009) 2. Yu, J., Joshi, D., Luo, J.: Connecting people in photo-sharing sites by photo content and user annotations. In: Proceeding of International Conference on Multimedia and Expo, New York, USA, pp. 1464–1467 (2009) 3. Yu, J., Jin, X., Han, J., Luo, J.: Social Group Suggestion from User Image Collections. In: Proceedings of the 19th International Conference on World Wide Web, Raleigh, North Carolina, USA, pp. 1215–1216 (2010) 4. Duan, M., UIges, A., Breuel, T.M., Wu, X.: Style Modeling for Tagging Personal Photo Collections. In: Proceeding of the International Conference on Image and Video Retrieval, Santorini, Fira, Greece, pp. 1–8 (2009) 5. Douze, M., Jegou, H., Sandhawalia, H., Amsaleg, L., Schmid, C.: Evaluation of GIST descriptors for web-scale image search. In: Proceeding of the International Conference on Image and Video Retrieval, Santorini, Fira, Greece, pp. 1–8 (2009) 6. Negoescu, R.A., Gatica-Perez, D.: Analyzing Flickr Groups. In: Proceeding of the International Conference on Image and Video Retrieval, Niagara Falls, Canada, pp. 417–426 (2008)
Semi-automatic Flickr Group Suggestion
87
7. Wang, M., Hua, X.-S., Yuan, X., Song, Y., Dai, L.-R.: Optimizing Multi-Graph Learning: Towards A Unified Video Annotation Scheme. In: Proceedings of the 15th ACM International Conference on Multimedia, Augsburg, Germany, pp. 862– 871 (2007) 8. Chen, H.-M., Chang, M.-H., Chang, P.-C., Tien, M.-C., Hsu, W., Wu, J.-L.: SheepDog-Group and Tag Recommendation for Flickr Photos by Automatic Search-based Learning. In: Proceeding of the 16th ACM International Conference on Multimedia, Canada, pp. 737–740 (2008) 9. Rakotomamonjy, A., Bach, F., Canu, S., Grandvalet, Y.: SimpleMKL. Journal of Machine Learning Research 9, 2491–2521 (2008) 10. Chatzichristofis, S., Boutalis, Y.: CEDD: Color and Edge Directivity Descriptor. A Compact Descriptor for Image Indexing and Retrieval. In: Computer Vision System, pp. 312–322 (2008) 11. Frey, B., Dueck, D.: Clustering by Passing messages Between Data Points. Science, 319–726 (2007) 12. Lewis, D.D., Gale, W.A.: A Sequential Algorithm for Training Text Classifiers. In: Proceeding of the 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pp. 3–12 (1994) 13. Wright, J., Yang, A., Ganesh, A., Sastry, S., Ma, Y.: Robust Face Recognition via Sparse Representation. IEEE Transaction on Pattern Analysis and Machine Intelligence 31(2), 210–227 (2009) 14. Zha, Z.-J., Yang, L., Mei, T., Wang, M., Wang, Z.: Visual Query Suggestion. In: Proceeding of the 17th ACM International Conference on Multimedia, Beijing, China, pp. 15–24 (2009) 15. Liu, D., Wang, M., Hua, X.-S., Zhang, H.-J.: Smart batch Tagging of Photo Albums. In: Proceeding of the 17th ACM International Conference on Multimedia, Beijing, China, pp. 809–812 (2009)
A Visualized Communication System Using Cross-Media Semantic Association Xinming Zhang1,2, Yang Liu1,2, Chao Liang1,2, and Changsheng Xu1,2 1
National Labratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijng, China 2 China-Singapore Institute of Digital Media, Singapore {xmzhang,liuyang,cliang,csxu}@nlpr.ia.ac.cn
Abstract. Can you imagine that two people who have different native languages and cannot understand other’s language are able to communicate with each other without professional interpreter? In this paper, a visualized communication system is designed to facilitate such people chatting with each other via visual information. Differing from the online instant message tools such as MSN, Google talk and ICQ, which are mostly based on textual information, the visualized communication system resorts to the vivid images which are relevant to the conversation context aside from text to jump the language obstacle. The multi-phase visual concept detection strategy is applied to associate the text with the corresponding web images. Then, a re-ranking algorithm attempts to return the most related and highest quality images at top positions. In addition, sentiment analysis is performed to help people understand the emotion of each other to further reduce the language obstacle. A number of daily conversation scenes are implemented in the experiments and the performance is evaluated by user study. The experimental results show that the visualized communication system is able to effectively help people with language obstacle to better understand each other. Keywords: Visualized Communication, Sentiment Analysis, Semantic Concept Detection.
1 Introduction The growing trend toward globalization has brought with a lot of opportunities and favorable conditions for transnational trade and traveling abroad. It is inevitable that people have to communicate with foreigners frequently. Besides face to face chatting, many online instant messaging tools such as MSN, Google talk, ICQ, are designed to help human communicate with each other regardless of wherever and whenever they are. However, it is difficult for such tools to enable people who have different native languages and do not understand other’s language to communicate smoothly. Existing instant messaging systems mentioned above purely transmit textual information, but the language obstacle makes them useless for foreigners’ conversation. At this time, the machine translation techniques can help them to understand each other. But sometimes the translated result of the machine translation may mislead the users because of its inaccuracy. Therefore, it is necessary to provide a solution for jumping the language obstacle. K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 88–98, 2011. © Springer-Verlag Berlin Heidelberg 2011
A Visualized Communication System Using Cross-Media Semantic Association
89
In daily communication with foreigners, visual information such as hand signs, body gestures is more understandable when people do not know the language with each other. Since such visual information is a kind of common experience for the people coming from all over the world, for example, if you perform the body gesture of “running”, anybody knows that you want to express the “run” meaning. It enlightens us that it is possible to assist people in communication resorting to visual information such as images, graphic signs. To the best of our knowledge, few studies have been investigated on assisting foreigners’ communications. Some research [1][2][3] aims at detecting chatting topic via the chatter textual information, whose performance mainly lies on the quality of the chatting textual information, instead of making conversation understandable and easy. In this paper, we propose a visualized communication system to provide a solution for foreigners’ conversation with multimedia information. Different from traditional chatting system, our system focuses on the intuitive visual information rather than the textual information. A multi-phase semantic concept detection strategy is proposed to associate the textual information with images collected from the web. Then, the image quality assessment is conducted based on a re-ranking strategy to prune the unsatisfied images and make sure the good quality images at the top positions in the retrieval list. In addition, the sentiment analysis is applied to express the chatter’s emotion with corresponding graphic signs to make the conversation much more understandable. Finally, the representative images and sentiment graphic signs are displayed in proper organization to assist the chatters in understanding the conversation. Our contributions can be summarized as follows: 1. Visualization is a novel technique to assist human communication, which is able to easily jump the language obstacle to make human understand each other. 2. Multi-phase semantic concept detectors not only pay attention to the objects in sentence, but also focus on the activities referred in the conversation, which make the visualized system more precise and vivid. 3. Embedding the sentiment analysis and images re-ranking can significantly prune the ambiguity in conversation and express the chatters’ intention. The remaining sections are organized as follows: Section 2 briefly review the related work. Section 3 introduces the framework of our system. The technical details of each component are described in the Section 4. The experimental results are reported in section 5. We conclude the paper with future work in Section 6.
2 Related Work The rapid growth of Internet leads to more and more chatting tools such as MSN and ICQ. These tools are usually hard to help people of different countries to understand each other due to the language barrier. To tackle this problem, various cartoon expressions or animations are applied to vividly convey the speaker’s intention or emotion. However, current image assistant function is rather limited as well as inconvenient. Specifically, the available image resource in current chatting tools is quite limited and the users have to manually specify the proper image. Motivated by the web
90
X. Zhang et al.
images analysis and annotation, we believe a visualized communication system can greatly promote the mutual understanding between people. To the best of our knowledge, little work has been directly conducted on such image facilitated mediation system. In the following, we will briefly review related work on topic detection in users’ chat messages, semantic concept extraction and re-ranking with sentiment analysis. Existing research on instant message mainly concentrates on processing the textual information. The approach in [3] used Extended Vector Space Model to cluster the instant messages based on the users’ chat content for topic detection. The methods in [1] aimed to extract certain topics from users’ talk and both of them were implemented in an offline manner. In contrast, Dong et.al. [2] analyzed the structure of the chat based on online techniques for topic detection. In the conversation among foreigners where talkers do not understand other’s language, these schemes can do little to facilitate people’s mutual understanding. . In the contrary to such limited expression ability of textual language, images are extremely useful in such cross language mediation. As long as semantically related images are presented, people with different native languages can accurately catch the other’s meaning. To implement such visualized communication, an ideal system should recommend some images closely related to users’ conversation. To extract semantics from images/videos, a series famous concept detectors are proposed, such as Columbia374 [8], VIREO-374 [9] and MediaMill-101 [7]. In [7], the proposed challenging problem for automatic video detection is to pay attention to intermediate analysis steps playing an important role in video indexing. Then they proposed a method to generate the concept detectors, namely MediaMill-101. But the main problem in these handy detectors is that they are all the concepts belonging to the noun domain. The noun detectors are indeed helpful to the low-level feature representing the high-level semantic, but they are limited to important nouns pattern ignoring that lots of necessary modes in people’s chatting include concepts companying the action. It is more meaningful to construct concept detector for the verb-noun phrase to detect uses’ real meaning during the dialogue. At the same time, the task in [4] concentrated on the keypoint-based semantic concept detection. In their work, it covered 5 representation choices that can influence the performance of keypoint-based semantic concept detection. These choices consist of the size of visual word vocabulary, weight scheme, stop word removal, feature selection and spatial information. Its experiment showed that if we choose appropriate configuration of the above items, we can derive a good performance in keypoint-based semantic concept detection. The other fact involves the effect in users’ visualization of the images filtered by semantic concept detectors. Re-ranking and sentiment analysis can help to solve this problem. Current re-ranking techniques can be partitioned into two categories. One is primarily depending on visual cues extracted from the images [Zha et al. MM09]. These visual cues in the re-ranking part are usually different from the visual information during the original search results (e.g. [11]). Another re-ranking technique is co-reranking method utilizing jointly the textual and visual information (e.g. [12]), but these work could not prove the quality of the image. Ke et.al. [10] proposed a novel algorithm to assess the quality of a photo. Their work can judge whether a photo is professional or amateur. Finally sentiment analysis can be considered as an additional function for enrich users’ chatting experience.
A Visualized Communication System Using Cross-Media Semantic Association
91
3 Framework The framework of our proposed visualized communication system is shown in Figure 1. The system consists of two parts: offline training phase highlighted by the yellow background and online communication phase highlighted by the dark green background.
NLP
PIC Mood Pictures
Img1 Img2 Img3 Img4
Feature Extraction
Img1
Re-ranking
Img3
Semantic Concept Detector Set
Fig. 1. The framework of our Visualized Communication System
The purpose of offline phase is to associate the proper images with the specific text information by resorting to semantic concept detectors. We collect the image sets for the predefined concepts which are usually appeared in the human daily conversation content. Then, the semantic concept detectors are trained using low-level visual features by SVM classifier. Four modules are contained in the online phase, which are the natural language processing (NLP), semantic concept detection, re-ranking and sentiment analysis. At first, the NLP module extracts the objects and activity keywords referred in the content from the conversation. At the same time, the translated sentence pair will be directly transmitted to the users’ interface. Then, we retrieve the relevant images by querying the keywords on Google image. The concept detection is applied to filter the noisy images. An image quality based re-ranking module is adopted to express the conversation means with high quality images. The sentiment analysis module is to help people to clearly understand the conversation mood of each other. Finally, the recommended images and sentiment graphic sign are organized and displayed to assist the conversation. The details of each part will be described in section 4.
92
X. Zhang et al.
4 Image Recommendation In this section, we introduce the technical details of the four modules in online phase in Figure 1. 4.1 Google Translation and NLP It is intuitive to use translation software or online translation service (e.g. Google Translation) to help people speaking different languages to communicate with each other. However, the state-of-the-art machine translation techniques are far from the real applications in general domains. One of the problems for machine translation is the translation ambiguity which may mismatch users’ original intent. For example, a Chinese student who cannot speak English and is going to visit his old friend living in Nanyang Technological University does not know the meaning of the phrase in a road sign saying “Go straightly by 500m to NTU”, and then he will seek help on the Google translation website. However, the translation result mismatches the users’ original intent “Nanyang Technological University” with “National Taiwan University” in Chinese. Therefore, after reading the translation, the Chinese student may be confused and may think that “Is this direction right?” From this example, the translated result has its ambiguity. To solve this problem, adding some images to associate with the specific word or sentence to be translated is necessary. If the word “NTU” can be sent to Google image search engine, the images both describing the Nanyang Technological University and National Taiwan University are returned. Thus it is easy to understand that NTU in above example means “Nanyang Technological University”. To the extracted keywords which are assigned by the images, we can utilize the NLP tools [13] to analyze the structure of a specific sentence. In our work, NLP tools can help remove stop words at first and Part of Speech (POS) tool will return the part of speech of each word in the sentence. Then we can extract the nouns pattern and the combination of transitive verb and nouns. This combination represents the activity of the object depicted by the sentence. 4.2 Multi-phrase Concept Detector For the visualized communication system, the crucial issue is how to visualize the conversation content. In other word, the system should automatically associate the visual information such as images with textual information at the semantic level. Therefore, the high-level concept detection is applied to tackle this problem. Semantic concept detection is a hot research topic as it provides semantic filters to help analysis and search of multimedia data [4]. It is essentially a classification task that determines whether an image is relevant to a give semantic concept, which also can be defined by text keywords. In our system, the semantic concept detection is applied for associating the conversation content with the images. However, traditional concept detectors usually focus on the noun phrases, which mostly denote the objects in an image. In human daily conversation, many activities will be appeared in the conversation. Therefore, it is necessary to extend current concept detection scope from pure nouns to using the activity concept detectors to clearly express the actions in the conversation. Figure 2 gives an example about the difference between the noun concept and activity concept.
A Visualized Communication System Using Cross-Media Semantic Association
93
Fig. 2. The difference between the noun and activity concepts
From Figure 2, we can see that the images that describing the noun and activity concept are apparently different. Suppose that a man likes car but cannot drive a car. He says to anther foreign person via a chatting system “Do you know how to drive a car?” If a chatting system is mainly based on noun concepts, it will detect the concept “car” and then return the images possibly like the images in the first row in Figure 2. After watching the images, the foreigner still cannot get the meaning of driving a car. However, if we trained not only the noun concept detectors but also more activity concept detectors representing the human’s activity, the images lied in the second row in Figure 2 can be presented to the user. The main difference between the noun concept detector and activity concept detector lies in that the activity concept detector can express the human’s action. In order to implement the function of detecting the action in the users’ conversation, two tasks should be involved. First we pre-define most usual the transitive with noun phrases in our chat and collect the corresponding web images to train for the concept detectors representing the human activity. We follow the approach in [4]. Once these concept detectors are prepared, they can judge whether a new image contains the concept they required. 4.3 Re-ranking Due to the various qualities returned by the search engine, the re-ranking algorithms may be not a good choice if applied in our system. Although maybe these re-ranking methods can correlate the images to the query, the performance of the re-ranked list may be aggravated if these top images are blurry. We follow the three criteria proposed in [10], which is aim to assess the quality of the photos taken by different people, to conduct re-ranking of searched images. This is different from the traditional re-ranking techniques, which rank the image list again based on another visual cue. Now the image set filtered by semantic concept detectors can be derived in section 4.2, but these images only mean they are related to the noun concept or activity concept. Therefore, if utilizing the traditional re-ranking approaches, it does not seem useful for proving the images with high quality ranked at the top positions. Take the noun concept “volleyball” as an example. If one of the speakers wants to learn what the volleyball looks like, the “volleyball” concept detector can return the images after filtering the images excluding the volleyball. Suppose that the images
94
X. Zhang et al.
Fig. 3. The professional and amateurish images
in the first row in Figure 3 are obtained after concept detector filtering, we can see clearly that the first two images are closer to the speaker’s intent, while the third one depicts volleyball in a volleyball match which seems disorganized to the user’s intention. Therefore, if we recommend the third image to the user, the user must be very unsatisfactory. To assess whether an image is professional, there exist three factors. The first one is Simplicity. Compared with snapshots which are often unstructured, busy, and cluttered, it is easy to separate the subject from the background in professional photos. The second one is realism which is another quality to differentiate the snapshots and professional photos. Snapshots look “real” while professional photos look “surreal”. The last one is about photographers’ basic techniques. It is extremely rare for an entire photo taken by a professional to be blurry. 4.4 Sentiment Analysis Sometimes the same sentence attached to different users’ attitudes will generate different meanings. If two or more users cannot know others’ attitude, they may be misunderstood by the text sentence without expressing others’ emotion. Therefore, it is essential to augment the sentimental information into the system.
Fig. 4. The predefined four sentiment sets
A Visualized Communication System Using Cross-Media Semantic Association
95
Here we mainly want to visualize the mood of a user’s conversation via some images. We predefined four sentiment sets representing the approving opinion, the opposing opinion, happy mood and angry mood respectively in Figure 4. These four sets contain most of our daily used words which could express the emotion of a person and some punctuation as well. At the same time, we correlate some emotional pictures to these sets. When we detect these elements belonging to one set or both sets, corresponding pictures can be selected to the users.
5 Experiments In our experiment, we define 15 scenes (namely traveling, asking the way and so on) which involve about total 90 concept detectors consisting of 30 noun semantic concept detectors and 60 activity semantic concept detectors, respectively. The data are collected via the Google image search engine, Flickr and other sources. In order to train the concept detectors, we follow the rule that the combination of local and global features can boost the performance of the semantic concept detectors [4]. Therefore, we utilize three low-level features, bag-of-visual-words (BOW), color moment (CM) and wavelet texture (WT). In BOW, we generate a 500 dimensional vector attached to each image. In CM, the first 3 moments of the 3 channels in HSV color space over 8 × 8 grid partition is used to form a 384 dimensional color moment vector. For WT, we use 3× 3 grids and each grid is represented by the variances in 9 Haar wavelet sub-bands to generate an 81 dimensional wavelet texture vector. The raw outputs from the SVM classifiers and then converted to posterior probabilities using Platt’s method. The probabilities are combined as a score, which indicates the confidence of detecting a concept in an image by average fusion. We divide our experiments into 2 parts. The first part (section 5.1) gives the accuracy of 20 activity concept detectors selected from all the scenes. An interactive user interface of our system is shown in section 5.2. We conduct a user study to evaluate the visualized communication system in section 5.3. 5.1 The Accuracy of the Semantic Activity Concept Detectors
We totally selected 6 semantic concept detectors from the traveling scene. The accuracy of human action detection of the traveling scene is shown in Figure 5. There are 6 groups along the horizontal axis, each consisting of 4 bars representing the accuracy of the CM, WT, BOW and the average fusion of them respectively. In general, the performance of BoW outperforms those of CM and WT due to the superiority of the local feature. The CM achieves the comparable result with the WT. Moreover, the fusion result achieves the best in most cases, which can be attributed to the complementary superiority collaborating both local and global features. In some specified cases, the results of CM are the best, such as diving and go surfing, which is mainly due to the homogeneous background in images related to these concepts. 5.2 User Interface of Visualized Communication System
The user Interface of our system is shown in Figure 6. It can be divided into 2 parts. The first part in the title region lists users’ name and the current system time. Users can select the scene in this region.
96
X. Zhang et al.
&0 :7 %R: )XVLRQ
Climb the mountains
Diving
Drink water Eat biscuits
Go surfing
Ride bikes
Activity Detector
Fig. 5. The accuracy of the semantic concept detectors
Fig. 6. User Interface of Visualized Communication System
The other part consists of two sections. The left section recommends the images related to the users’ conversation. The right section shows the current user list and the conversation. Our system can be used by the users speaking different languages. The translated sentences will be displayed under the original sentence. 5.3 The User Study
We invited 20 foreigners to evaluate our visualized communication system. We predefine five ranking scores for our proposed system. These five rankings stand for the degree of satisfaction, namely “Very satisfied”, “satisfied”, “Just so-so”, “Not satisfied” and “Disappointed”. The evaluating scores are shown in Figure 7.
A Visualized Communication System Using Cross-Media Semantic Association
97
From the evaluating results, we can see that all the users are satisfied in most cases. However, there are still some disappointed votes to our system. We think that the reasons may come from two aspects. One is that deficiency lies in the essential association between the textual information and visual information. It is hard to find proper image to represent every keyword due to semantic gap. The other is that concept detection result will affect the performance of the image recommendation if the concept detection result is poor. However, the result in Figure 7 proves that our system can enhance the quality of the chatting between the users speaking different languages by resorting to the sentiment graph and recommended images.
Fig. 7. The Evaluation Result of User Study
6 Conclusion The visualized communication system incorporating the multimedia items via visual information can really help the users who have an obstacle to communicate with each other. The multi-phase visual concept detection strategy can detect the most actions related to the conversation so that they can provide some assisting and relevant images for the users. The experimental results show that the visualized communication system is able to effectively help people with language obstacle to better understand each other. In the future, we plan to further study the user’s profile information, such as age, gender and education background, to provide rich multimedia images/video to facilitate the mediation process. In addition, user feedback technology will also be utilized to improve the system’s utility with less operation but more suitable accommodation.
References 1. Adams, P.H., Martell, C.H.: Topic Detection and Extraction in Chat. In: 2008 IEEE International Conference on Semantic Computing, pp. 581–588 (2008) 2. Dong, H., Hui, S.C., He, Y.: Structural analysis of chat messages for topic detection. Online Information Review, 496–516 (2006) 3. Wang, L., Jia, Y., Han, W.: Instant message clustering based on extended vector space model. In: Proceedings of the 2nd International Conference on Advances in Computation and Intelligence, pp. 435–443 (2007)
98
X. Zhang et al.
4. Jiang, Y.-G., Yang, J., Ngo, C.-W., Hauptmann, A.G.: Representations of Keypoint-Based Semantic Concept Detection: A Comprehensive Study. IEEE Transitions on Multimedia, 42–53 (2009) 5. Jiang, Y.G., Ngo, C.W., Chang, S.F.: Semantic context transfer across heterogeneous sources for domain adaptive video search. In: Proceedings of the Seventeen ACM International Conference on Multimedia, pp. 155–164 (2009) 6. Snoek, C.G.M., Huurnink, B., Hollink, L., de Rijke, M., Schreiber, G., Worring, M.: Adding semantics to detectors for video retrieval. IEEE Transaction on Multimedia 9(5), 975–986 (2007) 7. Snoek, C.G.M., Worring, M., Van Gemert, J.C., Geusebroek, J.M., Smeulders, A.W.M.: The challenge problem for automated detection of 101 semantic concepts in multimedia. In: Proceedings of the 14th Annual ACM International Conference on Multimedia, p. 430 (2006) 8. Yanagawa, A., Chang, S.-F., Kennedy, L., Hsu, W.: Columbia university’s baseline detectors for 374 lscom semantic visual concepts. In: Columbia University ADVENT Technical Report #222-2006-8 (2007) 9. Jiang, Y.-G., Ngo, C.-W., Yang, J.: Towards optimal bag-of-features for object categorization and semantic video retrieval. In: Proceedings of the 6th ACM International Conference on Image and Video Retrieval, p. 501 (2007) 10. Ke, Y., Tang, X., Jing, F.: The design of high-level features for photo quality assessment. In: CVPR 2006 (2006) 11. Natsev, A., Haubold, A., Tesic, J., Xie, L., Yan, R.: Semantic concept-based query expansion and re-ranking for multimedia retrieval. In: ACM Multimedia, p. 1000 (2007) 12. Yao, T., Mei, T., Ngo, C.W.: Co-reranking by Mutual Reinforcement for Image Search. In: Proceedings of the 6th ACM International Conference on Image and Video Retrieval (2010) 13. http://nlp.stanford.edu/software/tagger.shtml 14. Shih, J.-L., Chen, L.-H.: Color image retrieval based on primitives of color moments. In: IEE Proceedings-Vision, Image, and Signal Processing, p. 370 (2002) 15. Van de Wouwer, G., Scheunders, P., Dyck, D.V.: Statistical Texture Characterization from Discrete Wavelet Representations. IEEE Transactions on Image Processing, 592-598 (1999) 16. Zhang, J., Marszalek, M., Lazebnik, S., Schmid, C.: Local features and kernels for classification of texture and object categories: A comprehensive study. IJCV, 213–238 (2007) 17. Keshtkar, F., Inkpen, D.: Using Sentiment Orientation Features for Mood Classification in Blogs. IEEE, Los Alamitos (2009) 18. Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up?: sentiment classification using machine learning techniques. In: Proceedings of the ACL 2002 Conference on Empirical Methods in Natural Language Processing, vol. 10 (2002) 19. Zha, Z.-J., Yang, L., Mei, T., Wang, M., Wang, Z.: Visual query suggestion. In: Proceedings of ACM International Conference on Multimedia, pp. 15–24 (2009)
Effective Large Scale Text Retrieval via Learning Risk-Minimization and Dependency-Embedded Model Sheng Gao and Haizhou Li Institute for Infocomm Research (I2R), A-Star, Singapore, 138632 {gaosheng,hli}@i2r.a-star.edu.sg
Abstract. In this paper we present a learning algorithm to estimate a risksensitive and document-relation embedded ranking function so that the ranking score can reflect both the query-document relevance degree and the risk of estimating relevance when the document relation is considered. With proper assumptions, an analytic form of the ranking function is attainable with a ranking score being a linear combination among the expectation of relevance score, the variance of relevance estimation and the covariance with the other documents. We provide a systematic framework to study the roles of the relevance, the variance and the covariance in ranking documents and their relations with the different performance metrics. The experiments show that incorporating the variance in ranking score improves both the relevance and diversity. Keywords: Risk minimization; Diversity search; Language Model.
1 Introduction The task in the information retrieval (IR) system is to retrieve documents relevant to the information needs of users and to rank them with respect to their relevance to the query. Thus, the design of ranking function (or ranker) that computes the relevance score has been the central topic in the past decades. One of the popular ranking principles is based on Bayesian theory, where the documents are ranked according to the odds of the relevance and irrelevance probability for a query-document pair (e.g. [6, 7, 9, 10]). Many rankers are thus derived with the different assumptions of query model and document model such as Okapi [12], Kullback-Leibler divergence and query log-likelihood [6, 7, 8], cosine distance with the tfidf features [2]. These functions share three common properties: 1) the uncertainty of parameter estimation on query or document models is ignored; 2) the ranking score thus omits uncertainty of calculation; and 3) the relevance score, often calculated for each query-document pair, excludes the document relationship. For example, the document or query models are point estimation in LM with the ML criterion [6, 7, 8] as well as the ranking score. When we use the point estimation of relevance in ranking, the risk arises due to the uncertainty of estimation. In addition, because of the independency assumption on documents when calculating relevance score, the diversity related to the query topic is deteriorated. The recent research in diversity search tries to address the problem so that the top-N documents deliver as many relevant sub-topics as possible [3, 4, 19, 20, 21]. K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 99–110, 2011. © Springer-Verlag Berlin Heidelberg 2011
100
S. Gao and H. Li
However, none of the methods can calculate the ranking score in a principle way when the document dependency is embedded, most of them using an ad-hoc method to interpolate the relevance and divergence scores. In the paper we present a principle method to learn a ranking function when the document relation is embedded. The novel ranker is derived by minimizing a risk function, which measures the expected loss between the true ranking score and the predicted one on a set of documents interested (For clarity, we use the ranking score to refer to the final value for ranking documents. It is the sum of relevance score and uncertainty score. The relevance score holds its traditional meaning that refers to query-document similarity while the uncertainty score measure the risk in estimating similarity score and related to the diversity.). With suitable assumptions (See section 2.), an analytic form is resulted in and the ranking score becomes a linear combination among the expectation of the relevance score, the uncertainty (variance) of relevance estimation, and the abstract of the covariance between the document and the others, which measures the document dependency. The evaluation is done on the ImageCLEF 2008 set used for the ad-hoc photo image retrieval for diversity search. We study the roles of variance and covariance in ranking and their effects on mean average precision (MAP, relevance measure) and cluster recall at the top-N (CR@N, diversity measure). Our analysis will show that incorporating the variance improves both the relevance performance and diversity performance, and the addition of the covariance further improves the diversity performance at the cost of relevance performance. In the next Section, the proposed risk-minimization and dependency-embedded ranking function (RDRF) is introduced. Then we report the experiments in Section 3. The related work and our findings are discussed in Sections 4 and 5 respectively.
2 Learning RDRF Ranker In the section, we will elaborate the RDRF ranker using the statistical methodology. Firstly, the ranking function is treated as a random variable rather than a deterministic value; secondly, the objective function is defined, which measures the expected loss between the true ranking scores and the predicted ones over the documents; lastly, a particular RDRF ranker is derived based on the query log-likelihood ranking function. 2.1 Ranking Functions as Random Variables , , | | ( :the iIn the LM based IR, the relevance score between a query, th term frequency in the query. | |: the size of the vocabulary.), and a document, , , | | ( : the i-th term frequency in the document.), can be calculated from the query log-likelihood generated by the document model, , , | | , , , | | , and the document or from the discrepancy between the query LM, LM . Here the query log-likelihood function (see Eq.(1)) is chosen as the ranking function to develop the RDRF ranker, while the principle can be applied to the others. ∑|
|
log
(1)
Effective Large Scale Text Retrieval
101
In the case, the query model is the query term frequency and the document model is unigram LM. Conventionally, LM is computed using the ML criteria before calculating the ranking scores. Obviously, the ranking score is one value of Eq. (1) at the estimated point of document model. To improve robustness of score estimation, we can sample as many points as possible from the distribution, , to collect a large number of scores for each querydocument pair. We then use their average as the ranking score. Such the estimation should be more robust than point estimation. However, it is time consuming. is Fortunately, we will soon find that sampling is not necessary. Because is a random function. For each query-document pair, we have a random, thus to characterize the distribution of query-document relevance random variable in is integrated out with some suitable assumptions. score. 2.2 Risk-Minimization Based RDRF Ranker For each query-document pair, we have to characterize the distribution of relevto ance score. Thus, given a set of documents (or the corpus), we have a set of describe the joint distribution of the set of relevance scores. The RDRF algorithm tries to measure and minimize the loss between the true ranking scores and the predicted ones on the documents in order to find the optimal one. 2.2.1 Loss Function on the Set We first define a few notes used in the discussions. : the corpus with | | documents. : the i-th document in . : the ranking function of (See Eq.(1)). | , , : the distribution of depending on the query , the document excluding . and the set of documents : the predicted ranking score of . : the overall ranking score on the set of documents. The whole corpus is used here for discussion. The findings are applicable in the other cases. | , : the distribution of . Now the overall ranking score is defined as ( : the weight of ), ∑|
|
(2)
We try to seek a set of estimation so that Eq. (2) is maximized. It is noted that is | , connected by Eq. (1). should be known in theory as the function of are estimated from the documents. We denote the predicted overall ranking score as . Its optimal estimation is found by minimizing Eq.(3), ,
,
,
| ,
.
(3)
Here · | · is the expected loss. · is the loss function, which is the LINEX function in Eq.(4) [13, 15, 18] ( is the risk weight.). 1
, And the optimal Bayesian estimation
is
(4)
102
S. Gao and H. Li
1
∑
⁄2
!
⁄6
⁄24.
(5)
Herein ’s are the cumulants. For example, (the mean of ), (the measure of skewness of ), while is the n-th (the variance of ), moment of . Eq. (5) tells that the optimal overall ranking score is only related to the moments of . In the paper, only the mean and variance are used. It is not our target to know . In the next, we will introduce how to connect with the estimation of individual ranking score when the document dependency is embedded. 2.2.2 Document-Dependent Relevance Score are calculated as in Eq.(6) and Eq.(7), From Eq. (2), it is easy to verify that and ∑| ∑|
|
∑|
|
(6)
|
(7)
is the mean of and is the covariance between and . ReHere placing Eq.(6) and Eq.(7) into Eq. (5), we can get the overall ranking score as, ∑|
|
⁄2 ∑|
⁄2
|
(8)
In Eq. (8) there are three components in the rectangle bracket. The first one is the expectation of relevance score for . The second one is its variance, an uncertainty measure coming from the document itself. And the last one is the covariance coming from the other documents, an uncertainty measure due to dependency on the other . Here documents. The last two summarizes the overall uncertainty in estimating the variance and covariance are treated separately because of their different effects on the ranking relevance (See section 3 for experimental studies.) Now we separate the term related to and get a novel ranking function as, ⁄2 ∑|
⁄2 ̃
|
(9)
and . However, it is not easy. In the To calculate ̃ , we need to know following, we discuss two practical ways for computation of Eq.(9). A) Documents are independent of each other when calculating and . Therefore, for the query log-likelihood function, the mean is calculated as, ∑| |
|
∑|
|
log
|
log
(10)
is the posterior distribution. Here it is the Dirichlet distribution, ∏|
|
|
(11)
| |
∏
. Γ · : gamma function.
the document length.
According to the properties of the Dirichlet distribution, log
.
(12)
Effective Large Scale Text Retrieval
· : the digamma function),
Therefore, the mean is calculated as ( ∑| Similarly,
|
(13)
is derived as in Eq. (14) ( ∑|
· : trigamma function),
|
(14)
B) Documents are dependent when calculating covariance To get the covariance, we calculate from Eq.(15). ∑|
103
| ,
log
. log
(15)
In order to get log log , we need to estimate to the Bayes’ rule, it is found that,
,
|
,
. According
,
(16)
is independent of given . Thus, we When we derive Eq. (16), we assume induce an order in the document pair, i.e. Eq.(16) measuring the information flow to . In general, it is not equal to the information flowed from to . We from will soon discuss its effect on the covariance. According to the properties of the Dirichlet distribution, Eq.(17) is derived to calculate
log
log
(
: a pseudo-document with the term frequency
), log
log
log
log
(17)
Now the covariance is computed as, ∑|
|
log
∑|
|
log
log
(18)
In Eq. (18), the first sum is the expectation of relevance score of . The second sum is the difference between the expectation of relevance score for conditioned on , ) and the expectation of (Document model is (Document model is | ). Since , , , Eq. (18) is asymmetric. Strictly saying, it is not a covariance. But here we still use the term to measure the dependency between the documents. 2.2.3 Discussions Substituting Eqs.(13, 14, 18) into Eq.(9), we will get the ranking score for each querydocument pair in case of the document dependency included. It is obvious that the value of the covariance (plus variance) is not comparable in the scale to that of expectation. It is tough to balance them by adjusting the risk weight , because the range of
104
S. Gao and H. Li
the risk weights depend on the size of document set. To normalize such effect, herein we introduce 3 types of methods to calculate the covariance abstract. A) Covariance average The average of covariance (plus variance) is used as the covariance abstract, i.e., ̃
| |
∑|
̃ will have the comparable range with noted that . The prior weights
|
(19)
for the variable size of | |. It is are discarded in the paper.
B) Maximal covariance The average covariance can smooth the effects of the uncertainties coming from different documents. Like in MMR [3] to use the maximal margin as the measure of diversity, we can also use the maximal covariance (plus variance) as the abstract, i.e., ̃
max
,| |
(20)
Thus, if a document has the higher covariance with the others, i.e. more similar in the content with others, it will get larger penalty in calculating ranking score. It means that the ranker will promote the documents which contain novel information. C) Minimal covariance Similarly, the minimal covariance can also be used. ̃
min
,| |
(21)
In the above discussions, we know that we need to find a working set for each document in order to calculate the covariance. For simplicity, the whole corpus is selected in the above. In practice, the working set may depend on the individual document and thus vary according to the applications. For example, if we use the rankers in Eqs. (19-21) to rank all documents in the corpus, the working set is the corpus. But if we want to re-rank the documents in a ranking list in a top-down manner, the working set for a document, saying , may only include the documents that are ranked higher than in the ranking list.
3 Experiments We implement the proposed rankers based on the Lemur toolkit1. Lemur is the representative of the up-to-date technologies developed in IR models. It is used to develop the benchmark system and our ranking systems. 3.1 Evaluation Sets The experiments are carried out on the ImageCLEF 2008 (CLEF08) set, which is officially used for evaluating the task of the ad-hoc image retrieval. It has 20,000 documents with the average length 19.33 words. In the paper we do experiments only on the text modality. The query contains the terms in the field of title with an average 1
http://www.lemurproject.org
Effective Large Scale Text Retrieval
105
length 2.41 terms. Totally 39 queries are designed. Because the queries are designed to evaluate the diversity, they have much ambiguity. 3.2 System Setup and Evaluation Metrics The Jelinek-Mercer LM2 is chosen for document representation. The baseline system is built on the traditional query log-likelihood function (See Eq. (1)). Although the proposed rankers can be applied to retrieve and rank the documents from the whole corpus, considering the computation cost, we run our rankers on a subset which contains the top-1000 documents in the initial ranking list generated by the baseline. The performances of different systems are compared in terms of multiple metrics including mean average precision (MAP) for the relevance performance and cluster recall at the top 5 (CR@5) and 20 (CR@20) for the diversity performance. The cluster recall at the top-N documents measures the accuracy of the sub-topics in the top-N documents for a query. It is calculated by the number of sub-topics covered in the topN documents divided by the total sub-topics of the query. In the ImageCLEF08 set, the sub-topic label is tagged for each document and query in the pooled evaluation set besides the relevance label3 . 3.3 Result Analysis Now we study the behaviors of the RDRF rankers. From the discussion in Section 2, we know that the RDRF based ranking scores contain 3 components: expectation of the relevance score, variance and covariance. In addition, there is a tuning ter . In the following, we will study the performance as a function of the risk weight in the following conditions: 1) variance without document relation (See section 3.3.1), 2) covariance (See section 3.3.2), 3) various covariance abstract methods (See section 3.3.3) and 4) working set selection (See section 3.3.4). In the first two studies, (Eq. (20)) is chosen while the working set is same in the first 3 stuthe ranker dies4. 3.3.1 Effects of Variance Figure 1 depicts the changing performance as a function of the increasing risk weight in the case of only the variance being considered. Obviously, the performances are improving as the weight is increasing. At some points, they reach their maximum and then drops. The maximal MAP, CR@5 and CR@20 are 0.1310 ( 3), 0.1340 ( 1) and 0.2316 ( 4), respectively. In comparison with the baseline performance, which has 0.1143 for MAP, 0.1182 for CR@5 and 0.1990 for CR@20, the significant improvements are observed5. The relatively improvements are 14.6% (MAP), 13.4% (CR@5) and 16.4% (CR@20), respectively. Results are reported here based on the smoothing parameter 0.1. Findings are similar for others. 3 http://imageclef.org/2008/photo 4 The working set for a document contains the documents ranked higher than it in the initial ranking list. For computational consideration, currently only the top-100 documents are reranked using the learned rankers and the other documents in the initial ranking list keep their ranking positions. 5 We test statistical significance using t-test (one-tail critical value at the significance level 0.05). 2
106
S. Gao and H. Li
Therefore, the Bayesian estimation of ranking score gets the better performance than the traditional point estimation based method. The addition of variance further improves the performance. In the experiment, the performances with the zero risk weight are 0.1195 (MAP), 0.1244 (CR@5) and 0.2022 (CR@20) respectively, which are worse than the performances obtained with the optimal risk weights. Our investigations on Eqs.(9, 14, 20) reveal that the variance functions as a penalty added to the relevance estimation. With the same term frequency, the term penalty in a long document is higher than that in a short document. The trigamma function in Eq.(14) also normalizes the effect of the document lengths on the term contribution. Due to the normalization of term and document length, the estimated ranking score become robust and has the positive effect on the performances. MAP
CR@5
CR@20
0.2 0.1 0 0
2
4
6
8
10
Fig. 1. Performance vs. risk weight b (only the variance is considered) VAR MAX_COV
MAX_COV
0.2
0.2
0.15
0.15
0.1
0.1
0.05
0.05
0
0 0
2
4
(a) MAP
6
8 10
VAR MAX_COV
VAR 0.3 0.2 0.1 0 0
2
4
(b) CR@5
6
8
10
0
2
4
6
8
10
(c) CR@20
Fig. 2. Performance vs. risk weight b for the covariance based ranker (MAX_COV) and the variance only ranker (VAR)
3.3.2 Effects of Covariance We now add the covariance into the ranking score. Its effect on the performance is illustrated in Figure 2 and is compared with the performance when the dependency is ignored (i.e. only variance, see section 3.3.1). We have the following findings. 1) The inclusion of covariance improves the diversity performance. In the experiments, adding the dependency obviously improves the diversity performance compared with the case where only the variance is included. We study
Effective Large Scale Text Retrieval
107
maximum of CR@5 and CR@20 for the covariance based ranker (MAX_COV) and the variance based ranker (VAR). The CR@5 has a relative increment 12%, which reaches 0.1501, while CR@20 improves 4.7% which achieves 0.2424. Since the covariance measures the document dependency, it is not surprising to see it improve the diversity performance. 2) Incorporating the covariance decreases MAP, which coincides with the observations in [3, 4, 19, 20]. Since the diversity only concerns the novelty among the documents rather than the similarity, we can understand that in most of times, the diversity algorithms might have negative effects on the relevance metric. 3.3.3 Effects of Covariance Abstract Methods Figure 3 illustrates the performances for the rankers based on 3 covariance abstract methods, i.e. covariance average (AVG_COV), maximal covariance (MAX_COV) and minimal covariance (MIN_COV) (see Sec. 2.2.3 for details). AVG_COV MAX_COV MIN_COV
0.15
AVG_COV MAX_COV MIN_COV
AVG_COV MAX_COV MIN_COV
0.3
0.15
0.05
0.1
0.05 0
2
4
6
8 10
0
(a) MAP
2
4
6
8
0
10
(b) CR@5
2
4
6
8
10
(c) CR@20
Fig. 3. Performance vs. risk weight b for the rankers with covariance average (AVG_COV), maximal covariance (MAX_COV) and minimal covariance (MIN_COV) MAX_COV_N MAX_COV
0.2
MAX_COV_N MAX_COV
MAX_COV_N MAX_COV
0.2
0.2
0.1
0.1 0
0 0
2
4
(a) MAP
6
8
10
0 0
2
4
(b) CR@5
6
8
10
0
2
4
6
8
10
(c) CR@20
Fig. 4. Performance as a function of risk weight b for the rank-listed working set (MAX_COV) and the working set with all documents (MAX_COV_N)
We find that MAX_COV works best. MIN_COV is the worst while AVG_COV is in the middle. When we compare MAX_COV with AVG_COV, the biggest gap is
108
S. Gao and H. Li
found for CR@20 which reaches 20% relative increment. For CR@5, MAX_COV obtained 8.2% increment. 3.3.4 Working Set Selection In the above experiments, we collect all documents that are ranked higher than the document of interest as the working set. Thus, each document has a different working set. We call it the rank-listed working set. But if the ranking order is not available, we can use all documents as the working set. Figure 4 compares the performances between two working set schemes. One is used in the above experiments (MAX_COV) and another is the working set which contains all documents to be ranked (MAX_COV_N). In the experiments, the latter working set includes top-100 documents to be re-ranked for each document. Figure 4 shows that the rank-listed working set works better. But their maximal performances are similar. It is seen that the performance of MAX_COV_N drops quicker than that of MAX_COV off the optimal setting. In other words, MAX_COV is more stable. This is because that the MAX_COV_N includes more documents in its working set than MAX_COV set. Thus, the MAX_COV_N incurs more noises in calculating the document dependency measure.
4 Related Work Lafferty & Zhai [6, 16] presented the risk minimization framework for IR model, where the user browsing documents is treated as the sequential decision. They proved that the commonly used ranking functions are the specials of the framework with the particular choices of loss functions, query and document models and presentation schemes. Although finding the optimal decision is formulated as minimizing the Bayesian loss, in practice they approximated the objective function using the point estimation in their work. In our work the full Bayesian approach is exploited by integrating out the model parameters. Thus, our work results in a novel ranker, which contains: 1) the similarity expectation, a measure of the average relevance and 2) the covariance (plus variance), a measure of the risk of the expectation estimation that is related to diversity performance. The above novelties also make our work different from the risk-aware rankers presented by [13, 18]. In their work, each term is treated as an individual ranker and the uncertainty of the term ranker is estimated independently. Therefore, they need to combine all term rankers to get the query-document ranking score. In comparison, we directly estimate the expectation and the uncertainty. This gives us a single value for each query-document pair to measure the expected relevance score and the uncertainty. Incorporating the covariance into the ranking score is now a natural result in our risk-minimization and document-dependency embedded framework. In contrary, the alternative method is used to calculate the covariance in [13]. When applied to diversity search, our work is quite different from the methods such as [3, 4, 19, 20]. In their works, they developed the methods to estimate the diversity degree of the documents and then linearly combined the diversity scores with the similarity scores in the original ranking list. This means that their similarity scores
Effective Large Scale Text Retrieval
109
and diversity scores are estimated separately which may follow the different criteria. But in our work, they are derived under a unified risk-minimization framework.
5 Conclusion We have presented a learning algorithm to achieve a ranking function where the uncertainty of relevance estimation and document relations are embedded. Learning the ranking function is formulated in the framework of Bayesian risk-minimization. With proper assumptions, an analytic form of the ranking function is attainable. The resulted ranking score becomes a linear combination among the expectation of relevance score, variance of expectation estimation and the covariance with other documents. The presented algorithm provides a systematic way to study the relation among the relevance, diversity and risk. The roles of the variance and covariance in ranking are empirically studied. The inclusion of variance in the ranker improves the performance of relevance as well as diversity. And incorporating covariance can further improve the diversity performance but decrease relevance performance (MAP). The tunable risk weight allows us to balance the relevance and diversity. In future, we will investigate how to embed an adaptive query model in the framework.
References 1. Allan, J., Van Rijsbergen, C.J.: Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. On Information Systems 20(4), 357–389 (2002) 2. Baeza-Yates, R., Ribeiro-Neto, B.: Modern information retrieval. Addison-Wesley Publisher, Reading (1999) 3. Carbonell, J., Goldstein, J.: The use of MMR, diversity-based reranking for reording documents and producing summaries. In: Proc. of SIGIR 1998 (1998) 4. Chen, H., Karger, D.R.: Less is more: probabilistic models for retrieving fewer relevant documents. In: Proc. of SIGIR 2006 (2006) 5. Jelinek, F., Mercer, R.: Interpolated estimation of markov source parameters from sparse data. Pattern Recognition in Practice, 381–402 (1980) 6. Lafferty, J.D., Zhai, C.: Document language models, query models and risk minimization for information retrieval. In: Proc. of SIGIR 2001 (2001) 7. Lavrenko, V., Croft, W.B.: Relevance-based language models. In: Proc. of SIGIR 2001 (2001) 8. Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval. In: Proc. of SIGIR 1998 (1998) 9. Robertson, S.E., Sparck Jones, K.: Relevance weighting of search terms. Journal of the American Society for Information Science 27(3), 129–146 (1976) 10. Robertson, S.E., Walker, S.: Some simple effective approximations to the 2-poisson models for probabilistic weighted retrieval. In: Proc. of SIGIR 1994 (1994) 11. Robertson, S.E.: The probability ranking principle in IR. Readings in information Retrieval, 281–286 (1997) 12. Robertson, S.E., Walker, S., Hancock-Beaulieu, M., Gatford, M., Payne, A.: Okapi at TREC-4. In: Proc. of Text Retrieval Conference, TREC (1995)
110
S. Gao and H. Li
13. Wang, J., Zhu, J.H.: Portfolio theory of information retrieval. In: Proc. of SIGIR 2009 (2009) 14. Zaragoza, H., Hiemstra, D., Tipping, M., Robertson, S.E.: Bayesian extension to the language model for ad hoc information retrieval. In: Proc. of SIGIR 2003 (2003) 15. Zellner, A.: Bayesian estimation and prediction using asymmetric loss functions. Journal of the American Statistical Association 81(394), 446–451 (1986) 16. Zhai, C., Lafferty, J.D.: A risk minimization framework for information retrieval. Information Processing and Management 42(1), 31–55 (2006) 17. Zhai, C., Lafferty, J.D.: A study of smoothing methods for language models applied to information retrieval. ACM Trans. on Information Systems 22(2), 179–214 (2004) 18. Zhu, J.H., Wang, J., Cox, I., Taylor, M.: Risk business: modeling and exploiting uncertainty in information retrieval. In: Proc. of SIGIR 2009 (2009) 19. Zhai, C., Cohen, W., Lafferty, J.: Beyond independent relevance: methods and evaluation metrics for subtopic retrieval. In: Proc. of SIGIR 2003 (2003) 20. Gollapudi, S., Sharma, A.: An axiomatic approach for result diversification. In: Proc. of WWW 2009 (2009) 21. Radlinski, F., Kleinberg, R., Joachims, T.: Learning diverse rankings with multi-armed bandits. In: Proc. of ICML 2008 (2008)
Efficient Large-Scale Image Data Set Exploration: Visual Concept Network and Image Summarization Chunlei Yang1,2 , Xiaoyi Feng1 , Jinye Peng1 , and Jianping Fan1,2 1
School of Electronics and Information, Northwestern Polytechnical University, Xian, P.R.C. 2 Dept. of Computer Science, UNC-Charlotte, Charlotte, NC 28223, USA
Abstract. When large-scale online images come into view, it is very important to construct a framework for efficient data exploration. In this paper, we build exploration models based on two considerations: inter-concept visual correlation and intra-concept image summarization. For inter-concept visual correlation, we have developed an automatic algorithm to generate visual concept network which is characterized by the visual correlation between image concept pairs. To incorporate reliable inter-concept correlation contexts, multiple kernels are combined and a kernel canonical correlation analysis algorithm is used to characterize the diverse visual similarity contexts between the image concepts. For intra-concept image summarization, we propose a greedy algorithm to sequentially pick the best representation of the image concept set. The quality score for each candidate summary is computed based on the clustering result, which considers the relevancy, orthogonality and uniformity terms at the same time. Visualization techniques are developed to assist user on assessing the coherence between concept-pairs and investigating the visual properties within the concept. We have conducted experiments and user studies to evaluate both algorithms. We observed very good results and received positive feedback.
1 Introduction With the exponential availability of high-quality digital images, there is an urgent need to develop new frameworks for image summarization and interactive image navigation and category exploration [1-2]. The project of Large-Scale Concept Ontology for Multimedia (LSCOM) is the first such kind of efforts to facilitate more effective end-user access of large-scale image/video collections in a large semantic space [3-4]. There are more than 2000 concepts and 61901 labels for each concept in the LSCOM project, which is still a small subset compared to all the available online resources. Commercial image collection web site, such as Flickr.com, has 2 billion images and are still increasing. Considering the scale of problem we are dealing with, to effectively organize the relationship between large number of concepts and also to better summarize the large number of data within each concept will be the focus of our work. To organize the relationship of inter-concept pairs, concept ontology can be used to navigate and explore large-scale image/video collections at the concept level according to the hierarchical inter-concept relationships such as “IS-A” and “part-of” [4]. However, following issues make most existing techniques for concept ontology construction unable to support effective navigation and exploration of large-scale image collections: K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 111–121, 2011. c Springer-Verlag Berlin Heidelberg 2011
112
C. Yang et al.
(a) Only the hierarchical inter-concept relationships are exploited for concept ontology construction [5-6]. When large-scale online image collections come into view, the interconcept similarity relationships could be more complex than the hierarchical ones (i.e., concept network) [7]. (b) Only the inter-concept semantic relationships are exploited for concept ontology construction [5-6], thus the concept ontology cannot allow users to navigate large-scale online image collections according to their visual similarity contexts at the semantic level. It is well-accepted that the visual properties of the images are very important for users to search for images [1-4, 7]. Thus it is very attractive to develop new algorithm for visual concept network generation, which is able to exploit more precise inter-concept visual similarity contexts for image summarization and exploration. To reduce the scale of the image set of each concept, while maintaining visual perspectives of this concept as much as possible, image summarization methods can be conducted to select a subset of images which have the best representative visual properties. Jing et al. [8] used the ”local coherence” information to find the most representative image, which has the maximum number of edge connections in the group. Simon et al. [9] proposed a greedy algorithm to find representative images iteratively and also considered likelihood and orthogonality score of the image. Unfortunately, neither algorithms took into consideration the global clustering information, such as the distribution of the clusters, the size, mean value and variance value of each cluster. Based on these observations, this paper will focus on: (a) integrating multiple kernels and incorporating kernel canonical correlation analysis (KCCA) to enable more accurate characterization of inter-concept visual similarity contexts and generate more precise visual concept network; (b) supporting similarity-preserving visual concept network visualization and exploration for assisting users on perceptual coherence assessment; (d) iterative image summarization generation considering the global clustering information, which is specifically characterized by relevancy, orthogonality and uniformity score; and (e) generating visualization of image summarization results within each of the concept. The remainder of this paper is organized as follows. Section 2 introduces our approach for image content representation, similarity determination and kernel design. Section 3 introduces our work on automatic visual concept network generation. Section 4 describes our work on intra-concept image summarization algorithm. We visualize and evaluate our visual concept network and image summarization results in section 5. We conclude this paper at section 6.
2 Data Collection, Feature Extraction and Similarity Measurement The images used in this work are collected from the internet. There are totally 1000 keywords used to construct our data set, with some of the keywords derived directly from Caltech-256 [13] and LSCOM concept list. It is not an easy job to determine the meaningful text terms for crawling images from the internet. Many people use the textterm architecture from WordNet. Unfortunately, most of the text terms from WordNet may not be meaningful for image concept interpretation, especially when you need a
Efficient Large-Scale Image Data Set Exploration
113
Fig. 1. The taxonomy for text term determination for image crawling
Fig. 2. feature extraction frameworks: image-, grid-,segment-framework
large number of keywords; Some other people use the most popular tags from Flickr or some other commercial image-share web sites as the query keywords. The problem, again, is that these tags may not represent a concrete object or something that has a visual form, examples are like “2010”, ”California”, ”Friend”, “Music” etc. Based on the above analysis, we have developed a taxonomy for nature objects and scenes interpretation [10]. Thus we follow this pre-defined taxonomy to determine the meaningful keywords for image crawling as shown in Fig. 1. Because there is no explicit correspondence between the image semantics and the keywords extracted from the associated text documents, images returned are sometimes junk images or weakly-related images. We apply the algorithms introduced in [11] for cleansing the images which are crawled from the internet(i.e., filtering out the junk images and removing the weakly-tagged images). For each of the keywords, or concepts as used in this paper, approximately 1000 images are kept after cleansing. For image classification applications, the underlying framework for image content representation should be able to: (a) characterize the image contents as effectively as possible; (b) reduce the computational cost for feature extraction to a tolerable degree. Based on these observations, we have incorporated three frameworks for image content representation and feature extraction as shown in Fig. 2: (1) image-based framework; (2) segment-based framework and (3) grid-based framework. The segment-based framework has the most distinguish power and the best representation of the image, however, it always suffer from the large computation burden and sometimes may over segment the image. On the other hand, image-based framework is most computationally efficient but is too coarse to model the local information. As a tradeoff, we find
114
C. Yang et al.
the grid-based framework most suitable for our system in terms of both efficiency and effectiveness. Feature extraction is conducted on the grids of the images as described above. The global visual features such as color histograms can provide the global region statistics and the perceptual properties of entire region, but they may not be able to capture the object information, or in other words ,the local information, within the region[12,14]. On the other hand, the local visual features such as SIFT(Scale Invariant Feature Transform) features can allow object-level detail recognition against the cluttered backgrounds [12,14]. In our implementation, we incorporated three types of visual features as color histogram, gabor texture and SIFT, which are described in detail as follows. Color histogram: We performed on the HSV color space to extract the histogram vectors. HSV color space outperforms the RGB color space by its invariance to the change of illuminance. We conducted the histogram extraction on 18 bins of Hue components and 4 bins of Saturation components which yields a 72 dimension feature vector. Gabor texture: We apply gabor filter to different orientation and scale of the image region and the homogenous texture feature of the region is represented by the mean and standard deviation value of the transformed coefficients. Specifically, the extraction is conducted on 3 scales and 4 orientations which yields a 24 dimension feature vector. SIFT: We use the SURF (Speed Up Robust Features)[14]descriptor which is inspired by the SIFT descriptor but has a much faster computation speed in the real implementation. The parameters are configured as blob response threshold equals 0.001 and maximum number of interpolation step equals 5. The similarity measurement for color histogram and gabor texture is defined as follows: Similarity(i, j) = max −||xi − xj ||2 i,j∈X
(1)
where i, j are from the grid set of X, which is composed by 5 regional grids. By calculating the similarity score for each of the grid pairs, the maximum score is taken as the similarity score between the two images. The similarity measurement for SIFT feature is defined as Similarity(i, j) =
total # of matches total # of interesting points
(2)
We have also studied the statistical property of the images under each feature subset as introduced above. The gained knowledge for the statistical property of the images under each feature subset has been used to design the basic image kernel for each feature subset. Because different basic image kernels may play different roles on characterizing the diverse visual similarity relationships between the images, and the optimal kernel for diverse image similarity characterization can be approximated more accurately by using a linear combination of these basic image kernels with different importance. For a given image concept Cj , the diverse visual similarity contexts between its images can be characterized more precisely by using a mixture of these basic image kernels (i.e., mixture-of-kernels) [11, 15-17].
Efficient Large-Scale Image Data Set Exploration
κ(u, v) =
5
αi κi (ui , vi ),
i=1
5
αi = 1
115
(3)
i=1
where u and v are the visual features for two images in the given image concept Cj , ui and vi are their ith feature subset, αi ≥ 0 is the importance factor for the ith basic image kernel κi (ui , vi ).
3 Visual Concept Network Generation We determine the inter-concept visual similarity contexts for automatic visual concept network generation with the image features and kernels introduced above. The interconcept visual similarity context γ(Ci , Cj ) between the image concepts Ci and Cj can be determined by performing kernel canonical correlation analysis (KCCA) [18] on their image sets Si and Sj : γ(Ci , Cj ) =
θT κ(Si )κ(Sj )ϑ max θ, ϑ θT κ2 (Si )θ · ϑT κ2 (Sj )ϑ
(4)
The detailed explanation of the parameters can be found in our previous report of [22]. When large numbers of image concepts and their inter-concepts visual similarity contexts are available, they are used to construct a visual concept network. However, the strength of the inter-concept visual similarity contexts between some image concepts may be very weak, thus it is not necessary for each image concept to be linked with all the other image concepts on the visual concept network. Eliminating the weak interconcept links can increase the visibility of the image concepts of interest dramatically, but also allow our visual concept network to concentrate on the most significant interconcept visual similarity contexts. Based on this understanding, each image concept is automatically linked with the most relevant image concepts with larger values of the inter-concept visual similarity contexts γ(·, ·) (i.e., their values of γ(·, ·) are above a threshold δ = 0.65 in a scale from 0 to 1). Compared with Flickr distance [19], our algorithm for inter-concept visual similarity context determination have several advantages: (a) It can deal with the sparse distribution problem more effectively by using a mixture-of-kernels to achieve more precise characterization of diverse image similarity contexts in the high-dimensional multi-modal feature space; (b) By projecting the image sets for the image concepts into the same kernel space, our KCCA technique can achieve more precise characterization of the inter-concept visual similarity contexts.
4 Image Summarization Algorithm When the user explores inside of each concept, there are still thousands of images need to be displayed. In order to summarize the data set, we need to find the most representative images which is composed by a subset of the original data set. The summarization problem is eventually a subset selection problem and can be interpreted formally with mathematical terms as follows. Given a image data set V of N images, our goal is to
116
C. Yang et al.
find a subset S ⊂ V that can best represent the original data set V. We introduce the quality term Qv for each v ∈ V expressing its capability to represent the entire data set. The vi with the highest value is considered as the best candidate to be a summarization of V and thus can be added into S, we name the set of S as the “summarization pool”. Traditional image summarization model will partition or cluster the data set and find the summarization image within each cluster. We propose to build a model that not only make use of the cluster information but also consider the inter-cluster relationship of the data and then build a global objective function. We apply the affinity propagation algorithm to cluster the data set into several clusters and record the size, mean and variance value of each cluster. Affinity propagation has demonstrated its superiority in automatic number of clusters, faster converge speed and more accurate result compared with other methods, like k-means. The similarity measurement used in affinity propagation is defined in Eqn (1). For our proposed image summarization algorithm, we go through each element in V for the best Qv to be added as a candidate summarization. The representativeness of an image v can be reflected by the following three aspects: 1. Relevancy: the relevance score of v is determined by the size, mean and variance value of the cluster that v belongs to, for which v ∈ c(v). A candidate summarization comes from the cluster with big size, small variance and small distance value d(v, vmean ) to the mean. In other words, it should be most similar to other images. 2. Orthogonality: the orthogonality score penalizes candidates from same cluster, or it penalize candidates which are too similar to each. It is not recommended to select multiple candidates from one cluster, because these candidates tends to bring redundancy to the final summarization. 3. Uniformity: the uniformity score penalizes candidates appears to be an outlier of original data set. Although the outliers always show a unique perspective of the original set, or in other words, a high relevancy score, it should not be considered as a summarization. Base on the above criteria, we formulate the final quality score as a linear combination of the three terms: ˆ Qv = R(v) + αO(v, S) + βU (v, S)
(5)
ˆ represent for the Relevancy, Orthogonality and Uniwhere R(v), O(v, S) and U (v, S) formity score respectively. We further define the formulation of the three terms as follows. For Relevancy score: R(v) =
|c(v)| ∗ Lc(v) σc(v) + d(v, μc(v) ) + 1
(6)
where c(v) denotes the cluster that contains v, or v ∈ c(v). |c(v)| is the number of elements in c(v), μc(v) and σc(v) are the mean and standard deviation of the cluster, Lc(v) is the number of linkage from v. Within each cluster, similar images are linked together. The similarity measurement is defined by SIFT feature as in Eqn (2). The similarity score above a pre-set threshold (0.6) is defined as a match. The matched image pairs are linked together, while un-matched pairs are not linked. Lc(v) can be also seen as the degree of v in the match graph. For Orthogonality score:
Efficient Large-Scale Image Data Set Exploration
O(v, S) =
−
J(v )
0 if J(v ) = ∅ if J(v ) = ∅
1 d(v,v )+2
117
(7)
where J(v ) = {v |c(v ) = c(v), v ∈ S}. J(v ) is empty if none of the elements v in S comes from the same cluster with v, or else, J(v ) is not empty and a penalty term will be applied. For Uniformity score: ˆ =− U (v, S)
1 d(v, μSˆ) + 3
(8)
ˆ where Sˆ = V \ S, μSˆ is the mean value of S. In the above terms, d(, ) is the Euclidean distance, i , i = 1, 2, 3 is a positive number small enough to make sure the denominator in the fraction is non-zero. Applying the formulation of Qv will give the best summarization of the concept set. For a fixed number of summarizations, |S| = k, we need to iterate the calculation of Qv for k times. The process of the proposed algorithm is close to greedy algorithm and can be summarized as follows: Algorithm 1. Cluster-based Image Summarization Algorithm 1: Initialization: S = ∅ 2: For each image v ∈ V, compute ˆ Qv = R(v) + αO(v, S) + βU(v, S) 3: The v with the maximum quality Qv is add into S S =S∪v 4: If stop criteria is satisfied, stop. Otherwise, V = V \v repeat from step 2.
At each iteration, the Qv is calculated for every v in V, and the best v is found and added into S. For fixed size summarization problem, the stop criteria is that S is enlarged into the pre-defined size. For automatic summarization problem, the stop criteria is that Qv is above a pre-defined value, and this value is 0 in our implementation. The parameter of α and β is determined experimentally and here in our implementation we have α = 0.3 and β = 0.4.
5 System Visualization and Evaluation For inter-concept exploration, to allow users to assess the coherence between the visual similarity contexts determined by our algorithm and their perceptions, it is very important to enable graphical representation and visualization of the visual concept network, so that users can obtain a good global overview of the visual similarity contexts between the image concepts at the first glance. It is also very attractive to enable
118
C. Yang et al.
Fig. 3. System User Interface: left: global visual concept network; right: cluster of the selected concept node
interactive visual concept network navigation and exploration according to the inherent inter-concept visual similarity contexts, so that users can easily assess the coherence with their perceptions. Based on these observations, our approach for visual concept network visualization exploited hyperbolic geometry [20]. The essence of our approach is to project the visual concept network onto a hyperbolic plane according to the inter-concept visual similarity contexts, and layout the visual concept network by mapping the relevant image concept nodes onto a circular display region. Thus our visual concept network visualization scheme takes the following steps: (a) The image concept nodes on the visual concept network are projected onto a hyperbolic plane according to their inter-concept visual similarity contexts by performing multi-dimensional scaling (MDS) [21] (b) After such similarity-preserving projection of the image concept nodes is obtained, Poincare disk model [20] is used to map the image concept nodes on the hyperbolic plane onto a 2D display coordinate. Poincare disk model maps the entire hyperbolic space onto an open unit circle, and produces a non-uniform mapping of the image concept nodes to the 2D display coordinate. The visualization results of our visual concept network are shown in Fig. 3, where each image concept is linked with multiple relevant image concepts with larger values of γ(·, ·). By visualizing large numbers of image concepts according to their inter-concept visual similarity contexts, our visual concept network can allow users to navigate large amounts of image concepts interactively according to their visual similarity contexts. For algorithm evaluation, we focus on assessing whether our visual similarity characterization techniques (i.e., mixture-of-kernels and KCCA) have good coherence with human perception. We have conducted both subjective and objective evaluations. For subjective evaluation, users are involved to explore our visual concept network and assess the visual similarity contexts between the concept pairs. In such an interactive visual concept network exploration procedure, users can score the coherence between the inter-topic visual similarity contexts provided by our visual concept network and their perceptions. For the user study listed in Table 2, 10 sample concept pairs are selected equidistantly from the indexed sequence of concept pairs. One can observe that our visual concept network has a good coherence with human perception on the underlying inter-concept visual similarity contexts.
Efficient Large-Scale Image Data Set Exploration
119
For objective evaluation, We find center concepts and their first-order neighbor as the clusters. By clustering the similar image concepts into the same concept cluster, it is able for us to deal with the issue of synonymous concepts effectively, e.g., multiple image concepts may share the same meaning for object and scene interpretation. Because only the inter-concept visual similarity contexts are used for concept clustering, one can observe that some of them may not semantic to human beings, thus it is very attractive to integrate both the inter-concept visual similarity contexts and their inter-concept semantic similarity contexts for concept clustering. As shown in Table 2, we have also compared our KCCA-based approach with Flickr distance approach [19] on inter-concept visual similarity context determination. The normalized distance to human perception is 0,92 and 1.42 respectively in terms of Euclidean distance, which means KCCA-base approach performs 54% better than Flickr distance on the random selected sample data. Table 1. Image concept clustering results group 1 urban-road, street-view, city-building, fire-engine, moped, brandenberg-gate, buildings group 2 knife, humming-bird, cruiser, spaghetti, sushi, grapes, escalator, chimpanzee group 3 electric-guitar, suv-car, fresco, crocodile, horse, billboard, waterfall, golf-cart group 4 bus, earing, t-shirt, school-bus, screwdriver, hammock, abacus, light-bulb, mosquito
Table 2. Evaluation results of perception coherence for inter-concept visual similarity context determination: KCCA and Flickr distances concept pair user score KCCA (γ) Flickr Distance urbanroad-streetview 0.76 0.99 0.0 cat-dog 0.78 0.81 1.0 frisbee-pizza 0.56 0.80 0.26 moped-bus 0.50 0.75 0.37 dolphin-cruiser 0.34 0.73 0.47 habor-outview 0.42 0.71 0.09 monkey-humanface 0.52 0.71 0.32 guitar-violin 0.72 0.71 0.54 lightbulb-firework 0.48 0.69 0.14 mango-broccoli 0.48 0.69 0.34
For intra-concept exploration, we simply display the top 5 image summarization results for each of the concept as shown in Fig.4. The user will have a direct impression about the visual properties of this concept. The accessability to the full concept set is also provided as shown in Fig.3. For image summarization evaluation task, efficiency, effectiveness, and satisfaction are three important metrics. We will design a user study based on these three metrics: 1. A group of 10 users were told to find top 5 summaries of three concepts as “Building”, “Spaghetti” and “Bus”; 2. Their manual selection results were gathered to find the ground truth summarization of the give concept by picking the image views with the highest votes.
120
C. Yang et al.
Fig. 4. Top 5 summarization results for “Building”, “Spaghetti” and “Bus”
3. Users were guided to evaluate our summarization system and give a satisfaction feedback compared with their own understanding of the summarization of this concept. As a result, the user took around 50 seconds in average to find top 5 summarizations from the concept data set. Comparatively, given the affinity propagation clustering results. The calculation of top image summaries can be finished almost at real time, which shows a big advantage for large scale image summarization. Averagely, 2 out of 5 images generated by our algorithm coincided with the ground truth summarizations we derived from the users. We define “coincide” by a strong visual closeness to each other, even if the images may not be identical. Considering the size of the concept set, and the limited number of summarizations we derived, the performance is quite persuasive. After exploring our system, the user also provided satisfactory feedback on the user friendly operation interface, fast responding speed and reasonable return results.
6 Conclusion To deal with large-scale image collection exploration problem, we have proposed novel algorithms for inter-concept visual network generation and intra-concept image summarization. The visual network reflects the diverse inter-concept visual similarity relationships more precisely on a high-dimensional multi-modal feature space by incorporating multiple kernels and kernel canonical correlation analysis. The image summarization iteratively generates most representative images of the set by incorporating the global clustering information while computing the relevancy, orthogonality and uniformity term of the candidate summaries. We designed user interactive system to run the algorithms on a self-collected image data set. The experiment observed very good result and the user study provide positive feedback. This research is partly supported by NSFC-61075014 and NSFC-60875016, by the Program for New Century Excellent Talents in University under Grant NCET-07-0693, NCET-08-0458 and NCET-10-0071 and the Research Fund for the Doctoral Program of Higher Education of China (Grant No.20096102110025).
Efficient Large-Scale Image Data Set Exploration
121
References 1. Smeulders, A.W.M., Worring, M., Santini, S., Gupta, A., Jain, R.: Content-based image retrieval at the end of the early years. IEEE Trans. PAMI 22(12), 1349–1380 (2000) 2. Hauptmann, A., Yan, R., Lin, W.-H., Christel, M., Wactlar, H.: Can high-level concepts fill the semantic gap in video retrieval? A case study with broadcast news. IEEE Trans. on Multimedia 9(5), 958–966 (2007) 3. Benitez, A.B., Smith, J.R., Chang, S.-F.: MediaNet: A multimedia information network for knowledge representation. In: Proc. SPIE, vol. 4210 (2000) 4. Naphade, M., Smith, J.R., Tesic, J., Chang, S.-F., Hsu, W., Kennedy, L., Hauptmann, A., Curtis, J.: Large-scale concept ontology for multimedia. IEEE Multimedia (2006) 5. Cilibrasi, R., Vitanyi, P.: The Google similarity distance. IEEE Trans. Knowledge and Data Engineering 19 (2007) 6. Fellbaum, C.: WordNet: An Electronic Lexical Database. MIT Press, Boston (1998) 7. Wu, L., Hua, X.-S., Yu, N., Ma, W.-Y., Li, S.: Flickr distance. In: ACM Multimedia (2008) 8. Jing, Y., Baluja, S., Rowley, H.: Canonical image selection from the web. In: Proceedings of the 6th ACM International CIVR, Amsterdam, The Netherlands, pp. 280–287 (2007) 9. Simon, I., Snavely, N., Seitz, S.M.: Scene Summarization for online Image Collections. In: ICCV 2007 (2007) 10. Fan, J., Gao, Y., Luo, H.: Hierarchical classification for automatic image annotation. In: ACM SIGIR, Amsterdam, pp. 11–118 (2007) 11. Gao, Y., Peng, J., Luo, H., Keim, D., Fan, J.: An Interactive Approach for Filtering out Junk Images from Keyword-Based Google Search Results. IEEE Trans on Circuits and Systems for Video Technology 19(10) (2009) 12. Lowe, D.: Distinctive image features from scale invariant keypoints. Intl Journal of Computer Vision 60, 91–110 (2004) 13. Grin, G., Holub, A., Perona, P.: Caltech-256 object category dataset, Technical Report 7694, California Institute of Technology (2007) 14. Bay, H., Ess, A., Tuytelaars, T., Van Gool, L.: Speeded-up robust features (surf). Comput. Vis. Image Underst. 110(3), 346–359 (2008) 15. Sonnenburg, S., R¨atsch, G., Sch¨afer, C., Sch¨olkopf, B.: Large scale multiple kernel learning. Journal of Machine Learning Research 7, 1531–1565 (2006) 16. Zhang, J., Marszalek, M., Lazebnik, S., Schmid, C.: Local features and kernels for classification of texture and object categories: A comprehensive study. Intl Journal of Computer Vision 73(2), 213–238 (2007) 17. Torralba, A., Murphy, K.P., Freeman, W.T.: Sharing features: efficient boosting procedures for multiclass object detection. In: IEEE CVPR (2004) 18. Huang, J., Kumar, S.R., Zabih, R.: An automatic hierarchical image classification scheme. In: ACM Multimedia, Bristol, UK (1998) 19. Vasconcelos, N.: “Image indexing with mixture hierarchies. In: IEEE CVPR (2001) 20. Barnard, K., Forsyth, D.: Learning the semantics of words and pictures. In: IEEE ICCV, pp. 408–415 (2001) 21. Naphade, M., Huang, T.S.: A probabilistic framework for semantic video indexing, filterig and retrieval. IEEE Trans. on Multimedia 3(1), 141–151 (2001) 22. Yang, C., Luo, H., Fan, J.: Generating visual concept network from large-scale weaklytagged images. In: Advance in Multimedia Modeling (2010)
A Study in User-Centered Design and Evaluation of Mental Tasks for BCI Danny Plass-Oude Bos, Mannes Poel, and Anton Nijholt University of Twente, Faculty of EEMCS, PO Box 217, 7500 AE Enschede, The Netherlands {d.plass,m.poel,a.nijholt}@ewi.utwente.nl
Abstract. Current brain-computer interfacing (BCI) research focuses on detection performance, speed, and bit rates. However, this is only a small part of what is important to the user. From human-computer interaction (HCI) research, we can apply the paradigms of user-centered design and evaluation, to improve the usability and user experience. Involving the users in the design process may also help in moving beyond the limited mental tasks that are currently common in BCI systems. To illustrate the usefulness of these methods to BCI, we involved potential users in the design process of a BCI system, resulting in three new mental tasks. The experience of using these mental tasks was then evaluated within a prototype BCI system using a commercial online role-playing game. Results indicate that user preference for certain mental tasks is primarily based on the recognition of brain activity by the system, and secondly on the ease of executing the task. Keywords: user-centered design, evaluation, brain-computer interfacing, multimodal interaction, games.
1
Introduction
The research field of brain-computer interfaces (BCI) originates from the wish to provide fully-paralyzed people with a new output channel to enable them to interact with the outside world, despite their handicap. As the technology is getting better, the question arises whether BCI could also be beneficial for healthy users in some way, for example, by improving quality of life or by providing private, handsfree interaction [11,18]. There are still a lot of issues yet to solve, such as delays, bad mental task recognition rates, long training times, and cumbersome hardware [8]. Current BCI research concentrates on improving the recognition accuracy and speed, which are two important factors of how BCI systems are experienced. On the other hand there is a lot of interest for making BCI a more usable technology for healthy users [1,13]. But in order for this technology to be accepted by the general public, other factors of usability and user experience have to be taken into account as well [14,19]. There is some tentative research in this direction, such as Ko et al. who evaluated the convenience, fun and intuitiveness of a BCI game they developed [6], and Friedman K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 122–134, 2011. c Springer-Verlag Berlin Heidelberg 2011
A Study in User-Centered Design and Evaluation of Mental Tasks for BCI
123
et al. who looked into the level of presence experienced in two different navigation experiments [3]. But a lot of research still needs to be conducted. In this paper we focus on applying HCI principles to BCI design. In Section 2 we apply a user-centered design method to the selection of mental tasks for shapeshifting in the very popular massively-multiplayer online role-playing game R R , developed by Blizzard Entertainment, Inc . Within the World of Warcraft large group of healthy users, gamers are an interesting target group. Fed by a hunger for novelty and challenges, gamers are often early adopters of new paradigms [12]. Besides, it is suggested that users will be able to stay motivated and focused for longer periods if the BCI experiment can be presented in a game format [4]. Afterwards, in Section 3 and 4 the selected mental tasks are evaluated in a user study; in this evaluation the focus was on the user preferences – which among others include recognition, immersion, effort, ease of use – for the designed mental tasks. The main research questions we try to answer are: Which mental tasks do the users prefer, and why? How may this preference be influenced by the detection performance of the system? The results are discussed in Section 5 and conclusions of this user-centered design and evaluation can be found in Section 6.
2
User-Centered Design: What Users Want
One of the problems facing BCI research is the uncovering of usable mental tasks that trigger detectible brain activity. The tasks (by convention indicated by the name of the corresponding brain activity) that are currently most popular are: slow cortical potentials, imaginary movement, P300, and steady-state visuallyevoked potentials [15,17]. Users regularly indicate that these tasks are either too slow, nonintuitive, cumbersome, or just annoying to use for control BCIs [10,12]. Current commercial applications are a lot more complex and offer a lot more interaction possibilities than applications used in BCI research. Whereas current game controllers have over twelve dimensions of input, BCI games are generally limited to one or two-dimensional controls. Also, the mental tasks that are available are limited in their applicability for intuitive interaction. New mental tasks are needed that can be mapped in an intuitive manner with the system action. One way to discover mental tasks that are suitable from a user perspective is to simply ask the user what they would like to do to trigger certain actions. R In World of Warcraft , the user can play an elf druid who can shape-shift into animal forms. As an elf, the player can cast spells to attack or to heal. When in bear form, the player can no longer use most spells, but is stronger and better protected against direct attacks, which is good for close combat. R players of varying In an open interview, we asked four World of Warcraft expertise and ages what mental tasks they would prefer to use to shape-shift from the initial elf form to bear, and back again. The participants were not informed about the limits of current BCI systems, but most people did need an introduction to start thinking about tasks that would have a mental component.
124
D. Plass-Oude Bos, M. Poel, and A. Nijholt
They were asked to think of using the action in the game, and what it means to them, what it means to their character in the game, what they think about when doing it, what they think when they want to use the action, and then to come up with mental tasks that fit naturally with their gameplay. The ideas that the players came up with can be grouped into three categories. For the user evaluation of these mental tasks in Section 3, these three categories were translated into concrete mental tasks, mapped to the in-game action of shapeshifting. Each category consists of a task, and its reverse, to accommodate the shapeshifting action in the directions of both bear and elf form. 1. Inner speech: recite a mental spell to change into one form or the other. The texts of spells subsequently used were derived from expressions already used in the game world. The user had to mentally recite “I call upon the great bear spirit” to change to bear. “Let the balance be restored” was the expression used to change back to elf form. 2. Association: think about or feel like the form you want to become. Concretely, this means the user had to feel like a bear to change into a bear, and to feel like an elf to change into an elf. 3. Mental state: automatically change into a bear when the situation demands it. When you are attacked, the resulting stress could function as a trigger. For the next step of this research, this had to be translated into a task that the users could also perform consciously. To change to bear form the users had to make themselves feel stressed; to shift into elf form, relaxed.
3
User Evaluation Methodology
The goal of the user evaluation was to answer the following questions in this game context: Which mental tasks do the users prefer, and why? How may this preference be influenced by the detection performance of the system? Fourteen healthy participants (average age 27, ranging from 15 to 56; 4 female) participated in the experiment voluntarily. All but one of the participants were right-handed. Highest finished education ranged from elementary school to a R master’s degree. Experience with the application World of Warcraft ranged from “I never play any games” to “I raid daily with my level 80 druid”. Three participants were actively playing on a weekly basis. A written informed consent was obtained from all participants. The general methodology to answer these questions was as follows. In order to measure the influence of the detection performance of the system the participants were divided in two groups, a so-called “real-BCI” and “utopia-BCI” R with “utopia-BCI” decided group. The group that played World of Warcraft for themselves whether they performed the mental task correctly, and pressed the button to shapeshift when they had. In this way a BCI system with 100% detection performance (an utopia) was simulated. The group that played World R of Warcraft with “real-BCI” actually controlled their shapeshifting action with their mental tasks, at least insofar as the system could detect it.
A Study in User-Centered Design and Evaluation of Mental Tasks for BCI
125
The participants came in for experiments once a week for five weeks, in order to track potential changes over time. During an experiment, for each pair (change to bear, to elf) of mental tasks (inner speech, association, and mental state), the participant underwent a training and game session and filled in questionnaires to evaluate the user experience. The following sections explain each part of the methods in more detail. 3.1
Weekly Sessions and Measurements
The participants participated in five experiments, lasting about two hours each, over five weeks. The mental tasks, mentioned above, were evaluated in random order to eliminate any potential order effects, for example, due to fatigue or user learning. For each task pair, a training session was done. The purpose of the training session was manifold: it gathered clean data to evaluate the recognizability of the brain activity related to the mental tasks, the user was trained in performing the mental tasks, the system was trained for those participants who play the game with the real BCI system, and the user experience could be evaluated outside the game context. A training session consisted of two sets of trials with a break in between during which the participant could relax. Each set started with four watch-only trials (two per mental task), followed by 24 do-task trials (twelve per mental task). The trial sequence constituted of five seconds watching the character in their start form, followed by two seconds during which the shape-shifting task was presented. After this the participant had ten seconds to perform the mental task repeatedly until the time was up, or just watch if it was a watch-only trial. At the end of these ten seconds, the participant saw the character transform. See Figure 1 for a visualization of the trial sequence. During the watch-only trials, the participant saw exactly what they would see during the do-task trials, but they were asked only to watch the sequence. This watch-only data was recorded so it could function as a baseline. It is possible that simply watching the sequence already induces certain discriminable brain activity. This data provides the possibility to test this. The character in the videos was viewed from the back, similar to the way the participant would see the avatar in the game. At the end of the training session, the participant was asked to fill in forms to evaluate the user experience. The user experience questionnaire was loosely based on the Game Experience Questionnaire [5]. It contained statements for which the participants had to indicate their amount of agreement on a five-point Likert scale, for example: “I could perform these mental tasks easily”, “It was tiring to use these mental tasks”, and “It was fun to use these mental tasks”. The statements can be categorized into the following groups: whether the task was easy, doable, fun, intuitive, tiring to execute, whether they felt positive or negative about the task, and whether the mapping to the in-game action made sense or not. After the training session, the participant had roughly eight minutes to try the set of mental tasks in the game environment. The first experiment consisted only
126
D. Plass-Oude Bos, M. Poel, and A. Nijholt
Fig. 1. Training session trial sequence: first the character is shown in their start form, then the task is presented, after which there is a period during which this task can be performed. At the end the animation for the shape-shift is shown.
Fig. 2. Orange feedback bar with thresholds. The user has to go below 0.3 to change to elf, and above 0.7 to change to bear. In between no action is performed.
of a training session. For weeks two up to and including four, the participants were split up into the “real-BCI” and “utopia-BCI” group. Groups were fixed for the total experiment. During the last week all participants followed a training and played the game with the real BCI system. The “real-BCI” group received feedback on the recognition of their mental tasks in the form of an orange bar in the game (see Figure 2). The smaller the bar, the more the system had detected the mental task related to elf form. The larger, the more the system had interpreted the brain activity as related to bear form. When the thresholds were crossed the shape-shift action was executed automatically. The “utopia-BCI” group participants had to interact with a BCI system with (a near) 100% performance. Since this is not technical feasible yet, one could rely on a Wizard of Oz technique [16]. The users wear the EEG cap and have to perform the mental tasks when they wanted to shapeshift and the Wizard decides when it was performed correctly. In this case, however, the Wizard would have no way of knowing what the user is doing as there is no external expression of the task. The only option left to simulate a perfect system is to let the participants evaluate themselves whether or not they had performed the task correctly. Then they pressed the shape-shift button in the game manually. At the end of the session, the user experience questionnaire was repeated, to determine potential differences between the training and game sessions. The game session questionnaire contained an extra question to determine the perceived detection performance of the mental tasks. A final form concerning the experiment session as a whole was filled in at the end of the session. The participants were asked to put the mental tasks in order of preference, and to indicate why they choose this particular ordering.
A Study in User-Centered Design and Evaluation of Mental Tasks for BCI
3.2
127
EEG Analysis and Mapping
The EEG analysis pipeline, programmed in Python, was kept very general, as there was no certainty about how to detect these selected mental tasks. Common Average Reference was used as a spatial filter, in order to improve the signalto-noise ratio [9]. The bandpass filter was set to 1–80Hz. The data gathered during the training session was sliced in 10-second windows. These samples were whitened [2], and the variance was computed as an indication of the power in the window. A support vector machines (SVM) classifier trained on the training session data provided different weights for each of the EEG channels. To make the BCI control more robust to artifacts, two methods were applied to the mapping of classification results to in-game actions. A short dwelling was required to trigger the shape-shift, so it would not be activated by quick peaks in power. Secondly, hysteresis was applied: the threshold that needed to be crossed to change into a bear was higher than the threshold required to revert back to elf form. In between these two thresholds was a neutral zone in which no action was performed, see also Figure 2.
4 4.1
Results Which Mental Tasks Do Users Prefer and Why?
In the post-experiment questionnaire, the participants were asked to list the mental tasks in order of preference. The place in this list was used as a preference score, where a value of 1 indicated first choice, and 6 is the least preferable. These values were rescaled to match the user experience questionnaire values: ranging from 1 to 5, with 5 most preferable, and therefore 3.0 indicates a neutral disposition in preference order. Sixty-nine measurements were obtained from 14 participants over five weeks. One week one participant had to leave early and could not fill in his preference questionnaire. The average preference scores show a general preference for the association tasks, and the mental state seems to be disliked the most. But this paints a very simplistic image, as there are large differences between the “real-BCI” and “utopia-BCI” groups. To better understand the effects of the different aspects, Figure 3 shows the preference and user experience scores for each of the three mental task pairs, separate for the two participant groups. Whereas for the “real-BCI” group the mental state tasks are most liked, for the ”Utopia BCI” group, they are most disliked. Similarly, The “utopia-BCI” group most preferred inner speech, which was least preferred by the “real-BCI” group. Because of these large differences, these two groups need to be investigated as two separate conditions. 4.2
What Is the Influence of Recognition Performance on Task Preference?
Although it is not possible to completely separate the influence of the recognition performance and other aspects that differ between the participant groups, based
128
D. Plass-Oude Bos, M. Poel, and A. Nijholt
Fig. 3. User experience, preference, and perceived performance scores for the “utopiaBCI” and “real-BCI” groups, separate for the three mental task pairs, averaged for weeks 2 to 4. The bar plot is annotated with significant differences between task pairs (association, inner speech, mental state; with a line with star above the two pairs) and game conditions (“utopia-BCI”, “real-BCI”; indicated by a star above the bar).
A Study in User-Centered Design and Evaluation of Mental Tasks for BCI
129
on the user experience scores, recognition perception scores, and the words the participants used to describe their reasoning for their preference, it is possible to explain the discrepancy between the two conditions and get an idea of the influence of recognition performance. Inner speech is preferred by the “utopia-BCI” group, mainly because it is considered easy and doable. Although the inner speech tasks were rated highly by both groups, the system recognition had a heavy impact: it is the least preferred task pair for the “real-BCI” participants. The association tasks are valued mostly for their intuitiveness and the mapping with the in-game task, by the “utopia-BCI” group. Where the bad detection of inner speech mainly affected the preference scores, for association the user experience is significantly different on multiple aspects: easy, intuitive, and positive. The opposite happens for mental state. This task pair scores low across the board by both groups, yet it was preferred by the “real-BCI” group. It was also the task that was best recognized by the system, which is reflected in the perceived recognition scores. Based on these results, it seems that the recognition performance has a strong influence on the user preference, which is the most important consideration for the “real-BCI” group. For the “utopia-BCI” group different considerations emerge, where the ease of the task pair seems to play a dominant role, followed by the intuitiveness.
Fig. 4. Counts for the categories of words used to describe the reasoning behind the participant’s preference ranking, total for weeks 2 to 4
This view is confirmed by looking at the reasons the participants describe for their preference ranking, see Figure 4. The words they used were categorized, and the number of occurrences within each category was used as an indication of how important that category was to the reasoning. To reduce the number of categories, words that indicated a direct opposite and words that indicated a similar reason were merged in one category. For example, difficult was recoded to easy, and tiring was recoded to effort. In the words used by the “real-BCI”
130
D. Plass-Oude Bos, M. Poel, and A. Nijholt
group, recognition performance is the most used (n = 15), more than twice as any other word category (n <= 7). The ’Utopia BCI’ group mostly referred to the ease of executing the task (n = 12, where n <= 5 for the other word categories). Other issues that were often mentioned were effort, feels good, and speed. 4.3
Correlations between Preference and User Experience
Given the fact that participants indicated task recognition and ease to be the most important considerations for their preference, do these aspects from the user experience questionnaire also show a correlation to the preference scores? Weeks 1 and 5 were excluded from this analysis as both groups perform the tasks in the same conditions in these weeks (“Utopia” in week 1 and “Real” in week 2). Therefore the number of samples for the correlation tests are 63 (3 weeks, 3 task pairs, 7 participants), except for one case where there are some missing samples due to a participant having to leave early. For the correlation tests with the two conditions (“Real” and “Utopia”) combined, there are twice the number of samples. There are no perceived recognition scores for the “utopiaBCI” group.
Table 1. Pearson correlation coefficients and p-values for the correlation of user experience componenents and perceived recognition rate with the preference scores for the mental task pairs. Significance annotation: p <= 0, 005 in bold. doable easy fun intuitive mapping posit. negat. concentr. tiring recogn. Utopia r -,482 -,542 -,478 -,384 -,395 -,579 ,297 ,273 ,254 p ,000 ,000 ,000 ,002 ,002 ,000 ,019 ,032 ,047 Real r -,218 -,293 -,294 ,057 ,079 -,189 ,201 ,246 ,137 -,316 p ,095 ,023 ,023 ,665 ,550 ,147 ,124 ,058 ,295 ,014 All r -,360 -,415 -,391 -,203 -,190 -,410 ,252 ,259 ,199 -,316 p ,000 ,000 ,000 ,025 ,036 ,000 ,005 ,004 ,028 ,014
For the “utopia-BCI” group, the expected correlations with the ease-related aspects doable and easy were found, as well as correlations with most of the other aspects. After correction for multiple tests, the correlations with concentration and tiring are not significant. There were no significant correlations of preference with any of the user experience aspects for the “real-BCI” group, but the most relevant correlation was with the perceived recognition performance by the system, which is as expected. Looking at the conditions combined, the most significant correlations are for doable, easy, fun, and positive. The relation between preference and recognition performance is not that apparent when investigating this correlation. Yet, the correlations do show the importance of easy, expressed in how easy and how doable it is to perform the task. They also show that other aspects can be important as well, such as fun, how intuitive it is, and the mapping to the in-game action.
A Study in User-Centered Design and Evaluation of Mental Tasks for BCI
5
131
Discussion
The participants were varied in both age and experience with the game. It would be interesting to investigate the influence of these aspects on the preference and user experience. The participants in the “utopia-BCI” group had to evaluate their execution of the mental tasks themselves. This means we had no control over whether they were really performing the tasks seriously. It is unsure how well people can evaluate their own task execution and how this may affect their experience. This is a problem in any methodology that relies on self-evaluation in any part of the experiment. Nonetheless, we assume that participants really did perform the mental tasks: during the game sessions short delays were perceived where the participants seemed to pause, after which the button was pressed. For the “utopia-BCI” participants the brain activity has been recorded as well, which would have been another motivator. The recordings cannot be used to be certain about the task execution as the execution may be different for different participants, and it is unsure whether there are clear differences between the brain signals for the task pairs to begin with. Aspects that were indicated by the participants to influence preference ratings overall, in order of number of mentions from high to low, are: ease, recognition, feels good. Certain terms could be grouped because of similarity, like ease and effort, which would make the results more clear. As expected, the perceived recognition performance of the system was low. This fact could have a big impact on the interpretation of the scores of the “real-BCI” group. It is however interesting to note that for the mental state task, the participants did feel the system responded to their mental actions. Over time, the perceived recognition increased, and the recognition scores for the assocation tasks also rise above the neutral response. Secondly, the participants have a tendency to be optimistic about the system recognition. A score of 1 would indicate really no recognition at all, and even though the recognition for inner speech for example would have been minimal, the average rating is still above 2. This optimism about self-assessment of the level of control you have is know from other studies [7]. This analysis focused on the features derived from explicit subject reporting, to show how valuable this information can be, or at least to indicate the importance of the user in it all. This information should preferably be combined with more objective measures, such as the speed and accuracy of the BCI system mentioned earlier. The goal when evaluating BCI systems should be to obtain a complete image in order to discover where the main problems lie. If the mental task is disliked, then perhaps it might be better to look into other tasks that are suitable for the application instead of trying to gloss it over with recognition performance improvements. It is also important to keep in mind that these findings come from a limited R context: the shapeshifting task in World of Warcraft . The task itself could have had impact on what the participants value most in a mental task. It is not possible to see task preference independent of the in-game action as they are
132
D. Plass-Oude Bos, M. Poel, and A. Nijholt
inherently linked. If the game is fast-paced, tasks that can be executed more quickly will be more suitable, both on a functional and experiential level. If a task matches the in-game situation it has an advantage compared to a task that is similar in all other aspects. If a task will be executed often the effort and time of execution become more important than when it is only to be done every now and then. Finally, the point was made that healthy users will be less accepting than patients of the problems that current BCIs still struggle with. But it is important to consider that patients will also benefit from usability improvements.
6
Conclusions
When evaluating BCIs, current research only pays attention to task recognition performance, speed, and the derivative: bit rates. Human-computer interaction research shows that for a user to accept and value this new means of interaction other aspects may be important as well, generally summarized as usability and user experience. For this research, we involved potential users in the design process of determining which mental tasks to use for certain actions within the application. This resulted in three categories of mental tasks that are not listed among the most frequently used tasks in current BCI applications, and which are appreciated by the users (mental state, inner speech, and association). The three pairs of mental tasks uncovered in the design process were evaluated based on user experience with the help of our prototype system. By asking the participants about their experience, we gained new insights into what potential users like or dislike about these particular mental tasks for this BCI system and why. In the context of this experiment, user preference for mental tasks seems to be based on (in order of preference) accuracy of task recognition by the system, ease of performing the mental task, and lastly by factors such as fun, intuitiveness, and suitability for the task. The fact that the recognition by the system was indicated to be so important to the participants in this experiment could validate the current focus of BCI research on speed and accuracy. To prove such a generalization, further research is required. However, though the speed and accuracy of detection for a given mental task can be improved with better hardware and analysis methods, the mental task itself remains the same. When the performance of BCI systems gets better, other aspects of user experience , such as ease, fun and intuitiveness, will play a more prominent role. When designing and evaluating systems it would, both for patients and healthy users, be beneficial to take the usability and user experience into account.
Acknowledgements We would like to thank Betsy van Dijk for her thorough feedback, and we also gratefully acknowledge the support of the BrainGain Smart Mix Programme of
A Study in User-Centered Design and Evaluation of Mental Tasks for BCI
133
the Netherlands Ministry of Economic Affairs and the Netherlands Ministry of Education, Culture, and Science. This work was supported by the Information and Communication Technologies Coordination and Support Action “Future BNCI” within the FP7 framework, Project Number 248320.
References 1. Blankertz, B., Popescu, F., Krauledat, M., Fazli, S., Tangermann, M., M¨ uller, K.: Challenges for Brain-Computer Interface Research for Human-Computer Interaction Applications. In: ACM CHI Workshop on Brain-Computer Interfaces for HCI and Games (2008) 2. Dornhege, G., Blankertz, B., Krauledat, M., Losch, F., Curio, G., Muller, K.: Combined optimization of spatial and temporal filters for improving brain-computer interfacing. IEEE Transactions on Biomedical Engineering 53(11), 2274–2281 (2006) 3. Friedman, D., Leeb, R., Guger, C., Steed, A., Pfurtscheller, G., Slater, M.: Navigating virtual reality by thought: what is it like? Presence: Teleoperators and Virtual Environments 16(1), 100–110 (2007) 4. Graimann, B., Allison, B., Gr¨ aser, A.: New Applications for Non-invasive BrainComputer Interfaces and the Need for Engaging Training Environments. In: BRAINPLAY 2007 Brain-Computer Interfaces and Games Workshop at ACE (Advances in Computer Entertainment), pp. 25–28 (2007) 5. IJsselsteijn, W., de Kort, Y., Poels, K., Jurgelionis, A., Bellotti, F.: Characterising and measuring user experiences in digital games. In: International Conference on Advances in Computer Entertainment Technology (2007) 6. Ko, M., Bae, K., Oh, G., Ryu, T.: A study on new gameplay based on braincomputer interface. In: Barry, A., Helen, K., Tanya, K. (eds.) Breaking New Ground: Innovation in Games, Play, Practice and Theory: Proceedings of the 2009 Digital Games Research Association Conference, Brunel University (2009) 7. Langer, E.: The illusion of control. Journal of personality and social psychology 32(2), 311–328 (1975) 8. L´ecuyer, A., Lotte, F., Reilly, R.B., Leeb, R., Hirose, M., Slater, M.: BrainComputer Interfaces, Virtual Reality, and Videogames. Computer 41(10), 66–72 (2008) 9. McFarland, D., McCane, L., David, S., Wolpaw, J.: Spatial filter selection for EEG-based communication. Electroencephalography and clinical Neurophysiology 103(3), 386–394 (1997) 10. Molina, G.: Detection of High-Frequency Steady State Visual Evoked Potentials Using Phase Rectified Reconstruction. In: 16th European Signal Processing Conference, EUSIPCO 2008 (2008) 11. Nijholt, A., van Erp, J., Heylen, D.K.J.: BrainGain: BCI for HCI and Games. In: Proceedings AISB Symposium Brain Computer Interfaces and Human Computer Interaction: A Convergence of Ideas, pp. 32–35 (2008) 12. Nijholt, A., Tan, D., Pfurtscheller, G., Brunner, C., Mill´ an, J.d.R., Allison, B., Graimann, B., Popescu, F., Blankertz, B., M¨ uller, K.R.: Brain-computer interfacing for intelligent systems. IEEE Intelligent Systems pp. 76–83 (2008) 13. Nijholt, A., Tan, D., Mill´ an, J.d.R., Graimann, B., Jackson, M.: Brain-computer interfaces for HCI and games. In: Proceedings ACM CHI 2008: Art. Science. Balance, pp. 3925–3928 (2008)
134
D. Plass-Oude Bos, M. Poel, and A. Nijholt
14. Plass-Oude Bos, D., Reuderink, B., van de Laar, B., G¨ urk¨ ok, H., M¨ uhl, C., Poel, M., Nijholt, A., Heylen, D.: Brain-computer interfacing and games. In: Tan, D., Nijholt, A. (eds.) Brain-Computer Interfaces: Applying our Minds to Human-Computer Interaction, ch. 10. Springer, Heidelberg (2010) 15. Reuderink, B.: Games and Brain-Computer Interfaces: The State of the Art. Tech. Rep. TR-CTIT-08-81, Human Media Interaction, Faculty of EEMCS, University of Twente (2008) 16. Salber, D., Coutaz, J.: Applying the wizard of oz technique to the study of multimodal systems. Human-Computer Interaction 753, 219–230 (1993) 17. Tonet, O., Marinelli, M., Citi, L., Rossini, P., Rossini, L., Megali, G., Dario, P.: Defining brain–machine interface applications by matching interface performance with device requirements. Journal of Neuroscience Methods 167(1), 91–104 (2008) 18. Wolpaw, J.R., Loeb, G.E., Allison, B.Z., Donchin, E., Nascimento, O.F., Heetderks, W.J., Nijboer, F., Shain, W.G., Turner, J.N.: BCI meeting 2005 - workshop on signals and recording methods. IEEE Transactions on Neural System and Rehabilitation Engineering 14(2), 138–141 (2006) 19. Wolpaw, J., Birbaumer, N., McFarland, D., Pfurtscheller, G., Vaughan, T.: Brain-computer interfaces for communication and control. Clinical Neurophysiology 113(6), 767–791 (2002)
Video CooKing: Towards the Synthesis of Multimedia Cooking Recipes Keisuke Doman1 , Cheng Ying Kuai1, , Tomokazu Takahashi2 , Ichiro Ide1,3 , and Hiroshi Murase1 1
2
Graduate School of Information Science, Nagoya University Furo-cho, Chikusa-ku, Nagoya 464-8601, Japan {kdoman,kuai,ide,murase}@murase.m.is.nagoya-u.ac.jp Faculty of Economics and Information, Gifu Shotoku Gakuen University 1-38 Naka-Uzura, Gifu 500-8288, Japan
[email protected] 3 Institute for Informatics, University of Amsterdam Science Park 904, 1098 XH Amsterdam, The Netherlands
[email protected]
Abstract. In this paper, we propose the concept of synthesizing a multimedia cooking recipe from a text recipe and a database composed of video clips depicting cooking operations. A multimedia cooking recipe is a cooking recipe where each cooking operation is associated with a corresponding video clip depicting it, aimed to facilitate the understanding of cooking operations. In order to synthesize such a multimedia cooking recipe from an arbitrary text-based cooking recipe, a large number of video clips describing various cooking operations should be prepared in the database. Thus, we propose a method to build a database composed of video clips depicting cooking operations, which detects and classifies cooking operations in the cook shows, and tags them with cooking operations in a corresponding cooking recipe text. We also introduce a prototype multimedia cooking recipe interface named “Video CooKing” to demonstrate our concept. Keywords: cooking recipe, visualization, cooking video, cooking operation, motion analysis, automatic tagging.
1
Introduction
Recently, the number of cooking recipe texts posted on the Web is increasing. For example, “Cookpad”1 is a recipe-based social networking service where users can post original recipes and also report results and comments. It is so popular that it is said that one fourth of Japanese women in their thirties accesses this service. However, most of the cooking recipes on the Web are text-based and do not have enough explanations about cooking terms, especially cooking operations. 1
Currently at Brother Industries, Ltd., Japan. COOKPAD Inc., “COOKPAD,” http://cookpad.com/
K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 135–145, 2011. c Springer-Verlag Berlin Heidelberg 2011
136
K. Doman et al.
Although some cooking recipes may have image-based explanations, they are not always sufficient for the understanding of some cooking operations. Therefore, in this paper, we propose the concept of synthesizing a multimedia cooking recipe from a text-based recipe, and also introduce a prototype interface, named “Video CooKing”. As shown in Fig. 1, a multimedia cooking recipe is a cooking recipe where each cooking operation is associated with a corresponding video clip depicting it. Compared with an existing text-based cooking recipe, the multimedia cooking recipe makes each cooking operation more understandable with a visual explanation. In fact, a cooking assistance software for a portable game machine “Nintendo DS”, that provides users with multimedia explanations of cooking recipes, is already commercially available2. However, to create such a software, visual explanations should be prepared manually in advance. We consider that it is unrealistic to manually prepare numerous video clips depicting cooking operations on various ingredients. To solve this problem, we are aiming at obtaining such video clips from cook shows automatically. A cook show contains video clips depicting each cooking operation in its corresponding cooking recipe, and is broadcast with closedcaptions. We obtain a set of tagged video clips depicting cooking operations from many cook shows, and consequently build a database with them. Once such a database is built, we can apply the method proposed in this paper to cooking recipes without corresponding cook shows, for example cooking recipes on the Web. Meanwhile, Hamada et al. proposed the “Cooking Navi” system that analyzes the dependency structure in a cooking recipe text [2], aligns each step to a video segment obtained from a corresponding cook show [3], and also presents the steps along the dependency structure with the aligned video while a user cooks [4]. However, this system can generate a multimedia cooking recipe only when the cooking recipe has a corresponding cook show. This paper is organized as follows. The next section illustrates the method for associating a cooking recipe text with its corresponding cook show to build a database composed of video clips depicting cooking operations. Section 3 introduces a prototype multimedia cooking recipe interface named “Video CooKing”. The paper ends with a summary and discussion of future works in Section 4.
2
Composing a Database of Video Clips Depicting Cooking Operations
As shown in Fig. 2, the flow of associating a cooking recipe text with a cook show is composed of mainly three processes: 1) text processing, 2) image processing and 3) integration. The text processing part analyzes the cooking recipe text and the closed-captions (CC) to extract tags (a pair of an “ingredient” and a 2
Nintendo Co., Ltd., “It talks! DS Cooking Navi (in Japanese),” http://www.nintendo.co.jp/ds/a4vj/
Video CooKing: Towards the Synthesis of Multimedia Cooking Recipes
(Title)
137
(Title)
(Ingredients)
(Ingredients)
(Preparation Steps)
(Preparation Steps)
(a) Text-based cooking (b) Video clips of cooking recipe operations
(c) Multimedia cooking recipe
Fig. 1. Synthesis of a multimedia cooking recipe ((a) + (b) → (c))
Cooking recipe text
Text processing (“onion”, “slice”)
Closed-captions
Text processing
Cooking video
Image processing
Integration
Cooking show
Fig. 2. Process flow for tagging a video clip depicting cooking operations from a cook show
“cooking operation”). The image processing part extracts video clips depicting cooking operations from the cook show, and then classifies them. The integration part associates the tags with the video clips depicting cooking operations. Although the following explanation is based on the processing of Japanese cooking recipe text, it should be possible to be applied to other languages by a similar approach. Accordingly, some language-specific details are omitted in the following explanations for simplicity. 2.1
Text Processing: Extracting Cooking Operations and Corresponding Ingredients
In the text processing part, pairs of an “ingredient” and a “cooking operation” are extracted as tags for video clips depicting cooking operations from a cooking recipe text and CCs. The structure of a cooking recipe text and CCs is shown in Fig. 3. Generally, a cooking recipe is composed of “Ingredients” where all ingredients are listed, and “Preparation Steps” where cooking procedures are described as a numbered list. Meanwhile, CCs contain the transcript of the speech and the timing of its utterance in the audio track. The details of the process are described below.
138
K. Doman et al.
Analyzing the cooking recipe text and the CC. First, morphological analysis is applied to each sentence in “Ingredients” and “Preparation Steps” of the input cooking recipe text. Next, nouns (ingredients) that commonly appear in both “Ingredients” and “Preparation Steps”, together with verbs (cooking operations) are extracted from “Preparation Steps”, respectively. Then, considering the grammar, each cooking operation is associated with its target ingredients in “Preparation Steps”. As a result, pairs of an “ingredient” and a “cooking operation” are obtained from the cooking recipe text. Similarly, pairs of an “ingredient” and a “cooking operation” are obtained from the CC, too.
Time
(Title)
Speech transcript
(Ingredients)
(Preparation Steps)
(a) Cooking recipe text
(b) Closed-captions
Fig. 3. Structure of a cooking recipe text and closed-captions
2.2
Image Processing: Classifying the Video Clips Depicting Cooking Operations from the Cook Show
In general, as shown in Fig. 4, a “face shot” and a “hand shot” appear alternately in a cook show. The upper body of a person is captured in the face shot, whereas the hand of a person is zoomed-up in the hand shot. It is generally considered that the hand shot is more important for cooking assistance, since the closeup of the current state or a cooking operation is captured in the shot [6]. For this reason, we focus on the hand shot and classify scenes in them into the following three categories considering motion. – “Repetitious motions”: A scene containing motions that repeat several times. It is further classified into the following two categories: • “Converged”: A scene where the periodic changes of pixel values are observed in a specific area of a frame. (ex. cut) • “Distributed”: A scene where periodic changes of pixel values are observed in a wide area of a frame. (ex. fry, mix)
Video CooKing: Towards the Synthesis of Multimedia Cooking Recipes Cut
Cut Face shot
Cut Hand shot
Cut Face shot
139 Cut
Hand shot
t
Cooking operation scene
Face shot
Current state scene
Fig. 4. General structure of a cook show
– “Current state”: A scene containing no dynamic motion. (ex. stew, boil, “ingredient (noun)”) – “Other motions”: Other than the above. (ex. serve, season) The details of the classification of the scenes are described next. Classifying the hand shots. First, the input cook show is segmented into shots, and then only hand shots are extracted from them by the method proposed by Miura et al. [6]. Next, for each segment composed of several continuous frames in a hand shot, an eigenspace is constructed. Then each frame in the segment is projected onto the eigenspace. As the motion feature for the classification, we focus on the trajectory drawn in each eigenspace (Fig. 5 (center column)), especially that drawn on the first eigenaxis (Fig. 5 (right column)). Then, the trajectory of each segment is classified with the following conditions: ⎧ ⎪ ⎨ “Repetitious motions” “Current state” ⎪ ⎩ “Other motions”
if m ≥ θm if m < θm and Δr ≤ θΔr otherwise
(1)
where m is the number of peaks, Δr is the difference of the minimum and the maximum values of the trajectory, and θm and θΔr are the thresholds of m and Δr, respectively. Here, the peak is defined as a point that meets the following conditions: ⎧ g(t) − g(t + 1) ≥ θ1 , ⎪ ⎪ ⎪ ⎨ g(t) − g(t − 1) ≥ θ , 1 (2) ⎪ g(t + 1) − g(t + 2) ≥ θ2 , ⎪ ⎪ ⎩ g(t − 1) − g(t − 2) ≥ θ2
140
K. Doman et al.
0.1 0
-0.1 0.2
-0.2 -0.2
-0.1
1st eige0 n
0.1 0 xis na ige e 2nd -0.1
0.1
axis
0.2
1st eigen axis
3rd eigen axis
0.2
0.2
0.1
0
-0.1
-0.2
-0.2
0
64
128
191
255
191
255
191
255
Time [frame]
(a) Repetitious motions
0.1 0
-0.1 0.2
-0.2 -0.2
0.1 s axi n e eig 0
-0.1
1st eige0 n
-0.1 0.1
axis
0.2
-0.2
2nd
1st eigen axis
3rd eigen axis
0.2
0.2
0.1
0
-0.1
-0.2
0
64
128
Time [frame]
(b) Current state
0.1 0
-0.1 0.2
-0.2 -0.2
-0.1
1st eige0 n
0.1 0 xis na e g ei
-0.1 0.1
axis
0.2
-0.2
2nd
1st eigen axis
3rd eigen axis
0.2
0.2
0.1
0
-0. -0.1
-0. -0.2
0
64
128
Time [frame]
(c) Other motions Fig. 5. Example of the analysis results for each motion category. In each motion category, the left column is an input video clip, the center column is its trajectory on the eigenspace, the right column is the trajectory projected onto the first eigenaxis.
where g(t) is the value on the first eigenaxis at time t, θ1 and θ2 are the thresholds of the peak strengths. Finally, we regard a series of the segments classified into the same motion category as a scene. Classifying the repetitious motions. The second-stage of the classification is performed only for “repetitious motions”. First, frequency analysis is applied to continuous frames in an input scene, and then, after segmenting the frame into blocks, the temporal change of pixel values is calculated per block. Next, the number of repetitions in each block are counted, where an explicit peak at a certain frequency exists. Example of the results of the repetition count is shown in Fig. 6. Next, regarding each block as a sample point, eigenvalues λ1 , λ2 are calculated by applying PCA (Principal Component Analysis) to the distribution of repetition counts in a frame. As shown in Fig. 7, there is a clear difference between “converged” and “distributed” in the distribution of the repetition counts.
Video CooKing: Towards the Synthesis of Multimedia Cooking Recipes
141
Fig. 6. Example of the frequency analyses. The deeper the color, the larger the count.
(a) Converged
(b) Distributed
Fig. 7. Example of the results of PCA applied to the distribution of repetition counts in a frame. The dashed lines represent the first and the second eigenaxes. There is a clear difference between the distribution of repetition counts in “Converged” and “Distributed”.
Focusing on this, the input scene is classified further into two motion categories by the following condition:
“Converged” “Distributed”
if (λ1 − λ2 ) ≥ θλ otherwise
(3)
where θλ is the threshold of the variance on each axis. 2.3
Integration: Tagging the Video Clips Depicting Cooking Operations
Each video clip depicting a cooking operation classified according to the process described in 2.2 is associated with a tag (pair of an “ingredient” and a “cooking operation”) obtained according to the process described in 2.1 as follows: First,
142
K. Doman et al.
the pairs of an “ingredient” and a “cooking operation” that commonly appear in both cooking recipe text and CC are extracted by matching them. Next, each pair is tagged to a video clip depicting a cooking operation according to the time stamp in the CC. Here, only tags that correspond to the motion category of the video clip depicting a cooking operation are selectively tagged. By applying the method to many cook shows, we will obtain a set of tagged video clips depicting cooking operations, and consequently build a database composed of them. 2.4
Experiment
We evaluated the association performance of the method described above with the following experiment. Experimental conditions. Eight cooking recipes and corresponding cook shows3 (320 × 240 pixels, 30 fps and in total 75 min.) were used. A Japanese morphological analyzer MeCab4 was used to obtain the parts-of-speech of the terms. Cut detection and hand shot extraction from the cook shows were performed manually for this experiment, and the extracted hand shots without corresponging tags were excluded from the classification targets. We manually labeled each scene for the ground-truth. The size of the blocks and the window width for the classifications of “repetitious motions” were set to 16 × 16 pixels and 256 frames (about 8 seconds), respectively according to the result of a preliminary experiment. We evaluated the tagging accuracy based on two counting rules: 1) a tagging is judged as correct if a pair of an “ingredient” and a “cooking operation” is correctly tagged to a scene (pair-count) or 2) if only a “cooking operation” is correctly tagged (solo-count). Experimental results. 129 scenes were obtained as the result of the classification of the hand shots, and 135 pairs of an “ingredient” and a “cooking operation” were tagged to them. The tagging accuracy was 52.6% based on the pair-count rule, 68.1% based on the the solo-count rule. Discussions. In the text processing part, there were many mis-matchings of tags caused by different expressions of a similar cooking operation. We consider that we can cope with this problem by using a thesaurus of cooking terms. In the image processing part, we consider that there are two important points in order to obtain higher classification accuracy: 1) treatment of a scene with a large camerawork which yields a large motion, and 2) choice of the best thresholds for accurate classification of “repetitious motions” and “other motions”. In the 3 4
NHK Educational Corp., “Today’s cooking for everyone (in Japanese),” http://www.kyounoryouri.jp/. Kyoto University, “Japanese morphological analyzer MeCab,” http://mecab.sourceforge.net/
Video CooKing: Towards the Synthesis of Multimedia Cooking Recipes
143
integration part, most of the mis-taggings were caused by the existance of an operation that is not mainly-focused but captured in the frame (Fig. 8(a)), or operations or states that do not involve motion (Fig. 8(b)). As for the former (Fig. 8(a)), although the tagging result was (“water”, “boil”), the camerawork was focusing on dissolving scorch in the boiled water using a spoon. As for the latter (Fig. 8(b)), although the tagging result was (“egg”, “become hard”), we judged the tagging result as incorrect since such a cooking operation could not be observed as a motion. However, in view of the complexity of associating a cooking recipe text with a video clip depicting a cooking operation, we consider that the experimental results are sufficient for our purpose at the current stage, since we consider that the user may select what s/he needs from the video clips presented in an interface.
(a) Tagging result: (“water”, “boil”)
(b) Tagging hard”)
result:
(“egg”,
“become
Fig. 8. Examples of mis-tagging
3
Video CooKing: A Prototype Multimedia Cooking Recipe Interface
We implemented a prototype interface “Video CooKing” as shown in Fig. 9 to demonstrate the concept of the multimedia cooking recipe. In the interface, each cooking operation in “Preparation Steps” in the left column is linked with its one or more corresponding video clips depicting cooking operations. When a linked cooking operation is clicked, a list of ingredients that are the targets of the cooking operation is shown in the right column. Each ingredient is linked to a video clip depicting a cooking operation if there is more than one corresponding video clips that exist in the database. Users can play / stop the video clip themselves. We consider that this interface that enables an user to browse a multimedia cooking recipe, facilitates the understanding of cooking operations.
144
K. Doman et al.
Fig. 9. Video CooKing: a prototype multimedia cooking recipe interface. The original cooking recipe on the Web6 is shown in the left column, and video clips depicting cooking operations corresponding to each cooking operation in the recipe text is shown in the right column when clicked.
Video CooKing: Towards the Synthesis of Multimedia Cooking Recipes
4
145
Conclusion
In this paper, we proposed the concept of synthesizing a multimedia cooking recipe and also introuced a prototype interface. Experimental results showed the effectiveness of our method in tagging a pair of an “ingredient” and a “cooking operation” to a video clip obtained from a cook show, and consequently, the capability of building a database composed of video clips depicting cooking operations. Future work includes the improvement of the interface and the tagging [5]. Acknowledgments. Parts of this work were supported by the Grants-in-Aid for Scientific Research (21013022) from the Japanese Ministry of Education, Culture, Sports, Science and Technology. The “Media Integration Standard Toolkit”5 was used for the implementation of the system, and some thumbnail images from the “Video database for evaluating video processing” [1] was used for explanations in this paper.
References 1. Babaguchi, N., Etoh, M., Satoh, S., Adachi, J., Akutsu, A., Ariki, Y., Echigo, T., Shibata, M., Zen, H., Nakamura, Y., Minoh, M., Matsuyama, T.: Video database for evaluating video processing (in Japanese). Tech. Rep. IEICE (PRMU2002-30) (June 2002) 2. Hamada, R., Ide, I., Sakai, S., Tanaka, H.: Structural analysis of preparation steps on supplementary documents of cultural TV programs. In: Proc. Fourth Int. Workshop on Information Retrieval with Asian Languages (IRAL 1999), Taipei, Taiwan, pp. 43–47 (November 1999) 3. Hamada, R., Miura, K., Ide, I., Satoh, S., Sakai, S., Tanaka, H.: Multimedia integration for cooking video indexing. In: Aizawa, K., Nakamura, Y., Satoh, S. (eds.) PCM 2004. LNCS, vol. 3332, pp. 657–664. Springer, Heidelberg (2004) 4. Hamada, R., Okabe, J., Ide, I., Satoh, S., Sakai, S., Tanaka, H.: Cooking Navi: Assistant for daily cooking in kitchen. In: Proc. Thirteenth ACM Int. Multimedia Conf. (ACMMM 2005), Singapore, pp. 371–374 (November 2005) 5. Ide, I., Shidochi, Y., Nakamura, Y., Deguchi, D., Takahashi, T., Murase, H.: Multimedia supplementation to a cooking recipe text for facilitating its understanding to inexperienced users. In: The Second Workshop on Multimedia for Cooking and Eating Activities (CEA 2010) (December 2010) 6. Miura, K., Hamada, R., Ide, I., Sakai, S., Tanaka, H.: Motion based automatic abstraction of cooking videos. In: Proc. ACM Multimedia 2002 Workshop on Multimedia Information Retrieva, lMIR 2002 (December 2002)
5
Nagoya University, “Media Integration Standard Toolkit: MIST,” http://mist.murase.m.is.nagoya-u.ac.jp/
Snap2Read: Automatic Magazine Capturing and Analysis for Adaptive Mobile Reading Yu-Ming Hsu1, Yen-Liang Lin2, Winston H. Hsu1,2, and Brian Wang3 1
Department of Computer Science and Information Engineering, National Taiwan University 2 Institute of Networking and Multimedia, National Taiwan University 3 Institute for Information Industry, Taiwan {leafwind,yenliang}@cmlab.csie.ntu.edu.tw,
[email protected],
[email protected]
Abstract. The rise of electronic book services and the prevalence of mobile devices reveal the needs for mobile reading and the booming business opportunities in e-book developments. However, incompatible e-book readers, non-uniformed e-book formats, or even limited screen resolution causes the inconvenience of reading documents on handheld devices. Meanwhile, it is difficult to read physical magazines that do not have the corresponding digital copies. Therefore, we propose a system, Snap2Read, that can automatically segment the captured document images (i.e., from the physical magazines) in mobile phones into readable patches (e.g. text, title, image), and then scale them into suitable size so that users can easily browse the digitalized magazine pages via the mobile phone with simple clicks. Keywords: mobile reading, page segmentation.
1 Introduction When it comes to reading on handheld device, the presently-available e-book readers (e.g., “Kindle”) and their related software might come to our mind first. Although the release of Apple iPad heated up the competition fore-book readers, the companies that have survived (e.g., Amazon, B&N) stand firm because of their rich resources of digitalized book content. This shows us how important digital content is. Our work gives another possibility for mobile reading: automatically capture and analyze paper documents from physical magazines that users own and turn them into digitalized pages, adaptively readable in mobile devices (cf. Fig. 1). Users do not have to buy various versions of a certain e-book or their corresponding readers from different companies. As long as it is a paper book (i.e., physical magazine) they own, it can be captured (or scanned), cut into the proper size, and turned into the right format to be read on mobile device. Unlike the traditional process (scanning followed by OCR, i.e., Optical Character Recognition,) which limits reading (i.e., offer text format only, may have error, do not preserve layout, etc.), our approach can preserve the original appearance. Other than panning (i.e., dragging) the whole page while reading, not knowing the present location information and losing track of the reading progress, our system is much more flexible and humanized. K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 146–156, 2011. © Springer-Verlag Berlin Heidelberg 2011
Snap2Read: Automatic Magazine Capturing and Analysis for Adaptive Mobile Reading
147
Fig. 1. System overview. (A) A user is interested in a physical magazine page and then takes a snapshot at it1. (B) Page Segmentation step decomposes the whole page into homogeneous blocks. (C) Zone Classification step predicts the labels on the segmented blocks. (D) Mobile Adaptation step changes the classified blocks into readable patches and further determines their reading sequence. (E) The user can then be guided to read by simple clicks.
Mobile interaction is a common topic in recent years. Erol et al. [1] tried to link the physical paper to digital documents. They utilized BWC (Brick Wall Coding) features to retrieve digital documents, and their application aimed at different types of annotation on retrieved document. Liao et al. [2] further provided fine-grained user interface for users, and then the retrieved digital content can be selected, copied or queried. However, they did not address the reading process of users on small-size screen devices. Putting an entire large resolution document on mobile to read may be considered annoying, but if the content can be rendered properly. Many researches have addressed this issue which focuses on web page browsing. Xie et al. [3] used both spatial and content features to learn a block importance model so that they can classify the “importance” of each segmented block, and then simply aggregate them to the final output list by ranking their importance. Hattori et al. [4] created an “object list” to link each segmented element and proved to be more efficient than commercial mobile web browsers. The main differences between our work and those works focused on web page reading are as follows. First, segmenting web pages usually takes the “content” into account (i.e., html tags, text information, etc.), while our inputs only have visual features from the snapshots. Second, their output segmented blocks is DOM (Document Object model) based, which can be further decoded as plain texts. Thus adapting the content to 1
We make an assumption that the image is clear and does not have image distortion resulted from shooting angle. To snap and read the paper is possible if the rectification is conducted correctly and through the help of camera sensors through our preliminary developments.
148
Y.-M. Hsu et al.
fit the screen has never been a problem, while for document images, the segmented blocks should be further split and padded for different screen resolutions. Third, they only create a simple viewing list for vertical direction reading and do not preserve the layout. By applying the page segmentation techniques on document image, we segment the document directly on the appearance, so the layout of our output does not change. There are many algorithms can be categorized into several classes in page segmentation researches. (See the reviews in [5].)They usually focus on skew tolerance, time efficiency or certain document types and most of them, unavoidably, need a certain fixed threshold, while our requirement is that the system can be tolerant to different types of layout structure and font size. The key contributions of this paper are: 1. To the best of our knowledge, Snap2read represents one of the first attempts that transform physical papers (especially magazines) into electronic readings, thus presenting another venue for mobile reading services. 2. To make reading experience more comfortable, suitably rendering the segmentation blocks on mobile devices is nontrivial. We employ the classification technique to enhance page adaptation results by knowing the block types. 3. We propose an adaptive morphological approach for page segmentation, which aims to structure captured magazines on mobile devices. 4. Experiments on hundreds of manually collected magazine pages (with segmentation ground truth) show the promising results of our proposed system. The detailed description of our method is stated at section 2. Section 3 shows the experimental results, and section 4 gives a conclusion and possible future works.
2 The Method The purpose of our system is to segment the document image into homogeneous blocks with the maximum size (Section 2.1 - Page Segmentation), to classify them into a set of predefined categories (Section 2.2 - Zone Classification), and finally to render them to fit for different screen resolutions (Section 2.3 - Mobile Adaptation). 2.1 Page Segmentation Previous approaches usually focus on one single language or specific types of documents. However, we cannot assume our input to be a certain type of magazines for reading activities on mobile devices, so we do not give any fixed parameters related to character font size, line spacing or layout structure, which traditional page segmentation methods do. Like other morphological methods, our method is a bottom-up approach [5]. The main idea is to group those small connected components into larger regions by dilation. However, we dilate iteratively and automatically select the appropriate dilation kernel.
Snap2Read: Automatic Magazine Capturing and Analysis for Adaptive Mobile Reading
149
2.1.1 Pre-processing The purpose of the preprocessing step is to filter out noises, dividing lines, and blocks that are possibly images. They tend to be merged with others during dilation, so we must make sure that other main components (mostly texts) will not be affected. The detailed steps: (1) To take efficiency into consideration, resize the image to a proper area measure (i.e., about 900*650). (2) Do global threshold binarization. (3) For those fore ground (i.e., white) pixels, apply connected component analysis. (4) For each component, if its proportion of height to width is significantly high (i.e., 20 times), then remove it from the original image. If the component is big enough (i.e., 1.25%) and its density, the total number of foreground pixels divided by the area size, is high enough (i.e., 0.1), it will be considered to be “possibly image block” and be extracted in advance. The examples are illustrated in Fig. 2.
(a)
(b)
(c)
(d)
Fig. 2. An illustration of the pre-processing steps: (a) Original image. (b) Binarization. (c) Connected component analysis. Components are in different colors. (d) Noise removal and image block pre-extraction.
Note that in the step (2), we have also tried edge detector and local threshold binarization method, which are widely used in document image processing [5]. However, although the former depicts the salient edges, it also turns complex image regions into many edge fragments. The latter is mostly used on scanned document images that only contain texts in order to make them robust to lighting change, but it, too, has the similar effect mentioned above on image regions. As a result, we decide to use global threshold binarization to preserve image blocks. 2.1.2 Adaptive Dilation Threshold In this work, we enlarge small components by dilation operation to group them together, and the square kernel is applied. It enlarges both vertical and horizontal regions of a component, but how to determine the kernel size is vital and nontrivial. A fact is that the main font size may change from one magazine to another, or from one language to another. Thus, using fixed dilation kernel size for segmentation may not work on every case, so we need to determine it dynamically. During the iterative dilation process, we find that the number of components will drop rapidly at a certain kernel size, and the most suitable size is at the turning point
150
Y.-M. Hsu et al.
(e.g., red arrow in Fig. 3).The physical meaning of this phenomenon is that a large number of characters and words are merged into lines or paragraphs at the same time. We then count the number of component versus kernel sizes (e.g. a blue line in figure 3), and the turning point can be found by applying approximate second order differential. To ensure that almost all components are sufficiently merged, we also add a constraint that the number of components should not be less than a certain number (i.e., 30). # of components 500
Kernel size
Fig. 3. The relationship of component numbers to kernel size, this figure includes20 pages (blue lines). The red arrow indicates the turning point which is suitable for the kernel size.
(a)
(b)
(c)
(d)
Fig. 4. Dilation for little blocks mergence and redundant blocks removal. (a) An example intermediate image from preprocessing step, in which title and footnote are split into several fragments. (b) Remove the large components and dilate again. (c) There exists some inside or non-informative block. (d) The final output example which does not contain those redundant blocks.
In the last step, minimum enclosing rectangles are extracted from those components which are large enough (i.e., at least0.3% of whole page area), but some remaining components are not merged well enough (over-segmentation) because the spacing is larger than main texts (e.g., title, footnote, etc.) Therefore, we use a 1.5-time-large kernel to dilate again for the remaining small components (cf. Fig. 4). Inside blocks and small blocks (i.e., less than 0.1% of whole page area) are also removed (cf. Fig. 4).
Snap2Read: Automatic Magazine Capturing and Analysis for Adaptive Mobile Reading
151
2.2 Zone Classification The rectangles (blocks) produced by the segmentation step above will be rearranged into meaningful “patches” in accordance with different screen resolutions of the device during the next adaptation step, while images and text blocks have different adaptations: Images cannot be further segmented, but text blocks can be split if the yare too large to accommodate themselves to the screen. Therefore, it is necessary to classify images and text blocks (cf. Fig. 1). 2.2.1 Features for Classification We use the early fusion scheme – multimodal features concatenated as a long one – to combine features so that we can learn a classifier by SVM (Support Vector Machine) [6] for each label (e.g., text, image). The three features we used are spatial feature, GCM (Grid Color Moment), and PHOG (Pyramid of Histograms of Orientation Gradients.)[7]Their detailed descriptions are as follows. Spatial feature contains the coordinate and size of a specified region (i.e., x, y, width and height). As for color feature (GCM), we adopt the first order moment (mean) and the second order moment (variance) for color feature. The image will be partitioned into several (i.e., 8*8) sub-blocks, and for each block, calculate its mean and variance values. As a result, the GCM feature is avector with 8 * 8 (blocks)* 3 (color planes) * 2 (moments) = 384 dimensions. The shape feature (PHOG descriptor [7]) represents the “local shape,” and the “spatial layout” of the image. To calculate PHOG, first extract edge contours by Canny edge detector, and the image is divided into 4l 44 sub-blocks at level l. The HOG of each grid at each pyramid resolution level is then calculated. In this paper, we set level up to 2 (i.e., l = 0, 1, 2) and 8 bins for HOG. Thus, by concatenating different level of resolutions of HOGs, it can be formulated as a vector representation with (40 + 41 + 42 ) *8 = 168 dimensions. Their dimensions are 4, 384 and 168, respectively. The concatenated feature vector is then measured 556 dimension, and each dimension will be scaled into [-1, +1]. 2.2.2 Model Selection We use the RBF (Radial Basis Function) kernel for classification, so there are two main parameters g and c to be determined (i.e., gamma and cost, respectively). In order to select the suitable model for prediction, we apply 5-fold cross validation on total 1430 page segments and get average accuracy around 0.95. 2.3 Mobile Adaptation Although the segmented blocks are composed with homogeneous components (cf. Fig. 6 and Fig. 7), they cannot be read directly because we extract them as large as possible for each region type, which may not fit the screen resolution, so we must adjust the blocks to readable patches. As mentioned above, only text blocks may need to be further split (e.g., the wide text block in Fig. 1(E).). Generally speaking, the English articles tend to stretch in vertical direction, while the Chinese articles tend to expand horizontally. Thus, we
152
Y.-M. Hsu et al.
adopt a heuristic approach (take English language as an example): For each block, we scale it along its width, and split it into several patches according to its height, and pad those patches whose height are not sufficient to prevent distortion. (cf. Fig. 5) As for the reading sequence, images will have higher priority than texts, and then we rank them from upper left corner to the lower right corner (for Chinese magazines, rank from upper right corner to lower left corner). We also provide a transparent overview window at the upper right corner of the mobile interface to indicate the current location on whole page. Thus users do not have to zoom-in and zoom-out repeatedly to obtain the geometric information.
1
2
1
2 3a
3 4
5 6
(a)
3b
4a
6a
4b 5
6b
4c
6c (b)
Fig. 5. An illustration of adaptation. (a) Original segmented blocks (b) The adapted (scaled, padded) patches. The reading sequence is 1, 2, 3a, 3b, 4a, 4b, etc…, and it is used to guide users so that they can read the page conveniently through clicks without losing track of page context.
3 Experimental Results This section describes our dataset and how we label the ground truth, and it also shows the evaluation of our work. 3.1 Magazine Dataset To the best of our knowledge, there is no public dataset for page segmentation evaluation. Previous page segmentation researches usually tend to use their own private dataset depending either on their target document genre (e.g., newspaper, journal), or on a specific language. Although we know that in recent years, ICDAR Page Segmentation Competition has created their own dataset with rich types of sources, only those who participate in the competition can gain access to the dataset. Further more, they do not provide documents in Asian languages (e.g. Chinese, Japanese, Korean), which do not have clear bounding boxes for each word, while we expect our system can work well on both type of languages. Thus we create a dataset on our own. We selected 4 different popular magazines: for Chinese language, “Common Wealth” and “Business Weekly” are adopted; we also take 2 English magazine named
Snap2Read: Automatic Magazine Capturing and Analysis for Adaptive Mobile Reading
153
“Business Week” and “Science”. For each magazine, we manually filter out advertisement pages and select 30 scanned pages, and the detail is listed in Table 1. To collect groundtruth, we use an editor named GEDI [8], a highly configurable document image annotation tool. It reads an image file, and when annotation is done, it produces a corresponding XML file in GEDI format. Table 1. Magazine dataset for segmentation and classification experiments Magazine Common Wealth Business Weekly Business Week Science
Language Chinese Chinese English English
# of pages 30 30 30 30
Scanned resolution 1184*1573 963*1280 944*1260 944*1203
3.2 Page Segmentation and Zone Classification Performance Because the evaluation metrics of the previous methods are usually computed pixelwise, which is aimed at OCR. However, our output is rectangle-based, which is aimed at locating reading patches. As a result, comparing the two does not make sense. Although we do not compare them directly, we adopt one of the most widely used metrics in ICDAR 2005 [9], and try to illustrate our performance with their intuition.We have annotated three types of entities (i.e., categories): text, image and footnote.For each entity, the EDM (Entity Detection Metric) is calculated. First, evaluate how much they overlapped between a ground truth zone and a result zone by keeping a global matrix MatchScore, which is defined by function ⎧ T (G j ∩ Ri ∩ I ) ⎫ , if ( g j = ri ) ⎪ ⎪ T ( G R I ) ∪ ∩ MatchScore(i, j ) = ⎨ ⎬ j i ⎪ ⎪ 0, otherwise ⎩ ⎭
(1)
Where I denotes all image pixels, Gj: all pixels inside the ground truth j, Ri: all pixels inside the result i, gj: the entity type of the ground truth j, ri: the entity type of the result i, and T(s):a function that counts the elements of sets. Second, three types of matches are defined (i.e., one-to-one, one-to-many and many-to-one) according to their MatchScore: If the MatchScore of ground truth zone j and result zone i is higher than the accept threshold (i.e., 0.6), then it is a one-to-one match. (See figure 8 for more explanation) If there are K ground truth zones jk (k = 1, 2…K) overlapping the same result zone i, and each of their MatchScore is between the accept threshold and the reject threshold (i.e., 0.1<MatchScore(i,jk) < 0.6, k = 1, 2…K), but their summation is higher than the accept threshold, then it is a many-to-one match, and vice versa. For simplicity, the acceptable matched number for each entity is defined as MatchNumber = (w1*one-to-one + w2*one-to-many + w3*many-to-one), where w1 = 1 and w2 = w3 = 0.75 for partial match penalty. Then DetectRatet and RecognAccuracyt for entityt are defined asDetectRatet = MatchNumber/Nt, and RecognAccuracyt = MatchNumbert/Mt. Nt is the number of ground truth regions for t’th entity, and Mt is the number of result regions for t’th entity. The DetectRatet and RecognAccuracyt
154
Y.-M. Hsu et al.
represent the acceptable ratio among all ground truth zones and all result zones for t’th entity, respectively. The Entity Detection Metric score for each entity (text, image, footnote) is then defined as
EDM t =
2* DetectRatet * RecognAccuracyt DetectRatet + RecognAccuracyt
(2)
There overall page segmentation performance is promising. See the breakdowns in Table 2. The page segmentation results are sampled in Fig. 6 and Fig. 7. We also
(a)
(b)
(c)
Fig. 6. Chinese magazine results. (The blue, red and green bounding boxes indicate text, image and footnote, respectively.) (a) A Common Wealth example (b) A Business Weeklyexample (c) Anover-segmentation examplewhich divides a flow chart intotext blocks.
(a)
(b)
(c)
Fig. 7. English magazine results. (a) A Business Week example (b) A Science example (c) An over-segmentation example results from figures with unclear bounding boxes.
Snap2Read: Automatic Magazine Capturing and Analysis for Adaptive Mobile Reading
155
found the results are satisfactory as rendering them in reading patches in two Android phones with different resolutions. The result of Business Weekly seems to have better performance than others (cf. Table 2(a).), because its layout is less complicated than others and its text block size is mostly large and rectangle shaped, while Business Week has lower performance on image category (cf. Table 2(b).) because it has a lot of figures and tables combined with text explanation inside the bounding boxes, and Footnote category usually has lower performance because parts of them are removed during redudant rectangle removal step,but they are thought of as non-informative blocks, our system does not guide users to read them. Thus the lower performance on footnote category does not really matter.
Fig. 8. An illustration of accept threshold selection. The groundtruth blocks are marked as magenta and the result blocks are marked as blue. Low MatchScore mostly comes from those small blocks (about 0.8 for logos and 0.6 for footnotes), because a trifling difference (less than 10 pixels) between the two region boundary can result in a large number of percentage of area measure. Thus the smaller blocks tend to have the lower MatchScore. However, this situation does not impede the reading process of users. As a result, we set the accept threshold at 0.6. Table 2. Page Segmentation results (T: text. I: image, F: footnote) Common Wealth Business Weekly(a) Business Week T
Nt Mt DetectRatet RecognAccuracyt EDMt
I
F
T
I
F
T
I
213 44 55 203 49 52 195 91 199 53 42 209 54 40 251 134 0.80 0.93 0.49 0.91 0.91 0.62 0.89 0.76 0.86 0.77 0.64 0.89 0.83 0.80 0.70 0.51 0.83 0.84 0.56 0.90 0.87 0.70 0.78 0.61(b)
F
84 67 0.50 0.63 0.56
Science T
I
F
245 77 122 250 81 79 0.89 0.81 0.52 0.88 0.77 0.81 0.88 0.78 0.64
156
Y.-M. Hsu et al.
4 Conclusion and Future Works This work demonstrates a possibility that people can turn the physical magazines into mobile e-book automatically and read them everywhere by simply snapping a shot. Compared to the text only e-books, our method can preserve layout appearence and images, free from being restricted by certain formats and hardware. It is also possible to do magazine retrieval if there is a magazine database.Thus, if users see an interestingmagazine by chance, they can retrieve parts of the magazine instead of buying them at full price. Furthermore, if we can apply image rectification techniques, the angle of inclination resulting from taking a snapshot will not be under so many restrictions as before through the help of mobile sensors, and the retrieval performance can also be improved. We are developing the system for leveraging mobile sensors for boosting snapshot and rectification quality. Meanwhile, we are also evaluating the proposed mobile reading system on Android phones for subjective performance.
References 1. Erol, B., Antúnez, E., Hull, J.J.: HOTPAPER: Multimedia Interaction with Paper using Mobile Phones. In: ACM Conference (2008) 2. Liao, C., Liu, Q.: PACER: Toward A Cameraphone-based Paper Interface for Fine-grained and Flexible Interaction with Documents. In: ACM MM (2009) 3. Xie, X., Miao, G., Song, R., Wen, J.-R., Ma, W.-Y.: Efficient Browsing of Web Search Results on Mobile Devices Based on Block Importance Model. In: Proc. Pervasive Computing and Communications. IEEE, Los Alamitos (2005) 4. Hattori, G., Hoashi, K., Matsumoto, K., Sugaya, F.: Robust Web Page Segmentation for Mobile Terminal Using Content-Distances and Page Layout Information. In: ACM WWW (2007) 5. Okun, O., Doermann, D., Pietikäinen, M.: Page Segmentation and Zone Classification: The State of the Art. In: UMD (1999) 6. Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines (2001), http://www.csie.ntu.edu.tw/~cjlin/libsvm 7. Bosch, A., Zisserman, A., Munoz, X.: Representing shape with a spatial pyramid kernel. In: CIVR (2007) 8. GEDI: Groundtruthing Editor, http://gedigroundtruth.sourceforge.net/ 9. Antonacopoulos, A., Gatos, B., Bridson, D.: ICDAR 2005 Page Segmentation Competition. In: ICDAR (2005)
Multimodal Interaction Concepts for Mobile Augmented Reality Applications Wolfgang Hürst and Casper van Wezel Utrecht University, PO Box 80.089, 3508 TB Utrecht, The Netherlands
[email protected],
[email protected]
Abstract. Augmented reality on mobile phones – i.e. applications where users look at the live image of the device’s video camera and the scene that they see is enriched by 3D virtual objects – provides great potential in areas such as cultural heritage, entertainment, and tourism. However, current interaction concepts are often limited to pure 2D pointing and clicking on the device’s screen. This paper explores different interaction approaches that rely on multimodal sensor input and aim at providing a richer, more complex, and engaging interaction experience. We present a user study that investigates the usefulness of our approaches, verify their usability, and identify limitations as well as possibilities for interaction development for mobile augmented reality. Keywords: Mobile augmented reality, multi-sensor input, interaction design.
1 Introduction In this paper, we explore new interaction metaphors for mobile augmented reality (mobile AR), i.e. applications where users look at the live image of the video camera on their mobile phone and the scene that they see (i.e. the reality) is enriched by integrated 3-dimensional virtual objects (i.e. an augmented reality; cf. Fig. 1). Even if virtual objects are registered in 3D and updated in real-time, current interaction concepts are often limited to pure 2D pointing and clicking via the device’s touch screen. However, in order to explore the tremendous potential of augmented reality on mobile phones for areas such as cultural heritage, entertainment, or tourism, users need to create, access, modify, and annotated virtual objects and their relations to the real world by manipulations in 3D. In addition to common problems with interaction on mobile phones such as small screen estate, interaction design for mobile AR is particularly difficult due to several characteristics. First of all, since we are interacting with the (augmented) reality, interface design is somehow limited. Normally, an appropriate interface design can often deal with mobile interaction problems. For example, one can limit the number of buttons and increase their sizes so they can easily be hit despite small screen sizes. In mobile AR however, placement, style, and size of virtual objects is dictated by the video image and the virtual augmentation, resulting for example in sizes of objects that are hard to select (cf. Fig. 2, top). In addition, interfaces of normal applications can be operated while holding the device in a stable and comfortable position, K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 157–167, 2011. © Springer-Verlag Berlin Heidelberg 2011
158
W. Hürst and C. van Wezel
Fig. 1. Mobile Augmented Reality enhances reality (represented by the live video stream on your mobile) with 3D virtual objects (superimposed at a position in the video that corresponds to a related spot in the real world). In this example, a virtual flower grows in the real plant pot.
whereas mobile AR applications require users to point their phone to a certain scene, often resulting in uncomfortable and unstable positions of the phone (especially when touching the screen at the same time; cf. Fig. 2, center). Being forced to point the device to a specific position in space also limits the possibility to interact via tilting (which has become one of the most commonly used interaction metaphors for controlling virtual reality data on mobiles e.g. for mobile gaming). Finally, input signals resulting from touch screen interactions are 2-dimensional, whereas augmented reality requires us to manipulate objects in 3D (cf. Fig. 2, bottom). The goal of this paper is to explore alternatives to touch screen based interaction in mobile AR that take advantage of multimodal information delivered by the sensors integrated in modern smart phones, i.e. camera, accelerometer, and compass. In particular, the accelerometer combined with the compass can be used to get the orientation of the phone. In fact, these sensors are used to create the augmented reality in the first place by specifying where and how virtual 3D objects are registered in the scene of the real world. We present an approach that uses this information also to select and manipulate virtual 3D objects. We compare it with touch screen interaction and a third approach that analyzes the camera image. In the latter case, the tip of your finger is tracked when it is moved in front of the camera and gestural input is used for object manipulation. Using gestural input is currently a hot trend in human computer interaction as illustrated by projects such as Microsoft’s Natal (now Kinect) [1] and MIT’s SixthSense [2]. Gesture-based interaction on mobile phones has also been explored by both industry [3] and academia [4] but in relation to different applications than augmented reality and utilizing a user-facing camera. Work in mobile AR that takes advantage of the front facing camera include analysis of hand-drawn shapes [5] and 3D sketches [6] as well as various marker-based approaches such as tangible interfaces [7]. However, most of these have been applied in indoor environments under rather restricted and controlled conditions, and often rely on additional tools such as the utilization of a pen or markers. [8] is an example for work in traditional augmented reality (e.g. using head mounted displays) that utilizes hand gestures. However, we are unaware of any work applying this concept to augmented reality on mobile phones. The purpose of the study presented in this paper is to verify if this concept is also suitable in a mobile context. In the following, we describe the three interaction concepts and tasks we want to evaluate (section 2), present our user study (section 3) and results (section 4), and conclude with a discussion about the consequences and our resulting future work on interaction design for mobile AR (section 5).
Multimodal Interaction Concepts for Mobile Augmented Reality Applications
159
Interface design. A good interface design allows us to deal with many interaction problems, e.g. by making buttons big enough or enlarging them during interaction for better visibility (left). In augmented reality, size and position of the objects is dictated by reality and thus objects might be too small to easily select or manipulate them (right).
Holding the device. Whereas normally one can hold the device in a stable and comfortable position during interaction (left), augmented reality requires us to point the phone to a certain spot in the real world, resulting in an unstable position esp. during interaction (right).
2D vs. 3D. Touch screen based interaction only delivers 2dimensional data, hence making it difficult to control virtual objects in the 3D world (e.g. put a flower into a plant pot; right).
Fig. 2. Potential problems with mobile augmented reality interaction
2 Interaction Concepts and Tasks Our goal is to compare standard touch screen based interaction with two different interaction concepts for mobile AR: one that depends on how the device is moved (utilizing accelerometer and compass sensors) and one that tracks the user’s finger in front of the camera (utilizing the camera sensor). Our overall aim is to target more complex operations with virtual objects than pure clicking. In the ideal case, a system should support all kinds of object manipulations in 3D, i.e. selection, translation, scaling, and rotation. However, for an initial study and in order to be able to better compare it to touch screen based interaction (which per default only delivers 2dimensional data), we restrict ourselves here to the three tasks of selecting virtual objects, selecting entries in context menus, and translation of 3D objects in 2D (i.e. left/right and up/down). In order to better investigate the actual interaction experience and eliminate noise from hand and finger tracking, we use a marker that is attached to the finger that will be tracked (cf. below). For the standard touch screen based interaction, the three tasks have been implemented in the following way: selecting an object is achieved by simply clicking on it on the touch screen (cf. Fig. 3, left). This selection evokes a context menu that was implemented in a pie menu style [9] which has proven to be more effective for pen and finger based interaction (compared to list-like menus as commonly used for mouse or touchpad based interaction; cf. Fig. 3, right). One of these menu entries puts
160
W. Hürst and C. van Wezel
Fig. 3. Touch screen interaction: select an object (left) or an entry from a context menu that pops up around the object after selection (right)
Fig. 4. Device based interaction: selection by pointing a reticule (left) to the target till the related bar is filled (right)
the object in “translation mode” in which a user can move an object around by clicking on it and dragging it across the screen. If the device is moved without dragging the object, it stays at its position with respect to the real world (represented by the live video stream). Leaving translation mode by clicking at a related icon fixes the object at its new final position in the real world. In terms of usability, we expect this approach to be simple and intuitive because it conforms to standard touch screen based interaction. In case of menu selection, we can also expect it to be reliable and accurate because we are in full control of the interface design (e.g. we can make the menu entries large enough and placed as far apart of each other so they can be hit easily with your finger). We expect some accuracy problems though when selecting virtual object, especially if they are rather small, very close to each other, or overlapping because they are placed behind each other in the 3D world. In addition, we expect that users might feel uncomfortable when interacting with the touch screen while they have to hold the device upright and targeted towards a specific position in the real world (cf. Fig. 2, center). This might be particularly true in the translation task for situations where the object has to be moved to a position that is not shown in the original video image (e.g. if users want to place it somewhere behind them). Our second interaction concept uses the position and orientation of the device (defined by the data delivered from the integrated accelerometer and compass) for interaction. In this case, a reticule is visualized in the center of the screen (cf. Fig. 4, left) and used for selection and translation. Holding it over an object for a certain amount of time (1.25 sec in our implementation) selects the object and evokes the pie menu. In order to avoid accidental selection of random objects, a progress bar is shown above the object to illustrate the time till it is selected. Menu selection works in the same way by moving the reticule over one of the entries and holding it till the bar is filled (cf. Fig. 4, right). In translation mode, the object sticks to the reticule in
Multimodal Interaction Concepts for Mobile Augmented Reality Applications
161
Fig. 5. Interaction by using a green marker for tracking that is attached to the tip of the finger
the center of the screen while the device is moved around. It can be placed at a target position by clicking anywhere on the touch screen. This action also forces the system to leave translation mode and go back to normal interaction. Compared to touch screen interaction, we expect this “device based” interaction to take longer when selecting objects and menu entries because users can not directly select them but have to wait till the progress bar is filled. In terms of accuracy, it might allow for a more precise selection because the reticule could be pointed more accurately at a rather small target than your finger. However, holding the device at one position over a longer period of time (even if it’s just 1.25 sec) might prove to be critical especially when the device is hold straight up into the air. Translation with this approach seems intuitive and might be easier to handle because people just have to move the device (in contrast to the touch screen where they have to move the device and drag the object at the same time). However, placing the object at a target position by clicking on the touch screen might introduce some inaccuracy because we can expect that the device shakes a little when being touched while held up straight in the air. This is also true for the touch screen based interaction, but might be more critical here because for touch screen interaction the finger already rests on the screen. Hence, we just have to it release it whereas here we have to explicitly click the icon (i.e. perform a click and release action). Touch screen based interaction seems intuitive because it conforms to regular smart phone interaction with almost all common applications (including most current commercial mobile AR programs). However, it only allows to remotely controlling the 3-dimensional augmented reality via 2D input on the touch screen. If we track the users’ index finger when their hand is moved in front of the device (i.e. when it appears in the live video on the screen), we can realize a finger based interaction where the tip of the finger can be used to directly interact with objects, i.e. select and manipulate them. In the ideal case, we can track the finger in all three dimensions and thus enable full manipulation of objects in 3D. However, since 3D tracking with a single camera is difficult and noisy (especially on a mobile phone with a moving camera and relatively low processing power) we restrict ourselves to 2D interactions in the study presented in this paper. In order to avoid influences of noisy input from the tracking algorithm, we also decided to use a robust marker based tracking approach where users attach a small sticker to the tip of their index finger (cf. Fig. 5). Object selection is done by “touching” an object (i.e. holding the finger at the position in the real world where the virtual object is displayed on the screen) till an associated
162
W. Hürst and C. van Wezel
Fig. 6. Holding the finger too close to the camera makes it impossible to select small objects (left). Moving your hand away from the camera decreases the size of the finger in the image but can result in an uncomfortable position because you have to stretch out your arm (right).
progress bar is filled. Menu entries can be selected in a similar fashion. In translation mode, objects can be moved by “pushing” them. For example, an object is moved to the right by approaching it with the finger from the left side and pushing it rightwards. Clicking anywhere on the touch screen places the object at its final position and leaves translation mode. Gesture based interaction using finger or hand tracking can be a very powerful way for human computer interaction in a lot of situations. However, in relation to mobile AR, there are many shortcomings and potential problems, most importantly due to the limited range covered by the camera. For example, the finger becomes very large in the image when being close to the camera, making it unsuitable for selection of small or overlapping objects. Moving it further away should increase possibilities for interaction, but might result in uncomfortable positions where the arm has to be stretched out quite far in order to create a smaller image of the finger on the screen (cf. Fig. 6). In addition, moving an object by pushing it from the side might turn out to be difficult depending on which hand is used (e.g. pushing it from the right using the left hand) and either result in an awkward hand position or even force people to switch the hand in which they hold the device. We implemented all three interaction concepts on a Motorola Droid/Milestone phone with Android OS version 2.1. In the next chapter we present a user study to verify their intuitive advantages and disadvantages discussed above.
3 User Study We evaluated the interface concepts described in the previous section in a user study with 18 participants (12 male and 6 female, 5 users at ages 15-20 years, 8 at ages 2130, 1 at ages 31-40 and 41-50, and 3 at ages 51-52). For the finger tracking, users were free in where to place the marker on their finger tip. Only one user placed it on his nail. All others used it on the inner side of their finger as shown in Figures 5 and 6. Eleven participants held the device in their right hand and used the left hand for interaction. For the other seven it was just the other way around. No differences could be observed in the evaluation related to marker placement or hand usage.
Multimodal Interaction Concepts for Mobile Augmented Reality Applications
163
Fig. 7. Object selection task: test case (single object, left), easy (multiple non-overlapping objects, center), and hard (multiple overlapping objects, right)
A within-group study was used, i.e. each participant tested each interface (subsequently called touch screen, device, and finger) and task (subsequently called object selection, menu selection, and translation). Interfaces were presented in different order to the participants to exclude potential learning effects. For each user, tasks were done in the following order: object selection, then menu selection, Fig. 8. Translation task: move object to then translation, because this would also target (white square on the right) be the natural order in a real usage case. For each task, there was one introduction test in which the interaction method was explained to the subject, one practice test in which the subject could try it out, and finally three “real” tests (four in case of the translation task) that were used in the evaluation. Subjects were aware that the first two tests were not part of the actual experiment and were told to perform the other tests as fast and accurate as possible. The three tests used in the object selection task can be classified as easy (objects were placed far away from each other), medium (objects were closer together), and hard (objects overlapped; cf. Fig. 7). In the menu selection task, the menu contained three entries and users had to select one of the entries on top, at the bottom, and to the right in each of the three tests (cf. Fig. 3, 4, and 5, right). In the translation task, subjects had to move an object to an indicated target position (cf. Fig. 8). The view in one image covered a range of 72.5°. In two of the four tests, the target position was placed within the same window as the object that had to be moved (at an angle of 35° between target and initial object position to the left and right, respectively). In the other two tests, the target was outside of the initial view but users were told in which direction they had to look to find it. It was placed at an angel of 130° between target and object, to the left in one case, and to the right in the other one. The order of tests was randomized for each participant to avoid any order-related influences on the results. For the evaluation, we logged the time it took to complete the task, success or failure, and all input data (i.e. the sensor data delivered from the accelerometer, compass, and the marker tracker). Since entertainment and leisure applications play an important role in mobile computing, we were not only interested in pure accuracy and performance, but also in issues such as fun, engagement, and individual preference.
164
W. Hürst and C. van Wezel
Hence, users had to fill out a related questionnaire and were interviewed and asked about their feedback and further comments at the end of the evaluation.
4 Results Figure 9 shows how long it took the subjects to complete the tests averaged over all users for each task (Fig. 9, top left) and for individual tests within one task (Fig. 9, top right and bottom). The averages in Figure 9, top left show that touch screen is the fastest approach for both selection tasks. This observation conforms to our expectations mentioned in the previous section. For device and finger, selection times are longer but still seem reasonable. Looking at the individual tests used in each of these tasks, times in the menu selection task seem to be independent of the position of the selected menu entry (cf. Fig. 9, bottom left). Almost all tests were performed correctly: only three mistakes happened over all with the device approach and one with the finger approach. In case of the object selection task (cf. Fig. 9, top right), touch screen interaction again performed fastest and there were no differences for the different levels of difficulty of the tasks. However, there was one mistake among all easy tests and in five of the hard tests the wrong object was selected thus confirming our assumption that interaction via touch screen will be critical in terms of accuracy for small or close objects. Finger interaction worked more accurate with only two mistakes in the hard test. However, this came at the price of a large increase of selection time. Looking into the data we realized that this was mostly due to subjects holding the finger relatively close to the camera resulting in a large marker that made it difficult to select an individual object that was partly overlapped by others. Once the users moved their hand further away from the camera, selection worked well as indicated by the low amounts of errors. For the device approach, there was a relatively large number of errors for the hard test, but looking into the data we realized that this was only due to a mistake that we made in the setup of the experiment: in all six cases, the reticule was already placed over an object when the test started and the subjects did not move the device away from it fast enough to avoid accidental selection. If we eliminate these users from the test set, the time illustrated for the device approach in the hard test illustrated in Figure 9, top right increases from 10,618 msec to 14,741 msec which is still in about the same range as the time used for the tests with easy and medium levels of difficulty. Since all tests in which the initialization problem did not happen have been solved correctly, we can conclude that being forced to point the device to a particular position over a longer period of time did not result in accuracy problems as we suspected. In the translation task, we see an expected increase in time in case of the finger and touch screen approach if the target position is outside of the original viewing window (cf. Fig. 9, bottom right). In contrast to this, there are only small increases in the time it took to solve the tasks when the device approach is used. In order to verify the quality of the results for the translation task, we calculated the difference between the center of the target and the actual placement of the object by the participants in the tests. Figure 10 illustrates these differences averaged over all users. It can be seen that
Multimodal Interaction Concepts for Mobile Augmented Reality Applications
45000
25000
AVERAGE TIME (IN MSEC) OVER ALL USERS & TESTS
165
OBJECT SLECTION TASK (INDIVIDUAL TESTS)
40000 20000
35000 30000
DEVICE
FINGER
DEVICE
FINGER
TOUCH SCREEN
TOUCH SCREEN
15000
25000 20000
10000
15000 10000
5000
5000 0
0
OBJECT SELECTION
14000
MENU SELECTION
MENU SELECTION TASK (INDIVIDUAL TESTS)
12000
TRANSLATION DEVICE FINGER TOUCH SCR.
EASY
70000 60000
10000
50000
8000
40000
6000
30000
4000
20000
2000
10000
MEDIUM
HARD
TRANSLATION TASK (INDIVIDUAL TESTS)
DEVICE FINGER TOUCH SCREEN
0
0 BOTTOM
RIGHT
35° (LEFT)
TOP
35° (RIGHT)
130° (LEFT)
130° (RIGHT)
Fig. 9. Time it took to solve the tasks (in msec, averaged over all users) 0 35° (LEFT) 35° (RIGHT)
100
200
300
400
500
600
700
800
900
1000
AVERAGE DISTANCE FROM TARGET IN TRANSLATION TASK DEVICE
FINGER
TOUCH SCREEN
130° (LEFT) 130° (RIGHT)
Fig. 10. Correctness of translation task (squared Manhattan distance in screen units)
the device approach was not only the fastest but also very accurate, especially in the more difficult cases with an angel of 130° between target and initial object position. Finger based interaction on the other hand seems very inaccurate. A closer look into the data reveals that the reason for the high difference between target position and actual object placement are actually many situations in which the participants accidentally hit the touch screen and thus placed the object at a random position before reaching the actual target. This illustrates a basic problem with finger based interaction, i.e. that users have to concentrate on two issues: moving the device and “pushing” the object with their finger at the same time. In case of device based interaction on the other hand, users could comfortably take the device in both hands while moving around thus resulting in no erroneous input and a high accuracy as well as fast solution time. A closer look into the data also revealed that the lower performance in case of the touch based interaction are mostly due to one outlier who did extremely bad (most likely also due to an accidental input) but otherwise they are at about the same level of accuracy as the device approach (but of course at the price of a worse performance time; cf. Fig. 9).
166
W. Hürst and C. van Wezel
Based on this formal analysis of the data, we can conclude that touch based interaction seems appropriate for selection tasks. For translation tasks, the device based approach seems to be more suitable, whereas finger based interaction seems less useful in general. However, especially in entertainment and leisure applications, speed and accuracy are not the only relevant issues, but fun and engagement can be equally important. In fact, for gaming, mastering an inaccurate interface might actually be the whole purpose of the game (think of balancing a marble through a maze by tilting your phone appropriately – an interaction task that can be fun but is by no means easy to handle). Hence, we asked the participants to rank the interaction concepts based on performance, fun, and both. Results are summarized in Table 1. Rankings for performance clearly reflect the results illustrated in Figures 9 and 10 with touch based interaction ranked highest by eleven participants, and device ranked first by six. Only one user listed finger based interaction as top choice for performance. However, when it comes to fun and engagement, the vast majority of subjects ranked it as their top choice, whereas device and touch were ranked first only four and one time, respectively. Consequently, rankings are more diverse when users were asked to consider both issues. No clear “winner” can be identified here, and in fact, many users commented that it depends on the task: touch and device based interaction are more appropriate for applications requiring accurate placement and control, whereas finger based interaction was considered as very suitable for gaming applications. Typical comments about the finger based approach characterized it as “fun” and “the coolest” of all three. However, its handling was also criticized as summarized by one user who described it as “challenging but fun”. When asked about the usefulness of the three approaches for the three tasks of the evaluation, rankings clearly reflect the objective data discussed above. Table 2 shows that all participants ranked touch based interaction first for both selection tasks, but the vast majority voted for the device approach in case of translation. It is somehow surprising though that despite its low performance, finger based interaction was ranked second by eight, seven, and nine users, respectively, in each of the three tasks – another indication that people enjoyed and liked the approach although it is much harder to handle than the other two. Table 1. Times how often an interface was ranked first, second, and third with respect to performance versus fun and engagement versus both (T = touch screen, D = device, F = finger)
Ranked 1st Ranked 2nd Ranked 3rd
PERFORMANCE T D F 11 6 1 3 6 9 4 6 8
T 1 6 11
FUN D 4 10 4
F 13 2 3
PERF. & FUN T D F 8 3 7 6 8 4 4 7 7
Table 2. Times how often an interface was ranked first, second, and third with respect to the individual tasks (T = touch screen, D = device, F = finger)
Ranked 1st Ranked 2nd Ranked 3rd
OBJECT SELECT. T D F 18 10 8 8 10
MENU SELECT. T D F 18 11 7 7 11
TRANSLATION T D F 3 15 6 3 9 9 9
Multimodal Interaction Concepts for Mobile Augmented Reality Applications
167
5 Conclusion Interacting with augmented realities on mobile devices is a comprehensive experience covering many ranges, characteristics, and situations. Hence, we can not assume to find a “one size fits all” solution for a good interface design, but most likely have to provide different means of interaction depending on the application, the goal of a particular interaction, and the context and preference of the user. Our study suggests, that a combination of touch screen based interaction (which achieved the best results in the shortest time in both selection tasks and was rated highest by users in terms of performance) with device dependent input (which achieved the best results in the shortest time in the translation task and was highly ranked by the users for this purpose) is a promising approach for serious applications that require exact positioning and accurate control over the content. On the other hand, interaction via finger tracking (which had low performance values but was highly rated and appreciated by the users in terms of fun, engagement, and entertainment) seems to be a promising approach for mobile gaming and other leisure applications. Obviously, our study was only intended as a first step in the direction of better, more advanced interfaces for mobile AR. In relation to serious applications, our future work aims at further investigating touch and device based navigation, especially in relation to other tasks, such as rotation and scaling of objects. In relation to leisure games, our goal is to further investigate interaction via finger and hand tracking, especially in relation to “real” 3D interaction where moving the finger in 3D space can be used to manipulate objects in 3D, for example by not only pushing them left/right and up/down but also forward and backward.
References 1. Xbox Kinect (formerly known as Microsoft’s project Natal), http://www.xbox.com/en-US/kinect (last accessed 10/15/10) 2. Mistry, P., Maes, P.: SixthSense: a wearable gestural interface. In: ACM SIGGRAPH ASIA 2009 Sketches (2009) 3. EyeSight’s Touch Free Interface Technology Software, http://www.eyesight-tech.com/technology/ (last accessed 10/15/10) 4. Niikura, T., Hirobe, Y., Cassinelli, A., Watanabe, Y., Komuro, T., Ishikawa, M.: In-air typing interface for mobile devices with vibration feedback. In: ACM SIGGRAPH 2010 Emerging Technologies (2010) 5. Hagbi, N., Bergig, O., El-Sana, J., Billinghurst, M.: Shape recognition and pose estimation for mobile augmented reality. In: Proceedings of the 2009 8th IEEE International Symposium on Mixed and Augmented Reality (2009) 6. Bergig, O., Hagbi, N., El-Sana, J., Billinghurst, M.: In-place 3D sketching for authoring and augmenting mechanical systems. In: Proceedings of the 2009 8th IEEE International Symposium on Mixed and Augmented Reality (2009) 7. Billinghurst, M., Kato, H., Myojin, S.: Advanced Interaction Techniques for Augmented Reality Applications. In: Proceedings of the 3rd international Conference on Virtual and Mixed Realit, Part of HCI International 2009 (2009) 8. Lee, M., Green, R., Billinghurst, M.: 3D natural hand interaction for AR applications. In: Proceedings of Image and Vision Computing, New Zealand (2008) 9. Callahan, J., Hopkins, D., Weiser, M., Shneiderman, B.: An empirical comparison of pie vs. linear menus. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (1988)
Morphology-Based Shape Adaptive Compression Jian-Jiun Ding, Pao-Yen Lin, Jiun-De Huang, Tzu-Heng Lee, and Hsin-Hui Chen Graduate Institute of Communication Engineering, National Taiwan University, 10617, Taipei, Taiwan
[email protected],
[email protected],
[email protected],
[email protected],
[email protected]
Abstract. In this paper, we use morphology to improve the performance of the object oriented coding scheme. The object oriented compression algorithm is popular in recent years. However, due to the high frequency components of the edges, the reconstructed image is always seriously disturbed by the object edges (i.e., the ring effect) when using the algorithm. In this paper, we use the morphological operation to separate an object into the interior and the contour sub-regions before applying the shape-adaptive DCT. Since the contour subregion is separated apart, the high frequency components in it will not disturb the interior region and the reconstructed image will have much better quality. Simulation results show that the proposed algorithm outperforms the conventional object oriented compression algorithm. Keywords: image coding, image processing, shape, object oriented methods, morphological operations.
1 Introduction Image compression using object-oriented segmentation is based on the idea that the color values of the pixels within the same object always have high correlation [1]. The object-based coding scheme is popular in recent years [2]-[11]. It plays an important role on the MPEG-4 standard 3[4][5]. A variety of object-oriented image compression techniques were proposed to realize the functionality of making visual objects available in the compressed form 2, [6][11]. The shape-adaptive transformation is applied on the arbitrarily shaped region to achieve higher compression rate by exploiting high correlation between the pixels in the specific object. The earliest approach is to calculate the customized DCT basis functions for all the image objects using generalized orthogonal transform [2][6]. One of the most popular methods recently is the shape-adaptive DCT (SA-DCT) [7][8][9]. The SA-DCT flushes samples in an arbitrarily shaped object to a specific edge of its bounding box before performing the 1-D DCT in the flushed direction. However, the existence of the high frequency components in boundary regions reduces the coding performance and image quality. In this paper, we propose an improved object-oriented compression scheme which can efficiently represent and compress the image contents since the textures, patterns, and other characteristics in an object usually share the same or similar color values K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 168–176, 2011. © Springer-Verlag Berlin Heidelberg 2011
Morphology-Based Shape Adaptive Compression
169
and the color intensity variations are low. Higher compression ratio and better image quality of reconstruction image can then be achieved. Moreover, the proposed method is totally backward compatible with the existing coding standards.
2 Proposed Morphological Segmentation Using Erosion For an arbitrarily-shaped object, the pixels within it have similar or identical color values. However, the contour region of an object contains a great portion of the high frequency components because the color values at the edge of an object usually vary significantly. When we use the object-oriented compression technique instead of cutting the entire image into the conventional 8×8 blocks, it is possible that the resulting arbitrary-shape DCT coefficients contain a considerable amount of high frequency components. Since the main cause of the potential distortion and noise artifacts is due to the fact that the high frequency components are truncated during the quantization process, using the object-oriented compression technique could amplify the distortion and noise artifacts. To avoid the distorted artifacts, we could segment the entire image in a more detailed way. However, it would produce more boundary information that needs to be recorded. Since we know that the variations in pixel values occur around the contour region of an arbitrary shape, we can divide an arbitrarily-shaped image object into the contour sub-region and the interior sub-region. In Fig. 1, we show an example of the contour sub-region and the interior sub-region. Then we apply the shape-adaptive transformation and encode them separately. Note that we do not need to record the boundary of the overall interior sub-region because it can be retrieved by the original boundary in the decoding process. In Fig. 2, we depict the overall procedure of our proposed algorithm. The hatched block shows the role of the morphological operation corresponding to other procedure in the whole image compression processing scheme. It should be noted that the morphological operation (erosion) is used to divide the object into two separate parts as shown in Fig. 1. The proposed image segmenting method is completely discussed as below. Contour sub-region The overall object
Interior sub-region Fig. 1. The contour and interior sub-regions of an object
We use a simple test image depicted in Fig. 3, which has six arbitrarily-shaped image objects as an input image to be compressed using our proposed scheme. The way we divide an arbitrarily-shaped image object is by a morphological operation called
170
J.-J. Ding et al.
Fig. 2. Block diagram of our proposed algorithm
Interior sub -regions Shapeadaptive DCT
Contour sub-regions Shapeadaptive DCT
Fig. 3. Divide the image objects by morphological erosion
binary erosion [1]. The overall erosion process is illustrated in Fig. 3. As mentioned earlier, we obtain the shapes of the objects in binary format by filling the pixels inside the boundaries with zeroes. For two binary image set A and B, the erosion of A by B, denoted A B, is defined as
A B = { z | ( B ) z ⊆ A} ,
(1)
where the translation of set B by point z = (z1, z2), denoted (B)z, is defined as [1]
( B) z = {c | c = b + z , for b ∈ B} ,
(2)
Morphology-Based Shape Adaptive Compression
171
By the erosion operation, each of the objects is divided into two parts: the contour sub-region and the interior sub-region. The contour sub-region β (A) obtained by mask B can be expressed as
β ( A) = A − A B .
(3)
We take Fig. 3 for example. We erode the binary arbitrary shapes by a structuring element B of a size of 5×5 pixels as depicted in Fig. 3. The six contour sub-regions of the binary shapes are obtained by subtracting the eroded shapes from the original ones using (3). Finally, we multiply the original arbitrary-shape image objects by both the binary contour shapes and the interior shapes to get the final contour sub-regions and the interior sub-regions of the six objects, respectively. The shape adaptive DCT [7][8][9] are then performed on both contour regions and interior regions, then the resulting DCT coefficients are quantized and encoded separately.
3 Quantization of the Coefficients Because of the high portion of high frequency components in the contour sub-region, actually we can use higher quantization step size for the contour sub-region in order to achieve high compression ratio with better image quality. Basically, most of the shape-adaptive transformation, such as the SA-DCT, is backward compatible. Block-based quantization are used after performing these transformation, whereas the approach using generalized orthogonal transform to generate customized DCT basis functions may not use the original block-based quantizer. So we define an extendable and object-shape-dependent quantization array Q(k) as a linearly increasing line:
Q ( k ) = Qa k + Qc
(4)
for k = 1, 2,…, M. The two parameters Qa and Qc are the slope and the intercept of the line, respectively, and M is the number of the DCT coefficients within the same arbitrarily-shaped object. Each DCT coefficient F(k) is then divided by the corresponding quantization array Q(k) and rounded to the nearest integer:
⎛ F (k ) ⎞ Fq ( k ) = Round ⎜ , ⎜ Q ( k ) ⎟⎟ ⎝ ⎠
(5)
where k = 1, 2,…, M. The parameter Qc in (4) is important because it affects the quantization quantity of the DC term and the low frequency components whose value is usually greater than the high frequency components. The other parameter Qa affects the high frequency portion of the arbitrary-shape DCT coefficients the most. As mentioned in Section I quantization is fundamentally lossy especially at low bitrates compression. This problem is the main motivation of our proposed morphological segmentation algorithm discussed in Section 2.
172
J.-J. Ding et al.
4 Coding Technique of the Image Object After the quantization stage, the quantized arbitrarily-shaped DCT coefficients can be encoded by the same coding technique that is used to the shape adaptive transformation in previous stage. For the arbitrarily-shaped DCT using generalized orthogonal transform method, we actually do not need to record the length of the arbitrarily-shaped DCT coefficients of each object because it can be obtained by the boundary information which is not in the scope of this paper. Therefore, in the decoding process, we normally decode the bit stream of boundaries first. The point number inside a boundary is then counted and it should be exactly the length of DCT coefficients in the corresponding object. Furthermore, the contour sub-region and interior sub-region can be obtained by eroding the mask of the object.
5 Simulation Results and Performance Comparison The comparisons of the conventional and the proposed object-oriented image compression algorithms are made in this section. First, we use the image in Fig. 3 as the testing image. The image with size 256 × 256 is first divided into six small segments. In our algorithm, each small object is performed by the erosion operation before applying SA-DCT. As shown in Fig. 4, it is obvious that our proposed method have 43 42.5
SADCT SADCT with erosion
42
[dB]
41.5 41
PSNR
40.5 40 39.5 39 38.5 38 0.4
0.5
0.6
0.7
0.8 4
Bitrate(×10 )
0.9
1
1.1
[Bits]
Fig. 4. Comparison of the PSNRs of the reconstructed images with test image in Fig. 3. (Blue line): the result of the conventional method. (Red line): the result of the proposed method.
Morphology-Based Shape Adaptive Compression
40.5 40
173
SADCT SADCT with erosion
[dB]
39.5 39 38.5
PSNR
38
Barbara
37.5 37 36.5 36 6.5
7
7.5
8
8.5
Bitrate(×104)
9
9.5
10
10.5
[Bits]
Fig. 5. Comparison of the PSNRs of the reconstructed images with test image Barbara. (Blue line): the result of the conventional method. (Red line): the result of the proposed method. 49 SADCT SADCT with erosion
48
[dB]
47
PSNR
46
Akiyo
45
44
43
42 4
5
6
7
8 4
Bitrate(×10 )
9
10
11
[Bits]
Fig. 6. Comparison of the PSNRs of the reconstructed images with test image Akiyo. (Blue line): the result of the conventional method. (Red line): the result of the proposed method.
174
J.-J. Ding et al.
greater performance than the conventional object-oriented compression algorithm especially when the bit rate is low. It means that the contour sub-region extracted by erosion operation contains large amounts of high frequency components that are later quantized by higher quantization coefficients. Moreover, the proposed method performs better in low bit-rate condition. In Fig. 5 and Fig. 6, we show the comparison of the PSNRs with the Akiyo and Barbara images. We can find that the proposed method outperforms the conventional object-oriented compression algorithm. In Figs. 7, 8 and 9, we show some examples of the reconstructed images. The results in Fig. 7(b)(d), Fig. 8(b)(d), and Fig. 9(b) show that, using the proposed method, the blocky and blurry artifacts are much less than using conventional object-oriented compression algorithm. It can be seen that the blurring phenomena caused by the boundary region are significantly reduced. Therefore, our algorithm can indeed achieve much better reconstructed images and hence improve the performance of shape-adaptive image compression. (a) Conventional method
(b) Proposed method
(c) Conventional method
(d) Proposed method
Fig. 7. The reconstructed images of (Left) the conventional object-oriented compression scheme and (Right) the PROPOSED SCHEME. The results show that the reconstructed images of our proposed method have much better qualities.
Morphology-Based Shape Adaptive Compression (a) Conventional method
(b) Proposed method
(c) Conventional method
(d) Proposed method
SADCT
175
SADCT with erosion operation
Fig. 8. The reconstructed images of (Left) the conventional object-oriented compression scheme and (Right) the PROPOSED SCHEME. The results show that the reconstructed images of our proposed method have much better qualities. (a) Conventional method SADCT
(b) Proposed method SADCT with erosion operation
Fig. 9. The reconstructed images of (Left) the conventional object-oriented compression scheme and (Right) the PROPOSED SCHEME. The results show that the reconstructed images of our proposed method have much better qualities.
176
J.-J. Ding et al.
The original object-oriented compression algorithm suffers from the blurring phenomena because there are many high frequency components in the boundary regions of the objects. Since our algorithm uses morphological erosion to separate an object into interior and contour parts, as in Fig. 3, the high frequency components in the boundaries will not affect the interior regions. It makes reconstructed images can have higher quality.
6 Conclusions As evident from the simulation results, the proposed morphology-based shape adaptive compression makes significant improvements by the separation of contour sub-region which contains high portion of high frequency components. The proposed morphology-based arbitrarily-shaped image compression scheme has high adaptability because it is applicable to different inputs, different shape-adaptive transformation. The complexity of our proposed method is almost the same as that of the original object-oriented compression algorithm. However, from the simulations in Figs. 7-9, the qualities of the reconstructed images are much improved.
References 1. Gonzalez, R.C., Woods, R.E.: Digital Image Processing, 2nd edn. Prentice Hall, New Jersey (2002) 2. Chang, S.F., Messerschmitt, D.: Transform Coding of Arbitrarily Shaped Image Segments. In: Proc. 1st ACM Int. Conf Multimedia, Anaheim, CA, pp. 83–90 (1993) 3. Sikora, T.: MPEG-4 Video Standard Verification Model. IEEE Trans. Circuits Syst. Video Technol. 7, 19–31 (1997) 4. Marpe, D., Schwarz, H., Wiegand, T.: Context-Based Adaptive Binary Arithmetic Coding in the H.264/AVC Video Compression Standard. IEEE Trans. Circuits Syst. Video Technol. 13, 620–636 (2003) 5. Richardson, L.E.G.: H.264 and MPEG-4 Video Compression: Video Coding for NextGeneration Multimedia. John Wiley and Sons, Chichester (2004) 6. Gilge, M., Engelhardt, T., Mehlan, R.: Coding of Arbitrarily Shaped Image Segments Based on a Generalized Orthonormal Transform. Signal Process: Image Commun. 1, 153– 180 (1989) 7. Sikora, T., Makai, B.: Shape-Adaptive DCT for Generic Coding of Video. IEEE Trans. Circuits Syst. Video Technol. 5, 59–62 (1995) 8. Sikora, T.: Low Complexity Shape-Adaptive DCT for Coding of Arbitrarily Shaped Image Segments. Signal Process: Image Commun. 7, 381–395 (1995) 9. Kauff, P., Schüür, K.: Shape-Adaptive DCT with Block-Based DC Separation and a DC Correction. IEEE Trans. Circuits Syst. Video Technol. 8, 237–242 (1998) 10. Li, S., Li, W.: Shape-Adaptive Discrete Wavelet Transforms for Arbitrarily Shaped Visual Object Coding. IEEE Trans. Circuits Syst. Video Technol. 10, 725–743 (2000) 11. Moon, J.H., Park, G.H., Chun, S.M., Choi, S.R.: Shape-Adaptive Region Partitioning Method for Shape-Assisted Block-Based Texture Coding. IEEE Trans. Circuits Syst. Video Technol. 7, 240–246 (1997)
People Tracking in a Building Using Color Histogram Classifiers and Gaussian Weighted Individual Separation Approaches Che-Hung Lin, Sheng-Luen Chung, and Jing-Ming Guo Department of Electrical Engineering National Taiwan University of Science and Technology Taipei, Taiwan {M9507208,slchung}@mail.ntust.edu.tw,
[email protected]
Abstract. This paper focuses on people tracking within a building with the installment of distributed overhead cameras. Our primary concern is to keep tracks of: number of people entrance to a particular area; and the whereabouts trajectory of a particular person within the monitored building. With image taken from ceiling mounted camera, pedestrian’s physiologic contour is analyzed from four different viewing angles to form a person’s identity signature. In doing so, techniques to locate a person’s head, to predict his/her movement direction, to separate overlapped physiological blobs, and to differentiate different people by color histogram classifier have been proposed. Special attention has been paid for system configurability such that the proposed software architecture can be deployed to different floor plans. We have conducted a continuing surveillance monitoring on the third floor of EE in NTUST, and the result shows moderate surveillance performance: 93% accuracy in entrance counting, and 76% accuracy in identification checking. Keywords: people tracking system, surveillance, color histograms classifier, Gaussian weighted individual separation, movement prediction.
1 Introduction Surveillance is always a concerned issue for security reason. The continuing cost reduction in cameras has increased the umber of camera installation. With this trend in deploying of more an more cameras, it is practically impossible to rely on human inspectors to track monitored objects (people) from different camera channels at the same time. In contrast to RFID-based solutions to tracking people, the focus of this work is to investigate solution that does not require people’s conscious participation to identify them. Efficient solutions under of premise of no active people’s involvement are likely to lead to more context awareness applications, where services are provided based on particular person at particular place and moment. Many methods for people tracking have been proposed in the literature: approximate color histogram is used to recognize human positions by their silhouette [1]-[2]; however, the color histogram is easily influenced by the changes of image distance. K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 177–186, 2011. © Springer-Verlag Berlin Heidelberg 2011
178
C.-H. Lin, S.-L. Chung, and J.-M. Guo
RFID is proposed [3] to detect people’s location; however, this approach requires people wearing a tag to be identified. The scope of the people tracking problem addressed in this paper is outlined as follows: The target of monitored coverage is a floor plan composed of corridors, exit/entrance to a room or the floor, access to elevators and stairways; the floor plan is partitioned into adjacent areas. The concerned issue is to track the whereabouts (a particular area in the floor plan) of a distinguished individual. Instead of distinguishing who is who, which requires a priori information on the names of people, we are interested instead in which is which, which for instance requires the distinction of a person of particular dressing to another. The equipment employed is a group of distributed ceiling mounted cameras, each of which is installed at fixed height and with a fixed coverage at the junction of two areas. To address the issue raised above, a Head Center Determination (HCD) approach is proposed to detect a person’s presence under a camera. The same notion of head center is also used to separate individuals from a overlapped image possibly containing several people using another proposed approach: Gaussian Weighted Individual Separation (GWIS). Since a person appears differently with different viewing angles from the ceiling mounted camera, physiological contours viewed from four different view angles are taken into account. Moreover, a Color Histogram Classifier (CHC) is proposed for both easier matching operation as well as reducing sensitivity for lighting effect. Particular attention is also paid to the status tracking of each entrant and the coordination between different cameras to yield a comprehensive view on people’s whereabouts on the floor plan.
(a)
(b)
Fig. 1. ROI of an IP Cam. (a) Snapshots of PTS. (b) Specification of a ROI.
Fig. 2. Setting of each ROI through areas
People Tracking in a Building Using Color Histogram Classifiers and GWIS Approaches
179
2 People Tracking in a Building Floor Plan Our solution to the PTS problem covering the whole floor plan relies on the efficient approaches of HCD, GWIS and CHC to the processing of the image from each ROI. The three key elements are introduced as below. A. Head Center Determination (HCD) With a fixed camera, a moving object can be obtained by applying background image subtraction. Easy to apply and efficient in delivering results, this method is suited for real-time processing. Background image subtraction is to maintain a background image first and then subtract the background images from subsequent images. After the subtraction, three difference images Pd _ r , Pd _ g , and Pd _ b associated to current R, G, and B color channels can be obtained. Additional Y, Cb, and Cr as well as H, S, and I can be obtained from R, G, and B [5]. Among these, R, G, B, Y, Cb, Cr, and S are employed to separate hair, skin, and the rest foreground from the background. The thresholds are empirical values from numerous test samples. In this work, the head region is composed of hair region and skin region. The blue part in Fig. 3 represents head region and the white part is considered as foreground. The head size should be larger than 180 pixels to claim a potential pedestrian existence according to experimental results; without this threshold, some objects with similar color as human head but are of smaller sizes may be incorrectly inferred as people. The amount of foreground pixels is critical in determining whether to enter background update procedure. In general, when the number of foreground pixels is lower than an empirical threshold of 640, it means the current image has no pedestrian presence. Consequently, the PTS ignores tracking procedure and enters background update phase, which can significantly reduce computational complexity otherwise required.
(a)
(b)
Fig. 3. Image binarization method. (a) Current image. (b) Analyzed image.
The background update procedure is to withstand the light effect in the environment when system is operated in different time frames (morning, afternoon, and night). When the foreground pixels of are lower than that of a predefined threshold for a total of 300 successive frames, PTS will update the current image as the updated background image. To give a rough estimate on how often the background is updated: it takes 16~79 ms to process one image from a camera, and there are a total of eight cameras. Hence, a total of 300 successive images take about 4.8~23 seconds, which is roughly the time used to enable background update procedure in this work.
180
C.-H. Lin, S.-L. Chung, and J.-M. Guo
Pf > TBU ⎧ 0 Cch = ⎨ + C Pf ≤ TBU 1 ⎩ ch
(1)
where the subscript ch denotes the channel index of each camera; Pf denotes the amount of foreground pixels calculated by background image subtraction; TBU denotes the foreground pixels threshold for background update, which is set at 400, for deciding the next object tracking step; Cch denotes the number of times that the foreground pixels of a channel are lower than that of the background updated threshold. To avoid the number of foreground pixels of the last updated background images close to TBU , when Cch is three hundred times, PTS updates images when Pf < 0.8TBU . In the proposed PTS, the image fed from each camera is of size 160x120. Through analyzing more than seven hundred pedestrians passing through the environment, we conclude that the minimum head requires at least 180 pixels. A mask of size 17x17 is used to decide the head region: when a pixel of the image is known as a foreground pixel, the pixel is considered at the center of the mask, and the totality of the pixels of head color inside this mask is calculated. When the totality is greater than 180 pixels, PTS concludes the center point of the mask as a head region. The formula is given as ⎧0 ( Pf < TBU ) OR (Ch,i , j < 180) Ph ,i , j = ⎨ ⎩1 ( Pf > TBU ) AND (Ch,i , j ≥ 180)
(2)
where Ch,i , j denotes the number of head pixels with center position (i,j); Ph,i , j denotes the possible head region when the value is 1. Figure 4 shows a processed example, in which the blue pixels represent the head area, and the red pixels represent the possible head center.
(a)
(b)
Fig. 4. Analyzing position of human head. (a) Current image (b) Analyzed image.
The parameter Ph,i , j is then used to analyze the number of people. First, we denote M s as a 17x17 mask used for scanning possible human head locations, and Pm,i , j is a
160x120 array storing marks for human blobs, which is initially set as 0 for each pixel. The processes are organized as below:
Pm,i , j
⎧ 0, if (f ∀ { x, y ∈ M s } , Pm,i + x, j + y = 0) AND ( Ph,i , j =0); ⎪ = ⎨ Pm ,i , j + 1, if (∀ { x, y ∈ M s } , Pm,i + x, j + y = 0) AND (Ph,i , j = 1); ⎪P , if (∀ { x, y ∈ M s } Pm,i + x , j + y = 1) AND (Ph ,i , j = 1). ⎩ m ,i , j
(3)
People Tracking in a Building Using Color Histogram Classifiers and GWIS Approaches
181
Eq. 3 assigns a number for each head center. A more accurate head center can be obtained by averaging the coordinates of the pixels with the same number in Pm,i , j . B. Gaussian Weighted Individual Separation (GWIS) As the head center is determined as stated in the previous sub-section, the proposed GWIS is applied to extract individual contours form the foreground image. According to experimental results, the maximum width of human contour is around 50 pixels. Thus, a mask of size 51x51 is employed to detect human contour. In essence, those pixels that are close to the head center have higher likelihood belonging to that person’s contour. Hence, the mask of Gaussian shape. The head center is mapped to the center of the Gaussian mask. Notably, the color red is more sensitive to lighting effect than blue and green; as a result, two different types of Gaussian masks are used to account for these two different types of color channels: red vs. blue and green. In this work, TP _ R and TP _ GB denote the adaptive Gaussian thresholds for red and green/blue, respectively, as given in the following general form: x2 y2 ⎤ ⎧ ⎡ −( + ) 2σ x 2 2σ y 2 ⎪⎪C ⎢ 1 ⎥ + TL e ⎥ ⎨ ⎢ 2πσ xσ y ⎦ ⎪ ⎣ TL ⎪⎩
if (x 2 + y 2 )
(4)
if (x + y ) ≥ D 2
2
where (C , σ x , σ y , TL , D) = (2000, 10, 10, 40, 254) and (2500, 13, 13, 30, 342) for TP _ R and TP _ GB , respectively. Figure 5 shows a binarized result using the proposed
GWIS, where the blue frame indicates the head center using the proposed HCD method. After the application of GWIS on human contour detection, overlapped pixels
(a)
(b)
Fig. 5. Binarized result using the proposed GWIS. (a) Test image. (b) Binarized result.
(a)
(b)
(c)
Fig. 6. Overlapped pixels elimination. (a) Current image. (b) Binarized result without overlapped pixels removal. (c) Binarized result with overlapped pixel removal.
182
C.-H. Lin, S.-L. Chung, and J.-M. Guo
may result when two walkers are too close to each other, as shown in Fig. 6(b). Similarity in color structure is to separate individual contours. In doing so, the head pixels are removed from contour pixels, and the average of the rest contour pixels are calcui lated as ABC , where i denotes the ith person. The overlapped pixels are classified by i comparing their values with ABC : they are then classified into the most comparable person’s contour. A classified example is shown in Fig. 6(b). It is clear that the proposed strategy can effectively solve the overlapped pixel problem. Some small broken objects surrounding the main objects are observed from Figs.5-6. These annoying objects can be eliminated using image patch method, where two types of masks of sizes 3x3 and 5x5 are adopted to raster scan the binarized image. Suppose a foreground pixel is detected with the 3x3 mask, the 5x5 mask is then applied to measure if number the foreground pixels more than 15 in this area; if no, the pixel is removed from the foreground pixel candidate. Figure 7 shows a processed example.
(a)
(b)
Fig. 7. Result of image patch method. (a) Original binarized result. (b) Processed result by the image patch method.
C. Color Histogram Classifiers (CHC) This sub-section introduces how to use CHC to differentiate and characterize each individual. The color spaces employed in this work include HSV and YCbCr. The major aim of CHC is to distribute unstable data to different fuzzy sets [6] according to different degree of membership. In this work, multiple fuzzy sets within color range 0~255 are involved. A fuzzy model is illustrated in Fig. 8 (a) in which x denotes grayscale; f ( x) is the degree of membership, and the range between S 0 and S1 is called fuzziness. The relationship between a fuzzy set and degree of membership is given as: ⎧ x-S0 ⎪ S -S , S0 <x<S1 ; ⎪ 1 0 ⎪ 1, S1 ≤ x ≤ S2 ; μ A1 (x)= ⎨ ⎪ S3 -x , S <x<S ; 2 3 ⎪ S3 -S2 ⎪ Otherwise. ⎩ 0,
(5)
where μA (x) denotes the characteristic function of set A1 . According to numerous 1 samples testing, the optimized number of fuzzy sets is eight, and the optimized
People Tracking in a Building Using Color Histogram Classifiers and GWIS Approaches
(a)
183
(b)
Fig. 8. Color histograms classifier. (a) Model of fuzzy set. (b) Classifier.
Fig. 9. Range of four identities
fuzziness range is 30. Figure 8 (b) shows the histogram classifier composed of the eight fuzzy sets in the proposed PTS. Through normalization, three color histograms of the detected contour are enhanced to 0~255 and multiplied by the color histogram classifier as given in Fig. 8 (b) to obtain a 48-byte identification signature composed of Cb, Cr , and S. Every two bytes are used to record an accumulate fuzzy sets’ degree of membership as C A f = C A f + Vx • μ A f ( x ) ,
(6)
where Af denotes the fth of fuzzy set in classifier; C A denotes the accumulated data of f
Af ; Vx denotes the color histograms data of the input parameter 0~255, and the suffix f=0, …, 7. In practice, we observed that the histogram of detected contour is significantly influenced by capturing angle of the ceiling mounted camera. To accommodate these differences, an ROI is divided into four co-centric areas, as shown in Fig. 9. Four identification signatures associated to the contour images taken from the four respective areas are constructed for each individual. Together, these four make an identity signature for the current user ID till he/she leaves from the floor of plan. Throughout the operation, the proposed PTS, composed of several ROIs, keeps the identity signatures for all the active people current in the system. Identification of a person’s identity by checking the current person’s identity signature against all active identity signatures is initiated when ROI fails to track the object, or when the pedestrian walks into the different ROI. In [7], the margin of error Ed is evaluated as Ed = Ed + C Ai f − C A* f
(7)
184
C.-H. Lin, S.-L. Chung, and J.-M. Guo
where C A* and C Ai denote the current identity and the identities in the system’s dataf f base, respectively. A threshold, empirically 800, is used for Ed to determine whether a new user ID is created. When many C Ai ’s margin of error are smaller than the threshold, f
PTS selects the one with the minimum Ed as the detected object’s identity.
3 Experimental Result Figure 10 shows the overall appearance of the GUI, where eight components are involved: 1) Real-time view of images captured from all installed cameras on the floor plan; 2) detected object view of current camera; 3) background update window; 4) message window used to show entrance and departure from a ROI of a person; 5) function buttons used to start facility connection, manually update background, start and stop operation, and delete data; 6) databases query for a people tracking history records on a selected ROI and indicated timing interval; 7) response view for query results, and 8) historical trajectories view which replays the entrance history of a selected object from his/her entrance to the floor plan till departure.
Fig. 10. GUI of the proposed PTS
The proposed PTS has been put to operate continuously on the third floor of NTUST for three hours: 2008-08-25 AM11:00~PM14:00. The Correct Detection Rate (CDR) of the proposed method has been checked against the Ground Truth (GT) in the recorded video. The experimental results are organized with three types of activities: 1.
2.
3.
Checking Leaving Direction (CLD): The number of passing people entering and then leaving into and from the current ROI; this is the number of people passing through the current ROI. Checking User ID (CUI): Number of right match of a user ID when the designated person leaves from an ROI and enters into another; this is the number of people who are being corrected tracked between ROIs. Checking Merges (CM): The individual counts of people when entering an ROI as a group, which is the number of people who are separated from a cluster of people and counted as individual.
People Tracking in a Building Using Color Histogram Classifiers and GWIS Approaches
185
An inspection is provided, which can be summarized to some possible reasons: 1.
2.
3.
Lighting effect: The lighting conditions in Cam 1, Cam 6, Cam 7 and Cam 8 are not equal. When a person walks under the dark area, PTS may mistakenly take his/her shadow as part of foreground image. Unreachable area: The respective width of corridor of Cam5 and Cam8 is larger than that of ROI: the passing of a person may not be detected by the cameras above. Object similarity: This work uses the cameras mounted on the ceiling to overlook the objects. The features which can be utilized are much less than lateral or front view detections. Thus, when individuals with similar contours or outfit colors are difficult to be separated.
Table 1 organizes the overall performances: with 750 people passing through all ROIs during the three-hour period, the CDRs of the CID, CUI, and CM are 90%, 76%, and 70%, respectively. Regarding the processing time, the proposed PTS system can process 84~495 frames/sec with images of size 160x120 using eight cameras. In average, it takes 1.706 ms to process an image when no object enters the system, and 10.8 ms when objects enter. We now compare the features of the proposed PTS with some former related works in the literature. By making the assumption that a walking person is less likely to change his/her clothes, studies of [8] and [1] use color histograms of the foreground object as the signature. If two people keep being overlapped, the signature obtained from color histograms cannot be trusted, and the CDR decreases accordingly. In [4], overhead analysis is also adopted by setting one camera on the ceiling to avoid the overlapping problem. However, the scope is limited to one camera and thus one ROI. In contrast, the coverage of the proposed PTS is a floor plan that is monitored by several cameras; as such, coordination among different ROIs is important. We apply the approach of setting camera on the ceiling to avoid the overlapping problem. Accompanying with this strategy, the CHC is employed to provide a stable identified data under multi-camera, which can avoid identifying color histograms from being influenced by lighting effect. Table 1. Overall Performances CID
CUI
CM
Ground truth
750
750
79
Result
701 93%
572 76%
55 70%
Correct detection rate
4 Conclusions In this work, we investigate the design and implementation of PTS by distributed ceiling mounted cameras. By dividing the monitored floor plan into adjacent areas, the
186
C.-H. Lin, S.-L. Chung, and J.-M. Guo
proposed PTS intends to keep tracking of not only the number of people entrance to each area, but also the presence trajectory of each entrant. Each entrant’s physiological contour in a ROI under a camera is characterized from four different viewing angles. With the proposed HCD and GWIS, the head centers and the associate physiological contours are determined. Surveillance experiments on the third floor of EE department in NTUST have been conducted. With techniques geared to reduced processing time at each ROI and the pedestrian handover between ROIs, the system response time, collected from our experiment, shows satisfactory: With eight cameras installed, the proposed PTS updates images 240 frames/sec for the whole system, or in averagely, 10 frames/second for each camera. The accuracy of the proposed system has been checked with recorded video. The result shows moderate surveillance performance: 93% accuracy in entrance counting, and 76% accuracy in identification checking.
References 1. Bahadori, S., Iocchi, L., Leone, G.R., Nardi, F., Scozzafava, L.: Real-Time Location and Tracking Through Fixed Stereo Vision. Applied Intelligence 26(2), 83–97 (2007) 2. Krumm, J., Harris, S., Meyers, B., Brumitt, B., Hale, M., Shafer, S.: Multi-Camera Multi-Person Tracking for EasyLiving. In: Proceedings of the 2000 Third IEEE International Workshop on Visual Surveillance, pp. 3–10 (July 2000) 3. Nohara, K., Tajika, T., Shiomi, M., Kanda, T., Ishiguro, H., Hagita, N.: Integrating Passive RFID Tag and Person Tracking for Social Interaction in Daily Life. In: Proceedings of the 17th IEEE International Symposium on Robot and Human Interactive Communication, RO-MAN, art. no. 4600723, pp. 545–552 (2008) 4. Velipasalar, S., Tian, Y.L., Hampapur, A.: Automatic counting of interacting people by using a single uncalibrated camera. In: Proc. 2006 IEEE International Conference on Multimedia and Expo., pp. 1265–1268 (2006) 5. Sonka, M., Hlavac, V., Boyle, R.: Image processing, analysis, and machine vision, 2nd edn. PWS Publishing (1998) 6. Negnevitsky, M.: Artificial Intelligence: A Guide to Intelligent Systems. Addison-Wesley, Reading (2002) 7. Piva, S., Zara, M., Gera, G., Regazzoni, C.S.: Color-based video stabilization for real-time on-board object detection on high-speed trains. In: Proceedings of IEEE Conference on Advanced Video and Signal Based Surveillance, pp. 299–304 (July 2003) 8. Krumm, J., Harris, S., Meyers, B., Brumitt, B., Hale, M., Shafer, S.: Multi-camera multi-person tracking for easyLiving. In: Proceedings of the 2000 Third IEEE International Workshop on Visual Surveillance, pp. 3–10 (July 2000)
Human-Centered Fingertip Mandarin Input System Using Single Camera Chih-Chang Yu1, Hsu-Yung Cheng2, Bor-Shenn Jeng3, Chien-Cheng Lee3, and Wei-Tyng Hong3 1
Department of Computer Science and Information Engineering, Vanung University, No.1, Wannung Rd., Jhongli City, Taoyuan, Taiwan 2 Department of Computer Science and Information Engineering, National Central University, No.300 Jhongda Rd., Jhongli City, Taoyuan, Taiwan 3 Department of Communications Engineering / Communication Research Center, Yuan-Ze University, Chung-Li, Taoyuan, Taiwan
Abstract. Designing a user friendly Chinese input system is a challenging task due to the logographic nature of Chinese characters. Using fingertips and cameras to replace pens and touch panels as input devices could reduce the cost and improve ease-of-use and comfort of the computer-human interface. In this work, Chinese character entry is achieved using Mandarin Phonetic Symbol (MPS) recognition via on-line fingertip tracking. In the proposed system, particle filters are applied for robust fingertip tracking. Afterwards, MPS recognition is performed on the tracked fingertip trajectories using Hidden Markov Models. In the proposed system, the challenges of entering, leaving, and virtual strokes caused by video-based fingertip input can be overcome. We conduct experiments to validate that the MPS symbols written by fingertips are successfully and efficiently recognized using the proposed framework. Keywords: Vision-based; Mandarin Phonetic Symbol recognition.
1 Introduction With the prevalence of mobile devices and all-in-one computers, consumers desire a better and easier to use input interface. Pen computing emerged due to the urgent need of convenient input interfaces. However, pens could be easily lost and would cause great inconvenience. Therefore, using fingertips to replace pens is more favorable for users. Special devices have been investigated to provide more convenient input interface, such as the pressure sensitive marking menus described in [1]. Also, touch panels that require no pens have become very popular on mobile devices such as smart phones in recent years. However, touch screens are thicker and heavier compared to normal display devices. Therefore, video-based fingertip input is a possible way of human interface design in small and light weight portable devices since both high-tier and low-tier cell phones are equipped with cameras nowadays. As for laptop or desktop computers, touch panels are criticized to be not conformed to ergonomics under some circumstances. Especially for the emerged all-in-one computers, using K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 187–195, 2011. © Springer-Verlag Berlin Heidelberg 2011
188
C.-C. Yu et al.
touch panels as input interface is not convenient because the users would need to approach the screen. By incorporating video-based fingertip tracking techniques, the proposed video-based input concept could allow users to control the computer and input data with ease. The concept of the proposed input interface is to allow users to use their fingers to write on any surface. Video-based fingertip input system is an alternative input method with high potential that provides improved ease-of-use and comfort to mobile device and computer users. Chinese character input is challenging because it is not as straightforward as typing languages based on alphabets. Hanyu Pinyin and Mandarin Phonetic Symbol (MPS) are two important methods for Chinese character entry. In Taiwan, the Mandarin Phonetic Symbol set, which includes 37 symbols as shown in Figure 1, is the most prevailing Chinese input method. However, many users have found the design of MPS layout on keyboards or keypads unsatisfactory, especially on all-in-one and mobile devices. In this paper, we propose a Chinese input system via Mandarin Phonetic Symbols using video-based fingertip tracking techniques with single camera. The rest of paper is organized as follows. In the next section, the proposed system is explained in details. In Section 3, we demonstrate the experimental results. In Section 4, conclusions are made and future works are discussed.
Fig. 1. The 37 Mandarin Phonetic Symbols
2 The Proposed System Figure 2 illustrates the framework of the proposed system. In the proposed system, the fingertip is identified and tracked in the input video automatically so that the trajectory of the fingertip can be recorded. Then, the recorded trajectory of the fingertip is recognized as Mandarin Phonetic Symbols. Finally, the Chinese Characters can be determined by the input Mandarin Phonetic Symbols and user selection. The fingertip detection and tracking procedures are described in subsection 2.1. Recognition of Mandarin Phonetic Symbols is elaborated in subsection 2.2. 2.1 Fingertip Detection and Tracking The fingertip is detected by adopting the method described in [2]. First, the hand silhouette is extracted using background subtraction. Then, we employ the inner border tracing algorithm to extract the border of the silhouette. These ordered pixels are called border-pixel-vector (BPV). As shown in Figure 3 (a), we select the middle point Pm of line P1 P2 as the reference point. Then, we calculate the Euclidean
Human-Centered Fingertip Mandarin Input System Using Single Camera
189
distance between Pm and BPV in counterclockwise order. A distance distribution diagram shown in Figure 3 (b) can be obtained and the global maximum corresponds to the location of the fingertip (Figure 3 (c)).
Fig. 2. Proposed System Framework
(a)
(b) (c) Fig. 3. Fingertip Detection
We apply a particle filter to track the target fingertip in this work. Particle filters have abilities to deal with non-linear motions and non-Gaussian noises [3] and therefore are suitable for dealing with targets that are close to the camera and having dramatic motion changes that cannot be easily modeled by Kalman filters. Various designs of particle filters have been applied in different vision based applications and systems [3]-[6]. Formulating the tracking problems as dynamic systems, the distribution of p( x k | z1:k −1 ) is obtained via the Chapman–Kolmogorov equation [4] in the prediction stage under the assumption that the pdf p ( x k −1 | z1:k −1 ) at time instance k-1 is known, as shown in Eq. (1). Here x k denotes the system state at time instance k and z1:k −1 denotes the measurement from time instance 1 to k-1. In the update stage,
190
C.-C. Yu et al.
when the measurement
z k at time instance k is available, it can be used to update the
required pdf p( x k | z1:k ) using Eq. (2). The particle filters aim to approximate the probability distribution of the system state by a sample set xk(i) , i = 1… Ns, each associated with a weight wk (i). In the proposed framework, we use an ellipse to model the region of the target fingertip. We assume that the ratio of the major axis a k to minor axis bk remains constant at every time instance k. Therefore we only need to consider the position, scale, and rotational angle of the ellipse. Thus, the system state x k at time instance k is defined as x k = [u k v k s k θ k ]T , where u k and v k are the coordinates of the target fingertip in the image plane, s k is the scaling factor, and θ k is the rotational angle of the ellipse. p( x k | z1:k −1 ) = ∫ p ( x k | x k −1 ) p( x k −1 | z1:k −1 ) dx k −1
p ( x k | z1:k ) =
p ( z k | x k ) p ( x k | z1:k −1 ) p ( z k | z1:k −1 )
(1) (2)
where p ( z k | z1:k −1 ) = p ( z k | x k ) p( x k | z1:k −1 )dx k . ∫ 2.2 Mandarin Phonetic Symbol Recognition For recognizing symbols and characters, Optical Character Recognition (OCR) and Online Character Recognition (OLCR) have both had significant advances over the past decades and achieved very high recognition rates. However, the characters obtained by the proposed video-based fingertip input system have several key differences from traditional OCR or OLCR characters obtained by scanners, touch panels or pen tablets. First of all, in our applications, the characters or symbols would include the entering stroke and leaving stroke. Secondly, all the virtual strokes are mixed with the actual strokes. Moreover, the trajectories obtained by video object tracking may not be smooth. Therefore, the challenges of the proposed system are greater than those of the traditional OCR and OLCR. As shown in Figure 4, the entering stroke and the leaving stroke are the first and the last virtual strokes, respectively. And the starting stroke and ending stroke are the first and the last real strokes. There could be intermediate virtual strokes between the real strokes (as shown in Figure 4 (b)). We have observed that the intermediate virtual stroke is somewhat stable. However, the entering and leaving strokes are very unstable because the fingertip of the users could enter the scene from anywhere. The unstable entering and leaving strokes would severely interfere with the recognition process. Hence, it is necessary to eliminate the entering and leaving strokes to obtain reasonable recognition results. In order to distinguish the entering and leaving strokes from other strokes, we detect the turning points of the tracked trajectory. First, the trajectory is encoded using an Inter-Frame Directional (IFD) code with eight directions, as shown in Figure 5. The trajectory of a phonetic symbol generates a sequence of code numbers. For example, the symbol could generate the following code sequence: 666660002222666666.
Human-Centered Fingertip Mandarin Input System Using Single Camera
191
Fig. 4. Entering stroke, leaving stroke, starting stroke, and ending stroke 2 3
1
4
0 5
7 6
Fig. 5. The inter-frame directional code with eight directions
To detect the turning point, we first take the difference between the code of each frame ct and the code of its previous frame ct −1 . The difference value dt at frame t is defined in the following equation.
⎧ct − ct −1 − 8 if ct − ct −1 > 4 ⎪ d t = ⎨ct − ct −1 + 8 if ct − ct −1 < −4 ⎪c − c otherwise t −1 ⎩ t
(3)
When dt equals to zero, the direction of fingertip does not change. When dt is positive, the fingertip changes its direction counterclockwise. Conversely, the fingertip changes its direction clockwise when dt is negative. Two criteria are used for turning point detection. More specifically, the position of fingertip (ut,vt) is regarded as a turning point if
192
C.-C. Yu et al.
1. both |dt| and |dt+dt+1| are above the threshold. 2. |dt+dt-1| is above the threshold. In our experiments, the threshold value is 2, which is about 90 degree. The detected turning points of a symbol are shown in red circles in Figure 4 (a). The trajectory between two turning points can be regarded as a stroke. We check all the strokes and eliminate those strokes which are too short. The short strokes are usually noises and provide little contribution for recognition. Hence, we discard these short strokes for better recognition results. After obtaining the turning points, we can define the entering and leaving strokes, which correspond to the trajectories before the first turning point and the last turning point. However, sometimes there is no leaving stroke. This is because the users may leave the scene very quickly, or the leaving stroke may have the same direction of the ending stroke. In such cases, we should not eliminate the last stroke. Thus, we check the length of the last stroke. If the last stroke is long enough, we treat it as an ending stroke and keep it. Otherwise we treat it as the leaving stroke and discard it. Figure 4 (a) and (b) show two recorded trajectories with entering and leaving strokes. Figure 4 (c) and (d) show the remaining strokes for the two trajectories after entering and leaving stroke elimination. In recent years, the Hidden Markov Models (HMMs) [7] have been extensively applied to speech recognition to resolve segmentation, time warping, stochastic randomness, etc. Prominent results have been widely recognized in the open literature. The length of the IFD code sequence varies due to the writing speed, and HMMs are suitable recognition models for variable length data. Therefore, MPS recognition is performed using HMMs. A Hidden Markov Model is a finite set of states, each of which is associated with a probability distribution. In a particular state an observation can be generated, according to the associated probability distribution. The IFD codes for each sequence serve as observations of HMM. Given an HMM λ and a sequence of observations O = {O1, O2,…, Ot}, the probability p(O|λ) which represents how possible O is generated by λ can be computed. Each Mandarin Phonetic Symbol s is trained independently to optimize the model parameter λs by a set of given observation sequence O. The training process can be achieved by the Baum-Welch algorithm and the likelihood is calculated using the forward algorithm [7].
3 Experimental Results In this section, we demonstrate the experimental results that verify the feasibility and effectiveness of the proposed system. First, the intermediate tracking trajectories of the symbol are shown in Figure 6 (a)-(h). We can clearly see the entering and leaving strokes of the tracked trajectory. The effects of unstable entering and leaving strokes are demonstrated in Figure 7. In Figure 8, we show the effect of the entering and leaving stroke elimination procedure. After eliminating the entering and leaving strokes from the original tracked trajectories (Figure 8 (a)), the remaining strokes (Figure 8 (b)) are much more stable and suitable for the recognition purpose.
Human-Centered Fingertip Mandarin Input System Using Single Camera
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
193
Fig. 6. Intermediate tracking trajectories of the symbol
Fig. 7. Unstable entering and leaving strokes
(a)
(b) Fig. 8. Tracked trajectories. Top row: original tracked trajectories. Bottom row: trajectories after removing entering and leaving strokes.
Twenty Mandarin Phonetic Symbol sets are collected from five different people, including one left-hander and four right-handers. We adopt the leave-one-out cross validation rule in order to obtain unbiased results. Due to the effect of confusion symbols, we use the cumulative matching score (CMS) as the recognition results. Chinese input system usually involves user selection and therefore top ranked symbols can be provided to the user for selection. Since the confusion symbols are less than 3, we only list the top three matching result. Moreover, we also examine the effect on recognition rate using different number of hidden states in HMM. Table I lists the recognition rate using different number of hidden states. We can find that the recognition rate (best 3 matches) is 98.38% with standard deviation 2.39 when the number of states in HMM equals to 4. Figure 9 shows a clear comparison of recognition rates using different number of states Q, which verifies the robustness of the proposed method. Notice that the recognition rates in Table I and Figure 9 are before confusion symbol verification. It is noticeable that the symbol and are totally dissimilar in their appearance. However, they have quite similar inter-frame directional code sequences which would confuse the trained HMMs. Overall speaking, the recognition rate using HMM is
194
C.-C. Yu et al.
satisfying at rank 3. We believe that with further enhancement of confusion symbol verification procedure, high recognition rates that support practical usage can be achieved. Table 1. Recognition rate of the 37 phonetic symbols
Number of hidden states
Rank 1 70.41 % ± 10.09 % 81.22 % ± 7.26 % 82.84 % ± 8.56 % 81.35 % ± 6.07 % 81.49 % ± 7.91 %
Q=2 Q=3 Q=4 Q=5 Q=6
Rank 2
Rank 3
88.92 % ± 6.73 % 92.97 % ± 3.44 % 94.59 % ± 4.30 % 93.92 % ± 3.81 % 93.51 % ± 5.5 %
96.49 % ± 3.52 % 96.89 % ± 2.2 % 98.38 % ± 2.39 % 97.57 % ± 3.02 % 97.30 % ± 2.77 %
rec o g n itio n ra te(% )
100 90 80
R an k 1
70
R an k 2 R an k 3
60 50 2
3
4
5
6
numb er o f hid d en state s
Fig. 9. The recognition rate of 37 phonetic symbols using different number of hidden states
4 Conclusions This paper proposes a video based Chinese input system. Video based input system using fingertips is a new research topic that is not seen in the current literature a lot. However, it is certainly a topic with strong potential and high research value. Although this paper focuses on Mandarin Phonetic Symbols, the concept of the proposed video-based fingertip input method can also be applied to other general input systems. Unlike other languages based on alphabets, Chinese character input systems are more challenging because of its logographic nature. In this work, a particle filter is used to track the target fingertip. The tracked trajectories are coded by Inter-Frame Directional coding. After locating the turning points and strokes of the input, the entering and leaving strokes can be eliminated so that the recognition process would not
Human-Centered Fingertip Mandarin Input System Using Single Camera
195
be affected by these unstable virtual strokes. Then, the Inter-Frame Directional code of each trajectory is analyzed by Hidden Markov Models for Mandarin Phonetic Symbol recognition. From the experimental results, we verify the performance of the proposed system. For future work, we would investigate algorithms that could robustly locate and track fingertips in backgrounds of various colors and under different lighting conditions. Also, developing sophisticated confusion symbol verification procedures could greatly improve the accuracy of recognition. We believe that the proposed input interface is with high potential and can be applied to various devices.
Acknowledgments The authors would like to thank National Science Council of Taiwan for supporting this work. This work is supported by National Science Council under project code 99-2628-E-008-098.
References [1] Lai, J.T.C., Li, Y., Anderson, R.: Donuts: Chinese Input with Pressure-Sensitive Marking Menu. In: ACM Symposium on User Interface Software and Technology (October 2005) [2] Han, C.-C., Cheng, H.-L., Lin, C.-L., Fan, K.-C.: Personal Authentication Using Palmprint Features. Pattern Recognition 36(2), 371–381 (2003) [3] Isard, M., Blake, A.: Condensation – Conditional density propagation for visual tracking. International Journal on Computer Vision 1(29), 5–28 (1998) [4] Arulampalam, M.S., Maskell, S., Gordon, N., Clapp, T.: A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking. IEEE Transactions on Signal Processing 50(2), 174–188 (2002) [5] Zheng, W., Bhandarkar, S.M.: Face detection and tracking using a Boosted Adaptive Particle Filter. Journal of Visual Communication and Image Representation 20(1), 9–27 (2009) [6] Gustafsson, F., et al.: Particle filters for positioning, navigation and tracking. IEEE Transactions on Signal Processing 50(2), 425–437 (2002) [7] Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77(2), 257–286 (1989)
Automatic Container Code Recognition Using Compressed Sensing Method Chien-Cheng Tseng1 and Su-Ling Lee2 1
Department of Computer and Communication Engineering, National Kaohsiung First University of Science and Technology, Kaohsiung, Taiwan
[email protected] 2 Department of Computer Science and Information Engineering, Chung-Jung Christian University, Tainan, Taiwan
[email protected]
Abstract. In this paper, an automatic container code recognition method is presented by using compressed sensing (CS). First, the compressed sensing approach which uses the constrained L1 minimization method is reviewed. Then, a general pattern recognition framework based on CS theory is described. Next, the CS recognition method is applied to construct an automatic container code recognition system. Finally, the real-life images provided by trading port of Kaohsiung are used to evaluate the performance of the proposed method. Keywords: Container code recognition, Compressed sensing, L1 minimization.
1 Introduction For the purpose of efficient transportation, it is important to automatically and intelligently manage the containers in the gates of the trading port. If the gates are managed by human inspection and manual registration to ensure the correct access of containers, it is a very time-consuming task. To increase the competition capability of port, it is necessary to construct an automatic system that records the exit and entry of the containers by using the techniques of pattern recognition, image processing and computer vision. Until now, many academic research papers and commercial products have been reported to achieve this purpose [1]-[7]. These systems are called container number recognition, container code recognition or container identity number recognition etc. Generally speaking, there are four stages in these methods. First, the captured image is preprocessed to enhance the quality of image and eliminate the unwanted distortion or noise. Second, the container code area is extracted to remove the background. Third, the container code is segmented into a character string. Finally, the character recognition engine is applied to recognize each character. The more details of these methods are briefly described below: In [1], a neural network model is applied to container number recognition. This method can attain a high level of accuracy even though the input patterns are highly distorted. In [2], a vehicle and container number recognition system is developed by using techniques of image enhancement, fast focusing, variable thresholding, image K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 196–207, 2011. © Springer-Verlag Berlin Heidelberg 2011
Automatic Container Code Recognition Using Compressed Sensing Method
197
normalization, neural network and domain knowledge filter. Under outdoor environment test, this method has proved to achieve accuracy higher than 95%. In [3], the segmentation of container code is carried on by applying top-hat morphological operator, thresholding, connected component labeling and size filter. In [4][5], the container image is filtered with both adaptive linear and nonlinear filters to reduce noise. Tests have been made to show character lines can be properly located with a rate above 98%. In [6], the feature-based local intensity gradient and adaptive multithreshold methods are combined to segment the codes from the background with high performance. In [7], a commercial product called SeeGate container code recognition system has been developed. On the other hand, a new signal acquisition method is discovered in recent years, named as compressed sensing (CS) or compressive sampling. This new approach states that a signal can be precisely recovered from only a small set of measurements under certain conditions of sparsity and incoherence [8]. So far, CS theory has been successfully applied to various signal processing areas including image compression, face recognition, speech recognition, radar, astronomy, medical imaging, video coding and DNA microarrays [9]-[17]. Due to the success of compressed sensing, it is interesting to use this method to develop a recognition system that can recognize the container code in various situations. The purpose of this paper is to solve this problem, as shown in Fig.1. This paper is organized as follows: In section 2, the compressed sensing approach which uses constrained L1 minimization method is reviewed. In section 3, a general recognition framework based on CS theory is described. In section 4, the CS recognition method is applied to construct an automatic container code recognition system. In section 5, the experimental test results are shown to demonstrate the effectiveness of the proposed method. Finally, a conclusion remark is made.
Location and Recognition algorithm
GSTU3601680
Fig. 1. The container code recognition problem. The input is the captured image and the output is the container code with 11 characters.
2 Compressed Sensing In this section, the compressed sensing (CS) theory is reviewed briefly. There are two parts in CS framework. One is measurement, the other is reconstruction. Now, these two parts are descried below. Let signal x be x = [ x1 , x 2 , " , x N ] T , where N is the number of signal. To get the information of signal x , we choose M basis φ k = [φ k 1 , φ k 2 , " , φ kN ]T ( k = 1, 2 , " , M ) to sense this signal. The measured signals are given by
198
C.-C. Tseng and S.-L. Lee
y k = x,φ k
k = 1, 2, " , M
(1)
where <, > denotes the inner product operator. Using the matrix representation, (1) can be written as the form
y = Φx
(2)
The above matrix Φ and vector y are
⎡ φ11 ⎢φ Φ = ⎢ 21 ⎢ # ⎢ ⎣φ M 1
φ12 φ 22
" φ1 N " φ2N % # " φ MN
⎤ ⎥ ⎥ ⎥ ⎥ ⎦
(3a)
y = [ y1 , y 2 , " , y M ]T
(3b)
#
φM 2
In many signal processing applications, the signal x is very sparse, that is, x contains only K non-zero valued elements where K is much less than N . Based on this sparseness assumption, we can reconstruct the signal x = [ x1 , x 2 , " , x N ] T from the measurement y = [ y 1 , y 2 , " , y M ] T even though M is much less than N . Thus, y is a sparse representation of the signal x . Next, the details of reconstruction are described. In the literature, there are two reconstruction cases to be studied. One is noiseless case, the other is noisy case. In noiseless case, the signal x can be recovered from measurement y by solving the constrained L1 minimization problem below: Min x
1
subject to y = Φ x
(4)
where L1 norm is defined by
x =
1
= | x1 | + | x 2 | + " | x N | (5)
N
∑ | xi | i =1
The problem in (4) can be solved by using linear programming easily. For noisy case, the measurement is corrupted by noise, that is,
y = Φx + v
(6)
where v is zero-mean white noise vector with size M × 1 . In this case, the L1 minimization with quadratic constraint is used to recover signal x approximately below: Min x
1
subject to
y − Φx
2
≤ε
(7)
where ε bounds the amount of noise in the measurement. This problem can be solved by convex programming.
Automatic Container Code Recognition Using Compressed Sensing Method
199
3 Pattern Recognition Using Compressed Sensing In this section, the conventional pattern recognition method based on similarity matching is first reviewed. Then, a pattern recognition approach using compressed sensing is developed. Given the pre-stored template vectors ψ 1 , ψ 2 ,…, ψ L and the input pattern P , conventional recognition procedure is described below: Step 1: Extract the feature vector Y from the input pattern P . Step 2: Compute the similarity score between feature vector Y and each template vector ψ 1 , ψ 2 ,…, ψ L . One typical method is to compute the normalized correlation below:
si =
Y ,ψ i
i = 1, 2 ,..., L
Y ψi
(8)
Step 3: Perform similarity score sorting and find the highest one to make decision. That is, if
k = arg max s i
(9)
i
then input pattern P belongs to the k th class. Based on the above procedure, it is clear that two main tasks of conventional method are similarity matching and score sorting. Using the sparse representation in compressed sensing, these tasks can be accomplished by solving constrained L1 minimization. The details are described below. When the compressed sensing theory is applied to the pattern recognition problem, the most important thing is to get the sparse representation of the problem. From the conventional recognition method, we see that the feature vector Y must be almost equal to one of the template vector ψ k if the input pattern P is close to one of the pattern used to construct template vector. Based on this fact, feature vector Y can be written as
Y ≈ [ψ
1
ψ
ψ
"
2
k
"
ψ
k
"
L
⎡0 ⎤ ⎢0 ⎥ ⎢ ⎥ ⎢#⎥ ]⎢ ⎥ ⎢1 ⎥ ⎢#⎥ ⎢ ⎥ ⎣⎢ 0 ⎦⎥
Define matrix Φ and vector X below
Φ = [ψ 1
ψ
X = [0
0
2
"
"
ψ 1
"
ψ 0]
T
L
(10)
]
(11a) (11b)
then (10) can be rewritten as
Y ≈ ΦX
(12)
200
C.-C. Tseng and S.-L. Lee
Because there is only one non-zero element in vector X , this vector is sparse. So, (12) is a sparse representation of input feature vector. Let V be the error between Y and Φ X , then (12) can be further written as
Y = ΦX + V
(13)
This means that feature vector Y is a noisy measurement of Φ X . Using CS theory, the sparse vector X can be recovered from the measurement Y , so we can get the location k of nonzero element in the recovered vector X to accomplish the recognition task. Finally, the procedure is summarized below: Step 1: Extract the feature vector Y from the input pattern P . Step 2: Solve the following L1 minimization with quadratic constraint to recover signal X from the feature vector Y : Min X
1
subject to Y − Φ X
2
≤ε
(14)
Let the optimal solution of this problem be denoted by X opt . Step 3: Find the maximum non-zero element of X opt . If the location of this element is k, then input pattern P belongs to the k th class. In next section, the above recognition method will be used to construct an automatic container code recognition system.
4 Container Code Recognition System In this section, we will develop a container code recognition system. The flow chart of the proposed algorithm is shown in Fig.2. The main stages include preprocessing, binarization, morphological filtering, location, segmentation, normalization, CS-based character recognition and post-processing. In the following, the details of each stage will be described. Stage 1. Preprocessing: There are two tasks in preprocessing stage. One is to reduce the image size, the other is to transform color input image to gray level image. Fig.3 shows the preprocessing results including the input image and output image. Stage 2. Binarization and morphological filtering: For binarization, the histogram of gray level image is first computed. Then, we choose a threshold T to let the percentage of pixels whose values are greater than T be equal to 15%. Finally, the binary image is obtained by using the rule: If the pixel value is greater than T, it is set to one. Otherwise, the pixel is set to zero. For morphological filtering, the binary image is first eroded 4 times to remove noise. Then, the eroded image is dilated 4 times to restore the image size. Fig.4 shows the processing results of binarization and morphological filtering. Stage 3. Character location and segmentation: Five steps involved in this stage are described below: First, the 8-adjacent connected component labeling is used to extract all possible objects in the binary image. Second, the objects whose size and aspect ratio are much different from those of characters of container code are removed. Third, we find the most right-side object and use it to determine the candidate area of
Automatic Container Code Recognition Using Compressed Sensing Method
201
Container image
Preprocessing
Image binarization and morphological filtering
Character location, segmentation and normalization
Character template
CS-based recognition
Post-processing
Recognition result
Fig. 2. The flow chart of the proposed container code recognition method based on compressed sensing (CS)
(a)
(b)
Fig. 3. The preprocessing results. (a) Input color container image. (b) The output image.
202
C.-C. Tseng and S.-L. Lee
(a)
(b)
(c)
(d)
Fig. 4. The results of binarization and morphological filtering. (a) Histogram. (b) Binary image. (c) Eroded image. (d) Dilated image.
(a)
(b)
(c)
(d)
Fig. 5. The processing results of location and segmentation. (a) The 24 objects after performing 8-adjacent connected component labeling. (b) The 17 objects after performing the size filtering. (c) The candidate area (white highlighted area) after find the most right-side object "o". (d)) The final 11 extracted objects.
Automatic Container Code Recognition Using Compressed Sensing Method
203
container code. Forth, all objects in the candidate area are extracted and arranged in order from left to right. Fifth, each object is normalized to the prescribed size for detailed matching. Fig.5 shows show the processing results of this stage. Fig.5(a) shows the 24 objects after performing 8-adjacent connected component labeling. Fig.5(b) depicts the 17 objects after performing the size filtering. Fig.5(c) shows the candidate area (the white-highlighted area) after finding the most right-side object "0". Fig.5(d) shows the final 11 extracted objects. Clearly, 11 characters of container codes have been segmented successfully. Stage 4. Character template: In this paper, the multi-template is built for solving the multi-font problem. Fig.6 shows 60 templates in our method. Based on the statistical data, we know that English alphabets J, K, O, P, Q, V, Y, Z, are not used in container code, so these templates are not necessary to be built. Moreover, there exist some distorted characters in container code image, so several distorted templates are also built for the robustness of recognition.
Fig. 6. The multi-font character template of container code whose size is equal to 50x50
Stage 5. CS-based recognition: Because there are eleven segmented characters in container image, the recognition problem is how to obtain the content of these eleven characters which are composed of number digits and English alphabets. The first task of the CS-based method is to construct the matrix Φ from character images in Fig.6. Given all template images with size N r × N c in the database, the procedure to con-
struct the matrix Φ is described below:
Step 1: Construct a random matrix R with size M × N r N c . Each element in matrix R is a zero-mean unit-variance Gaussian random variable. Note that the Gram Schmidt orthogonalization method is also applied to transform matrix R into a unitary matrix. Step 2: Transform two-dimensional (2-D) i-th binary image in the database into a onedimensional (1-D) vector B i by arranging pixel values according to the order from left to right and from top to bottom. The size of vector B i is N r N c × 1 .
204
C.-C. Tseng and S.-L. Lee
Step 3: The template vector ψ i is then given by
ψ i = RB i
(15)
Step 4: Use all template vectors to construct matrix Φ below:
Φ = [ψ 1 ψ 2
" ψk
" ψL]
where L is the number of template images. After template matrix Φ has been constructed, the second task is how to recognize the unknown input segmented character image P by using matrix Φ . The CS-based recognition procedure is described below: Step 1: Transform 2-D input binary image P into a 1-D vector A by arranging pixel values according to the order from left to right and from top to bottom. The size of vector A is N r N c × 1 . Step 2: The feature vector Y is then given by
Y = RA
(16)
Step 3: Solve the following constrained L1 minimization problem to get the vector X from the feature vector Y : Min X
1
subject to Y − Φ X
2
≤ε
Let the optimal solution of this problem be denoted by X opt . Step 4: Find the maximum non-zero element of X opt . If the location of this element is k, then input image P belongs to the k th class. Now, one example are used to illustrate the effectiveness of the above CS-based recognition method. The input image is “D” in Fig.7(a). The parameters are chosen as N r = N c = 50 , L = 60 , M = 100 , and ε = 0 . 001 . The optimal solution X opt of the constrained L1 minimization problem is shown in Fig.7(b). It is clear that there is the largest peak at index k = 5 that corresponds to the English alphabet “D”, as shown in template database of Fig.6. Thus, the recognition result is correct. 150
Xopt(k)
100
50
0
(a)
5
10
15
20
25
30 k
35
40
45
50
55
60
(b)
Fig. 7. A recognition example of the CS-based method. (a) Input character image. (b) The optimal solution X opt of the L1 minimization.
Automatic Container Code Recognition Using Compressed Sensing Method
205
Stage 6. Post-processing: In one of the following three cases, we know that the recognition result must be incorrect. First, if number of character of container code is not equal to eleven, the result is not correct. This wrong result comes from the fail of segmentation stage. In our experience, the alphabet "U" often distorts into "11" such that the 8 adjacent connected component labeling can not group "11" as one object "U". Second, the fourth character of all container images provided by Kaohsiung port is always the alphabet "U". Thus, when the fourth character of recognition result is not "U", the result must be incorrect. Third, the check digit (i.e. the last character) can be used to check whether the recognition result is correct or not even though the number of character is 11 and the fourth character is U. The checking rule can be found in the literature easily. So far, the recognition method has been described. In next section, several experiments are used to evaluate the performance of this method.
5 Experimental Results In this section, several experiments are conducted to test the proposed method. And, some discussions are also made. Experiment 1: In this experiment, successful cases are reported. Fig.8 shows the recognition result. The input container images are white characters with blue or red backgrounds. Because our algorithm does not use color information, the recognition method is robust to color variation. After checking by inspection, it is easy to see that recognition results are correct.
(a)
(b)
Fig. 8. The recognition results in experiment 1. (a) White character with blue background. (b) White character with red background.
Experiment 2: In this experiment, fail cases are reported. Fig.9 shows the recognition result. It is easy to see that there are some severely distorted characters in the input container images which cause the segmented results are incorrect. However, the users know the results are not correct in the post-processing stage because the total number of character is not equal to 11. Thus, some remedy procedure may start for handling
206
C.-C. Tseng and S.-L. Lee
these fail recognition cases. In our experience, most of incorrect results are due to the fail of character location and segmentation. Thus, to enhance the location and segmentation algorithms is an important future work.
(a)
(b)
Fig. 9. The recognition results in experiment 2. Both input images are white character with green background. Many severely distorted characters are in the input images which cause the segmented results are incorrect.
6 Conclusions In this paper, an automatic container code recognition method has been presented by using compressed sensing (CS). First, the compressed sensing approach which uses constrained L1 minimization method is described. Then, a general pattern recognition framework based on CS theory is studied. Next, the CS recognition method is applied to construct an automatic container code recognition system. Finally, the real-life images provided by trading port of Kaohsiung are used to evaluate the performance of the proposed method.
References 1. Lui, H.C., Lee, C.M., Gao, F.: Neural network application to container number recognition. In: Fourteenth Int. Conference on Computer Software and Application, pp. 190–195 (1990) 2. Lee, C.M., Wong, W.K., Fong, H.S.: Automatic character recognition for moving and stationary vehicles and containers in real-life images. In: Int. Joint Conf. on Neural Network, pp. 2824–2828 (1999) 3. Igual, I.S., Garcia, G.A., Jimenez, A.P.: Preprocessing and recognition of characters in container codes. In: The 16-th Int. Conf. on Pattern Recognition, pp. 143–146 (2002) 4. He, Z.W., Liu, J.L., Ma, H.Q., Li, P.H.: A new localization method for container autorecognition system. In: IEEE Int. Conf. on Neural Networks and Signal Processing, pp. 1170–1172 (2003) 5. He, Z.W., Liu, J.L., Ma, H.Q., Li, P.H.: A new automatic extraction method of container identity codes. IEEE Trans. on Intelligent Transportation Systems, 72–78 (2005)
Automatic Container Code Recognition Using Compressed Sensing Method
207
6. Pan, W., Wang, Y.S., Yang, H.: Robust container code recognition system. In: The 5-th World Congress on Intelligent Control and Automation, pp. 4061–4065 (2004) 7. http://www.htsol.com 8. Candes, E.J., Wakin, M.B.: An introduction to compressive sampling. IEEE Signal Processing Magazine, 21–30 (2008) 9. Romberg, J.: Imaging via compressive sampling. IEEE Signal Processing Magazine, 14– 20 (2008) 10. Wright, J., Yang, A.Y., Ganesh, A., Sastry, S.S., Ma, Y.: Robust face recognition via sparse representation. IEEE Trans. on PAMI, 210–227 (2009) 11. Gemmeke, J.F., Cranen, B.: Using sparse representations for missing data imputation in noise robust speech recognition. In: EUSIPCO 2008 (2008) 12. Parvaresh, F., Vikalo, H., Misra, S., Hassibi, B.: Recovering sparse signals using sparse measurement matrices in compressed DNA microarrays. IEEE Journal of Selected Topics in Signal Processing, 275–285 (2008) 13. Donoho, D.L.: Compressed Sensing. IEEE Trans. on Information Theory, 1289–1306 (2006) 14. Bobin, J., Starch, J.L., Ottensamer, R.: Compressed sensing in astronomy. IEEE Journal of Selected Topics in Signal Processing, 718–726 (2008) 15. Ye, J.C.: Compressed sensing shape estimation of star-shaped objects in Fourier imaging. IEEE Signal Processing Letters, 750–753 (2007) 16. Herman, M., Strohmer, T.: High-resolution radar via compressed sensing. IEEE Trans. on Signal Processing, 2275–2284 (2009) 17. Provost, J., Lesage, F.: The application of compressed sensing for photo-acoustic tomography. IEEE Trans. on Medical Imaging, 585–594 (2009)
Combining Histograms of Oriented Gradients with Global Feature for Human Detection Shih-Shinh Huang1, , Hsin-Ming Tsai1 , Pei-Yung Hsiao2 , Meng-Qui Tu2 , and Er-Liang Jian3 1
2
Dept. of Computer and Communication Engineering National Kaohsiung First University of Science and Technology Dept. of Electrical Engineering, National University of Kaohsiung 3 Chung-Shan Institute of Science and Technology
Abstract. In this work, we propose an algorithm of combining Histograms of Oriented Gradients(HOGs) with shape of head for human detection from a non-static camera. We use AdaBoost algorithm to learn local characteristics of human based on HOGs. Since local feature is easily affected by complex backgrounds and noise, the idea of this work is to incorporate the global feature for improving the detection accuracy. Here, we adopt the head contour as the global feature. The score for evaluating the existence of the head contour is through the Chamfer distance. Furthermore, the matching distributions of the head and non-head are modeled by Gaussian and Anova distributions, respectively. The combination of the human detector based on local features and head contour is achieved through the adjustment of the hyperplane of support vector machine. In the experiments, we exhibit that our proposed human detection method not only has higher detection rate but also lower false positive rate in comparison with the state-of-the-art human detector.
1
Introduction
Human detection is an extensive research in the field of computer vision and has many applications, such as surveillance, intelligent user interface, and pedestrian warning system for intelligent vehicle. The work to successfully detect pedestrian is one of the most difficult problems as it challenges the hardness on a broad range of deformable object appearances and poses, various types of human clothes, complex backgrounds, and many varied illumination conditions in the outdoor environments. 1.1
Related Work
To address the aforementioned challenges, many works have been proposed in the literature. In general, the previous studies on human or pedestrian detection can be classified into two classes, motion-based approach and appearance-based approaches.
Corresponding author.
K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 208–218, 2011. c Springer-Verlag Berlin Heidelberg 2011
Combining HOGs with Global Feature for Human Detection
209
Fig. 1. Overview of the proposed approach for human detection
Motion-Based Approach. The motion-based approach achieves pedestrian detection by observing the motion patterns resulting from human’s movement in the video sequences. The gait of human’s walking is one of an apparent pattern for human detection and has been used in the literature [1]. They observed that the walking figures at a specific row over a period of time are like a braid and thus proposed a braid model for gait detection. For the purpose of describing human gait in more detail, a pendulum model and a twin-pendulum model are respectively proposed in [2] and [3]. Then, some image processing techniques are adopted to find the features for fitting the models. Instead of using an explicit model, low-level features resulting from the human motion are extracted and learned for pedestrian detection. Little and Boyd [4] took the optical flow of two successive images as feature points of the human’s motion and analyzed the motion periodicity to settle the existence of moving human. Recently, Viola et al. [5] computed the different motion directions between two consecutive images as pedestrian features and then the AdaBoost algorithm was applied to obtain a set of discriminative one for further detection. However, this kind of approaches generally has two limitations. Firstly, they fail to detect the pedestrians without motion. Secondly, a period of time is necessary for successful detecting the motion pattern so that they may be not suitable for real-time applications. Appearance-Based Approach. The appearance-based approach uses a set of appearance features of static human image to detect the existence of the pedestrian. Global features, such as symmetry or shape of human are widely used
210
S.-S. Huang et al.
feature for pedestrian detection in the literature. Broggi et al. [6] had applied vertical symmetry properties of human shape to calculate the position and the size of a human. Another typical solution of using global shape is based on template matching, which constructs the human templates from different viewing angles and poses. It detects the humans by comparing the extracted shape with the constructed templates. Gavrila et al. [7] and Liu et al. [8] characterized the human shapes by silhouette or edge image and then transferred them into distance transformed images. However, some aforementioned approaches which purely take the global feature as pedestrian descriptors, such as entire human shape, tend to fail in detecting partially occluded humans. Thus, some researches detect human via parts of the human body and analyze their relations to reconstruct the human shape. Mohan et al. [9], chosen an Adaptive Combination of Classifiers (ACC) to detect different body parts and integrated all part-classifiers to categorize the humans. For more examples using local features for human detection, [10] and [11] are edgelets and shapelets, respectively. The edgelet feature is a set of predefined edgelet templates, which differ from each other in number of edge, orientations, single or pair. Similar to edgelet, shapelet is a piece of shape. Histograms of Oriented Gradients (HOGs) firstly proposed by Dalal et al. [12] is a kind of local features and their success in human detection has been proven in many works [13,14]. Although many sophisticated methods are proposed to improve the representation ability of original HOGs, it suffers from the problem of false detection in complex background due to the lack of introducing the semantic features. Accordingly, we propose a framework that combines the HOGs-based approach with one global feature to alleviate this problem.
1.2
Approach Overview
The idea behind the proposed approach is to combine the method using local feature, HOGs, with global feature, head contour. The introduction of the global feature is to facilitate the human detection in a more semantic manner. Figure 1 shows the overview of our proposed approach. The proposed system mainly consists of two phases: training and detection. Given a set of database including human and non-human samples, the objective of training phase is to find out an effective human detector. The HOGs [12] is used to represent the part of human appearance. Then, a Adaboost algorithm is applied to learn a classifier for human detection. Similarly, the template matching algorithm using Chamfer distance is applied to the database for modeling the distributions, respectively, for head and non-head cases. At the detection stage, the HOGs classifier and head detection algorithm are both used to evaluate the existing confidence of human in the image. If they have non-consistent results, the matching score of head detection is used to adjust the HOGs classifier. The rest of this paper is organized as follows. In Section 2, we give a overview of how to learn the HOGs classifier for human detection. The head matching algorithm using Chamfer distance is described in Section 3. Section 4 describes
Combining HOGs with Global Feature for Human Detection
211
the way to combine the detection results from HOGs classifier and head detector. Finally, we conclude this work in Section 5 with some discussions.
2
HOGs-Based Human Detector
In the pedestrian detection literature, the histogram of oriented gradients (HOGs) [12] are proven for its effectiveness and outperform wavelet approaches. Accordingly, the HOGs is used for describing the human appearance in this work. In this section, we will give an overview of how to use the HOGs for human detection. 2.1
Histogram of Oriented Gradients
We only use the rectangular HOGs for feature description which means that the used block is formed with 4 rectangular cells. Each cell is represented by a 9-D feature vector. An example of rectangular HOGs is shown in Figure 2.
Fig. 2. The example of rectangular HOGs
For each pixel (x, y), the gradient computation is calculated by applying two discrete derivative kernels Gh = [−1, 0, 1] and Gv = [−1, 0, 1]T to respectively obtain the horizontal difference dh (x, y) and vertical difference dv (x, y). The gradient magnitude mag(x, y) and orientation θ(x, y) can be derived as follows: mag(x, y) = dh (x, y)2 + dv (x, y)2 (1) θ(x, y) = tan−1 ddhv (x,y) (x,y) Then, each cell in the feature block is represented by a orientation histogram. For saving computation, we quantize the orientation ranging from 0 to 360 degrees into 9 bins, that is, each of which corresponds to 40 degrees. The summation of mag(x, y) × wG (x, y) is accumulated in a specific bin according to θ(x, y), where wG (x, y) is the weighting mask from a Gaussian distribution. Therefore, four 9-D orientation histograms are used to represent a feature block.
212
2.2
S.-S. Huang et al.
Learning of Human Detector
In this work, the sizes of blocks used are 12, 24, 36, 48, and 60 for the detecting window 64 × 128. The aspect ratio for each block can be one of the following choices: (1 : 1), (1 : 2), and (2 : 1). However, it is time-consuming to use all blocks for pedestrian detection as the work in [12]. For efficiency consideration, Zhu et. al [14] constructed a cascaded human detector using the Adaboost algorithm proposed in [15]. Each feature for representing the block is a 36-D vector. The weak classifiers is a separating hyperplane computed by using a linear support vector machine (SVM).
3
Head Contour Detection
The human detector selected in the aforementioned section is a set of HOGs and the detection is achieved by the combination of these local features. In general, this approach has a tendency toward false detection in case of complex environment. In order to alleviate this problem, we incorporate a more semantic feature called head into the detection framework. Due to the high degree-of-freedom of human body, no global features can be always completely visible and detected in all cases. Among these global features, it is obvious that head contour is a more robust one which has the properties of low variance in appearance and high visibility from different views. Figure 3 shows some examples of head contours in different views. In some works [6], it has been applied to find the candidates of pedestrians in the image.
Fig. 3. Examples of head contour from different view angles
Obviously, the head contour is a Ω-shape and we apply the template matching algorithm using the Chamfer distance to measure the possibility of the head existence. Figure 4 (a) shows the head template Ω for matching the head contour. To facilitate the matching computation, we apply the distance transform to the head template which is defined as: DT (p) = mind(p, q) q∈Ω
(2)
Combining HOGs with Global Feature for Human Detection
213
Fig. 4. The detection of Ω-shape. (a) is the template for matching; (b) shows the detection region; (c) and (d) are the correct detection results and the two examples of mis-detection are shown in (e) and (f).
where d(p, q) is the Euclidean distance between two points p and q. For an observed image, a Sobel edge detector is firstly applied to compute the edge map. Based on the assumption that the head is always located at the top of a human, we only scan the windows inside the detection region R for head matching. The red rectangle in Figure 4 (b) is the detection regions defined in this work. The Chamfer distance used to evaluate the matching score of a window w is expressed as: 1 d(w) = DT (p) (3) N p∈e(w)
where e(w) is the edge map of the window w. In order to eliminate the regions without no edge points, we only take the windows which have edge points larger than a pre-defined threshold into account. Thus, we define the matching score for head existence as: sH = min d(w) (4) w∈R
The detection results shown in Figure 4 (c)(d) are the window with the largest score. It exhibits that the template matching can effectively detect the head in the image. However, there may be some regions have large matching score so that the head detection using Ω-shape may fail as shown in Figure 4 (e)(f). That is the reason why not directly detecting human through head matching.
4
Feature Combination
In this section, we introduce how to combine the matching score from head detector with the HOGs human classifier. 4.1
Head Distribution Modeling
However, the matching score of the head contour is just a relative measure rather than absolute one and is not suitable used for justify the existence of human. Consequently, we model the probability distributions of the matching score with
214
S.-S. Huang et al.
respect to head and non-head cases. In this work, we select 924 human and 829 and 829 non-human images, and then we compute matching score of these images. The histograms of the matching scores for head and non-head images are the blue and red lines shown in Figure 5, respectively. Obviously, the histogram of the head images is similar to a Gaussian distribution fG (x); the one of the non-head images approaches to chi-squared distribution fk (x) with k = 2. fk (x) =
(1/2)k/2 k −1 x 2 exp{−x/2} Γ (k/2)
(5)
The green and purple lines are the Gaussian and chi-squared distributions for respectively modeling the head and non-head distributions. Then, the presence of the head in the image is decided according to: ⎧ ⎨ +1 fG (sH ) > fk (sH ) 0 fG (sH ) = fk (sH ) f (sH ) = (6) ⎩ −1 fG (sH ) < fk (sH ) where +1, 0, and −1 stand for the existence, uncertainty, and non-existence of the head, respectively.
Fig. 5. The modeling of the distributions respectively for head and non-head images
4.2
Combination Algorithm
The strategy to combine the results of head detection with HOGs human classifier is through the adjustment of the learned SVM hyperplane. It mainly consists of two stages: consistency verification and hyperplane adjustment. Let V denote the confidence value obtained from HOGs human classifier. At the stage of consistency verification, we check whether the results from f (sH ) and V are
Combining HOGs with Global Feature for Human Detection
215
consistent or not. When V ≥ 0 and f (sH ) ≥ 0, they have the consistent results and we claim the presence of human in the image. Similarly, the detection result after combination is non-human if V < 0 and f (sH ) ≤ 0. In other cases, the HOGs classifier and head detector are inconsistent with each other. Since the HOGs classifier has been proven its effectiveness in human detection, the detection is only based on V value if it have enough confidence. In other words, the rule is: Human if V ≥ 2 (7) Non-Human if V ≤ −2 When HOGs classifier is not confident enough in making decision, that is, |V | < 2, we adjust the hyperplane according to the f (sH ) and perform HOGs classifier again. The idea to adjust the hyperplane is described as follows. In case of f (sH ) = −1, the hyperplane of SVM is shifted toward the positive region as shown in Figure 6 (a) to make the HOGs classifier more strict. This is because head detector has exhibited the absence of head in the image. On the contrary, we move the hyperplane toward negative region in case of f (sH ) = 1 as shown in Figure 6.
Fig. 6. The adjustment of SVM hyper-planes to combine the head contour feature
5
Experiment
Our proposed approach is developed and evaluated on a personal computer with Intel Core2 Duo 1.86G Hz. The Intel Open Source Computer Vision Library (OpenCV) is used to facilitate the development of the system. 5.1
Database Description
In order to validate the effectiveness of our proposed algorithm, both databases announced by MIT [16] and INRIA [17] are used for training and testing. The MIT database has 900 human images. The poses of the people in MIT database are limited to frontal and rear views. Each image is scaled to the size 64×128 and
216
S.-S. Huang et al.
is aligned so that the person’s body is at the center of the image. The 1218 images in IRNIA database are all non-human images and we select 900 images from this database as negative samples for system validation. The selected negative samples have apparent edges or human-like patterns. In summary, we have 900 human images from MIT database and 900 non-human images from IRNIA one.
Fig. 7. Analysis of different number of weak classifiers. (a) and (b) are detection rate and false positive rate versus number of weak classifiers.
5.2
Performance Evaluation
In this work, the images are equally divided into two folds and each fold has 450 positive and 450 negative samples. One fold is used for training and the other one is for testing. For comparing the performance with the other human detector, we implement the works using traditional HOGs [12] and augmented HOGs (AHOGs) [18] which incorporates local symmetry, edge density, and contour distance. Then, the approaches combining head detector with HOGs and AHOGs are also implemented and are referred to as HOGs+ and AHOGs+, respectively. Two criteria used for objective analysis are the detection rate and false positive rate and are expressed as follows: Detection Rate =
#Detected Humans #Total Humans
False Positive = # Detected Non-Humans # Total Detected Humans
(8)
To exhibit the effectiveness of combination of head contour, we train the human detector with different number of weak classifiers ranging from 10 to 70 and analyze their detection rate and false positive rate. Figure 7 shows the corresponding experimental results. Obviously, the combination of head contour makes the HOGs+ and AHOGs+ have higher detection rate than HOGs and AHOGs. In addition, we show the Receiver Operating Characteristic (ROC) curves of the implemented four approaches, respectively, using 40 and 70 weak classifiers. In Figure 8, the approaches combing head contour have better performance than those without head combination.
Combining HOGs with Global Feature for Human Detection
217
Fig. 8. Analysis of performance using ROC for (a) with 40 and (b) with 70 weak classifiers
6
Conclusion
An algorithm combining Histograms of Oriented Gradients(HOGs) with shape of head for human detection from a non-static camera is proposed in this work. For the purpose of head combination, we firstly use a Ω-shape template for head matching using Chamfer distance and the probability distributions of head and non-head given a matching score are modeled through Gaussian and Anova distributions. At detection stage, the HOGs classifier and head detector are performed separately, Then, we check the consistency of these two results. If the results are inconsistent, we adjust the SVM hyperplane according to the existence probability of head in the image to achieve the combination. However, the shift distance of hyperplane adjustment is a constant in current work. We should formulate it as a function of matching score.
References 1. Niyogi, S.A., Adelson, E.H.: Analyzing and Recognizing Walking Figures in XYZ. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 469–474 (1994) 2. Cunado, D., Nixon, M.S., Carter, J.N.: Automatic Extraction and Description of Human Gait Models for Recognition Purposes. Computer Vision and Image Understanding 90, 1–41 (2003) 3. Ran, Y., Weiss, I., Zheng, Q., Davis, L.S.: Pedestrian Detection via Periodic Motion Analysis. International Journal of Computer Vision 71(2), 143–160 (2007) 4. Little, J.J., Boyd, J.E.: Recognition People by Their Gait: The Shape of Motion. Journal of Computer Vision Research, 1–32 (1998) 5. Viola, P., Jones, M.J., Snow, D.: Detecting Pedestrians Using Patterns of Motion and Appearance. International Journal of Computer Vision 63(2), 153–161 (2005) 6. Broggi, A., Bertozzi, M., Fascioli, A., Sechi, M.: Shape-Based Pedestrian Detection. In: IEEE Intelligent Vehicle Symposium, pp. 215–220 (2000) 7. Gavrila, D.M., Munder, S.: Multi-cue Pedestrian Detection and Tracking from a Moving Vehicle. International Journal of Computer Vision 73(1), 41–59 (2007)
218
S.-S. Huang et al.
8. Liu, C.Y., Fu, L.C.: Computer Vision Based Object Detection and Recognition for Vehicle Driving. In: IEEE International Conference on Robotics and Automation, vol. 3, pp. 2634–2641 (2001) 9. Mohan, A., Papageorgiou, C., Poggio, T.: Example-Based Object Detection in Images by Components. IEEE Transactions on Pattern Analysis and Machine Intelligence (23), 349–361 (2001) 10. Wu, B., Nevatia, R.: Simultaneous Object Detection and Segmentation by Boosting Local Shape Feature based Classifier. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007) 11. Sabzmeydani, P., Mori, G.: Detecting Pedestrians by Learning Shapelet Features. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007) 12. Dalal, N., Triggs, B.: Histograms of Oriented Gradients for Human Detection. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 886–893 (2005) 13. Wang, C.-C.R., Lien, J.-J.J.: AdaBoost Learning for Human Detection Based on Histograms of Oriented Gradients. In: Asian Conference on Computer Vision, pp. 885–895 (2007) 14. Zhu, Q., Avidan, S., Yeh, M.-C., Cheng, K.T.: Fast Human Detection Using a Cascade of Histograms of Oriented Gradients. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 1491–1498 (2006) 15. Viola, P., Jones, M.J.: Rapid Object Detection Using a Boosted Cascade of Simple Features. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 511–518 (2001) 16. MIT, MIT: CBCL Pedestrian Database. World Wide Web (2008), http://cbcl.mit.edu/software-datasets/PedestrianData.html 17. IRNIA, MIT Person Dataset. World Wide Web (2008), http://pascal.inrialpes.fr/data/human/ 18. Chuang, C.H., Huang, S.-S., Fu, L.-C., Hsiao, P.-Y.: Monocular Multi-Human Detection Using Augmented Histograms of Oriented Gradients.. In: Asian Conference on Computer Vision, pp. 885–895 (2007)
Video Browsing Using Object Trajectories Felix Lee and Werner Bailer JOANNEUM RESEARCH Forschungsgesellschaft mbH DIGITAL – Institute of Information and Communication Technologies Steyrergasse 17, 8010 Graz, Austria {felix.lee,werner.bailer}@joanneum.at
Abstract. Video browsing methods are complementary to search and retrieval approaches, as they allow for exploration of unknown content sets. Objects and their motion convey important semantics of video content, which is relevant information for video browsing. We propose extending an existing video browsing tool in order to support clustering of objects with similar motion and visualization of the objects’ positions and trajectories. This requires the automatic extraction of moving objects and estimation of their trajectories, as well as the ability to group objects with similar trajectories. For the first issue we describe the application of a recently proposed motion trajectory clustering algorithm, for the second we use k-medoids clustering and the dynamic time warping distance. We present evaluation results of both steps on real world traffic sequences from the Hopkins155 data set. Finally we describe the description of analysis results using MPEG-7 and the integration into the video browsing tool.
1
Introduction
With the increasing amount of multimedia data being produced, there is growing demand for more efficient ways of exploring and navigating multimedia collections. Viewing complete multimedia items in order to locate relevant segments is prohibitive even for relatively small content sets due the required user time for viewing and the amount of data that needs to be transferred. Video browsing methods are complementary to search and retrieval approaches, as they allow for exploration of unknown content sets, without the requirement to specify a query in advance. This is relevant in cases where only few metadata are available for the content set, and where the user does not know what to expect in the content set, so that she is not able to formulate a query. In order to enable the user to deal with large content sets, it has to be presented in a form which facilitates its comprehension and allows judging the relevance of segments of the content set. Objects and their motion convey important semantics of the video content, which is useful in video browsing. This requires the automatic extraction of moving objects and the estimation of their trajectories, as well as the ability to group objects with similar trajectories. In this paper we propose an approach for video browsing using object trajectories, which are obtained from clustering feature point trajectories by similarity. The focus of this paper is on the integration of K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 219–229, 2011. c Springer-Verlag Berlin Heidelberg 2011
220
F. Lee and W. Bailer
the object motion information into our video browsing system, using a novel trajectory clustering method that we have recently developed. The rest of this paper is organized as follows. Section 2 describes related work on clustering feature trajectories by similar motion and on the use of object trajectories for video browsing and retrieval. In Sections 3 and 4 we present our algorithm for trajectory clustering, the representation of object positions and trajectories and the method for grouping moving objects with similar trajectories. Section 5 discusses the integration into our video browsing tool and Section 6 concludes the paper.
2
Related Work
In this section we review work on clustering feature trajectories by object motion as well as the use of object trajectories for video browsing and retrieval. Here we only summarize the most relevant approaches for trajectory clustering. A more comprehensive discussion of the state of the art can be found in [2]. The factorization method has originally been proposed to solve the shape from motion problem and decompose a matrix containing point trajectories [22]. The approach has later been extended to the case of multiple moving objects [7], which includes the problem of finding subsets of trajectories that belong to the same object. The approach proposed in [14] works directly on the measurement data and is based on the same principle as earlier works on the factorization method. As discussed in literature, the approach is very sensitive to noise. In [20] a clustering method based on the Expectation-Maximization (EM) approach and Gaussian mixture models is proposed. Also this method considers only similarity by translatory motion. In contrast to most other approaches the authors of [10] propose an algorithm that handles trajectories with different life spans and clusters them in a time window by affine motion. The approach uses the J-linkage method for clustering. Indexing moving objects by their motion trajectories has been addressed already in early work in video retrieval. However, information about moving objects has been mainly used for query by example and query by sketch rather than browsing. In [24] a spatio-temporal video retrieval system based on objects and their trajectories is proposed. The system supports query by example and sketch using trajectory similarity. A similar system is presented in [17], but this system extracts trajectories in compressed domain from macroblock sequences. The object trajectories are described using MPEG-7. The system also supports retrieval by example or sketch. In [5] an approach for indexing trajectories using PCA is proposed. The authors of [11] review different methods to determine trajectory similarity and propose a system that matches trajectories based on curve fitting or by string matching based on symbols assigned to parts of a curve. The authors of [13] describe a system for visualization and annotation of video using mosaic representations generated from sequences with camera motion. Object trajectories can be visualized and are used to assign annotation made in one frame to the object’s position throughout the sequence.
Video Browsing Using Object Trajectories
221
An approach for extracting, clustering and retrieving trajectories in surveillance video is proposed in [21]. In [12] a more comprehensive system for surveillance video retrieval is presented. The object trajectories are clustered using spectral clustering and semantic keywords are assigned to clusters. The system supports querying by keyword or sketch. The authors of [8] propose a direct manipulation video player, which allows browsing a video by interacting with object trajectories. However, this approach does not provide means to work with a collection of video. A similar approach has been proposed in [16] for viewing surveillance video. Beyond interacting with objects in a single video the system visualizes trajectories on a floor plan which also can be used for navigation in the video. The authors of [1] propose an alternative visualization of moving objects in video by overlaying pose slices from different times over the video background.
3
Motion Trajectory Clustering
In the following we summarize our algorithm for motion trajectory clustering. The input is the result of tracking feature points throughout an image sequence. For feature point tracking, we use the GPU accelerated algorithm proposed in [9]. For each frame in which a point is present, the image coordinates of the point are given. Points may appear and disappear at any time. The task is to identify subsets of tracked points that are subject to similar motion according to a 4parameter motion model (x/y-translations tx and ty , scale s and rotation φ). The reason for clustering trajectories instead of single frames displacements is to achieve a more stable cluster structure over time. 3.1
Approach
The problem thus leads to a problem of clustering trajectories, with the following main issues. Not all trajectories may exist throughout the whole time window used for clustering, i.e. trajectories may end when points exit the scene, cannot be reliably tracked any more or are occluded, and new trajectories start when new reliable points are detected. As the number of moving objects in the scene is not known, the number of clusters is also unknown. The number of resulting clusters of the previous time window serves as a hint, but objects may appear or disappear. The feature trajectories just contain the x- and y-displacement of the moving objects. If clustering is performed using a more complex model than a simple translatory one, the feature used for clustering is hidden. A function must be defined which expresses how well a trajectory matches a set of motion parameters. An additional requirement is the capability to perform clustering on a time window without having all data for a shot available. This enables the application of the algorithm to incrementally process data and make it applicable in near online scenarios. To deal with these issues, our algorithm uses an Expectation-Maximization approach to solve the clustering problem, consisting of the following steps:
222
F. Lee and W. Bailer
1. Estimate a motion parameter sequence q for a subset of the trajectories. 2. Assign the trajectories to the motion parameter sequence. 3. Cluster motion parameter sequences by similarity of the parameter sequences. After initialization, these steps are performed iteratively until the cluster structure for the time window stabilizes. Algorithm 1 summarizes the proposed algorithm, more details can be found in [2]. 3.2
Evaluation
To evaluate the clustering algorithm, we use a subset of the Hopkins155 benchmark data set [23]. We use the 20 real-world traffic sequences in the data set and do not use the derived variants with only a subset of the motions present. To make the results comparable we use the trajectories provided with the data set. In contrast to typical tracking results, all of these trajectories exist throughout the whole sequence. As proposed in [23] we use the mean misclassification rate (MCR) as evaluation measure. Table 1 shows the results of this evaluation. Our algorithm estimates an appropriate number of clusters and thus creates in some cases more clusters than there are in the ground truth. In some of the sequences objects are stationary in a few frames of the video. If this is the case in the beginning of the video, our algorithm merges them and splits them only later. To deal with these issues, we use three different methods for comparing the results with the ground truth. max denotes counting only correctly assigned trajectories of a single cluster with maximum overlap with the ground truth cluster as correct. all addresses the issue of oversegmentation by counting trajectories of all clusters as correct that predominantly overlap with the ground truth cluster. For target no. we perform a post-processing step that using agglomerative hierarchical clustering by similarity of the parameter sequences to reduce the number of clusters to the target number given by the ground truth. Then there is a 1:1 correspondence between result and ground truth clusters. The table also shows a comparison with the results from the recent work by Fradet et al. [10], their paper also contains a comparison with further methods. It has to be noted that the parameters of their method have been tuned to reach the target number of clusters in the ground truth in the reported experiments. For the first three experiments we set the clustering time window to the duration of the video. In the last two experiments we use a window size of 20 frames, shifted by a step of 2 frames. The reported results are for a frame in the middle of the sequence. In all experiments the threshold for the trajectory error max has been set to 0.5.
4
Representation and Grouping of Object Trajectories
Each of the clusters determined by our approach is represented by the cluster’s bounding box in the initial frame and the trajectory of the center of the bounding box. In this section we discuss the description of this information using MPEG-7 and the approach used for grouping object trajectories.
Video Browsing Using Object Trajectories
223
Algorithm 1. The motion trajectory clustering algorithm. Require: pi , i = 1 . . . N, pi = {xi,k , yi,k }, k = 1 . . . M {N trajectories for M frames} Require: step, winSize, max C ← ∅ {set of clusters} Q ← ∅ {motion parameters} k ← 0 {process next window} while k + winSize ≤ M do p˜ ← {pi |pi,l = ∅; ∀ i; l = k . . . k + winSize } {trajectories existing in time window} C˜ ← sample({˜ pi|∃cj ∈ C, s.t. p˜i ∈ cj }) {random sample clusters with ≥ 3 members from unassigned trajectories} {initialize motion parameters of new clusters} ˜ for all c˜ j ∈ C do ⎛ ⎞ 0 0 ⎜0 ... 0⎟ ⎟ {(tx , ty , s, φ)T for each frame} q˜j ← ⎜ ⎝1 1⎠ 0 0
×winSize
˜←Q ˜ ∪ q˜j Q end for ˜ Q ← Q∪Q ˜ C ← C ∪ C, C ← ∅ {clusters from previous iteration} while C = C do C ← C for all cj ∈ C do qj ← reestimateP arameters(Cj, qj ) end for C ← clusterByM otionSimilarity(C, Q) for all p˜i ∈ p˜ do i ← minj (˜ pi , transf orm(˜ pi,k , qj )) {error between measured trajectory and predicted trajectory using motion parameters} cj ← cj \˜ pi , ∀ cj ∈ C if i ≤ max then j ∗ ← argminj (p˜i , transf orm(˜ pi,k , qj )) cj∗ ← cj∗ ∪ p˜i end if end for for all cj ∈ C do if |cj | ≤ 3 then C ← C\cj , Q ← Q\qj end if end for end while k ← k + step end while return C, Q
224
F. Lee and W. Bailer
Table 1. The misclassification rates of our clustering algorithm on 20 sequences from the Hopkins155 data set comparison window size traffic traffic traffic traffic
2 2 3 3
motions motions motions motions
max. all target no. max. target no. Fradet et al. [10] video video video 20 20 (mean) (median) (mean) (median)
all (mean) all (median)
4.1
0.43 0.44 0.35 0.23
0.04 0.01 0.03 0.02
0.05 0.00 0.15 0.03
0.38 0.23 0.37 0.24
0.07 0.01 0.22 0.14
0.02 0.00 0.05 0.00
0.38 0.04 0.23 0.01
0.09 0.00
0.38 0.23
0.13 0.02
-
Description
As described below, our video browsing tool imports MPEG-7 documents conforming to the Detailed Audiovisual Profile (DAVP) [3], that contain the results of content analysis. We thus use the description tools for moving regions defined in [18] to represent the bounding boxes and trajectories of the determined clusters. Each shot may contain a spatio-temporal decomposition with a list of moving regions, each representing a cluster existing throughout the whole or a fraction of the shot. For each moving region, the bounding box in the first frame is specified. For subsequent frames, the translation vector of the bounding box is specified for each frame. In order to reduce the size of the description, the position can be specified only every nth frame and the interpolation function to be used between these frames is stated in the description. This description format allows for efficient calculation of the position of the object bounding box for a key frame position in order to visualize it in the user interface. 4.2
Object Trajectory Grouping
For video browsing, object trajectories in the current video set need to be grouped by similarity. We use the k-medoids algorithm [15], which is similar to k-means but selects for each cluster the sample with the smallest accumulated distance to all other samples in the cluster as its center. In order to determine the distance between two trajectories we use the dynamic time warping (DTW) approach. DTW [19] tries to align two sequences so that the temporal order of the samples of the sequence is kept but the distance is globally minimized. Each sample of one sequence must be aligned to at least one sample of the other sequence. We apply this method to the normalized coordinates of the trajectories p1 and p2 of two bounding boxes. The commonly used implementation builds a distance matrix of size |p1 |×|p2 |, with | · | denoting the length of a trajectory. Then an optimal path through the distance matrix (i.e. from (1, 1) to (|p1 |, |p2 |) is found by moving either one step in horizontal, vertical or diagonal direction in order to minimally increase the distance.
Video Browsing Using Object Trajectories
225
Table 2. Mean NMI scores for clustering results with different number of clusters no. clusters mean NMI
4.3
3 0.60
6 0.65
8 0.71
Evaluation
In order to evaluate the grouping of the object trajectories we use the same data set as for evaluating the motion trajectory clustering described in Section 3.2. We have manually defined a ground truth specifying the motion direction of each of the objects in the videos of the data set. Based on this ground truth a correct cluster structure is defined. Shots containing several objects are assigned to more than one cluster, thus also increasing the total number of shots. We use normalized mutual information (NMI) as a measure of the quality of the output clusters [6]. As the k-medoids algorithm is randomly initialized we run the algorithm 10 times and report the mean NMI. Table 2 shows the clustering results. Around two thirds of the moving objects are clustered correctly, the result being slightly better with a higher number of clusters. Note that the results include also background regions as objects.
5
Visualization and Browsing
In this section we briefly review our video browsing tool and describe the integration of clustering based on moving object information and the visualization of objects and their trajectories. 5.1
Video Browsing Tool
The application scenario of video browsing is content management in the postproduction phase of audiovisual media production. The goal is to support the user in navigating and organizing audiovisual material, so that unusable material can be discarded, yielding a reduced set of material from one scene or location available for selection in the post-production steps. In the following we briefly describe the existing browsing tool. A more detailed description can be found in [4]. The basic workflow in the browsing tool is as follows. The user starts from the complete content set. By selecting one of the available features (e.g. object motion) the content will be clustered according to this feature. Depending on the current size of the content set, a fraction of the segments (mostly a few percent or even less) is selected to represent a cluster. The user can then decide to select a subset of clusters that seems to be relevant and discard the others, or repeat clustering on the current content set using another feature. In the first case, the reduced content set is the input to the clustering step in the next iteration. The user can select relevant items at any time and drag them into the result list. A screen shot of the browsing tool is shown in Figure 1. The central component is a light table view which shows the current content set and cluster structure
226
F. Lee and W. Bailer
using a number of representative frames for each of the clusters. The clusters are visualized by colored areas around the images, with the cluster label written above the first two images of the cluster. The size of the images in the light table view can be changed dynamically so that the user can choose between the level of detail and the number of visible images without scrolling. The tool bar at the top of the screen contains the controls for selecting the feature for clustering and confirming the selection of a subset. The light table view allows selection of multiple clusters by clicking on one of their member images. By double clicking an image in the light table view, a small video player is opened and plays the segment of video that is represented by that image. The size of the player adjusts relatively to the size of the representative frames. On the left of the application window the history and the result list are displayed. The history window automatically records all clustering and selection actions done by the user. By clicking on one of the entries in the history, the user can set the state of the summarizer (i.e. the content set) back to this point. The user can then choose to discard the subsequent steps and use other cluster/selection operations, or to branch the browsing path and explore the content using alternative cluster features. At any time the user can drag relevant representative frames into the result list, thus adding the corresponding segment of the content to the result set. 5.2
Integrating Moving Object Information
In the media production use case of our browsing tool, information about moving objects is an important selection criterion, especially, when it can be combined with other features supported by the browsing tool, such as camera motion or color properties of the object. Visual grammar and directing rules define how patterns of the motion of the main actors and objects in subsequent shots can be combined. Grouping shots by motion of the salient objects is thus a useful aid for content selection. The information extracted by the algorithms described above is integrated into the browsing tool as follows. As object bounding boxes and their trajectories are described in the MPEG-7 document containing also the other analysis results by the browsing tool, this information is extracted during the indexing process. Using a plugin for handling trajectory information that has been added. The bounding box coordinates and trajectories are normalized in order to make them independent of a specific video frame size and are stored in the browsing tool’s relational database. Also a new clustering plugin for handling object trajectories is added to the browsing tool. It implements the k-medoids based algorithm described above and performs clustering for the current data set on the fly. It can thus be easily used in combination with clustering tools working on other features. Figure 1 shows the result of clustering into 6 groups of object trajectories. A shot may appear in more than one cluster, if several differently moving objects are present. From the dominant horizontal and vertical motion direction meaningful cluster labels are generated.
Video Browsing Using Object Trajectories
227
Fig. 1. User interface of the video browsing tool, showing clustering results by similar object motion
Fig. 2. Examples of key frames with overlaid moving object information
In order to visualize the information related to moving objects, the object bounding boxes and parts of the forward and backward trajectories are overlaid on the key frames. If several objects are visible in the key frame, the one related to the motion represented by the cluster is visualized. Figure 2 shows some examples of key frames with overlaid information.
6
Conclusion and Future Work
Information about moving objects and their trajectories conveys important semantics of video content. In this paper we have extended our video browsing tool to support clustering by similar object motion and visualization of object bounding boxes and trajectories. Object motion is determined from feature point trajectories, which are clustered with an EM-like algorithm we have recently proposed. We use MPEG-7 for the description of moving object information together
228
F. Lee and W. Bailer
with a number of other content analysis results. Objects with similar motion are grouped using the k-medoids algorithm and the dynamic time warping distance. We have presented evaluation results on the real world traffic sequences from the Hopkins155 data set for both motion trajectory clustering and grouping of moving objects for browsing. The results for motion trajectory clustering show that the main issue is oversegmentation, especially of large background regions. The issue of identifying background regions correctly and compensation other motions accordingly also impacts the results of grouping objects by trajectory similarity. This needs to be addressed in future work.
Acknowledgments The authors would like to thank Christian Schober, Georg Thallinger and Werner Haas for their feedback and support. The research leading to these results has received funding from the European Union’s Seventh Framework Programme (FP7/2007-2013) under grant agreement n◦ 215475, “2020 3D Media – Spatial Sound and Vision” (http://www.20203dmedia.eu/).
References 1. Axelrod, A., Caspi, Y., Gamliel, A., Matsushita, Y.: Interactive video exploration using pose slices. In: ACM SIGGRAPH, p. 132 (2006) 2. Bailer, W., Fassold, H., Lee, F., Rosner, J.: Tracking and clustering salient features in image sequences. In: Proc. 7th European Conference on Visual Media Production, London, UK (November 2010) 3. Bailer, W., Schallauer, P.: The detailed audiovisual profile: Enabling interoperability between MPEG-7 based systems. In: Proc. of 12th Intl. Multi-Media Modeling Conference, Beijing, CN, pp. 217–224 (January 2006) 4. Bailer, W., Weiss, W., Kienast, G., Thallinger, G., Haas, W.: A video browsing tool for content management in post-production. International Journal of Digital Multimedia Broadcasting (March 2010) 5. Bashir, F.I., Khokhar, A.A., Schonfeld, D.: Segmented trajectory based indexing and retrieval of video data. In: International Conference on Image Processing, vol. 2, pp. 623–626 (2003) 6. Basu, S., Bilenko, M., Mooney, R.J.: A probabilistic framework for semi-supervised clustering. In: Proc. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 59–68 (2004) 7. Costeira, J., Kanade, T.: A multi-body factorization method for motion analysis. In: Proc. of the Fifth Intl. Conf. on Computer Vision, p. 1071 (1995) 8. Dragicevic, P., Ramos, G., Bibliowitcz, J., Nowrouzezahrai, D., Balakrishnan, R., Singh, K.: Video browsing by direct manipulation. In: Proc. SIGCHI Conf. on Human Factors in Computing Systems, pp. 237–246 (2008) 9. Fassold, H., Rosner, J., Schallauer, P., Bailer, W.: Realtime KLT feature point tracking for high definition video. In: GravisMa Workshop (2009) 10. Fradet, M., Robert, P., Perez, P.: Clustering point trajectories with various lifespans. In: Proc. of 6th European Conference on Visual Media Production, pp. 7–14 (November 2009)
Video Browsing Using Object Trajectories
229
11. Hsieh, J.-W., Yu, S.-L., Chen, Y.-S.: Motion-based video retrieval by trajectory matching. IEEE Transactions on Circuits and Systems for Video Technology 16(3), 396–409 (2006) 12. Hu, W., Xie, D., Fu, Z., Zeng, W., Maybank, S.: Semantic-based surveillance video retrieval. IEEE Transactions on Image Processing 16(4), 1168–1181 (2007) 13. Irani, M., Anandan, P.: Video indexing based on mosaic representations. Proceedings of the IEEE 86(5), 905–921 (1998) 14. Kanatani, K.: Motion segmentation by subspace separation and model selection. In: Proc. 8th IEEE Intl. Conf. on Computer Vision, vol. 2, pp. 586–591 (2001) 15. Kaufman, L., Rousseeuw, P.J.: Clustering by means of medoids. In: Dodge, Y. (ed.) Statistical Data Analysis Based on the L1-Norm and Related Methods, pp. 405–416 (1987) 16. Kimber, D., Dunnigan, T., Girgensohn, A., Shipman, F., Turner, T., Yang, T.: Trailblazing: Video playback control by direct object manipulation. In: IEEE Intl. Conf. on Multimedia and Expo, pp. 1015–1018 (July 2007) 17. Lie, W.-N., Hsiao, W.-C.: Content-based video retrieval based on object motion trajectory. In: Proc. IEEE Workshop on Multimedia Signal Processing, pp. 237–240 (December 2002) 18. MPEG-7. Information Technology—Multimedia Content Description Interface: Part 3: Visual. ISO/IEC 15938-3 (2001) 19. Myers, C.S., Rabiner, L.R.: A comparative study of several dynamic time-warping algorithms for connected word recognition. The Bell System Technical Journal 60(7), 1389–1409 (1981) 20. Pachoud, S., Maggio, E., Cavallaro, A.: Grouping motion trajectories. In: Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing, pp. 1477–1480 (2009) 21. Sahouria, E., Zakhor, A.: Motion indexing of video. In: International Conference on Image Processing, vol. 2, p. 526. IEEE Computer Society, Los Alamitos (1997) 22. Tomasi, C., Kanade, T.: Shape and motion from image streams under orthography: a factorization method. Int. J. Comput. Vision 9(2), 137–154 (1992) 23. Tron, R., Vidal, R.: A benchmark for the comparison of 3-D motion segmentation algorithms. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pp. 1–8 (June 2007) 24. Zhong, D., Chang, S.-F.: Spatio-temporal video search using the object based video representation. In: International Conference on Image Processing, vol. 1, p. 21 (1997)
Size Matters! How Thumbnail Number, Size, and Motion Influence Mobile Video Retrieval Wolfgang Hürst1, Cees G.M. Snoek2, Willem-Jan Spoel1, and Mate Tomin1 2
1 Utrecht University, PO Box 80.089, 3508 TB Utrecht, The Netherlands University of Amsterdam, Science Park 904, 1098 XH Amsterdam, The Netherlands
[email protected],
[email protected], {cwjispoe,mtomin}@students.cs.uu.nl
Abstract. Various interfaces for video browsing and retrieval have been proposed that provide improved usability, better retrieval performance, and richer user experience compared to simple result lists that are just sorted by relevance. These browsing interfaces take advantage of the rather large screen estate on desktop and laptop PCs to visualize advanced configurations of thumbnails summarizing the video content. Naturally, the usefulness of such screenintensive visual browsers can be called into question when applied on small mobile handheld devices, such as smart phones. In this paper, we address the usefulness of thumbnail images for mobile video retrieval interfaces. In particular, we investigate how thumbnail number, size, and motion influence the performance of humans in common recognition tasks. Contrary to widespread believe that screens of handheld devices are unsuited for visualizing multiple (small) thumbnails simultaneously, our study shows that users are quite able to handle and assess multiple small thumbnails at the same time, especially when they show moving images. Our results give suggestions for appropriate video retrieval interface designs on handheld devices. Keywords: Mobile video, video retrieval interfaces, visual assessment tasks.
1 Introduction Multimedia services like Internet browsing [3], music management [8], and photo organization [2] have become commonplace and frequently used applications on handheld devices – despite their limited screen sizes. Even for capturing, accessing, and displaying video, many effective mobile interfaces exist [12, 5]. These interfaces take the complete video as the unit for user interaction and offer means to navigate through the timeline with the aid of a finger [14], pen [11,15], or scroll wheel [13]. It has been predicted by many that users will soon demand facilities providing them direct access to the video content of interest without the need for intensive timeline navigation [20, 26]. With the help of social tagging and multimedia content analysis techniques, like speech recognition [27] and visual concept detection [24], textual labels can be added to video segments allowing for interactive retrieval. Although the video retrieval community has proposed several interactive browsers able to support the user in this task on the relatively large screen estate of desktop and laptop K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 230–240, 2011. © Springer-Verlag Berlin Heidelberg 2011
How Thumbnail Number, Size, and Motion Influence Mobile Video Retrieval
231
machines [27, 24, 1] or as part of a collaborative (mobile) network [23], only few interfaces exist for single-user interaction on handheld devices [18, 9]. In this paper, we are interested in the question how advanced video retrieval browsers should be adapted to constraints imposed by the screen of the handheld device. In order to place this question in perspective, it is important to realize that the basic building block of advanced video retrieval interfaces are thumbnails extracted from a short piece of video assumed to be representative for the multimedia content. We distinguish between static thumbnails (i.e. a reduced-size version of a single static image) and dynamic thumbnails (i.e. a set of consecutive moving reduced-size images). State-of-the-art video retrieval browsers display several of these static or dynamic thumbnails simultaneously in response to a user query, for example as a matrix-like storyboard or ordered in a grid [27]. One may be tempted to consider these browsers unsuited for mobiles, as the limited screen size of handheld devices would not allow displaying several thumbnails at the same time without loss of recognition by the user. However, a recent user study by Torralba et al. [25] revealed human participants were able to outperform computer vision algorithms in an image recognition task on the desktop, even when they were only able to see strongly reduced versions of the original images. In fact, the authors showed that at a size of 32x32 pixels, humans are still able to recognize 80% of the visual content accurately. In [16], we evaluated similar recognition tasks in a video retrieval context on a mobile device, confirming a surprisingly high recognition performance at relatively small sizes, especially when using dynamic thumbnails. These results suggest that despite the limited screen estate of handheld devices, we should still be able to design much more complex thumbnail-based interfaces than commonly assumed. However, our studies have been limited to single thumbnails shown in isolation. Thus, our findings can not necessarily be generalized to video retrieval application where multiple thumbnails are shown simultaneously. In this paper, we are presenting a user study that also investigates the number of thumbnails in addition to thumbnail size and type (i.e. static versus moving). In particular, we verify whether users are able to assess video retrieval results on a mobile phone when multiple static and dynamic thumbnails are displayed simultaneously (at different sizes) and whether this will influence their perception and verification performance. In the remainder of this paper we survey related work on (mobile) video retrieval interfaces (section 2), present the experimental methodology used and detail our findings (section 3), and conclude with resulting design suggestions (section 4).
2 Related Work 2.1 Interfaces for Traditional Video Retrieval Early video retrieval interfaces for desktop PCs and laptops simply presented a video to the user as a sequential sequence of static thumbnails, using metaphors like filmstrips [6]. Alternatively, researchers have merged several static thumbnails into Manga-like collages [4, 7] and storyboards [1]. Static thumbnails are well suited to summarize relatively short video shots. When shots are lengthy and contain a lot of object or camera motion, a single still image might not be able to communicate the factual visual content appropriately. To summarize lengthy shots containing moving
232
W. Hürst et al.
content, dynamic thumbnails such as skims [6] have been proposed. These summaries aim to capture the full information content as compact and efficient as possible. Despite the apparent advantage of dynamic thumbnails over static thumbnails [6, 19, 21], we are unaware of user studies other than [16] quantifying this advantage. Recall that on desktop PCs and laptops, the thumbnail has always been the traditional building block for summarizing video content in video retrieval interfaces [27, 24, 1, 10, 19, 21]. The result of a video query is typically represented as a sorted sequence of relevant thumbnails ordered in a grid, or a simple list of results. Since content-based video retrieval is an unsolved problem, it has been shown by many that in addition to visualizing direct video retrieval results in the interface, displaying indirect video retrieval results such as the complete time line [24] of the retrieved video shot, visually similar video shots [10], semantically related shots [21], or pseudorelevance feedback results [19] are all beneficial to increase video search performance [22]. These findings have resulted in effective but complex video retrieval interfaces, which maximize usage of screen estate to display as many thumbnails as possible. A good example is the CrossBrowser [24], shown in Figure 1, which displays shotbased video retrieval results on a vertical axis and utilizes the horizontal axis to display the timeline of the complete video for a selected shot. Naturally, it can be called into question if such advanced video retrieval interfaces exploiting the screen estate of desktop PCs to the max can be transferred to the relatively small display of handheld devices.
Fig. 1. Displaying video retrieval results using an advanced visualization for the desktop (left) and a handheld device (right). The CrossBrowser [24] on the left displays (static or dynamic) thumbnail results on the vertical axis and exploits the horizontal axis to display a thumbnail timeline of the complete video for selected clip. YouTube for the iPhone orders static thumbnail results in a linear list together with user-provided meta data. Note the waste of screen estate by the CrossBrowser and the limited number of static thumbnails on the mobile.
2.2 Interfaces for Mobile Video Retrieval Most present-day video retrieval interfaces on handheld devices are based on an ordered list of thumbnails; see Figure 1 for a representative example. It appears that the
How Thumbnail Number, Size, and Motion Influence Mobile Video Retrieval
233
size of the thumbnail in current commercial mobile video retrieval interfaces is merely decided by the amount of meta-data that services want to display next to it. This observation also holds for one of the earliest mobile video retrieval interfaces by Lee et al. [18]. In the reference, the authors study the design of a PDA user interface to video browsing, emphasizing the role of user interaction. Although the authors claim that spatial presentation of multiple thumbnails on a mobile display is unsuitable, no quantitative experiments are provided to support the claim. Most existing work on mobile video retrieval has focused on mechanisms for browsing through a single video. The MobileZoomSlider [11], for example, provides the user with several sliders, overlaid on the video on demand, which mimic different granularity levels of the video. The same approach was integrated in Apple’s iPhone in iPhoneOS version 3.0. The idea of using sliders has been developed further in [5], where the authors suggest combining a preview slider with markers, representing scene and speech information, on the timeline. In [15] the authors evaluate alternative approaches for browsing along the timeline such as flicking and elastic interfaces. [13] uses a virtual scroll wheel overlaid to the video. Despite their effectiveness for browsing a single video, none of these interfaces are optimized for browsing through thumbnails originating from several videos, as is typical in retrieval. Exceptions are [14] and [23]. In the iBingo system [23], several users are collaborating using different (mobile) devices to retrieve video fragments relevant to an information need. Special attention is placed on the collaborative retrieval backend which aims to eliminate redundancy and repetition among co-searchers to increase overall retrieval efficiency. Since iBingo emphasizes collaboration by using multiple devices, here also, no special attention is reserved for the role of the thumbnail in the video retrieval interface. In [14], the authors present an interface where thumbnails are used to access points of interest in a larger video. However, their work focuses on interface design for one handed operation and ignores related video retrieval or browsing questions. 2.3 Thumbnails for Mobile Video Retrieval Interfaces Although static thumbnails are often used for mobile video browsing in both commercial systems (cf. Fig. 1) as well as research prototypes (cf. previous subsection), the related interfaces are generally much less complex than their desktop counterparts, and dynamic thumbnails are generally not applied in a mobile context at all. In addition, the size of used thumbnails is often rather large. Motivated by Torralba et al.’s findings [25] in a desktop context, we presented several experiments in [16] investigating the relevance of thumbnail size and type for human recognition performance in a thumbnail based video retrieval scenario. In different test runs, both static and dynamic thumbnails were extracted from videos and presented to subjects with increasing as well as random sizes. Participants of the experiments had to solve typical video retrieval tasks solely based on a single thumbnail. In terms of performance, we identified 90 pixels to be a reasonable threshold upon which most tasks were solved successfully based on a static thumbnail. However, with dynamic thumbnails, a similar performance was achieved starting at sizes of 70 pixels. Most surprisingly, human performance was relatively high even at much lower sizes, with a successful recognition in almost 90% of the cases at the smallest thumbnail size of 30 pixels. These results indicate that we can indeed build much more complex thumbnail-based
234
W. Hürst et al.
interfaces for mobile video retrieval despite the small screen estate of handheld devices. However, all these tests have been done based on a single thumbnail that was presented to the subjects on a black background. In this paper, we verify if and to what degree these findings generalize to more complex interfaces where multiple thumbnails are presented simultaneously. Motivated by the findings that have been observed with single thumbnails displayed in isolation, we present a user study with 24 participants where we investigate the influence of thumbnail size and motion when multiple thumbnails are displayed simultaneously.
3 User Study with Varying Numbers and Sizes of Thumbnails 3.1 Motivation and Setup Small thumbnails are generally used in video retrieval interfaces because they allow us to present a large amount of information at a time – either from various documents (e.g. the shot-based retrieval results shown in the vertical axis of the CrossBrowser, cf. Fig. 1, left) or from one single video (e.g. the timeline-based representation of selected shots shown in the horizontal axis of the CrossBrowser). Hence, in addition to evaluated thumbnail size and type for a single shot (as done in [16]), it is also important to investigate these characteristics in context with several thumbnails, since we can expect that the concurrent representation of multiple thumbnails will influence users’ perception and verification performance. The main purpose of our user study was therefore to evaluate sizes and the relevance of dynamic versus static thumbnails when multiple shots of one video are represented in a timeline – similar to the horizontal axis in the CrossBrowser. Experiments in this study have been done with a Motorola Droid/Milestone phone running the Android operating system version 2.0. The phone features a touch screen with a relatively large resolution of 854x480 pixels. Although we expect most phones to increase in screen resolution in the future, this is at the upper end even for smart phones and clearly above the current state-of-the-art for the majority of devices. Therefore, we decided to implement and run all experiments in compatibility mode with older phones and Android OS versions, resulting in a screen resolution of 569x320 pixels that was used in all tests presented in this paper. Based on this resolution, we created three setups: first, a timeline with nine thumbnails at size 60 pixels each; second, one with seven thumbnails at size 75 pixels each; and third, one with five thumbnails at size 110 pixels each (sizes indicate width of the thumbnails, and heights are adapted according to the video’s aspect ratio). The reason for the selection of these three thumbnail sizes is explained below. We show examples of the three interfaces in Figure 2. Videos and thumbnails were taken from the TRECVID benchmark [4], and realistic questions were selected from [17]. Some questions needed to be adapted in order to fit to a “yes/no” answer scheme (which was chosen to focus on the independent variables thumbnail size and type) but were similar in spirit to the ones used in the literature. Questions were chosen randomly, but under consideration of covering different retrieval tasks – in particular: object and subject verification (e.g. “Does the clip contain any police car?”) versus scene and event verification (e.g. “Does the clip
How Thumbnail Number, Size, and Motion Influence Mobile Video Retrieval
235
Fig. 2. Interfaces used in the user study: 3 multiple thumbnail timelines representing different shorts from one video with 9, 7, and 5 thumbnails at width of 60, 75, and 110 pixels, respectively. Shorts are sorted along the timeline.
contain any moving black car?”). Overall, we took twelve video clips plus thumbnails from [4] and associated questions from [17] to create twelve different examples – four for each of the three setups shown in Figure 2. 3.2 Experiment Experiments have been done in a quiet place with no distractions and subjects sitting comfortably on a chair. They involved 24 subjects (22 male, 2 female, ages 2 from 15-20, 15 from 21-30, 5 from 31-40, and 2 from 41-50). Human recognition performance can be extremely high when the device is hold unnaturally close to ones face (even at very small thumbnail sizes, cf. [16]). Therefore, participants were asked to “hold the device in a natural and comfortable way”, for example by resting their arms on a table (cf. Fig. 3). A neutral observer reminded them of this guideline when an awkward position was recognized during the evaluations. Based on the sequence of thumbnails shown in the interfaces, subjects had to perform verification tasks that are typical in common video retrieval situations. For this, the participants had to answer the 12 questions (4 for each interface setup). The order of interfaces as well as the order of questions for each individual setup was randomized across the users in order to avoid any related influence on the results. After some informal testing with an initial implementation, it was obvious to us that a version where all dynamic thumbnails are playing at the same time would create too much distraction and a cognitive overload. Hence, we decided that initially only the center thumbnail (which corresponds to the major thumbnail shown in the center of the CrossBrowser’s horizontal axis; cf. Fig. 1, left) is playing (in an endless loop) whereas all other thumbnails are static ones. Users could however start playing
236
W. Hürst et al.
Fig. 3. Participants during the user study assessing the relevance of video retrieval results on the mobile phone with varying sizes and numbers of static and dynamic thumbnails
any related dynamic thumbnail by just clicking on its static representation. This also stopped the currently playing thumbnail and replaced it with its static version. Hence, only one dynamic thumbnail was shown at a time. 3.3 Results The sizes of the thumbnails used in all three setups were motivated by the results of our previous study with single thumbnails shown in isolation [16]. The smallest thumbnail size of 60 pixels was chosen because it proved to be large enough to achieve a good performance although both static and dynamic thumbnails also had a large amount of errors at this size. 70 pixels have been identified to be a good threshold for dynamic thumbnails, i.e. they are smaller than the size suggested for static thumbnails, but large enough for dynamic ones. Finally, 110 pixels is a size where both static and dynamic thumbnail types should lead to a good human verification performance. The number of thumbnails in each timeline has been assigned to take full advantage of the whole width of the mobile’s screen at a resolution of 569x320. Figure 4 illustrates the number and correctness of answers with respect to the number of dynamic thumbnails played for each of the three timelines used in the experiments. The diagrams illustrate that displaying 7 or 9 clips at smaller sizes results in a relative high number of wrong answers – especially compared to displaying 5 clips at a size of 110 pixels. Summed over all questions, we observed 30 and 44 mistakes for thumbnail sizes of 60 and 75, respectively, but only 7 for a size of 110 pixels. Interestingly, this number of mistakes seems inversely related to how often dynamic thumbnails have been played. If we divide the total number of played dynamic thumbnails by the number of clips, we observe an average number of plays per thumbnail of 36 and 37 for the two smallest thumbnail sizes versus 57 plays per thumbnail for a size of 110 pixels. However, in terms of absolute numbers, users click far more often on static thumbnails at the smallest size (232 clicks at size 60) than for larger ones (164 and 190 for sizes 75 and 110, respectively). These observations confirm the relevance of dynamic thumbnails compared to static ones that we already identified in our experiments with single thumbnails [16]. In addition, there was a lower average play rate per thumbnail for smaller sizes.
How Thumbnail Number, Size, and Motion Influence Mobile Video Retrieval
237
40
NUMBER OF ANSWERS
35
9 THUMBNAILS, WIDTH 50 PIXELS
CORRECT ANSWERS WRONG ANSWERS
30 25 20 15 10 5 0 1
2
3
4
5
6
7
8
10
14
NUMBER OF THUMBNAILS PLAYED 40
7 THUMBNAILS, WIDTH 75 PIXELS
NUMBER OF ANSWERS
35
CORRECT ANSWERS WRONG ANSWERS
30 25 20 15 10 5 0 1
2
3
4
5
6
7
9
11
14
NUMBER OF THUMBNAILS PLAYED 40
NUMBER OF ANSWERS
35
5 THUMBNAILS, WIDTH 110 PIXELS
CORRECT ANSWERS WRONG ANSWERS
30 25 20 15 10 5 0 1
2
3
4
5
6
7
8
NUMBER OF THUMBNAILS PLAYED
Fig. 4. Results for each interface setup used in the experiments. Human performance in relation to how many dynamic thumbnails have been played for timelines at different sizes and thumbnail numbers summed over all answers and participants. Note an extremely low number of mistakes for the largest thumbnails in the shortest timeline.
This indicates that users were able to gain some dynamic information from the static representation of the timeline despite their small sizes. For example, if three thumbnails in a row show an airplane in the sky, people can conclude that the related video contains a flying airplane, even without actually playing any dynamic thumbnail.
238
W. Hürst et al.
However, relying on static thumbnails came at the price of lower recognition performance: There were a relatively high number of mistakes despite the context represented by the larger number of static thumbnails. This can be explained with the verification problems with static thumbnails at smaller sizes that have been identified in our first study.
CrossBrowser
Mobile design suggestion 1
Mobile design suggestion 2
Fig. 5. Design drafts for a mobile video retrieval interface considering the results of our user studies (center and right). By rearranging the vertical and horizontal axis in an appropriate way (and considering the size recommendations from our user study), we are able to visualize about the same information as with the CrossBrowser interface (left) and offer almost the same functionality to the user – something that is generally considered impossible due to the mobile’s small screen size. Note that in both cases there is even room for the representation of additional meta information (filename, etc.) at the bottom of the screen.
4 Conclusion and Design Suggestions In this paper, we investigated the human recognition performance for typical video retrieval tasks based on a time-ordered sequence of multiple thumbnails. In particular, our experiments evaluated the relevance of thumbnail size, number, and type for human retrieval performance on a handheld device. Considering thumbnail size, we compared our results to previously identified thresholds in experiments where single thumbnails are shown in isolation. The observations of our user study confirmed the advantage of dynamic over static thumbnails, especially at very small sizes. However, the most important result of the study presented in this paper is that the previously identified thresholds for optimal thumbnail sizes do not transfer when they are not shown in isolation but represented in combination with several other thumbnails (even from the same video). Potential reasons for this are manifold (cognitive overload, distraction and additional clutter, etc.) and worth further investigation. Despite this decrease in performance compared to [16], our new experiments showed that users are able to achieve good verification results – even at small to medium thumbnail sizes. The participants also performed well in recognizing and selecting the most promising ones for playback as dynamic thumbnails. Contrary to the widespread believe that screens of handheld devices are unsuited for simultaneously visualizing several (small) thumbnails, our results therefore suggest that users are quite able to handle and assess multiple thumbnails. This is especially true when using moving images. These results suggest promising avenues for future research related to the design of advanced video retrieval interfaces on mobile devices.
How Thumbnail Number, Size, and Motion Influence Mobile Video Retrieval
239
Let’s first reconsider the mobile YouTube interface shown on the right side of Figure 1. Here, independent thumbnails are presented in a sorted list of results. In such an interface, we suggest that it is good to play them, because our results from [16] and the ones presented in this paper showed an increased human performance when using dynamic thumbnails. For more complex interfaces, as they are common for the desktop, the user study we presented here suggests that it is unwise to display the thumbnails on a mobile as small as suggested by our previous studies. However, a timeline with up to five thumbnails is no problem at all, even for the small display of a handheld device (also remember that we purposely used a rather low resolution for our tests). Reconsidering the CrossBrowser for example, our results suggest that we could successfully use a mobile version where the vertical axis as well as horizontal axis are both visualized at the same time (horizontally though and with less thumbnails). Figure 5 illustrates design drafts for an implementation of a mobile video retrieval interface featuring a similar functionality as the CrossBrowser interface. The results of our studies suggest a high human performance in common retrieval tasks with such a setup despite the small screen size. Our current and future work includes implementations of such designs on a smart phone and evaluations comparing human assessment performance with the traditional CrossBrowser interface on a PC. In conclusion, as the title “Keep moving!” of our previous work [16] suggests, moving images have a high relevance in mobile recognition tasks. Thumbnail sizes on the other hand are almost negligible – if and only if thumbnails are shown in isolation. The study presented in this paper confirmed the relevance of moving images over stills. However, it also showed that size is important if multiple thumbnails are used simultaneously to create more complex interface designs. Hence, when designing mobile video retrieval interfaces, we not only recommend to “Keep moving!” but also to keep in mind that “Size matters!”
References 1. Adcock, J., Cooper, M., Girgensohn, A., Wilcox, L.: Interactive Video Search Using Multilevel Indexing. In: Proc. CIVR (2005) 2. Ames, M., Eckles, D., Naaman, M., Spasojevic, M., Van House, N.: Requirements for Mobile Photoware. Personal and Ubiquitous Computing 14(2), 95–109 (2010) 3. Björk, S., Holmquist, L.E., Redström, J., Bretan, I., Danielsson, R., Karlgren, J., Franzén, K.: WEST: a Web Browser for Small Terminals. In: Proc. ACM UIST (1999) 4. Boreczky, J., Girgensohn, A., Golovchinsky, G., Uchihashi, S.: An Interactive Comic Book Presentation for Exploring Video. In: Proc. ACM CHI, The Hague, Netherlands, pp. 185–192 (2000) 5. Brachmann, C., Malaka, R.: Keyframe-less Integration of Semantic Information in a Video Player Interface. In: Proc. EITC, Leuven, Belgium (2009) 6. Christel, M.G., Warmack, A., Hauptmann, A.G., Crosby, S.: Adjustable Filmstrips and Skims as Abstractions for a Digital Video Library. In: Proc. IEEE Advances in Digital Libraries Conference 1999, Baltimore, MD, pp. 98–104 (1999) 7. Christel, M.G., Hauptmann, A.G., Wactlar, H.D., Ng, T.D.: Collages as Dynamic Summaries for News Video. In: Proc. ACM Multimedia (2002) 8. Dachselt, R., Frisch, M.: Mambo: A Facet-Based Zoomable Music Browser. In: Proc. ACM Mobile and Ubiquitous Multimedia (2007)
240
W. Hürst et al.
9. Gurrin, C., Brenna, L., Zagorodnov, D., Lee, H., Smeaton, A.F., Johansen, D.: Supporting Mobile Access to Digital Video Archives Without User Queries. In: Proc. MobileHCI (2006) 10. Heesch, D., Rüger, S.: Image Browsing: A semantic analysis of NNk networks. In: Proc. CIVR (2005) 11. Hürst, W., Götz, G., Welte, M.: Interactive Video Browsing on Mobile Devices. In: Proc ACM Multimedia (2007) 12. Hürst, W.: Video Browsing on Handheld Devices—Interface Designs for the Next Generation of Mobile Video Players. IEEE MultiMedia 15(3), 76–83 (2008) 13. Hürst, W., Götz, G.: Interface Designs for Pen-Based Mobile Video Browsing. In: Proc. Designing Interactive Systems, DIS (2008) 14. Hürst, W., Merkle, P.: One-Handed Mobile Video Browsing. In: Proc. uxTV (2008) 15. Hürst, W., Meier, K.: Interfaces for Timeline-based Mobile Video Browsing. In: Proc. ACM Multimedia (2008) 16. Hürst, W., Snoek, C.G.M., Spoel, W.-J., Tomin, M.: Keep Moving! Revisiting Thumbnails for Mobile Video Retrieval. In: Proc. ACM Multimedia (2010) 17. Huurnink, B., Snoek, C.G.M., de Rijke, M., Smeulders, A.W.M.: Today’s and Tomorrow’s Retrieval Practice in the Audiovisual Archive. In: Proc. ACM CIVR (2010) 18. Lee, H., Smeaton, A.F., Murphy, N., O’Connor, N., Marlow, S.: Fischlar on a PDA: Handheld User Interface Design to a Video Indexing, Browsing, and Playback System. In: Proc. UAHCI (2001) 19. Luan, H., Neo, S., Goh, H., Zhang, Y., Lin, S., Chua, T.: Segregated Feedback with Performance-based Adaptive Sampling for Interactive News Video Retrieval. In: Proc. ACM Multimedia (2007) 20. O’Hara, K., Slayden, A., Michell, A.S., Vorbau, A.: Consuming Video on Mobile Devices. In: Proc. CHI (2007) 21. de Rooij, O., Snoek, C.G.M., Worring, M.: Balancing Thread Based Navigation for Targeted Video Search. In: Proc. CIVR (2008) 22. Smeaton, A.F., Over, P., Kraaij, W.: Evaluation Campaigns and TRECVid. In: Proc. ACM Multimedia Information Retrieval, Santa Barbara, USA, pp. 321–330 (2006) 23. Smeaton, A.F., Foley, C., Byrne, D., Jones, G.J.F.: iBingo Mobile Collaborative Search. In: Proc. CIVR (2008) 24. Snoek, C.G.M., Worring, M., Koelma, D.C., Smeulders, A.W.M.: A Learned LexiconDriven Paradigm for Interactive Video Retrieval. IEEE Transactions on Multimedia 9(2), 280–292 (2007) 25. Torralba, A., Fergus, R., Freeman, W.T.: 80 Million Tiny Images: A Large Data Set for Nonparametric Object and Scene Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 30(11), 1958–1970 (2008) 26. Vaughan-Nichols, S.J.: The Mobile Web Comes of Age. IEEE Computer 41(11), 15–17 (2008) 27. Wactlar, H.D., Christel, M.G., Gong, Y., Hauptmann, A.G.: Lessons Learned from Building a Terabyte Digital Video Library. IEEE Computer 32(2), 66–73 (1999)
An Information Foraging Theory Based User Study of an Adaptive User Interaction Framework for Content-Based Image Retrieval Haiming Liu1 , Paul Mulholland1 , Dawei Song2 , Victoria Uren3 , and Stefan R¨ uger1 1
Knowledge Media Institute, The Open University, Milton Keynes, MK7 6AA, UK School of Computing, The Robert Gordon University, Aberdeen, AB25 1HG, UK 3 Department of Computer Science, University of Sheffield, Sheffield, S1 4DP, UK {h.liu,p.mulholland,s.rueger}@open.ac.uk,
[email protected],
[email protected]
2
Abstract. This paper presents the design and results of a task-based user study, based on Information Foraging Theory, on a novel user interaction framework - uInteract - for content-based image retrieval (CBIR). The framework includes a four-factor user interaction model and an interactive interface. The user study involves three focused evaluations, 12 simulated real life search tasks with different complexity levels, 12 comparative systems and 50 subjects. Information Foraging Theory is applied to the user study design and the quantitative data analysis. The systematic findings have not only shown how effective and easy to use the uInteract framework is, but also illustrate the value of Information Foraging Theory for interpreting user interaction with CBIR. Keywords: Information Foraging Theory, User interaction, Four-factor user interaction model, uInteract, content-based image retrieval.
1
Introduction
In an effort to improve the interaction between users and search systems, some researchers have focused on developing user interaction models and/or interactive interfaces. Spink et al. (1998) proposed a three-dimensional spatial model to support user interactive search [8]. Campbell (2000) proposed the Ostensive Model (OM), which indicates the degree of relevance relative to when a user selected the evidence from the results set [1]. Ruthven et al. (2003) adapted two dimensions from Spink et al.’s model combined with OM [7]. Liu et al. (2009) proposed an adaptive four-factor user interaction model (FFUIM) based on above models for content-based image retrieval (CBIR) [3]. The interaction models need to be delivered by visual interactive interfaces for further improving the user interaction. For instance, Urban et al. (2006) developed a visual image search system based on the OM [10]. Liu et al. (2009) K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 241–251, 2011. c Springer-Verlag Berlin Heidelberg 2011
242
H. Liu et al.
proposed an interactive CBIR interface that successfully delivered the FFUIM and allowed users to manipulate the model effectively [4]. To date, most of the evaluations of interactive search systems are still systemoriented. For instance, the search results of an automatic pseudo or simulated user evaluation are measured by precision and recall. However, users in real-life seek to optimize the entire search process, not just results accuracy. Evaluation of output alone is not enough to explain the effectiveness of the systems or users’ search experience [2]. Pirolli (2007) stated in Information Foraging Theory [6] that the two interrelated environments, namely task environment and information environment, will affect the information search process. The definition of the task environment “refers to an environment coupled with a goal, problem or task - the one for which the motivation of the subject is assumed”. “The information environment is a tributary of knowledge that permits people to more adaptively engage their task environments”. In other words, “what we know, or do not know, affects how well we function in the important task environments that we face in life.” [5]. We consider that a clear task environment and a rich information environment determine a forager’s effective and enjoyable search experience. With respect to Information Foraging Theory, our task-based user study applies simulated searching tasks with different complexity levels, and employed users with different age and image search experience. This way, we can investigate how the different task environments and the users’ different information environments affect evaluation results.
2
uInteract Framework
The uInteract Framework aims to improve user interaction and users’ overall search experience. The framework includes a four-factor user interaction model (FFUIM) and an interactive interface. HSV colour feature, City Block dissimilarity measure and ImageCLEF2007 collection are employed by the framework. 2.1
Four-Factor User Interaction Model
The four factors taken into account in the model are relevance region, relevance level, time and frequency [3]. The relevance region comprises two sub-regions: relevant (positive) evidence and non-relevant (negative) evidence. The two sub-regions contain a range of relevance levels. The relevance level is a quantitative level, which indicates how relevant/non-relevant the evidence is. The time factor adapts OM [1], which indicates the degree of relevance/non-relevance relative to when the evidence was selected. The frequency factor captures the number of appearances of an image in the user selected evidence both for positive and negative evidence separately. The FFUIM works together with two fusion approaches, namely Vector Space Model (VSM) for the positive query only scenario and K-Nearest Neighbours (KNN) for the both positive and negative queries scenario.
An Information Foraging Theory Based User Study
243
Fig. 1. The uInteract interface (the keys are explained in the main text.)
2.2
uInteract Interface
The key features of the uInteract interface (Figure 1) [4] are as follows: (1) Query images panel provides a browsing functionality that facilitates the selection of the initial query images. (2) Users can provide both positive and negative examples to a search query in the positive and negative query panel, and further expand or remove images from that query. (3) By allowing the user to override the system-generated scores (integer 1-20) of positive and negative query images, users can directly influence the relevance level of the feedback. (4) The displaying of the results in results shows not only the best matches but also the worst matches. This functionality can enable users to gain a better understanding of the data set they are searching. (5) Combining both positive and negative query history functionality has not previously been undertaken in CBIR. The query history not only provides users with the ability to reuse their previous queries, but also enables them to expand future search queries by taking previous queries into account.
3
User Study Methodology
Our user study contained three focused evaluations: evaluation1 (E1) was to evaluate the ease of use and usefulness of the uInteract interface; evaluation2 (E2) was to evaluate the performance of the four profiles of the OM; evaluation3 (E3) was to evaluate the effectiveness of the four different settings of the FFUIM. Fifty subjects were employed for the user study. They were a mixture of males and females, undergraduate and postgraduate students and academic staff from a variety of departments with different ages and levels of image search
244
H. Liu et al.
experience. The subjects were classified into two categories - inexperienced or experienced - based on their image search experience. We considered that people were experienced subjects if they searched images at least once a week, and otherwise they were inexperienced subjects. The 50 subjects were divided into three groups, 17, 16 and 17 subjects assigned to E1, E2 and E3 respectively. In each evaluation, the subjects were asked to complete four search tasks with different complexity level on four systems randomly (limited to five minutes for each task). The complexity level of each task in E1 was reflected by the task description. Task1 (E1T1) provided both search topic and example images, so we considered it the easiest task in term of the “easiness” of formulating the query and identifying the information need. Task2 (E1T2) gave example images without a topic description, so we considered it harder than E1T1. Task3 (E1T3) had only a topic but no image examples, which was even harder than E1T2. Task4 (E1T4) described a broad search scenario without any specific topic and image examples, so it was the hardest task in our view. The four testing systems of E1 were: system1 (I1) had a baseline interface, where users were allowed to give positive feedback from search results through a simplified interface; system2 (I2) - an interface based on Urban et al.’s [10] model, provided positive query history functionality which was an addition to I1; system3 (I3) - an interface based on Ruthven et al.’s [7] model, enhanced I2 by allowing users to assign a relevance value to the query images; system4 (I4) was the uInteract interface [4], which added negative query, negative result and negative query history functionalities based on I3. The four tasks in E2 and E3 used the same description structure, which had both specific search topic and three example images. The complexity level of each task was based on the search accuracy of the query images of the tasks from our earlier lab-based simulated experiments results. The mean average precision (MAP) of task1 (T1), task2 (T2), task3 (T3) and task4 (T4) was 0.2420, 0.0872, 0.0294, 0.0098 respectively. We considered T1 was the easiest task with the highest precision, and then it was followed by T2 and T3. T4 had lowest precision thus we took it as the hardest task. The four testing systems of E2 were: system1 (OM1) applied the increasing profile of the OM; system2 (OM2) applied the decreasing profile of the OM; system3 (OM3) applied the flat profile of the OM; system4 (OM4) applied the current profile of the OM. The four testing systems of E3 were: system1 (FFUIM1) delivered the relevance region factor and time factor of the FFUIM and here we apply the increasing profile of the OM to both positive and negative queries; system2 (FFUIM2) delivered the relevance region factor, time factor and relevance level factor of the FFUIM, and here we combined the increasing profile of the OM with the relevance scores provided by the users for both positive and negative queries; system3 (FFUIM3) delivered the relevance region factor and time factor and frequency factor of the FFUIM, and here we combined the increasing profile of the OM with the number of times (frequency) images appeared in the feedback
An Information Foraging Theory Based User Study
245
for both positive and negative queries; system4 (FFUIM4) delivered the relevance region factor, time factor, relevance level factor and frequency factor of the FFUIM, and here we combined the increasing profile of the OM and the relevance scores provided by the users and the number of times (frequency) images appeared in the feedback for both positive and negative queries. The data was collected by questionnaires and actual search results. The questionnaires used five point Likert scales, and included entry questionnaire, postsearch questionnaire, and exit questionnaire. 3.1
Main Performance Indicators and Nine Hypothesis of Quantitative Analysis
The main performance indicators (PIs) of the qualitative data are generated from the questionnaires and actual search results. The main PIs of E1, E2 and E3 are listed in Figure 1. The following nine evaluation hypotheses aims for investigating not only the effectiveness and ease of use of the uInteract framework, but also how the different task environments and the users’ information environments will affect the performance indicators. – Hypothesis1: Task Order (PI5) and System Order (PI7) will affect the PI8-33 provided by subjects because of familiarity or fatigue; – Hypothesis2: System (PI6) will affect the PI8-33; – Hypothesis3: Task (PI4) will affect the PI8-33 provided by subjects because of different complexity level; – Hypothesis4: The interaction between Task (PI4) and System (PI6) will influence the scores of the PI8-33; – Hypothesis5: Person (PI1) will affect the PI8-33, based on individual differences; – Hypothesis6: The subjects’ Age (PI2) and prior Image Search Experience (PI3) of the subjects will affect subjects’ opinion on overall search experience (PI8-21); – Hypothesis7: The subjects’ Age (PI2) and prior Image Search Experience (PI3) of the subjects will affect subjects’ opinion on the functionalities of the interfaces (PI22-33); – Hypothesis8: System (PI6) and Task (PI4) will have impact on Precision (PI34) of the search results; – Hypothesis9: System (PI6) and Task (PI4) will have impact on Recall (PI35) of the search results. 3.2
Quantitative Data Analysis Procedure
Quantitative data analysis is supported by the use of statistical software - SPSS. The analysis procedure is as follows: 1. Identify so-called precision value and recall for the 12 tasks preformed by 50 subjects;
246
H. Liu et al.
Table 1. The main performance indicators from the three evaluations for qualitative data analysis
– Get result images: We firstly get the union ( ) of result images of one task from all the result images selected by all of the subjects who did this task. Then we do the same to the other 11 tasks (4 tasks in each evaluation) to get 12 result images union sets; – Get independent raters to rate the result images: We asked 5 independent raters to rate all images in the 12 result union sets with 1 to 5 scales (5 is the most relevant). The raters give a relevance
An Information Foraging Theory Based User Study
247
value to every image in a union result set of a task, and the raters do the same to the result images of the other 11 tasks. We test the reliability of the raters’ scores of all the images for the 12 tasks by Cronbach’s Alpha statistics test according to a reliability of 0.70 or higher in SPSS, and find the reliability for all of the 12 tasks across the three evaluations; – Get the precision value: The precision value for each result image is the mean rating value provided by the five raters to the image. The precision value of a task is the mean precision value of all the result images of the task; – Get the recall value: The recall of a task is the number of images selected by a subject to complete the task; 2. Obtain the figures for the performance indicators listed in Figure 1 from the questionnaires and the actual search results for the three focused evaluations, and test the nine hypotheses that we intend to investigate in Section 3.1 by factorial ANOVA statistical tests; 3. Analyze the testing results that we have obtained from the ANOVA test based on Information Foraging Theory.
4
Evaluation Results and Analysis
Table 2 shows the results of the three evaluations that are obtained by ANOVA analysis (with α = 0.05) of the main PIs based on the nine hypotheses. From Table 2 we can see that (1) the different complexity level of the tasks and the different age and image experience of the users have very strong effect on Table 2. How the nine hypotheses have been supported or rejected in E1, E2 and E3 (partially = part of the PIs have significantly supported the hypotheses) Hypotheses Hypothesis1: Task Order (PI5) and System Order (PI7) will affect the PI8-33 provided by subjects because of familiarity or fatigue Hypothesis2: System (PI6) will affect the PI8-33
E1 Not supported Not supported Hypothesis3: Task (PI4) will affect the PI8-33 provided by subjects Partially because of different complexity level supported Hypothesis4: The interaction between Task (PI4) and System (PI6) Partially will influence the scores of the PI8-33 supported Hypothesis5: Person (PI1) will affect the PI8-33, based on Partially individual differences supported Hypothesis6: The subjects’ Age (PI2) and prior Image Search Partially Experience (PI3) of the subjects will affect subjects’ opinion supported on overall search experience (PI8-21) Hypothesis7: The subjects’ Age (PI2) and prior Image Search Partially Experience (PI3) of the subjects will affect subjects’ opinion supported on the functionalities of the interfaces (PI22-33) Hypothesis8: System (PI6) and Task (PI4) will have impact on Partially Precision (PI34) of the search results supported Hypothesis9: System (PI6) and Task (PI4) will have impact on Partially Recall (PI35) of the search results supported
E2 Not supported Not supported Partially supported Partially supported Partially supported Partially supported
E3 Not supported Not supported Partially supported Partially supported Partially supported Partially supported
Partially Partially supported supported Not supported Partially supported
Partially supported Partially supported
248
H. Liu et al.
the PIs, which confirms the importance of the task and information environment stated in Information Foraging Theory; (2) the performing order of the tasks and systems does not affect the PIs, which implies the familiarity or fatigue with the task and the system does not make a difference to the subjects’ scores on the indicators; (3) there is no significant difference between the testing systems from three evaluations. This may be because that Task and Person indicator strongly impinge on the PIs. The following sections will report how the different tasks (task environment) and different users - different age and image search experience (information environment) affect the scores of the PIs. 4.1
Effects of the Task Environment
Task (PI4) strongly influences most PIs in three evaluations. For E1T2, E1T3 and E1T4 in E1, subjective feelings decrease as task difficulty increases. This is the case for a number of PIs, e.g. (Figure 2), TaskGeneralFeeling (PI8), SystemSatisfaction (PI19), etc. However, E1T1 subjective feelings are relatively low even though the task is easier. This may be have been because the image examples used in the task being difficult to interpret, therefore making the task more difficult than intended.
(a) E1 TaskGeneralFeeling
(b) E1 SystemSatisfaction
Fig. 2. E1: examples of effects of Task on performance indicators (8-33)
(a) E2 NextAction
(b) E2 ResultSatisfaction
Fig. 3. E2: examples of effects of Task on performance indicators (8-33)
An Information Foraging Theory Based User Study
(a) E3 ResultSatisfaction
249
(b) E3 MatchedInitialIdea
Fig. 4. E3: examples of effects of Task on performance indicators (8-33)
For T1, T2 and T3 in E2, there is a decrease in subjective feelings as task difficulty increases, e.g. (Figure 3), NextAction (PI11), ResultSatisfaction (PI12), etc. However, subjective feelings were relatively high for T4 even though it was the hardest task. This may be because subjects tended to give an over-generous definition of what images were relevant to the solution, therefore making the task easier for themselves. This was reflected in the low precision scores for this task. One interesting observation from the analysis is that the trend in subjective feelings for T1, T2 and T3 in E3 become more negative as the task becomes harder, e.g. (Figure 4), ResultSatisfaction (PI12), MatchedInitialIdea (PI14), etc. As in E2, the subjects are relatively more positive about T4 because they had a generous definition of what images were relevant to the solution. 4.2
Effects of the Information Environment
Table 3 shows how the different users’ information environments - different age and image search experience - relate to the scores of the PIs. Table 3. The relationship between the scores of the main performance indicators and the information environment in E1, E2 and E3 Relationship Correlate with age
E1 E2 PI117, PI18, PI19, PI10, PI11, PI31, PI20 PI32, PI29, PI30
Inversely correlate with age Correlate with age for PI12, PI14 experienced users Inversely correlate with age for experienced users Correlate with age for inexperienced users Inversely correlate with age PI12, PI14 for inexperienced users Higher for experienced users Higher for inexperienced users
PI18, PI19, PI21, PI23, PI33
E3 PI10, PI16, PI18, PI19, PI20, PI24, PI33, PI8 PI21 PI28, PI29, PI30, PI31 PI28, PI29, PI30, PI31 PI21
PI18, PI19, PI21, PI23, PI33 PI22, PI25, PI28, PI31 PI23, PI24, PI26 PI19, PI20
250
H. Liu et al.
For E1 and E2 we can see that: – Some PIs tend to correlate with age - i.e. more positive feelings toward the task or system with age. – Some PIs tend to correlate with age only for experienced users and inversely correlate for inexperienced users - i.e. increasingly positive feelings for experienced users as they get older, decreasingly positive feelings for inexperience users as they get older. – Some PIs tend to be higher for experienced users. For E3 there are some exceptions, in which some factors (PI28, PI29, PI30 and PI31) correlate with age for inexperienced users, with inexperienced users also having higher scores for PI19 and PI20. This can be inferred from PI20, PI28, PI29, PI30 and PI31 all being related to user perception of negative functions i.e. inexperienced users can adapt the new negative functions easier than experienced users, and inexperienced users have increasingly positive perception to the new functions as they get older. In summary, the lesson learnt from the data can be mapped back to task and information environment, along the lines of the following: – Task difficulty can effect a range of measures and the difficulty of the task might differ from expectations depending on how users interpret the materials and instructions. – Age can affect perception of the task and and system, with older subjects perhaps more likely to have a positive perception. – Experience can effect perception of the task and and system, with experienced subjects more likely to have positive feelings for the specific functionalities of the system, and with inexperienced subjects likely to have positive feelings for the entire system. – Age may interact with experience in certain ways depending on the subjects’ perception of the functionalities and the search process in general. These findings have implications for how CBIR evaluations are designed and analysed: choice of PIs, selection of tasks and selection of subjects, etc.
5
Conclusions and Future Work
The quantitative data analysis results of E1, E2 and E3 show that the different tasks and different users have stronger effects on the performance indicators than the different systems. This finding reinforces the importance of the task and information environment concept of Information Foraging Theory for interactive CBIR study. A clear trend is found from the influence of the Task (PI4) indicator: the subjects tend to give higher scores to the performance indicators when they perform an easier task although there are exceptions (1) when the image examples are not intuitive, and (2) how the subjects perform the tasks. However, the results of the three evaluations do not show a clear trend on how the Person (PI1) indicator affected the performance indicators. We have tested the effects
An Information Foraging Theory Based User Study
251
of the Age (PI2) and Image Search Experience (PI3) of the subjects, but found varied results across the three evaluations. Therefore, we realize that the simple user classification based on Age (PI2) and Image Search Experiences (PI3) is not sufficient, so that we will need to investigate in-depth how to better classify user types and how the different user types affect the users’ search preferences.
Acknowledgments This work was partially supported by AutoAdapt project funded by the UK’s Engineering and Physical Sciences Research Council, grant number EP/F035705/1.
References 1. Campbell, I.: Interactive evaluation of the ostensive model using a new test collection of images with multiple relevance assessments. Journal of Information Retrieval 2(1) (2000) 2. J¨ arvelin, K.: Explaining user performance in information retrieval: Challenges to IR evaluation. In: Azzopardi, L., Kazai, G., Robertson, S., R¨ uger, S., Shokouhi, M., Song, D., Yilmaz, E. (eds.) ICTIR 2009. LNCS, vol. 5766, pp. 289–296. Springer, Heidelberg (2009) 3. Liu, H., Uren, V., Song, D., R¨ uger, S.: A four-factor user interaction model for content-based image retrieval. In: Proceeding of the 2nd International Conference on the Theory of Information Retrieval, ICTIR (2009) 4. Liu, H., Zagorac, S., Uren, V., Song, D., R¨ uger, S.: Enabling effective user interactions in content-based image retrieval. In: Proceedings of the Fifth Asia Information Retrieval Symposium, AIRS (2009) 5. Pirolli, P.: Information Foraging Theory Adaptive Interaction with Information. Oxford University Press, Inc., Oxford (2007) 6. Pirolli, P., Card, S.K.: Information foraging. Psychological Review 106, 643–675 (1999) 7. Ruthven, I., Lalmas, M., van Rijsbergen, K.: Incorporating user search behaviour into relevance feedback. Journal of the American Society for Information Science and Technology 54(6), 528–548 (2003) 8. Spink, A., Greisdorf, H., Bateman, J.: From highly relevant to not relevant: examining different regions of relevance. Information Processing Management 34(5), 599–621 (1998) 9. Urban, J., Jose, J.M.: Evaluating a workspace’s usefulness for image retrieval. Multimedia Systems Journal 12(4-5), 355–373 (2006) 10. Urban, J., Jose, J.M., van Rijsbergen, K.: An adaptive technique for content-based image retrieval. Multimedia Tools and Applications 31, 1–28 (2006)
Generalized Zigzag Scanning Algorithm for Non-square Blocks Jian-Jiun Ding, Pao-Yen Lin, and Hsin-Hui Chen Graduate Institute of Communication Engineering, National Taiwan University, 10617, Taipei, Taiwan
[email protected],
[email protected],
[email protected]
Abstract. In the shape-adaptive compression algorithm, an image is divided into irregular regions instead of 8×8 blocks to achieve better compression efficiency. However, in this case, it is improper to use the conventional zigzag of JPEG to scan the DCT coefficients. For example, for a region whose height is M and width is N, if N is much larger than M, then the DCT coefficient in the location (0, b) is usually larger than that in (b−1, 0). However, when using the conventional zigzag method, (b−1, 0) is scanned before (0, b). In this paper, we propose new algorithms to perform coefficient scanning for irregular regions. The proposed methods are easy to implement and can obviously improve the coding efficiency of the shape-adaptive compression algorithm. Keywords: image coding, zigzag, object oriented methods, image compression, adaptive coding.
1 Introduction Instead of segmenting an image into 8×8 blocks, the shape adaptive coding scheme divides an image into several irregular-shaped blocks and encodes the DC and AC coefficients of each block. The shape adaptive coding scheme plays an important role in the MPEG-4 standard [1], [2]. Compared with the conventional image and video coding standards, one of the features of the MPEG-4 standard is its ability to encode arbitrarily shaped video objects in order to support content-based functionalities. Then, together with the shapeadaptive DCT (SA-DCT) [3], [4], the coding efficiency can be much improved and the object can be processed in an efficient way. In addition to MPEG-4, there are many other shape adaptive image compression techniques. For example, in 2000, Yamane, Morikawa, Nairai, and Tsuruhara proposed an image coding method, which is called the skew coordinates DCT (SC-DCT) [5], [6]. They divided an image into rectangle, triangle, and rhombus blocks and compress the image. In 2008, Zeng and Fu proposed the directional discrete cosine transform 7. That is, the direction of the processing image block is first predicted by using the prediction techniques in H. 264 [8]. After prediction, the 1-D DCT will be performed along the direction according to the predefined order. K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 252–262, 2011. © Springer-Verlag Berlin Heidelberg 2011
Generalized Zigzag Scanning Algorithm for Non-square Blocks
253
Furthermore, the two-dimensional DCT expansion in triangular and trapezoid regions is proposed by Pei, Ding, and Lee in 2009 [9]. They found that a subset of DCT basis can be complete and orthogonal in triangular and trapezoid regions. Thus, an image can be divided into triangular and trapezoid blocks instead of 8×8 square blocks. With the new method, an image can be encoded with a very low bit rate. 8×8 blocks DCT-Based Encoder
FDCT
Quantizer
Entropy Encoder
Compressed Image Data
Table Specifications
Table Specifications
Fig. 1. DCT-based encoder diagram
Fig. 2. Conventional zigzag scanning order for 8×8 blocks
In the encoding phase of the JPEG standard [10], an image is divided into 8×8 blocks. As depicted in Fig. 1, each block is transformed by the discrete cosine transform (DCT) into a set of 64 DCT coefficients. The DCT coefficient at the top left corner is the DC coefficient and others are AC coefficients. All the DC and AC coefficients are then quantized by a quantization table. After quantizing, the DC term is encoded by differential coding. The other 63 quantized AC coefficients are converted into a one dimensional sequence using Zigzag scanning, as shown in Fig. 2.
254
J.-J. Ding, P.-Y. Lin, and H.-H. Chen
In Fig. 2, the scanning order of zigzag highly matches the rule that the AC coefficient with higher energy should be scanned before that with lower energy. Since in most conditions the AC coefficients in the high frequency region are equal to zero after quantization, thus, using the zigzag scanning in Fig. 2 can make the AC coefficients with zero value together. Then, with the technique of zero-run-length coding, the AC coefficients can be encoded in an efficient way. However, for shape adaptive image coding algorithms, an image is no longer divided into the 8×8 blocks. In this case, the zigzag algorithm in Fig. 2 will not be suitable for image coding. For example, suppose that there is a triangular region as in Fig. 3 whose width is 15 and height 8. Since its width is much longer than the height, the AC coefficient at the location (0, 6) is usually larger than that at the location (5, 0). However, when using the conventional zigzag, (5, 0) is scanned before (0, 6), which does not match the rule that the AC coefficient with higher energy should be scanned before that with lower energy. Therefore, we should design new zigzag scanning algorithms for shape adaptive image coding schemes.
(0, 6)
(5, 0)
Fig. 3. A triangular block with width 15 and height 8
In Sections 2 and 3, we propose two rules for the new zigzag scanning algorithms for non-square regions. First, we think that when determining the scanning order, the height and the width of a region should be considered. Second, as the original zigzag, the AC coefficient to be scanned should be neighboring to the previously scanned AC coefficient as possible. With the two rules, in Section 4, we propose a scoring method that can be used for determining the zigzag scanning order. In Section 5, we perform some simulations. The simulation results show that, with the proposed zigzag algorithm, the number of bytes of the compressed image can be obviously reduced by 6% to 12%.
2 First Rule for Scanning – Normalized by Size Suppose that there is an irregular-shaped block (not constrained for rectangular blocks) whose height is M and width is N. We propose the following formula to determine the scanning order:
Generalized Zigzag Scanning Algorithm for Non-square Blocks
Score1 =
p q + , Mk Nk
255
(1)
where p and q denote index of the vertical and horizontal axes, respectively, and k is a pre-defined constant. The scanning order is determined by the value of score1. The entry with smaller value of Score1 has higher priority to be scanned, and vise versa. Note that, when k = 0 the result is the same as conventional zigzag scanning order. When k ≠ 0, we can adjust the scanning order according to the height M and the width N of the irregular-shaped block. In Fig. 4, we show the scanning order determined by (1) for the triangular block in Fig, 3. Here, we choose k = 0.5. Then
Score1 =
p M
+
q N
.
(2)
From (2), the coefficients near the longer side are likely to be scanned earlier, as in Fig. 4. Note that when using the original scanning method, the entry (5, 0) is scanned before the entry (0, 6). However, when using the method in (2), the entry (0, 6) is scanned before (5, 0).
(0, 6)
(5, 0)
Fig. 4. Zigzag Scanning order determined by (2) for the triangular block in Fig. 3. The number in each entry indicates the order of scanning.
The key reason why the scanning order is defined as in (1) is that, when the width N is much larger than the height M, the coefficient at the location (0, a+d) is usually much larger than the coefficient at the location (a, 0) when d is small. However, when using the conventional zigzag scanning method, the entry (a, 0) is scanned before the entry (0, a+d). This order is not reasonable. Therefore, it is proper to adjust the scanning order according to the size of the block, as in (1) and (2).
256
J.-J. Ding, P.-Y. Lin, and H.-H. Chen
P[j]
j Fig. 5. Energy concentration of modified zigzag scanning scheme (red line) and conventional zigzag scanning scheme (blue line) Table 1. Data sizes of the AC coefficients (after quantizing and encoding) in irregular-shaped image blocks when using the conventional and our proposed zigzag scanning methods
size Block 1 Block 2 Block 3 Block 4 Block 5
M=103, N=127, 7213 pixels M=86, N=142, 7662 pixels M=131, N=98, 7853 pixels M=128, N=141, 8369 pixels M=151, N=137, 9204 pixels
Conventional Proposed Proposed zigzag k=0 zigzag, k=0.5 zigzag, k=1
Improved ratio
536 bytes
497 bytes
509 bytes
7.85%
588 bytes
561 bytes
527 bytes
11.57%
561 bytes
544 bytes
550 bytes
3.13%
788 bytes
724 bytes
732 bytes
8.84%
857 bytes
853 bytes
824 bytes
4.00%
A comparison of the coefficient scanning method proposed in this section and the conventional zigzag scanning method is in Fig. 5 and Table 1. First, we segment the image into triangular blocks, the trapezoid blocks, the rectangular blocks, and the blocks with irregular shapes, as the work in [7], [8], [9]. Then, image the blocks are transformed using the SA-DCT [5], [6] and the 2-D DCT for triangular and trapezoid regions [9]. In Fig. 5, we show the normalized partial sums of the energies of the largest DCT coefficients after the zigzag scanning of the conventional and proposed scanning methods.
Generalized Zigzag Scanning Algorithm for Non-square Blocks
257
j
∑ s [i] 2
partial energy sum P [ j ] =
i =1
total energy
,
(3)
where s[j] is the re-order of the DCT coefficients C[p, q] according to the conventional zigzag scanning order or the proposed scanning order (Here we use k = 0.5). The test image is the trapezoid region in the background of Lena image. From Fig. 5, we can see that the energy concentration of our proposed method performs better than the conventional zigzag scanning method. The proposed method has a trend of scanning the coefficients with larger energy in the prior order. For the backward encoding phase, the proposed method can reduce the data size significantly. As mentioned, the coefficients with larger energy are scanned earlier in the proposed method. That is, most zero coefficients are arranged at the back of the scanned sequence. This is very helpful for the zero run length coding. Table 1 gives a comparison of the encoded data size when using the conventional zigzag scanning method and our proposed zigzag scanning method. The five image blocks used in Table 1 are extracted from Lena, House, Pepper, Beach, and Fruits images. These image blocks are first transformed by the SADCT [5], [6] before zigzag scanning. By choosing different k in (1), we have different scanning order and resulting to different encoded data size. Grouping the zeros together will reduce the data size. From the results in Table 1, we can see that our proposed method is always superior to the conventional zigzag scanning method.
3 Second Rule for Scanning – Neighboring Coefficient Although from Table 1, using the method in Section 2 (i.e., adjust the scanning order according to the height and the width of the block) can indeed improve the coding efficiency, however there is some drawback. Note that, in Fig. 4, the next entry to be scanned may not be adjacent to the present entry. In Fig. 4, the 10th scanned entry is at the location (0, 4), but the 11th scanned entry is at the location (3, 0). The distance of these two entries is large. However, the DCT coefficients in the adjacent entries usually have high correlation. If the DCT coefficient at the location (p, q) (denoted by C[p, q]) is zero after quantization, then the probability that C[p±d1, q±d2] is zero after quantization is very high when |d1 + d2| is small. Therefore, it is better to choose one of the neighboring entries as the next scanned entry as possible. According to the requirement, we change the Score1 in (1) into Score2 = Score1 + s ( p − pc + q − qc
=
)
p q + + s ( p − pc + q − qc ) , M k Nk
(4)
258
J.-J. Ding, P.-Y. Lin, and H.-H. Chen
(a) Score1 0
(b) s(|p pc| + |q qc|)
0.447 0.894 1.342 1.789
0.5 0.947 1.394 1.842 2.289 1
0.6
0.8
1
0.2
0.4
0.6
0.8
1.447 1.894 2.342 2.789
0.2
0.4
0.6
0.8
1
1.5 1.947 2.394 2.842 3.289
0.4
0.6
0.8
1
1.2
(c) Score2 = (a) + (b)
(d)
1.494 2.142 2.789
1.494 2.142 2.789
1.147 1.794 2.442 3.089
1.147 1.794 2.442 3.089
1.2 1.847 2.494 3.142 3.789
1.2 1.847 2.494 3.142 3.789
1.9 2.547 3.194 3.842 4.489
1.9 2.547 3.194 3.842 4.489
Fig. 6. A simple example that uses Score1 in (1) and Score2 in (4) to determine the scanning order of a 4×5 block (k = 0.5, s = 0.2)
where the definitions of p, q, M, N, k are the same as those in (1), (pc, qc) is the coordinate of the current entry, and s is some constant that reflects the importance of location difference. We use (4) to determine the scanning order and the entry with lower value of Score2 is scanned before that with higher value of Score2. From (4), the entry that is near to the location of the current entry will have higher probability to be chosen as the next scanned entry. We give a simple example of a 4×5 rectangular block in Fig. 6. In Fig. 6(a), we calculate the value of Score1 defined in (1) for each entry. If we determine the scanning order according to Score1, after the entry at the location (1, 0) (Score1 = 0.5) is scanned, the next entry to be scanned is at the location (0, 2) (Score1 = 0.894), since its value of Score1 is smaller than those of the remained entry. However, note that the value of Score1 at the location (1, 1) is 0.947, which is only a little larger than 0.894. Moreover, the entry (1, 1) is adjacent to the current entry (1, 0), but the Hamming distance between (0, 2) and (1, 0) is 3. Therefore, it is unreasonable to choose (0, 2) as the next scanned entry. In Fig. 6(b), we calculate the value of s(|p −pc| + |q − qc|). Here, we choose s = 0.2 and the coordinate of the current entry is (pc, qc) = (1, 0). The summation of Fig. 6(a) and Fig. 6(b) is Score2 defined in (4). Its value is shown in Fig. 6(c). Note that the value of Score2 at (1, 1) is 1.147 and the value of Score2 at (0, 2) is 1.494. Since the entry (1, 1) has lower value of Score2 than other remained entries, we choose (1, 1) as the next scanned entry, as in Fig. 6(d). Therefore, if we use Score2 in (4) to determine the scanning order, the next scanned entry is always nearer to the current entry. In other words, the next AC coefficient will have higher correlation with the current AC coefficient. It is helpful for improving the coding efficiency.
Generalized Zigzag Scanning Algorithm for Non-square Blocks
259
Fig. 7. Zigzag Scanning order determined by (4) for the triangular block in Fig. 3 (k = 0.5, s = 0.2). The number in each entry indicates the order of scanning.
Fig. 8. Using the proposed method in (4) to determine the scanning order of a 9 ×13 rectangular block (k = 0.5, s = 0.2)
In Figs. 7 and 8, we give another two examples that use Score2 in (4) to determine the scanning orders of the triangular block in Fig. 3 and a 9×13 rectangular block. Do to the term p/Mk + q/Nk in (4), the scanning order can be adjusted according to the size of the block. Furthermore, to due the term s(|p −pc| + |q − qc|), the next entry to be scanned is always near to the current entry. Note that, in Fig. 4, after the entry at the
260
J.-J. Ding, P.-Y. Lin, and H.-H. Chen
location (5, 1) is the 30th entry to be scanned and the entry (0, 8) is the 31st scanned entry. Their Hamming distance is 12. By contrast, in Fig. 7, after the entry (5, 1) is scanned, the next entry to be scanned is at (5, 2) and their Hamming distance is 1. Therefore, the method in (4) has higher ability to gather the zero-value quantized AC coefficients. Then, with the technique of zero-run length coding, the compression efficiency can be improved.
4 Simulations In this section, we do some simulations for the performance comparison of the conventional zigzag scanning method and the proposed coefficient scanning methods, including (1) and (4). Then we apply the DCT transform expansion to each block. Then we use the conventional or the proposed zigzag algorithms to scan the DCT coefficients. In Tables 2 and 3, we show the simulation results when using the conventional zigzag scanning method and the proposed scanning methods to scan and compress the images. We use a set of standard tested images, including Lena, Cameraman, Pepper, and Barbara. The tested images are first uniformly segmented into non-square blocks. In Table 2, the images are divided into trapezoid blocks and in Table 3 the images are divided into triangular blocks. We use the same transform (SADCT in [5], [6]), the same quantization method, and the same encoding algorithm (Huffman codes) for all cases. However, the coefficient scanning algorithm is varied. The performances are evaluated by the number of byte of the compressed image. From the results in Tables 2 and 3, we can find that the proposed methods always have better coding performance than the conventional zigzag scanning method. Further, when using the method in (4) (i.e., considering the difference of entry locations), the performance is even better than that of the method in (1). Especially, when we divide an image into trapezoid blocks, from Table, 2, we can obviously reduce the number of bytes of the compressed image by 6% to 12%. Table 2. Number of bytes of the compressed images when using the conventional zigzag algorithm and the proposed coefficient scanning methods. The images are divided into trapezoid blocks.
Image name
Conventional zigzag method
Proposed method, using Score1 in (1)
Proposed method, using Score2 in (4)
Lena
15629 bytes
14370 bytes
14182 bytes
Cameraman
14226 bytes
13228 bytes
13424 bytes
Pepper
17104 bytes
16010 bytes
15915 bytes
Barbara
64649 bytes
58096 bytes
56972 bytes
Generalized Zigzag Scanning Algorithm for Non-square Blocks
261
Table 3. Number of bytes of the compressed images when using the conventional zigzag algorithm and the proposed coefficient scanning methods. The images are divided into triangular blocks.
Conventional zigzag method
Proposed method, using Score1 in (1)
Proposed method, using Score2 in (4)
Lena
15867 bytes
15599 bytes
15376 bytes
Cameraman
13683 bytes
13648 bytes
13510 bytes
Pepper
16521 bytes
16390 bytes
16133 bytes
Barbara
65317 bytes
63547 bytes
62548 bytes
Image name
5 Conclusions In this paper, we propose two coefficient scanning schemes for the irregular-shaped transformation coefficient block. From the simulation results, it is obviously that, when using the proposed method, the energy concentration and the coding efficiency are better than those of the case where the conventional zigzag scanning method is used. It is due to that, for a non-square block with height M and width N, the AC coefficient at the location (m−d, n+d) usually has larger value than the AC coefficient at (m, n) when d > 0. Therefore, it is reasonable that the coefficients near the longer side have higher priority to be scanned. The proposed coefficient scanning methods will be helpful for improving the coding efficiency of the shape-adaptive image and video compression algorithm, which plays an important role in H.264 and MPEG-4.
References 1. Sikora, T.: MPEG-4 Video Standard Verification Model. IEEE Trans. Circuits Syst. Video Technol. 7, 19–31 (1997) 2. Sikora, T.: MPEG-4 Very Low Bit Rate Video. In: IEEE International Symposium on Circuits and Systems, vol. 2, pp. 1440–1443 (1997) 3. Sikora, T., Makai, B.: Shape-Adaptive DCT for Generic Coding of Video. IEEE Trans. CSVT. 5, 59–62 (1995) 4. Kauff, P., Schüür, K.: Shape-Adaptive DCT with Block-Based DC Separation and a DC Correction. IEEE Trans. Circuits Syst. Video Technol. 8, 237–242 (1998) 5. Yamane, N., Morikawa, Y., Nairai, T., Tsuruhara, A.: An Image Coding Method Using DCT in Skew Coordinates. Electron. Commun. Japan 83, 53–62 (2000)
262
J.-J. Ding, P.-Y. Lin, and H.-H. Chen
6. Yamane, N., Morikawa, Y., Nairai, T.: Performance Improving in Skew-Coordinates DCT Method for Images by Entropy Coding Based on Gaussian Mixture Distributed Model. Electron. Commun. Japan 84, 37–44 (2001) 7. Zeng, B., Fu, J.: Directional Discrete Cosine Transforms—A New Framework for Image Coding. IEEE Trans. Circuits Syst. Video Technol. 18, 305–313 (2008) 8. Sullivan, G., Wiegand, T.: Video Compression—From Concepts to the H.264/ AVC Standard. Proceedings of the IEEE, Special Issue on Advances in Video Coding and Delivery (2004) 9. Pei, S.C., Ding, J.J., Lee, T.H.H.: Two-Dimensional Orthogonal DCT Expansion in Triangular and Trapezoid Regions. In: VCIP (2010) 10. Wallace, G.K.: The JPEG Still Picture Compression Standard. IEEE Trans. Consumer Electronics 38, 18–34 (1992)
The Interaction Ontology Model Supporting the Virtual Director Orchestrating Real-Time Group Interaction Rene Kaiser, Claudia Wagner, Martin Hoeffernig, and Harald Mayer Institute of Information and Communication Technologies Joanneum Research Graz, Austria {firstname.lastname}@joanneum.at
Abstract. In a system that enables real-time communication between groups of people via audio/video streams, a component called orchestration intelligently selects appropriate camera views for each participant individually, enabling larger setups and enhancing the social interaction itself. The Interaction Ontology (iO) receives low-level cue input from the audiovisual analysis component and informs the camera view switching component about the social interaction on a higher semantic level. The iO is a software component consisting of both a static ontology model and dynamic event processing logic. In this paper, we elaborate on the design rationale of the model and the intended dynamic behaviour of low-level cue processing. Finally we discuss performance and scalability issues, as well as alternative approaches to low-level event processing in such environments.
1
Introduction and Related Work
The TA2 system1 aims to enhance communication between distant people with a common social background by combining a number of technological innovations. We do not aim to support communication within business meetings (cf. [1,2]) but within casual meetings of spatially and temporally distant friends and family members. In this document we focus only on real-time communication [3] not on recorded audiovisual (AV) content. Synchronous communication scenarios explored in TA2 can be seen as next generation video-conferencing systems, characterized by high-resolution cameras and displays, seamlessly synchronized with low-delay transmission of spatial audio (cf. [4]) and the intelligent selection and presentation of content from other locations. Regarding the latter component, the TA2 system will act like a virtual film director using knowledge about the current situation to orchestrate live interaction on the media level. The multi-location communication will be integrated with other applications such as games with which it will have to share the input and output devices (e.g. split screens). The system aims to stimulate conversations between people, but not 1
http://www.ta2-project.eu
K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 263–273, 2011. c Springer-Verlag Berlin Heidelberg 2011
264
R. Kaiser et al.
influence or direct humans beyond that, although, as with any mediated communication, the medium itself will influence the conversation. The component responsible for this task is referred to as the Orchestration Engine (OE). In smaller setups the OE’s added-value may be limited, however, its necessity and the benefits of intelligent camera view selection becomes more visible in larger group communication setups with multiple participating locations. Real-time orchestration is a multifaceted research topic. It relies on the effective capturing, representation, processing and interpretation of knowledge. Various applications require intelligent manipulation of AV content in realtime. Examples are systems for educational content personalization, enterprise knowledge management, and assistance in health care. Novel personalized entertainment formats should benefit from advances in such technology. The term interaction is a rather ambiguous one. In this context, we refer to the interaction between people in distant locations and aim to capture the knowledge necessary to enhance (technological goal) and encourage (social goal) their communication. The ultimate ambition is to achieve a level of togetherness as close to face-to-face communication as possible [5,6] but also to explore novel ways of communication and their implications. The media industry is both challenged and stimulated by the emergence of Web based technologies. Propelled by the development of a broadband infrastructure new distribution and consumption models appear. The Interaction Ontology (iO) is an essential part of the OE and serves as its knowledge base. Note that the term iO denotes both the static ontology model and the software that is wrapping it and handles event processing and communication with other components. We refer to the OE’s other subcomponent that takes the actual decisions as the Orchestrator. The iO’s purpose is to process the stream of low-level cues (events) from AV analysis and to inform the Orchestrator about what is happening in the current scene on a higher semantic level. In other words, it continuously builds up a knowledge base (current and past events) and performs knowledge inference to bridge the semantic gap between low level input and higher level assertions needed for decision making. As an example, the AV analysis component could detect whenever participants talk or gesticulate. In a series of such cues, the iO could identify the pattern of a situation where one person is explaining something to the other participants, and inform the Orchestrator so that the video/audio stream selection could be adjusted to enhance the communication immensely. Several related work exists that uses ontology models resembling a statemachine to define events on different abstraction levels. Different yet similar approaches for inferring higher-level concepts based on low-level input (transitions) are implemented. Hakeem and Shah [7] developed a meeting ontology to detect specific human behavior in meeting videos. They represent three different levels of concepts: low-level events (e.g. hand or head movements), mid-level events (e.g. hand raised) and human behavior (voting) as the highest-level events. They use this ontology within a rule-based system to represent state machines. The SNAP (situations, needs, actions, products) ontology [8] is another example for an ontology that defines different types of domain concepts and relations
The Interaction Ontology Model
265
between them. The goal is to enable a recommender system that computes suitable products using an Enhanced Semantic Network. Like in our work, OWL reasoning itself was not expressive enough to derive the desired recommendations. Sebastian et al. [9] define state-machine-based collaborative workflows by modeling states and activities on different abstraction levels. The workflow ontology comprises concepts such as State, Activity and Workflow. Comparable to our approach, state transitions are triggered by events. Using a different but interesting approach, Khan et al. [10] model processes as UML Activity Models and then map it to an ontology using XSLT. It logically resembles a state machine, similar to our model. Already before our project, it was clear that even though an ontology based approach for such a problem has certain benefits and prospects, the runtime performance and scalability potential will be worse compared to other real-time event processing technologies. Still, our goal was to evaluate this hypothesis and draw conclusions to setups larger than we actually investigated. In this paper, we concentrate on the Ontology Engineering aspect of our work, and dedicate less space to dynamic processing. Nevertheless, basic dynamic performance conclusions will be discussed in Section 6. But first, Section 2 will describe the real-time scenarios investigated in TA2. While Section 3 illustrates the architecture and Section 4 outlines the information input, Section 5 will present the actual ontology model.
2
Real-Time Scenarios
TA2’s blue sky thinking goal regarding real-time (also synchronous or live) communication is to have participants losing awareness that the people they are taking to are not in the same room, diminishing the sense of distance. This effect, often referred to as telepresence, is sketched in Figure 1. TA2 is approaching this challenge from a bottom-up approach: a number of interaction scenarios (concept demonstrators) drive the technological developments.
Fig. 1. The TA2 vision of group-to-group communication
266
R. Kaiser et al.
Live orchestration implies several intrinsic technical challenges. In contrast to asynchronous scenarios where knowledge about recorded content can be acquired before processing it, real-time situations only allow to analyze the past. However, we regard the immediate past (in our case events less than roughly 250 milliseconds old) as the present, the current situation. Here is an example issue: in a purely responsive mode (i.e. without prediction), obviously, it is impossible to deterministically zoom to a person before she/he starts talking. Yet, in some cases people might expect a speaker to be shown slightly before the person starts talking, because the experience is intuitively compared to movie entertainment where content is usually very well directed. Predicting what is likely to happen next (e.g. a person answering a question) naturally doesn’t always work. Further, the natural constraints within real-time setups impede the usage of computationally expensive methods. The OE makes use of the events that happen in the space it directs, and is fully dependent on components providing or extracting that information. TA2’s scenarios integrate the social communication with an interchangeable application, for example a game. The Game Engine is another source of information, which can e.g. provide who’s next in turn-based game situations. Details about the scenarios in our research scope are provided on the TA2 website.
3
Black Box View
Figure 2 illustrates the dynamic information flow within the iO, highlighting steps where dynamic knowledge is represented as individuals of the ontology. For processing, incoming low-level information bits are inserted into the ontology as actions. This first triggers updates to person-centered states, which in turn triggers higher-level states detection. As standard ontology reasoning is not expressive enough to phrase the required transformation logic, more sophisticated steps are based on Jena rules2 which comprise built-ins where SPARQL3 queries are executed.
Fig. 2. Dynamic information processing
States are the highest level concepts represented in the iO. State changes are evident candidates for events to be sent to the Orchestrator as relevant 2 3
http://jena.sourceforge.net/inference/ http://www.w3.org/TR/rdf-sparql-query/
The Interaction Ontology Model
267
information. The Orchestrator’s output in turn are commands and playlists that state how (transmitted) AV content is displayed. Playlists in SMIL format [11] are sent to instances of a Video Composition Engine (VCE) which is responsible for screen composition and renders the desired layout.
4
Input Primitives
Reiterating, the iO’s input consists of a static setup configuration and dynamic audio/video analysis cues. A significant amount of knowledge about the dynamic situation is necessary for the Orchestrator to take quality decisions; the metadata’s accuracy is also essential. Scarce cues diminish the added value of the TA2 system and force the OE to fall back on default behavior. Cues are equipped with confidence values and it is up to the iO to deal with uncertain information. The information available to the OE is dictated by what other components are able to gauge. Scalability issues aside, the more information available to the component, the better it performs. Increased cue diversity provides more scope for knowledge inference. By combining different information sources, more complex concepts can be detected. However, real-time performance constraints might prevent employing automatic reasoning techniques to the full possible extent. 4.1
Static Setup Configuration
A number of setup properties will be configured at start-up time of the TA2 system. In the current development iteration, the number of participating locations and participants is fixed. In future iterations, the knowledgebase will be synced dynamically with the Presence Server, allowing participants to join/leave dynamically. The setup configuration are individuals inserted into the ontology at startup time. These facts are unlikely to change over the course of an interaction session. Examples properties are: location definition, number and positions of cameras and screens, information about and ID of the participants, and the TA2 application currently active. 4.2
Audiovisual Analysis Cues
The pivotal input for the iO are cues extracted by AV analysis. They describe the content captured by HD cameras and microphone arrays. Cues both available and planned are listed in Table 1. At the current stage, the person number changed cue is interpreted as a notification that a certain person’s face is currently visible or not to the frontal camera. A person not visible has not necessarily left the room, but is unlikely talking to a person in another location as she/he doesn’t face the screen. The voice activity cue is the most important cue currently available. As the Orchestrator aims to identify the channels of communication to properly take camera selection decisions, the iO tries to infer conversation patterns as a basis. Based on the stream of low-level information events, conversation modes like monologues, dialogues or group discussions are detected.
268
R. Kaiser et al. Table 1. Available and planned cues
Cue
Description
Person number changed Person joins or leaves the scene. Voice activity Person starts or stops speaking. Keyword spotting Speaker-independent ASR; keywords may change dynamically. Laughter (planned) Detection of laughter. Applause (planned) Detection of applause noise. Excitement (planned) Detection of excitement level. Visual activity (planned) Level of visual activity, e.g. caused by person movement.
4.3
Game Engine State
The Game Engine maintains the TA2 application’s state and shares relevant updates with the iO via an API. As the Game Engine is designed as a generic framework for various synchronous scenarios, we identified a common denominator between those applications. The game will share when it is running, paused or stopped, or interrupted so that participants can watch and talk about a piece of media. Further cues state that a certain person is next to take an action or has emerged as the game winner. As an example, whenever the game is interrupted, such as for recounting about a shared piece of media such as a photo, it would apply screen layout templates that enhance the narration of the story behind the asset. Another example would be artificial image interference effects. Whenever the space ship in a game is hit by an asteroid, it could distort the players’ view, simulating technical malfunctions using flickering effects.
5
Interaction Ontology Specification
The iO’s knowledgebase is an OWL2 ontology developed in Prot´eg´e 44 . The state-machine-like model enables dynamic knowledge management. New triples are inserted, queried, or updated triggered by rules, which are stored outside the OWL model. By processing, the information is lifted to a higher semantic level. 5.1
Design Rationale
The design process was driven by a set of competency questions. Figure 3 illustrates the core concepts and their relationships. Two methods exist to describe a collection of values within an ontology: they can be described as partitions of classes (value partition) or as enumerations of individuals (enumeration class). Modeling types of person states, both approaches have been considered: 1. As value partition (union of disjoint subclasses of the PersonState class): Each possible person state (e.g. InLocationPersonState, VoiceActivePersonState) is a subclass of the PersonState class. An individual of the class Person 4
http://www.w3.org/TR/owl2-overview, http://protege.stanford.edu
The Interaction Ontology Model
269
Fig. 3. Core concepts of the Interaction Ontology. All but Keyword and AVCommunication are defined classes.
is related with an individual of a concrete subclass of the PersonState class by the hasState relation. The main advantage of this modeling approach is that we can specialize the PersonState class for specific types of person states. It is easy to model which action changes which type of person state. Therefore, we decided to model the PersonState class as value partition. 2. As individuals of the PersonStateType enumeration class: Each person state is an individual of the PersonStateType enumeration class. The PersonState class has hasStartTime, hasEndTime, isCurrentState and a hasType properties. The hasType property relates a PersonState individual to one concrete type of the PersonStateType class. An individual of the class Person is related to an individual of the PersonState class. Expressing that a certain kind of action (e.g. PersonStartTalkingAction) changes a certain type of person state is difficult in this model. If different person state types are represented as individuals of the class PersonStateType, all types must have the same restrictions and properties. If the set of person state types would be known and fixed this modeling approach would be appropriate. Similar thoughts have driven the definition of the Action classes. Types of actions can be either modelled as value partition (union of disjoint subclass of Action) or as individuals of the ActionType enumeration class. We decided to model the Action class as value partition, mainly because it allows us to model how actions change the states of entities. Individuals have exactly one hasExecuter, hasLocation, hasConfidence, hasExecutionTime, hasInitialState, hasTargteState and isProcessed property. 5.2
Core Concepts
Figure 3 visualizes core concepts in our model. The class AVCommunication describes the set of all AV communications. It is a subclass of the Situation pattern from the Ontology Design Patterns Web portal (cf. [12]). This pattern describes
270
R. Kaiser et al.
different entities of a situation or context and how they are tied together. Currently, the AVCommunication has one of the following states: – – – – –
SilentAVCommunicationState (also the initial state) MonologueAVCommunicationState (1 person talking long-lastingly) DialogueAVCommunicationState (between 2 persons) GroupAVCommunicationState (between 3 or more persons) UnknownAVCommunicationState (fallback)
The class Action describes the set of actions which can be detected within a place of an AVCommunication. An action always occurs in a certain Place and is executed by a Person. Examples for actions are VoiceActivityAction, PersonNumberChangedAction, PersonKeywordMentionAction, InterruptDialogueAction and StartMonologueAction. Each action is either an UnprocessedAction (current event) or a ProcessedAction (history). The class Person describes the set of persons. A person has exactly one location and one state per type, which is an individual of the CurrentState class and of one specific state type class. The most relevant distinctions declare if a person is talking or not, and is visible or not. The State class is the superclass for a number of states abstracted by the classes PersonState, GameState and AVCommunicationState. Each state individual is classified as either a CurrentState or PastState. The five classes just mentioned are modelled as defined classes. Each state has exactly one start time, end time and isCurrentState property. The complete subclass tree is illustrated in Figure 4. The class Place describes a set of places within an AVCommunication context using the Place ontology content design pattern. A place is defined as having something located in it (e.g. a person). A place is an approximate, relative location for an absolute, abstract location. 5.3
Applying Reasoning
Since Pellet25 supports most of the features proposed in OWL2 (e.g. datatype reasoning) we use it as a OWL-DL reasoner to perform the following tasks: Consistency checking verifies whether a given ontology contains contradictions or not. Therefore the consistency of individuals (A-Box) with respect the ontology schema (T-Box) is validated. The result of consistency checking is true or false. Concept satisfiability checks whether all classes are satisfiable or not. A class is satisfiable when it is possible for the class to have individuals. Unsatisfiable classes would lead to an inconsistent model. Classification is used to infer the class hierarchy of an ontology. Therefore subclass relations are computed. Realization is used to compute all classes that a given individual belongs to. The following discusses an example realization task. Assume there is a person currently not talking. The iO receives a voice activity cue from analysis stating that the person started talking, so the state for that person must be updated. It can be inferred that this person is part of the set of all persons who are currently talking. The former talking state individual is updated as well and remains in the iO as history, of type PastState. In detail, the following steps are required to update a certain state: 5
http://clarkparsia.com/pellet
The Interaction Ontology Model
271
Fig. 4. T-Box view of state subclasses in the Interaction Ontology
1. Create a new action individual and add it to the iO: The iO receives a cue and creates a new individual of class Action using Jena. In the above example, that’s an individual of class PersonStartTalkingAction. Additional information about the location, executor, and execution time are added to the individual using the specific properties (hasLocation, hasExecutor and hasExecutionTime). 2. Change the associated state based on the newly added action: The new action causes updates of associated states. A new individual of class State is created and tagged as new current state. Also, the former current state is marked as past state (updated end time) and the action which is responsible for this state change is marked as processed. Logical rules are applied for this task. In our example, a new individual of VoiceActivePersonState is created and denoted as current state using the isCurrentState property with value true. 3. Infer new knowledge based on state changes: Now it is possible to infer new knowledge based on the new state and updates of the initial and associated states. For this task, a DL reasoner is used. Regarding the example, the person is now associated with an individual of class VoiceActivePersonState. Since this individual is also a member of class CurrentState, the DL reasoner infers that the person is now a member of class PersonVoiceActive. Whenever a new individual is added to the ontology, the
272
R. Kaiser et al.
reasoner performs consistency checking. Any individual of class Person with 2 current disjoint states would lead to an inconsistent ontology.
6
Discussion
We have described our approach for low-level cue processing based on a central OWL knowledgebase. While we were well aware of the fact that the approach we decided to investigate is not the most efficient for the problem in a setup of this size, and of performance issues with ontology reasoning in gernal, we nevertheless aimed at investigating the applicability of an OWL based approach and its limitations. For an elaborate discussion of the dynamic behaviour of the system and comprehensive scalability conclusions the interested reader finds more details in [13]. Performing state updates via logical (Jena) rules and classification via OWL reasoning, we aim to detect higher-level, meaningful concepts on a real-time stream of low-level event cues. As early evaluation results indicate, the performance of the approach is just good enough for limited test setups of 2 locations, 2 participants each. Our test data consisted of about 150000 cues (only a fraction of type voice activity) resulted in roughly 3 detected dialogues per minute on average. However, this already challenges the upper limit of cues the iO is able to handle. More cues received would fill up the processing queue and ultimately lead to non-deterministic behavior. Additionally, the cue processing time is monotonically increasing with the size of the OWL model, especially the Jena rule execution and SPARQL queries (c.f. performance discussion in [14]). Even though we considered multiple potential threats, the significance of the ontology size surprised us. Eventually, we found that it won’t scale to size of setups we plan to implement. Alternative approaches will be investigated. In future work, a bunch of more cues will be available, enabling us to detect more than the limited set of person states and conversation types we’ve been working on so far. We will investigate if the combination of multiple information sources enables us to detect interesting and valuable high-level concepts, even if single cues may be unreliable. Options for lifting cues include using Machine Learning and rule-based approaches, and Complex Event Processing6 (CEP) frameworks. While analyzing only a short time window on the event stream with continuous queries should allow much better performance, some frameworks don’t support reasoning with probabilistic knowledge well. A comprehensive summary of further options is given by Lavee et al. [15].
Acknowledgements This work was performed within the Integrated Project TA2, Together Anytime, Together Anywhere. TA2 receives funding from the European Commission under the EU’s Seventh Framework Programme, grant agreement number 214793. The authors gratefully acknowledge the European Commission’s financial support and the productive collaboration with the other TA2 consortium partners. 6
http://www.complexevents.com
The Interaction Ontology Model
273
References 1. Blackburn, T., Nguyen, V., Swatman, P., Vernik, R.: Supporting creative processes in “any time and place” and initial empirical work. In: International Conference on Creativity and Innovation in Decision Making and Decision Support, CIDMDS (2006) 2. Nijholt, A., Akker op den, R., Heylen, D.: Meetings and meeting modeling in smart environments. AI Soc. 20(2), 202–220 (2006) 3. O’Conaill, B., Whittaker, S., Wilbur, S.: Conversations over video conferences: An evaluation of the spoken aspects of video-mediated communication. Human Computer Interaction 8, 389–428 (1993) 4. Abbas, S., Mosbah, M., Zemmari, A.: Itu-t recommendation g.114, one way transmission time. In: International Conference on Dynamics in Logistics 2007, LDIC 2007 (2007) 5. Kock, N.: The psychobiological model: Towards a new theory of computer-mediated communication based on darwinian evolution. Organization Science 15(3), 327–348 (2004) 6. Kock, N.: Designing e-collaboration technologies to facilitate compensatory adaptation. Inf. Sys. Manag. 25(1), 14–19 (2008) 7. Hakeem, A., Shah, M.: Ontology and taxonomy collaborated framework for meeting classification. In: International Conference on Pattern Recognition, vol. 4, pp. 219– 222 (2004) 8. Morgenstern, L., Riecken, D.: Snap: An action-based ontology for e-commerce reasoning. In: Proceedings of the 1st Workshop FOMI 2005 – Formal Ontologies Meet Industry (2005) 9. Sebastian, A., Noy, N.F., Tudorache, T., Musen, M.A.: A generic ontology for collaborative ontology-development workflows. In: Gangemi, A., Euzenat, J. (eds.) EKAW 2008. LNCS (LNAI), vol. 5268, pp. 318–328. Springer, Heidelberg (2008) 10. Khan, A.H., Minhas, A.A., Niazi, M.F.: Representation of uml activity models as ontology. In: Proceedings of 5th International Conference on Innovations in Information Technology (2008) 11. Bulterman, D.C.A., Rutledge, L.W.: SMIL 3.0: Flexible Multimedia for Web, Mobile Devices and Daisy Talking Books, 2nd edn. Springer, Berlin (2009) 12. Presutti, V., Gangemi, A.: Content ontology design patterns as practical building blocks for web ontologies. In: Li, Q., Spaccapietra, S., Yu, E., Oliv´e, A. (eds.) ER 2008. LNCS, vol. 5231, pp. 128–141. Springer, Heidelberg (2008) 13. Kaiser, R., Torres, P., Hoeffernig, M.: The interaction ontology: low-level cue processing in real-time group conversations. In: 2nd ACM International Workshop on Events in Multimedia (EiMM 2010), in conjunction with ACM Multimedia (2010) 14. Bizer, C., Schultz, A.: The berlin sparql benchmark. International Journal On Semantic Web and Information Systems (2009) 15. Lavee, G., Rivlin, E., Rudzsky, M.: Understanding video events: a survey of methods for automatic interpretation of semantic occurrences in video. Trans. Sys. Man Cyber Part C 39(5), 489–504 (2009)
CLUENET: Enabling Automatic Video Aggregation in Social Media Networks Zhuhua Liao1,2,3, Jing Yang1, Chuan Fu1, and Guoqing Zhang1 1
Institute of Computing Technology, Chinese Academy of Sciences 2 Graduate School of the Chinese Academy of Sciences 3 Key Lab of Knowledge Processing & Networked Manufacturing, Hunan University of Science and Technology, China {liaozhuhua,jingyang,chuanfu,gqzhang}@ict.ac.cn
Abstract. Similar contents, duplicates, and segments of videos abound in social media networks. However, querying and aggregating all these data with high quality and personalized demand present increasingly formidable challenges. In the paper, we propose a novel framework for automatically aggregating semantically similar and contextual videos in social media network, which called CLUENET. We use a proactive method to collect and integrate all-around valuable clues centered a video to improve the quality of aggregation; By use of these clues the CLUENET constructs a clues network for video aggregation which extract sequences and similar contents of videos, and uses dynamic Petri net (DPN) to steer video aggregation and data prefetching for adapted to different user’s personalized demand. The main features of this framework and how it was implemented using state-of-the-art technologies are also introduced. Keywords: Video Aggregation; Clues Network; Dynamic Petri Net.
1 Introduction With the growing multimedia in social media networks, there are similar contents, duplicates, and segments of videos distributed on social media networks. To aggregate these relevant contents of videos can fusion integral videos and enrich contents of videos in distributed environment, but aggregating the videos that lack of detailed annotations or clues presents several increasingly formidable challenges. First, in social media networks, for many videos’ annotations are sparse, there are no accurately semantic-based information extraction and integration tools to automatically aggregate videos and related contents with high recall and precision. Second, since the annotations of many videos loose coupling with videos on social media networks, for the videos with annotation we hardly keep the videos not to miss the connection with their corresponding annotations while these videos be moved or replicated to other places. And keeping all data that closely related to videos to update and return rich contents that user excessively concerned with are arduous tasks. So, today the large volume of multimedia data can not share and reuse effectively in social media networks. That is, it is hardly to retrieve and aggregate all relevant contents centered one video in the open network environment based on the current network infrastructure and content publish paradigm. K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 274–284, 2011. © Springer-Verlag Berlin Heidelberg 2011
CLUENET: Enabling Automatic Video Aggregation in Social Media Networks
275
Although many applications for management and sharing of multimedia have emerged on social media networks, such as YouTube, Flickr, etc, they did not resolve the problems for multimedia retrieving and aggregating on the network scale. In [1], the authors proposed a multimedia database resource integration and search system over Web. In [2], the authors presented an interactive video retrieval system which gives the user a global view about the similarity relationships among the whole video collection. But they do not consider the dynamic dissemination, the aggregation performance on the network scale and the influences by users’ interaction. For resolving these problems, there are many researchers propose many algorithms, develop tools and even some architectures for video retrieving and aggregating, such as LeeDeo [7], RoadRunner [8], duplicate detecting method [9], and SocialScope [10] which is a video annotator and aggregation architecture. But the performance of aggregation failed to take good effect, such that the recall and precision don’t meet users’ demand. We refer to a collection of social media sites that supporting communication between users, interaction such as reviewing, authoring, disseminating, annotating, and content sharing as social media networks. In the paper, we rethink the information and structure of present social media networks: many social media sites provide rich social features such as posting comments, ratings on specific media objects, uploading new media and establishing share communities for same interest; many duplicates or variants of media objects (including videos) deployed in multiple social media sites. And we try to explore the rich knowledge of media contents and context that distributed diversely social media networks to enable the video aggregation that integrating similar contents, duplicates, and segments of videos automatically. Somewhat surprisingly, our research suggests that most of the necessary changes reside in how information is aggregated and refined. We introduce a novel approach, which is inspired by service-oriented architecture (SOA) [3], for efficiently aggregating videos in social media networks. The SOA, that allows users to combine and reuse large chunks of functionality to form ad hoc applications over internet, has been successfully applied within the internet by the so called Web Services. Many methods based on Web Services related to multimedia have recently been proposed. For instance, E. Sakkopoulos, et al. [4] proposed techniques facilitating semantic discovery and interoperability of web services that manage and deliver media content. However, these methods aimed to discover and compose the services components of multimedia over internet. Our approach is something different from these methods because we focus on data that have decentralized over internet instead of software services. Since video can be easily migrated, duplicated and edited by user, our framework provides a clues center for videos where individual clue reported proactively by client is aggregated to a network, called clues network. In the network, we focus more on the high quality and personalized demand, rather than passively and separately hard mining video’s clues and low level collection.
2 Architecture The architecture of the CLUENET use the lightweight clues reporter and clues center to improve the video aggregation at two levels: collecting all-around valuable video’s clues and extracting videos’ sequences and relevant contents; aggregating videos and
276
Z. Liao et al.
Fig. 1. The overall architecture of CLUENET
closely relevant contents for different users’ demand. The overall architecture is illustrated in Fig. 1, where the clues center takes charge of the sequential discovering, similarity matching and duplicate detecting for clues network construction. 2.1 Clues Collection In SOA, software services are scarcely moved or varied the granularity and function while they are deployed and registered. However, the media objects are quite different. So in social media networks, the aggregation architecture requires clues to glue the loose-coupling of dynamic media objects such as similar contents, duplicates and segments in open network environment. Clues of video can mainly be classified into sequence clues, duplicate clues and similarity clues: over diverse social media networks, sequence clues are the clues for indexing and tracking the sequential videos or segments; duplicate clues are for identifying the features (such as location, state, size) of duplicate videos; and similarity clues are for discovering similar videos or segments. By use of these clues the system can aggregate all related videos and all their related contents (e.g. commenting, rating) for avoiding the upset by the duplicate, move operation on some of videos, segments and related contents over social media networks. In order to track these clues and collect them accurately, we propose a specific encapsulation principle for tracking new publish videos’ clues over social media networks and combining existing ways to retrieve the clues of legacy videos. New publish videos: To the new publish videos, it is simple to track all clues of them and collect into clues center and retaining the link between the clues when relevant videos are moved to other sites. we extend the video file header for each new publish video or segment, and then encapsulate the extended items and original video data as a new video file format, named cmf file. The extended file structure is listed in table 1. In the new file header, the ID item is for overcoming the persistence problem introduced by user renamed or duplicated, address of clues center item is for tracking new publish videos, description item is description text of new publish videos. For implementing proactive tracking the clues, we develop a clues reporter middleware,
CLUENET: Enabling Automatic Video Aggregation in Social Media Networks
277
which is performed on the client when a user is playing cmf file, to report all kinds of clues. While a user is playing a cmf file, the clues reporter extract the cmf file to original video file and push all-around valuable clues, such as comments, ratings added by users to the video, user’s duplicate behavior and all viewing users, to clues center. The duplicate behavior involves the duplicate time, content, and the destination /source. When a user accesses a video in clues reporter at the first time, clues center will assign a unique ID to the video. Through these processes, we can get comprehensive information relevant of every video to build a powerful clues network. Note that, with the file transformation supporting of the clues reporter middleware, the presentation of new publish video in client is not changed regarding users. When a user is playing a new publish video the clues reporter middleware will running in media player. If no one plays the new publish video, the clues center can detect the availability of it over social media networks by the clues. Table 1. The cmf file structure File header: File name File ID Original file format Description Address of clues center File data: original video file data
Legacy videos: up to date, there are millions of videos available over social media networks that we can not encapsulate. Despite the fact that many researchers propose some useful architecture [10], algorithms and develop many tools for the legacy video’s retrieving [7] and aggregating [8], the performance of these tools failed to take good effect for video retrieving and video aggregating on account of a large volume of time is spent but harder to get high recall and precision. However, aggregating the legacy videos/segments with relevant contents can improve the reusability, composability of media objects and enrich the content of them. In the paper, we take in account aggregating clues of legacy videos using existing tools and find similar legacy videos to connect cmf videos. To aggregate similar videos, we use the tools and methods, such as VSM(Vector Space Model [12]), duplicate detecting [9] to computer similarity of videos. We do not introduce here in detail again. We consider aggregating cmf videos and legacy videos from the clues network. After finding the similar legacy videos to a new publish video, the connection between new publish videos, legacy videos and related data is an easy and simple process. It will introduce the video aggregation in the session 2.2. 2.2 Clues Management After the clues are collected in clues center, we need to manage and refine all the clues for video aggregation that includes aggregating sequential videos, duplicates, similar videos, and social information. In there, we call the information including the
278
Z. Liao et al.
comment, rating and annotation that generated by user as social information (or User Generated Content [16]). We first design the data model of clues center, which mainly includes the following interrelated entities: a video element links a set of duplicates, similar videos, and social information such as comments and ratings; and a set of sequences interlinking videos or video’s segments. We use directed graph represent the sequences of videos and a node represent a video while a directed edge represent a video direct to one of its subsequential videos. For discovering video sequences, we use data extracting tools [8] to extract the static relations of videos on web pages, and the sequence mining tools such as Apriori II [5] to mine the Top-N frequent access sequences of videos since the clues reporter can trace access behavior of users. In addition, the publisher can also directly submit the sequences of videos to clues center. To aggregate the similar videos and related contents (i.e comments), one possible way is to use keywords matcher or VSM matcher. Unfortunately, those methods have some limitations and drawbacks when applied to compute similarity of videos, since they only take into account the matching between descriptions of videos and neglect the graphical structure between videos. Moreover, these approaches are prone to be cheated [6]. In this paper, we weigh the similarity between videos by their description and social information (e.g. comments and ratings of the video added by users). And further use the ranking algorithms such as PageRank [6,15] to rank all videos based on the clues network. To the legacy videos that lack of textual description (i.e annotation), we first use the duplicate detecting technique [9] to compute the duplicates, similar videos and then aggregate their textual descriptions. By these various approaches, we can quickly shrink the size of the candidate videos and recommend high-relevant videos for different users. After video aggregation, each video links multiple types of important relevant contents. We present all video’s contents as a d-dimensional space. Each video represents a node in the video content space while it is an abstract tree in whole data space in the architecture. The data structure of video after aggregation is showing figure 2. For example, in the Fig. 2 the related contents of video id123 have duplicates, similar videos, annotation, presenting users and sequence links (the sequence link can be treated as a tree when truncated the loop in graph). The sequence links has two: id123>id109->id110 and id123->id101->id102. The information in the presenting users’ records includes comments, ratings and so on.
Fig. 2. Data structure related to a video
CLUENET: Enabling Automatic Video Aggregation in Social Media Networks
279
Note that the clues center can detect the new publish videos and their duplicates in social media networks all the times according to the video ID. While a user sends out a request for a remote video file in clue reporter, clues center will probe the states of all duplicates and redirect the nearest and available duplicate for the user. By the duplicates and their social information, we can detect fraud duplicates.
3 Automatic Aggregation Model 3.1 User Query In the architecture, a new publishing video aggregates and maintains multiple types of important information, such that duplicate, similar content and social information. The contents strengthen the video’s annotation and the power of users’ query. With the help of the clues center, when user send a query, the system can: (1) Commends sequential and similar videos for users according to its contextual clues; (2) Integrates a video and social information, and strengthens the video sharing in social media networks; (3) Chooses the nearest video duplicate or replace invalid video with the valid duplicate for users which can optimize the transmission in social media networks. 3.2 The Automatic Aggregation Model In general, the video aggregation is depended on user’s query and selection. So in the architecture, we develop an automatic aggregation model by a Dynamic Petri Net (DPN) [11] to automatically generate aggregating multimedia document for presenting videos and closely relevant contents. The generated DPN can be used to deal with user interactions and requests for video aggregation, and used as a guideline to layout the presentation. It also specifies the temporal and spatial formatting for each media object. A Dynamic Petri Net structure, S, is a 10-tuple. S=(P, T, I, O, τ, Pd{F},N, F, P{F}, Oc{F})
(1)
1) P={p1,p2,……,pα}, where α≥0, is a finite set of places; 2) T={t1,t2,……,tβ}, where β≥0, is a finite set of transitions, such that P∩T= φ , i.e the set of places and transitions are disjoint; 3) I: P ∞ →T is the input arc, a mapping from bags of places to transitions; 4) O: T→ P ∞ is the output arc, a mapping from transitions to bags of places; 5) τ={( p1, τ 1 , s1 ),( p2 ,τ 2 , s2 ),……,( pα , τ α , sα )}, where τ i and si represent the video’s playing time and spatial location for the related objects represented by the places pi ;
6) N={ n1 , n2 ,..., nγ }, where γ ≥ 0 ,is a finite set of persistent control variables. There variables are persistent through every marking of the net; 7) F={ f 0 , f1 ,..., fη }, where η ≥ 0 , is a finite set of control functions that perform functions based on any control variable N; 8) P{F}, where P{F} ⊆ P, is a finite set of (static) control places that executes any control function F;
280
Z. Liao et al.
9) Oc{F}, where Oc{F} ⊆ O, is a finite set of (static) control output arcs that may be disabled or enabled according to any control function F; 10) Pd{F}, where Pd {F} ⊆ P, is a finite set of dynamic places that takes their value from control function F. z Temporal Layout The temporal layout is mainly dominated by the playing video after the user queried and chose it. The layout composed of the playing video and related contents which have to be synchronized the playing video. The temporal layout Lmain is represented as {Lv , Ld } . The layout Lv represents the playing video and Ld represents the related contents to the playing video. They together represent the entire layout of the presentation. The Lv is represented as: {Lv1 , Lv 2 ,......} , where Lvi = (< Bi0 , Ei0 >, S i ) , Bi0 is the start time of ith playing video,
Ei0 is the terminating time of ith playing video and 1 ≤ i , Si represents the spatial location of ith playing video. The set Ld is used to store the information that includes all related contents, and is represented as {Ld 1 , Ld 2 ,......} , where Ldi={ Di1 , Di 2, ......., Din }, and Dij= ( < B ij , E ij >, S j ). Here 1 ≤ i , 1 ≤ j ≤ n and Bi0 ≤ B ij < E ij ≤ Ei0 , S i represents the spatial location of jth data. z
Modeling of automatic video aggregation
Fig. 3. The automatic aggregation model
In the video aggregation, the playing video object is a dominant object. The control function controls the aggregating operation, and presents aggregated media objects in what time. Considering the representation of video sequence in an interactive manner, the aggregation is modeled as shown in Fig. 3. The media objects are represented as places (P) while the synchronization points are modeled using transitions (T) in the Petri net. Control Functions: -query(): The function is associated with the output arc that triggers the start of the aggregation.
CLUENET: Enabling Automatic Video Aggregation in Social Media Networks
281
-has_video: the function is present whether videos returned after the query. Algorithm 1. Function: query(q) list = query_video(q) Seqlist = null Simlist = null n1 = size(list)
Algorithm 2. Function: has_video() n1>0
if
then
enable
Oac ,
disable Obc else end
disable Oac ,enable Obc
-choose_a: when the video set has more than one, the control function allows user to choose one video for presenting. -aggregating: the system aggregates the related contents to the playing video from clues center, such that similar videos and sequential segments. Algorithm 4. Function: aggregating () Algorithm 3 Function: choose_a() Seq =Discovery_sequence(v_ch if user has chose ose)list list[k] OR Seqlist[k] Simlist=Match_similar(v_chose) OR Simlist[k] Datalthen ist=Get_RelatedData(v_chose) v_chose= the video j = 0 that chose by user n2 = size(Seqlist) end
-is_relatedvideo: assess whether there exists the related videos to the playing video by its similarity clues. -is_relateddata: assess whether there exists the related data (e.g. text, image) to the playing video by its related data clues. Algorithm 5 Function: is_relatedvideo() if Seqlist >1 OR Simlist>1 then enable O cc
Algorithm 6 Function: is_relateddata() if Datalist >0 then enable O ec
else disable O cc end
else disable O ec end
-auto_a: when the playing video is ready to reach the terminate time, the system will automatically choose a successor video as the next playing video which also refer to user for proposal. Algorithm 7. Function: auto_a () if j
282
Z. Liao et al.
-next: when the playing video is terminated, the system arranges the next video to play. -play_video: the algorithm plays the video that chose by user or by system automatically. -layout: the algorithm arranges the layout of the sequential segments and similar video list related to the playing video on the user interface. -presenting: the algorithm presents the related data on the user interface. -is_terminated: the function is present whether the playing video is reaching to the termination time. The DPN is generated as the synchronization points are being processed by the proposed algorithm. The Petri net is initialized as follows: P= { Pstart , Pend }, T= {}, I= {}, O= {}, τ={}, N = {}, F = {query}, P{F} = {}, Oc{F} = {}, Pd{F} = {}. The dynamic Petri net is then modified as follows, after the result of query is not null: P= {Pstart , P1 , P2 , P3 , P4 , P5 , P6d , Pend } , T= {t10 , t 20 , t30 ,04 , t11 , t 12 , t31} I= {( Pquery , t10 ),( P1 , t 20 ),( P2 ,t30 ),( P3 , t11 ),( P4 ,t 40 ), ( P5 , t31 ), ( p4 , t12 ),( p6d , t 20 )} O= { (t10 , P1 ) , (t10 , Pend ) , ( t 20 , p2 ),( t 20 , P3 ),( t30 , P6d ), ( t11 , p4 ), ( t11 , p5 ), ( t30 , p6d ), ( t 40 , p6d ), ( t12 , p1 )( t11 , pend ), ( t31 , pend ),( t30 , pend )} τ={
( P2 , < Bi0 , Ei0 >, S i )
,
( P3 , < B 0j , E 0j >, S j )
,(
P4 , < Bl0 ,
El0 >, Sl ),( P5 , < Bk0 , Ek0 >, S k )} N = {n1, n2} F = {has_video, is_relatedvideo, is_relateddata, next, is_terminated, is_null, no_relatedvideo} P{F} = { P1 , P2 , P3 , P4 , P5 } Oc{F} = { (t10 , P1 ) {has_video}, (t10 , Pend ) {is_null}, (t11 , P4 )
{is_relatedvideo},
{is_relateddata}, (t11 , Pend ) {no_relatedvideo}} Pd{F} = { P6d }
(t11 , P5 )
Note that, the clues reporter also provides an interface to layout similar videos, sequential segments and social information when a user playing a video. And user can forward a new query to clues center if recommended contents do not meet the user.
4 Implementation and Measurement The objective of the implementation was to develop a multimedia aggregation and presentation to deliver the following capabilities: (1) Multimedia aggregating and sharing in social media networks; (2) Synchronous presenting video and related data; (3) Videos’ transmission optimizing in social media networks. The application is also to allow prefetching and buffering of videos to obtain a reasonable QoS before playing. We executed a web crawler for collecting a set of videos and there descriptions. We have collected relevant information about 1,260 video files from Youku (see http://www.youku.com) and Tudou (see http://www.tudou.com), two popular videodistributing websites in china.
CLUENET: Enabling Automatic Video Aggregation in Social Media Networks
283
For demonstrating the CLUENET can be efficiently aggregated video sequences, similar videos and related contents, we have developed the clues reporter in the opensource Ambulant player[13] and built the clues center on the Internet. In addition, we have developed the cmf file packager for encapsulating new publish video file to cmf file and unpackager for extracting the original video file from cmf file. For aggregating videos and related contents, system automatically generates the media document based on Scalable MSTI model [14] in the place P4 and P5 of the Petri net. The media document was generated based on the SMIL template which we predefined. Based on these video’s data repository, we conduct a test in two phases: Firstly, we encapsulate these videos to cmf files and test the clues collected by clues reporter. The test shows our system can not only detect user’s some behaviors such as accessing time, adding ratings and comments, but also track the sequence accessed by the user, while the user visits a set of videos. Secondly, for verifying the construction of videos’ clues network, we have used the Multimedia PageRank [15],VSM, and Apriori II algorithms to refine the clues in clues center. Test shows that the clues center can recommend some high-relevant contents and top-N frequent video sequences to the user while a user is playing a video. After the user sends a query to the system, it can aggregate a set of related videos and data, and present them to user according to the DPN model. In short, the CLUENET can not only aggregate relevant contents and synthesize all textual information and video’s sequences to present for user, but also keep track of the new publish videos in social networks and redirect nearby duplicate for user. On the other hand, aggregating all-around related contents can enhance the system retrieve more comprehensive knowledge and enhance the dynamic adaptive faculty in social networks.
5 Conclusions Inspired by SOA, we propose a novel framework for automatically aggregating videos and relevant contents in social media networks. The goal of our framework is to build a world wide mesh of videos based on the clues network, which can not only aggregate single videos and their context, and present related rich semantic media, but guide users to retrieve coherent videos that have been associated in deep-level and enhance the trustiness of results by clues in social networks. In the future we expect to continue to retrieve deeper knowledge of videos by analyzing the clues and explore some mechanism to alleviate the extent of clues collection that depended on users, and we will further study the scalability and address the security issues of the novel framework. Acknowledgements. We acknowledge the National High-Tech Research and Development Plan of China under Grant No. 2008AA01Z203 for funding our research.
References 1. Murthy, D., Zhang, A.: Webview: A Multimedia Database Resource Integration and Search System over Web. In: WebNet: World Conference of the WWW, Internet and Intranet (1997)
284
Z. Liao et al.
2. Cao, J., Zhang, Y.D., et al.: VideoMap: An Interactive Video Retrieval System of MCGICT-CAS. In: CIVR (July 2009) 3. Sprott, D., Wilkes, L.: Understanding Service Oriented Architecture. Microsoft Architecture Journal (1) (2004) 4. Sakkopoulos, E., et al.: Semantic mining and web service discovery techniques for media resources management. Int. J. Metadata, Semantics and Ontologies 1(1) (2006) 5. Agrawal, R., Srikant, R.: Mining sequential pattern. In: Proc. of the 11th International Conference on Data Engineering, Taipei (1995) 6. Bianchini, M., Gori, M., Scarselli, F.: Inside PageRank. ACM Transactions on Internet Technology 5(1) (2005) 7. Dongwon, L., Hung-sik, K., Eun Kyung, K., et al.: LeeDeo: Web-Crawled Academic Video Search Engine. In: Tenth IEEE International Symposium on Multimedia (2008) 8. Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: Towards automatic data extraction from large web wites. In: VLDB (2001) 9. Wu, X., Hauptmann, A.G., Ngo, C.W.: Practical elimination of near-duplicates from web video search. In: ACM Multimedia, MM 2007 (2007) 10. Amer-Yahia, S., Lakshmanan, L., Yu, C.: SocialScope: Enabling Information Discovery on Social Content Sites. In: CIDR (2009) 11. Tan, R., Guan, S.U.: A dynamic Petri net model for iterative and interactive distributed multimedia presentation. IEEE Transactions on MultiMedia 7(5), 869–879 (2005) 12. Massimo, M.: A basis for information retrieval in context. ACM Transactions on Information Systems (TOIS) 26(3) (June 2008) 13. Bulterman, D.C.A., Rutledge, L.R.: SMIL 3.0: Interactive Multimedia for the Web. In: Mobile Devices and Daisy Talking Books. Springer, Heidelberg (2008) 14. Pellan, B., Concolato, C.: Authoring of scalable multimedia documents. Multimedia Tools and Applications 43(3) (2009) 15. Yang, C.C., Chan, K.Y.: Retrieving Multimedia Web Objects Based on PageRank Algorithm. In: WWW, May 10-14 (2005) 16. Cha, M., Kwak, H., et al.: I tube, you tube, everybody tubes: analyzing the world’s largest user generated content video system. In: IMC 2007. ACM, New York (2007)
Pedestrian Tracking Based on Hidden-Latent Temporal Markov Chain Peng Zhang1 , Sabu Emmanuel1 , and Mohan Kankanhalli2 1 2
Nanyang Technological University, 639798, Singapore {zh0036ng,asemmanuel}@ntu.edu.sg National University of Singapore, 117417, Singapore
[email protected]
Abstract. Robust, accurate and efficient pedestrian tracking in surveillance scenes is a critical task in many intelligent visual security systems and robotic vision applications. The usual Markov chain based tracking algorithms suffer from error accumulation problem in which the tracking drifts from the objects as time passes. To minimize the accumulation of tracking errors, in this paper we propose to incorporate the semantic information about each observation in the Markov chain model. We thus obtain pedestrian tracking as a temporal Markov chain with two hidden states, called hidden-latent temporal Markov chain (HL-TMC). The hidden state is used to generate the estimated observations during the Markov chain transition process and the latent state represents the semantic information about each observation. The hidden state and the latent state information are then used to obtain the optimum observation, which is the pedestrian. Use of latent states and the probabilistic latent semantic analysis (pLSA) handles the tracking error accumulation problem and improves the accuracy of tracking. Further, the proposed HL-TMC method can effectively track multiple pedestrians in real time. The performance evaluation on standard benchmarking datasets such as CAVIAR, PETS2006 and AVSS2007 shows that the proposed approach minimizes the accumulation of tracking errors and is able to track multiple pedestrians in most of the surveillance situations. Keywords: Tracking, Hidden-Latent, Temporal Markov Chain, Error Accumulation, Surveillance.
1
Introduction
Pedestrian tracking has numerous applications in visual surveillance systems, robotics, assisting systems for visually impaired, content based indexing and intelligent transport systems among others. However, tracking of pedestrians robustly, accurately and efficiently is hard because of many challenges such as target appearance changing, non-rigid motion, varying illumination, and occlusions. To achieve successful tracking by resolving these challenges, a lot of works [2][8][4][9][11] employing mechanisms such as using more distinctive features, using of integrated models and subspace decomposition analysis with Markov K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 285–295, 2011. c Springer-Verlag Berlin Heidelberg 2011
286
P. Zhang, S. Emmanuel, and M. Kankanhalli
chain or Bayesian inference have been proposed. These works approximate a target object by employing a random posterior state estimation in every temporal stage during the tracking to obtain more accurate appearance presentation of the target object. However, this mechanism would introduce approximation error when ‘estimation optimization’ is performed. Eventually, the errors accumulated from each temporal stage leads to the problem of ‘error accumulation’ making the tracking to fail as time passes. In this paper, to address the error accumulation problem in tracking we propose to incorporate probabilistic latent semantic analysis (pLSA) into the temporal Markov chain (TMC) tracking model. The latent information is discovered by training the pLSA with histogram of oriented gradient (HoG) features. The tracking incorporates the pLSA for the observation optimization during the stage of maximum a posteriori process (MAP) computation of the tracking process. The pLSA used is a time independent operation and thus the past errors have no influence on the current result of the optimization. Thus by combining a time independent operation with particle filters of the TMC for observation optimization, the proposed method can avoid the error accumulation problem by calibrating the errors that occur during each time stage. In this way, we model tracking as a temporal Markov chain with two hidden states, called hidden-latent temporal Markov chain (HL-TMC). In order to distinguish the hidden states, we call the hidden state which is part of the pLSA as ‘latent’. The hidden state is used to generate the estimated observations from the current state during the Markov chain transition process and the latent state helps to determine the most accurate observation. Further, HL-TMC accurately track multiple pedestrians by simultaneously considering the diffusion distance [10] of low-level image patches and high-level semantic meaning between observations and models. Additionally, the effect of pose, viewpoint and illumination changes can be alleviated by using semantic information learned from the HoG features [5]. The rest of the paper is organized as follows. In Section 2, we describe the related works. In Section 3, we present the proposed HL-TMC mechanism for pedestrian tracking in detail. The experimental results and the analysis are carried out in Section 4. Finally, we conclude the paper in Section 5 with our observations and conclusions.
2
Related Works
Black et. al. [2] utilized a pre-trained view-based subspace and a robust error norm to model the appearance variations to handle the error accumulation problem. However, the robust performance of this algorithm is at the cost of large amount of off-line training images which cannot be obtained in many realistic visual surveillance scenarios. Further, the error caused by template matching process has not been considered in this work and its accumulation can fail the tracking process. Ramesh et. al. proposed a kernel based object tracking method called ‘mean-shift’ [4]. Instead of directly matching the templates as
Pedestrian Tracking Based on HL-TMC
287
in [2], ‘mean-shift’ based tracking improves the tracking performance by performing a gradient-based estimation-optimization process with histogram feature of isotropic kernel. However, these mean-shift based methods still introduce tracking errors under intrinsic variation such as pose changing. When introduced errors accumulate to some level, they would finally make the tracking algorithm to fail. Therefore, more flexible on-line learning models are needed to model the appearance variations of targets to be tracked. To explicitly model the appearance changes during the tracking, a mixture model via online expectation-maximization (EM ) algorithm was proposed by Jepson et al. [9]. Its treatment of tracking objects as sets of pixels makes the tracking fail when the background pixels are modeled differently from foreground during the tracking. In order to treat the tracking targets as abstract objects rather than sets of independent pixels, Ross et al. proposed an online incremental subspace learning mechanism (ISL) with temporal Markov chain (TMC) inference for tracking in [14]. The subspace learning-update mechanism facilitates to reduce the tracking error when varying pose/illumination and partial occlusions occur, and the target appearance modeling can be effectively achieved. However, when the target size is small or if there is strong noise in the captured frame, the error still occurs because the optimization is based on the image pixel/patch-level (low-level) distance metric using particle filters. Therefore, as the TMC inference passes, the ISL also suffers from error accumulation problem and cause the tracking to fail at the end. Grabner et. al. [6] proposed a semi-supervised detector as ‘time independent’ optimization method to calibrate the accumulated errors. However, it performs satisfactorily only for the scenarios where the target is leaving out of the surveillance scenes [1]. In this paper, we simultaneously consider pixel-level and semantic level distance metrics (probability) in each temporal optimization process and the error accumulation problem is resolved by maximizing their joint probability.
3
Pedestrian Tracking Based on Hidden-Latent Temporal Markov Chain Inference (HL-TMC)
We model the pedestrian tracking problem as an inference task in a temporal Markov chain with two hidden state variables. The proposed model is given in Fig. 1. The details are given below. For a pedestrian P , let Xt describes its affine motion parameters (and thereby the location) at time t. Let It = {I1 , . . . , It } where It = {d1t , d2t , . . . , dnt } denotes a collection of the estimated image patches of P at time t and n is a predefined number of sample estimates. Let z = {z1 , z2 , . . . , zm } be a collection of m latent semantic topics associated with the pedestrians and w = {w1 , w2 , . . . , wk } is a collection of k codewords . We consider the temporal Markov chain with hidden states Xt and z. For the HL-TMC, it is assumed that the hidden state Xt is independent of latent semantic topics z and codewords w [7]. The whole tracking process is accomplished through maximizing the following probability for each time stage:
288
P. Zhang, S. Emmanuel, and M. Kankanhalli
Fig. 1. Proposed Hidden-Latent Temporal Markov Chain (HL-TMC)
p(Xt |It , z, w) · p(It |z, w) · p(z|w) ∝ p(Xt |It ) · p(It |z, w), Based on the Bayes theorem, the posterior probability can be obtained as: p(Xt |It ) ∝ p(It |Xt ) p(Xt |Xt−1 )p(Xt−1 |It−1 )dXt−1 = Yt . Since the latent semantic analysis process is not a temporal inference, the maximum probability for each time stage t can be obtained as follows: max p(Xt |It , z, w) · p(It |z, w) = max Yt · p(It |z, w) = max Yt · p(d1t |z, w), . . . , Yt · p(dnt |z, w) . Thus, tracking at each time stage t is achieved by maximizing the following quantity for each i ∈ [1, n], p(dit |z, w)p(It |Xt )
p(Xt |Xt−1 )p(Xt−1 |It−1 )dXt−1 .
(1)
We estimate the three probabilities p(Xt |Xt−1 ), p(It |Xt ) and p(dit |z, w) in the Expression 1 above by adopting the following three probabilistic models: timevarying motion model is used to describe the state transfer probability p(Xt |Xt−1 ), the temporal-inference observation model is to estimate the relationship p(It |Xt ) between the observations It and the hidden states Xt and the probabilistic latent semantic analysis model (pLSA) is employed in the testing phase to find the maximal pedestrian likelihood probability p(dit |z, w), for 1 ≤ i ≤ n. The pLSA model helps to improve the accuracy of tracking, reduces the tracking error accumulation and deal with pose, viewpoint and illumination variations. We now discuss these models in detail below.
Pedestrian Tracking Based on HL-TMC
3.1
289
Time-Varying Motion Model
We represent each pedestrian as an affine image warp, which is represented by the hidden state Xt composed of 6 parameters: Xt = (xt , yt , θt , st , αt , ϕt ), where xt , yt , θt , st , αt , ϕt denote x, y translation, rotation angle, scale, aspect ratio and skew direction at time t respectively. As in [8], the distribution of each parameter of Xt is assumed to be Gaussian centered around Xt−1 and the corresponding diagonal covariance matrix Ψ of the Gaussian distribution is made up of 6 parameters denoting the variance of the affine parameters, σx2 , σy2 , σθ2 , σs2 , σα2 and σϕ2 . If we assume that the variance of each affine parameters do not change over time, the time-varying motion model can be formulated as, p(Xt |Xt−1 ) = N (Xt ; Xt−1 , Ψ ). 3.2
(2)
Temporal-Inference Observation Model
The relationship p(It |Xt ) between the observations It and the hidden states Xt is estimated using this model. We use It to denote a collection of the estimated image patches from the hidden state Xt . Here principle component analysis (PCA) is employed in a stochastic manner. Suppose that the sample It is drawn from a subspace spanned by U (obtained by SVD of centered data matrix [14]) and centered at µ such that the distance from the sample to µ is inversely proportional to the probability of this sample being yielded from this subspace. Let Σ denote the matrix of singular values corresponding to the columns of U , I denote the identity matrix and εI denotes the additive Gaussian noise in the observation process. Then as in [14], the probability of a sample drawn from the subspace is estimated as: p(It |Xt ) = N (It ; µ, U U + εI) · N (It ; µ, U Σ −2 U ). 3.3
(3)
Time-Independent Probabilistic Latent Semantic Analysis (pLSA) Model
Using the above two probability models, one can decide which estimated sample has the shortest distance to the hidden states. However, this computation is based on low-level processing (pixel-level) of the samples. This mechanism may cause the problem that the estimated sample which has the shortest distance is not exactly what we really want to track because the background pixels inside the sample can affect the distance heavily. Therefore, we need the tracking method to be more intelligent and understand what is being tracked by working at the latent semantic level. Therefore we employ the pLSA model [7] in the testing phase to find the maximal pedestrian likelihood probability (likelihood probability whether the estimated sample is a pedestrian or not from time t − 1 to t) p(dit |z, w), for 1 ≤ i ≤ n. For the pLSA model in [7], the variable dit denotes a document, while in our case it denotes an estimated image patch which is the observation. The variable z ∈ z = {z1 , . . . , zk } are the unobserved latent topics with each observation, which is defined as ‘pedestrian’. The variable
290
P. Zhang, S. Emmanuel, and M. Kankanhalli
w ∈ w = {w1 , w2 , . . . , wm } represent the codewords which is the clustering centers of extracted HoG feature vectors from training dataset images. We assume that d and w are independent and conditioned on the state of the associated latent variable z. Let wit ⊂ w be a collection of codewords generated from dit . Each codewords in wit is obtained from extracted HoG features of each dit by vector quantization. As in [7] during testing the likelihood of each estimated sample (observation) at time t to be a pedestrian is obtained as, p(dit |z, w) ∝ p(dit |z) ∝ p(z|w)p(w|dit ). (4) w∈wit
Since latent semantic analysis process is not a temporal inference process, by using all the above three models in the Expression 1, the tracking can be performed by maximizing the following probability for each time stage.
4
Experiments and Discussion
The implementation of the proposed mechanism consists of two tasks, one is off-line learning and the other is tracking. For the feature selection in our experiments, we use the histogram of oriented gradients (HoG) because it contains the shape, context information as well as the texture information, which is substantially descriptive for representing the characteristics of a pedestrian. The effectiveness of the HoG feature has been verified in [13] and [15]. Another advantage of the HoG feature is its computational efficiency, which is critical for the real-time requirement of the visual surveillance tracking system. For the training dataset selection, we use the NICTA pedestrian dataset for our training. The reason we choose this dataset is because it provides us about 25K unique pedestrian images at different resolutions and also a sample negative set. And also, each image of NICTA has the size 64 × 80 making it suitable for efficient generation of HoG features. In the pLSA learning process we first perform the HoG feature extraction on the training pedestrian dataset NICTA for each image. Then the generated HoG features are clustered by k-means clustering algorithm to obtain the codebook w = {w1 , w2 , . . . , wk } [12]. Next, the vector quantization (VQ) is carried out for each HoG of each training image based on the codebook to obtain its histogram of codewords which is the ‘bag-of-words’ for learning requirement. The results of learning is the m × k size association probability matrix p(z|w). In our experiments, the size of the codebook k and the number of topics m are both pre-defined as k = 300 and m = 20. For the learned topics, not all of them equally denote the meaning of a ‘pedestrian’. Hence we need to assign weights to each topic based on their descriptive ability for a pedestrian. To obtain this weights-topics histogram, we used a dataset called Pedestrian Seg. The important characteristic of this dataset is that, each image in this dataset only contains the foreground (pedestrian) without the background. The segmentation work of the Pedestrian Seg dataset is done manually by using Adobe Photoshop. We first extracted the HoG features
Pedestrian Tracking Based on HL-TMC
291
Fig. 2. Comparison between HL-TMC tracking and ISL tracking on CAVIAR dataset
of each image in the Pedestrian Seg dataset. Then another collection of codewords wps is obtained by VQ on the codebook w. The weighs λ1 , . . . , λm for each topic is computed as,
λi =
λ p(w|zi ), λi = mi 1
w∈wps
λj
.
As described in Section 3, the tracking process at each temporal state is done by maximizing p(dit |z, w) · p(It |Xt ) p(Xt |Xt−1 )p(Xt−1 |It−1 )dXt−1 . For each estimation sample dit , p(dit |z, w) and p(It |Xt ) p(Xt |Xt−1 )p(Xt−1 |It−1 )dXt−1 are calculated independently. The quantity p(dit |z, w) denotes a high/semantic level distance value, which is computed as: p(dit |z, w) = λj p(zj |w)p(w|dit ). j
w∈wit
The quantity p(It |Xt ) p(Xt |Xt−1 )p(Xt−1 |It−1 )dXt−1 represents a low/pixel level distance value. The quantity p(It |Xt ) is computed from the probability distributions N (It ; µ, U U + εI) and N (It ; µ, U Σ −2 U ) as described in Section 3.2. As in [15], exp(−(It − µ) − U U (It − µ))2 ) corresponds to the negative exponential distance of It to the subspace spanned by U . This low/pixel level distance is proportional to the Gaussian distribution N (It ; µ, U U + εI). The component N (It ; µ, U Σ −2 U ) is modeled using the Mahalanobis distance. Thus, for each dit , there is a probability product arising from high/semantic level and low/pixel level distances. The estimation sample dit which has the largest value for the product is regarded as the optimum estimation for the next tracking stage.
292
P. Zhang, S. Emmanuel, and M. Kankanhalli
Fig. 3. Comparison between HL-TMC tracking and ISL tracking on PEST2006 dataset
4.1
Tracking Performance Comparison
To verify the effectiveness of the proposed tracking mechanism, we chose surveillance videos with multiple pedestrians for tracking. In addition the scenes where challenging with occlusion, illumination variations and scale changes happening. We compare the proposed work with the ISL tracking of Ross et al. [14], AMS tracking of Collins [3] and WSL tracking of Jepson et al. [9]. Since ISL tracking has been demonstrated to be more robust and accurate than other classic tracking approaches [10], due to space limitations we give the visual comparison of the proposed method only with ISL. The quantitative comparison is provided against all the three works. Visual Performance Comparison: We performed the proposed tracking method on the popular surveillance video datasets, such as CAVIAR, PETS2006 and AVSS2007. The tracking results are shown in the Fig. 2-4. Fig. 2 presents the performance of the proposed pedestrian tracking in the surveillance scene inside a hall of the CAVIAR dataset. For all the tracked pedestrians, the proposed HL-TMC tracking performs more accurately than the ISL tracking, but both methods can deal with the occlusion cases in this test. Besides halls and malls, visual surveillance systems are widely deployed in subway stations. Therefore, we test the performance of HL-TMC on these scenes also. The comparison in Fig. 3 shows that the proposed tracking method can track multiple pedestrians more accurately in the long-distant camera shot scene of PETS2006 dataset. Even with occlusion occurring in the scenes with many moving pedestrians, the proposed method can still robustly perform the tracking. Another test is performed with short-distant camera shot in the AVSS2007 surveillance dataset and the results are given in Fig. 4. In this case, since the background is simple compared to the previous cases, the proposed method and the ISL performs well for the pedestrians who are close to the camera. However,
Pedestrian Tracking Based on HL-TMC
293
Fig. 4. Comparison between HL-TMC tracking and ISL tracking on AVSS2007 dataset
Fig. 5. Covering rate computation for quantitative analysis
while tracking the pedestrians who are far way the ISL loses track and the HLTMC is still able to track the pedestrians when occlusion occurs. Similarly, the proposed method outperformed the AMS and WSL tracking methods on these test videos. The AMS and WSL test results are not included due to the space limitations. Quantitative Analysis: For the quantitative analysis of tracking performance, we manually label several key locations on the pedestrians as ‘ground truth’ that need to be covered by the tracking area. The tracking precision of each pedestrian is defined by the ‘covering rate’ (CR) and is computed as:
CR =
number of tracked key locations inside the bounding box . total number of key locations needed to be tracked
(5)
In our experiments, we had about 409 pedestrian templates of various poses for marking these key locations for the CR calculation. The CR computations on some sample templates are shown in the Fig. 5. During the tracking process, both numbers are manually counted frame by frame for the whole video. Then the
294
P. Zhang, S. Emmanuel, and M. Kankanhalli
)b*
)c*
Fig. 6. Covering rates of HL-TMC,ISL,WSL and AMS Tracking Algorithms
CR for each frame is obtained using Equation 5. In addition, the likelihood computation based on local features(texture & HoG) would guarantee the tracking bounding box not enlarge too much over the boundary of the target pedestrian, which can avoid the case of CR ≡ 1 if bounding box covers the whole frame. In our experiments, we did not employ the Jepson’s root mean square (RMS ) error calculation for the quantitative analysis because for tracking of multiple pedestrian with occlusions and pose changing, computing RMS is very hard. Fig. 6 (a) shows the comparison of CR between HL-TMC tracking and ISL, AMS and WSL tracking on the video “EnterExitCrossingPaths1cor” in the CAVIAR dataset. It can be seen that during the whole tracking process, the HL-TMC always outperforms the other three tracking mechanisms and the performance of ISL is better than the other two as claimed in [14]. Fig. 6(b) shows the performance comparison on the video “PETS2006.S3-T7-A.Scene3” in the PETS 2006 dataset. The proposed HL-TMC tracking in this case also outperforms the other three methods. Notice that the WSL tracking outperforms the ISL tracking at the beginning, but it gradually looses track when heavy occlusions of pedestrians occur.
5
Conclusion
In this paper, we proposed a novel pedestrian tracking mechanism based on a temporal Markov chain model with two hidden states to handle the tracking error problem. To minimize the accumulation of tracking errors, we proposed to incorporate the semantic information about each observation in the general temporal Markov chain tracking model. By employing the HoG features to find the latent semantic cues denoting the pedestrians, the proposed method can search the meaning of a “pedestrian” to find the most likely (“pedestrian” observation) sample accurately and efficiently for updating the target appearance. The experimental results on different popular surveillance datasets, such as CAVIAR,
Pedestrian Tracking Based on HL-TMC
295
PETS2006 and AVSS2007 demonstrated that the proposed method can robustly and accurately track the pedestrians under various complex surveillance scenarios and it outperforms the existing algorithms.
References 1. Babenko, B., Yang, M.H., Belongie, S.: Visual tracking with online multiple instance learning. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 983–990 (June 2009) 2. Black, M., Jepson, A.: Eigentracking: Robust matching and tracking of articulated objects using a view-based representation. International Journal of Computer Vision 26(1), 63–84 (1998) 3. Collins, R.: Mean-shift blob tracking through scale space. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, pp. II–234–40 (June 2003) 4. Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(5), 564–577 (2003) 5. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 886– 893 (June 2005) 6. Grabner, H., Leistner, C., Bischof, H.: Semi-supervised on-line boosting for robust tracking. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 234–247. Springer, Heidelberg (2008) 7. Hofmann, T.: Probabilistic latent semantic indexing. In: Proc. International ACM SIGIR Conference, pp. 50–57 (1999) 8. Isard, M., Blake, A.: Condensation – conditional density propagation for visual tracking. International Journal of Computer Vision 29(1), 5–28 (1998) 9. Jepson, A., Fleet, D., El-Maraghi, T.: Robust online appearance models for visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(10), 1296–1311 (2003) 10. Kwon, J., Lee, K.: Visual tracking decomposition. In: IEEE Conference on Computer Vision and Pattern Recognition (2010) 11. Lim, J., Ross, D., Lin, R., Yang, M.: Incremental learning for visual tracking. In: Advances in Neural Information Processing Systems (NIPS), vol. 17, pp. 793–800 (2005) 12. Niebles, J., Wang, H., Li, F.: Unsupervised learning of human action categories using spatial-temporal words. International Journal of Computer Vision 79(3), 299– 318 (2008) 13. Zhu, Q., Yeh, M.C., Cheng, K., Avidan, S.: Fast human detection using a cascade of histograms of oriented gradients. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 1491–1498 (2006) 14. Ross, D., Lim, J., Lin, R., Yang, M.: Incremental learning for robust visual tracking. International Journal of Computer Vision 77(1-3), 125–141 (2008) 15. Zhang, J., Marszalek, M., Lazebnik, S., Schmid, C.: Local features and kernels for classification of texture and object categories: A comprehensive study. International Journal of Computer Vision 73(2), 213–238 (2007)
Motion Analysis via Feature Point Tracking Technology Yu-Shin Lin1, Shih-Ming Chang1, Joseph C. Tsai1, Timothy K. Shih2, and Hui-Huang Hsu1 1
Department of Computer Science and Information Engineering, Tamkang University, Taipei County, 25137, Taiwan
[email protected] 2 Department of Computer Science and Information Engineering, National Central University, Taoyuan County, 32001, Taiwan
Abstract. In this paper, we propose a tracking method via SIFT algorithm for recording the trajectory of human motion in image sequence. Instead of using a human model that present the human body to analyze motion. Only exact two feature points from the local region of a trunk, one for joints and one for limb. We calculate the similarity between two features of trajectories. The method of computing similarity is based on the “motion vector” and “angle”. We can know the degree of the angle by the connect line from joint to limb in a plane which is using the core of object to be the center. The proposed method consists of two parts. The first is to track the feature points and output the file which record motion trajectory. The second part is to analyze features of trajectory and adopt DTW (Dynamic Time Warping) to calculate the score to show the similarity between two trajectories. Keywords: motion analysis, object tracking, SIFT.
1 Introduction Motion analysis is a very important approach in several researches such as kinematics of the human body or 3D objects generation in 3D games. It should record the trajectory of the motion of body or physical gestures by camera. After getting the information, the motion form can be completed. This kind of technology is widely applied to the issue of human motion 0. Before analyzing, we have to know the information of object moving. Object tracking technique is a useful method to find out the information about the locating of moving object over time in the image processing. In order to match the similarity between two set videos, we have to extract the features of object in images [2, 3]. For tracking objects, we use the SIFT algorithm 0 to describe the features from objects. The features are based on the local appearance as characteristic of objects, the scale and orientation of the features are invariant and the effect of matching features is robust. In recent years, there are many researches about object tracking or recognize are according to the modified SIFT algorithm [5, 6]. For human motion capture and record the trajectory of body movement [7, 8, 9, 10], such trajectories are important K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 296–303, 2011. © Springer-Verlag Berlin Heidelberg 2011
Motion Analysis via Feature Point Tracking Technology
297
content to analyze motion. In our approach, DTW [11, 12, 13] is used to match the human behavior sequences. In this paper, our system is combined by tracking and trajectories matching. The remainder of this paper is organized as follows. Section 2 is the overview of SIFT algorithm. Section 3 introduces the application of SIFT algorithm briefly. Section 4 describes the presentation of trajectory and how to analyze. In section 5 explain our system architecture. Experimental results are given in Section 6. Finally, a short conclusion is given in section 7.
2 SIFT Algorithm and Application In order to get an exact tracking result between two continuous frames, we choose the SIFT algorithm to finish this work. Although there are many approaches can get a very precise tracking result, we need a result includes the information of angle and location. So we choose the SIFT algorithm in this study. There are main major stages for SIFT: Scale-space extrema detection In the first stage, the tracking point can be extracted from the image over location and scale. The scale space of an image is computed by Gaussian function,
G ( x, y , σ ) =
1 2πσ 2
e
− ⎛⎜ x 2 + y 2 ⎞⎟ ⎠ ⎝
2σ 2
(1)
in each octave of the difference of Gaussian scale space, extrema are detected out by comparing a pixel to its 26 neighbors in 3 by 3 regions at the current and adjacent scales. Orientation assignment To achieve invariance to image rotation, assigning a consistent orientation to each keypoint based on local image properties. The scale of the keypoint is used to select the Gaussian smoothed image, and then compute the gradient magnitude, m(x,y), and orientation, θ(x,y): m(x, y ) =
(L(x + 1, y) − L(x − 1, y ))2 + (L(x, y + 1) − L(x, y − 1))2
θ (x, y) = tan−1((L(x, y +1) − L(x, y −1)) /(L(x +1, y) − L(x −1, y)))
(2)
where L is the smoothed image. An orientation histogram with 36 bins is formed, with each bin covering 10 degrees. The highest peaks in the orientation histogram correspond to dominant directions of local gradients. Keypoint descriptor After finding locations of keypoint at particular scales and assigned orientations to them. This step is to compute a descriptor vector for each keypoint. A keypoint
298
Y.-S. Lin et al.
Fig. 1. This figure shows the Keypoint descriptor of SIFT
descriptor is created by first computing the gradient magnitude and orientation at each image sample point in a region around the keypoint location. These are weighted by a Gaussian window, indicated by the overlaid circle. These samples are then accumulated into orientation histograms summarizing the contents over 4 by 4 subregions. Each subregion has a histogram with 8 orientations. Therefore, the feature vector has 4 by 4 by 8 elements. The descriptor is formed from a vector containing the values of all the orientation histogram entries. 2.1 Application of SIFT Algorithm The SIFT keypoints of objects are extracted from scale-space. In practice, the location of keypoint is not necessary in the region of object which we want to track. Therefore, replace the extreme detection method with entering the initial point by hand. By the way, we can set the region which we want to analyze. In the stages for Orientation Assign and Keypoint Descriptor, the scale of the keypoint is used to select the Gaussian smoothed image with the closest scale, and compute the gradient magnitude and orientation. Because the initial point is not detected, we can’t determine the scale to be which level, each standard deviation of Gaussian function is set to1. There are two reasons for setting, one is the scale of image that is close to original image; another is that Gaussian function can reduce amount of noise.
3 Trajectory of Object 3.1 Presentation of Trajectory A motion path of object is recorded from consecutive frames by the method from last section, the path is the trajectory of moving object. In this system, we just record the absolute coordinate of feature point which is in the region of object from each frame. Therefore, we con not calculate directly the similarity of two trajectories. In this paper, we defined “motion vector” and “angle” to be the features of data point which is belong to trajectory: Angle Human’s continues movement is consist of different postures. These postures of transformation are expressed by angle of transformation that is relative to limbs and a joint in the region of body.
Motion Analysis via Feature Point Tracking Technology
299
Fig. 2. The angle representation in our experiment. We set the neck to be the center point of the upper body and the hip to be the center point of the lower body. We also make a four-quadrant coordinate axis to record the angle and position of the feature points such as hands, head and foots.
The representation by a concept of polar coordinate system, the joint as a pole, the angle is that a point in the region of limb is measured counterclockwise from the polar axis. An instance is showed in Fig. 2. Motion Vector The direction of moving object is defined as motion vector. It is represented by 8 directions, fig. 3 shows the graph of directional representation.
Fig. 3. The motion vectors are showed in the figure. The total vectors are eight directions and start from the right way. We assign numbers to each direction. By this way, we can know the motion direction very clearly.
Assign a number to each direction for easily identify. In addition, 0 is assigned to the center which means object didn’t move.
300
Y.-S. Lin et al.
3.2 Similarity between Trajectories Dynamic time warping is an algorithm for measuring similarity between two sequences. Based on the “motion vector” and “angle”, we modify DTW algorithm into a score path equation: ⎧⎪SP(Ei −1 , C j −1 ) + mvSi , j + anSi , j , if θ < 1 SP(Ei , C j ) = ⎨ ⎪⎩MAX (SP (Ei , C j −1 ), SP (Ei −1 , C j ) ) + mvSi , j + anSi , j , if θ
(3)
where E and C are trajectories, i and j are data points which is belong to respective trajectory. The mvS is the score which can be reached through the calculation of similarity which is in the “motion vector”, and the anS is in the “angle”. According to the equation, we can get a total score, it means the similarity between two trajectories.
4 System Architecture In our system, it is consist to two parts. The first part is object tracking, the main goal is to track the object and record the moving path. By this step, we can get a file recorded the angle and location of the keypoints in the video stream. We can accord the results of tracking to analyze two motions. The second is to analyze and compare the trajectories. After analyzing two motions, we can assign a score to the estimative motion video. The score is the similarity between the two motions. Figure 4 illustrates the system architecture diagram. In the tracking process, at first, enter the initial points as the feature points which want to be tracked in the beginning. There is SIFT descriptor belong to the each point via SIFT operator. We have to do the motion estimation by a modified full search. We make the feature point as the center of a block and process full search with the next frame. The pixels in the region are keypoint candidates. We use the Full Search by block search to find out the candidate area which is the most similar to the feature point. The result is the feature point to be tracked. After the tracking processing, the system will output a file which is recorded the trajectory coordinate of feature points. The trajectory coordinate means the position of the target object in the consecutive frames. Features of trajectory are formed with “Motion Vector” and “Angle”, the trajectories will be matched by analyzing these two features. Finally, we use Dynamic Time Warping (DTW) to score the similarity of two trajectories, and show the result. In the tracking processing, the result of object location where detected may be not expected according to block matching algorithm. The error of result from that object is rotational or size of variance. To solve this problem, adding a adjust mechanism to tracking processing. In this system, it can show the result of tracking in the current frame on real-time, the situation is required to adjust when that occurred. There are 4 steps are following: 1. 2. 3. 4.
Pause the process immediately. Delete the record of the wrong point. Add the new point which is correct. Continue the processing.
Motion Analysis via Feature Point Tracking Technology
301
Fig. 4. The system flowchart. The left part is showed the tracking algorithm. At first, we have to enter the initial keypoints and track all of the frames of the video stream. When it finishes, we can get a file recorded the result of tracking. The right part is the matching algorithm. We load the files by the first part and accord the files to analyze. In the end, we can obtain the matching results and the scores of the object motion.
After the adjusting, we can get a more precise tracking result. We use an estimative algorithm named dPSNR to check the accuracy of result. dPSNR is a modified PSNR method. We just use the distance to replace the error of pixels. The values are showed as table 1. Table 1. A comparing table of dPSNR trajectory neck left hand right hand
dPSNR (without adjusting) 51.5986 42.8803 46.6695
dPSNR(adjusting) 52.6901 52.1102 50.0016
5 Experiments The video are used to study in this article, give instances of yoga sports. We cut a clip from a film, the motion is that raise both arms parallel to the floor. The result of analysis is showed in figure 5. The green of trajectory represent the standard motion, another blue of trajectory represent the mimic motion. The green circle is represented the joint which is on the
302
Y.-S. Lin et al.
neck. Square is the head, the right foot and the left foot separately. Head and foot did not move, in this case, there are no trajectories. Two joints are located the same coordinate, we see the similarity between two trajectories through the figure very clearly. There are red lines between data points when they are very similar to each other. The similarity means that the motion vector is identical and the angle of difference is less than degree 1.
Fig. 5. The matching result. It shows two hands example. The red lines are connected with two trajectories. The points in each line need to be compared and compute the score.
6 Conclusion In this paper, we use SIFT algorithm and Full Search approach to find out the trajectory of object in consecutive images. We represented trajectory as “motion vector” and “angle”, and then compute the similarity which is based on modified Dynamic Time Warping. Finally, show the data and the image which is the result of difference between trajectories. The trajectory consist of points, in practice, the one point is a pixel in digital image. Hence we can display the subtle action of human, and differentiate easily the difference.
References 1. Hernández, P.C., Czyz, J., Marqués, F., Umeda, T., Marichal, X., Macq, B.: Bayesian Approach for Morphology-Based 2-D Human Motion Capture. IEEE Transactions on Multimedia, 754–765 (June 2007) 2. Zhao, W.-L., Ngo, C.-W.: Scale-Rotation Invariant Pattern Entropy for Keypoint- Based Near-Duplicate Detection. IEEE Transactions on Image Processing, 412–423 (February 2009)
Motion Analysis via Feature Point Tracking Technology
303
3. Tang, F., Tao, H.: Probabilistic Object Tracking With Dynamic Attributed Relational Feature Graph. In: IEEE Transactions on Circuits and Systems for Video Technology, pp. 1064–1074 (August 2008) 4. Lowe, D.G.: Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision 60(2), 91–110 (2004) 5. Chen, A.-H., Zhu, M., Wang, Y.-h., Xue, C.: Mean Shift Tracking Combining SIFT. In: 9th International Conference on Signal Processing, ICSP 2008, pp. 1532–1535 (2008) 6. Li, Z., Imai, J.-i., Kaneko, M.: Facial Feature Localization Using Statistical Models and SIFT Descriptors. In: The 18th IEEE International Symposium on Robot and Human Interactive Communication, pp. 961–966 (2009) 7. Lu, Y., Wang, L., Hartley, R., Li, H., Shen, C.: Multi-view Human Motion Capture with An Improved Deformation Skin Model. In: Computing: Techniques and Applications, pp. 420–427 (2008) 8. Huang, C.-H.: Classification and Retrieval on Human Kinematical Movements. Tamkang University (June 2007) 9. Chao, S.-P., Chiu, C.-Y., Chao, J.-H., Yang, S.-N., Lin, T.-K.: Motion Retrieval And Its Application To Motion Synthesis. In: Proceedings. 24th International Conference on Distributed Computing Systems Workshops, pp. 254–259 (2004) 10. Lai, Y.-C., Liao, H.-Y.M., Lin, C.-C., Chen, J.-R., Peter Luo, Y.-F.: A Local Feature-based Human Motion Recognition Framework. In: IEEE International Symposium on Circuits and Systems, May 24-27, pp. 722–725 (2009) 11. Shin, C.-B., Chang, J.-W.: Spatio-temporal Representation and Retrieval Using Moving Object’s Trajectories. In: International Multimedia Conference, Proceedings of the 2000 ACM Workshops on Multimedia, pp. 209–212 (2000) 12. Chen, Y., Wu, Q., He, X.: Using Dynamic Programming to Match Human Behavior Sequences. In: 10th International Conference on Control, Automation, Robotics and Vision, ICARCV 2008, pp. 1498–1503 (2008) 13. Yabe, T., Tanaka, K.: Similarity Retrieval of Human Mot ion as Multi-Stream Time Series Data. In: Proc. International Symposium on Database Applications in Non-Traditional Environments, pp. 279–286 (1999)
Traffic Monitoring and Event Analysis at Intersection Based on Integrated Multi-video and Petri Net Process Chang-Lung Tsai and Shih-Chao Tai Department of Computer Science, Chinese Culture University 55, Hwa-Kang Road, Taipei, 1114, Taiwan, R.O.C.
[email protected]
Abstract. Decreasing traffic accidents and events are one of the most significant responsibilities for most of the government in the world. Nevertheless, it is hard to precisely predict the traffic condition. To comprehend the root cause of traffic accidents and restore the occurrence of traffic events, a traffic monitor and event analysis mechanism based on multi-videos processing and Petri net analysis techniques is proposed. In which, the traffic information are collected through the deployment of cameras at intersection of heavy traffic areas. After then, all of the collected information is provided for constructing multiviewpoint traffic model. The significant features are extracted for traffic analyzing through Petri Net and detection of motion vector. Finally, decision will be output after integrated traffic information and event analysis. Experimental results demonstrate the feasibility and validity of our proposed mechanism. It can be applied as a traffic management system. Keywords: Traffic monitor, video process, intelligent traffic system, Petri net, motion detection.
1 Introduction As the technique emerging, the necessity of establishing an intelligent life has become very popular and strongly advocated. Thus, the physical construction of ITS (intelligent traffic system) through traffic monitoring and management has become a significant issue in most of the countries. The goal is to decrease the occurrence of traffic incidents. Therefore high resolution video cameras have been mounted on all of the freeways, heavy traffic highways, tunnels and important traffic intersections of the main cities to monitor and detect the traffic condition. In addition, the related information can be supplied as the evidence for traffic enforcement. In the past few years, the fact is most of the traffic conditions are difficult to predict. But, as the scientific techniques emerging, traffic flow management becomes easier than before. In addition, some mobiles have mounted with high resolution camera to provide early warning and avoid accidents. The infrastructure for monitoring and managing the traffic at freeway, heavy traffic highway, main traffic intersection or tunnels has obviously improved and gradually accomplished in most of the critical cities worldwide. However, as regarding the early warning systems or traffic restoration system, the development of related applications and products is still a must. K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 304–314, 2011. © Springer-Verlag Berlin Heidelberg 2011
Traffic Monitoring and Event Analysis at Intersection
305
Life is invaluable treasure. That is why most of the drivers are driving carefully. However, there still traffic accidents occurred every day. Therefore, well understand the behavior of drivers is very important. The goal of traffic management is not only focus on the detection of traffic events and flow control, but also focused on detecting those aggressive and violating drivers to decrease traffic accidents. Regarding video surveillance for traffic detection, several researchers have developed some different types of video detection systems (VDS) [2]. In [3], Wang et al. proposed a video image vehicle detection system (VIVDS) to detect different color vehicles. In [4], Ki et al. proposed a vision-based traffic accident detection system based on inter-frame difference algorithm for traffic detection at intersections. Although there are a lot of different types of VIVDS developed, single installed camera mounted on the poles of roadside or traffic light is not sufficiently to capture the entire intersection and provide the needed whole completely traffic information. Some evidences shown that even the popular advanced Traficon systems might have problem of false positive and false negative signals at the intersections due to the reason of weather and lighting conditions [2]. In order to understand the root cause of traffic accidents and restore the occurrence of traffic events, a traffic monitor and event analysis mechanism based on multiviewpoint and 3D video processing techniques is proposed in this paper. The rest of this paper is organized as follows. In Section 2, the rationale of Petri Net and multi-viewpoint traffic model is introduced. Traffic monitor and event analysis are addressed in Section 3. Experimental results are demonstrated in Section 4 to verify the feasibility and validity of our proposed mechanism. Finally, concluding remarks are given in Section 5.
2 Rationale of Petri Net and Multi-viewpoint Traffic Model 2.1 Rationale of Petri Net Generally, a Petri net [5] is defined by variables such as place, transition, and directed arc to express their relation and flow and transition status. In which, an arc represents the variation and only runs from a place to a transition or from a transition to a place. It will not run between places or between transitions. A traditional graph of Petri Net is shown as Figure 1.
Fig. 1. A traditional graph of Petri Net
The graph of Petri Net graph applied in this paper is a 5-tuple (S, T, W, I0, M), where S represents a finite set of places that vehicle located in each frames of the traffic video, T represents a finite set of transitions that represent the variation of each vehicle, W represents a multiset of arcs which is defined and operated as W: (SxT)
∪
306
C.-L. Tsai and S.-C. Tai
(TxS)->N, I0 represents the initial state of each vehicle, and M represents the current state of each vehicle. All of the 5-tuple is assigned and validated only as the vehicle enters into the area of focused traffic intersection blocks as shown in Figure 2. The application of Petri Net analysis for traffic is based on the vehicular route and interactivity. After the vehicle shaping, the center of each shape represents the current place of that vehicle and its driving route will then be detected. As the vehicles come over into the intersection area which indicated by gray color as shown in Figure 2, each vehicle will be assign a token and the token is validated only inside the intersection area. In order to record the timestamp of each token, the token’s timing will be set synchronously with the fps (frames per second) of video displaying. As the vehicle move outside of the intersection area, the validation of the token will be expired. Under normal situation, each vehicle pass through the intersection area only as its driving direction’s green light is turning on. No matter the vehicle is going straight or taking left turn and even right turn, it has to wait for the validated indicating traffic light. Normally, most of the inner lanes will be reserved and marked for those vehicles that needed to take left turn and the other lanes are used for driving straight or taking right turn.
Fig. 2. Illustrate the focused traffic intersection for applying Petri Net analyzing
2.2 Rationale of 3D Video and Multi-viewpoint Traffic Model To record dynamic visual events, multi-viewpoint is an ultimate image media [6] in the real world. For example, one can record 3D object shape with high fidelity surface properties such as color, shape, or texture according to time varying sampling. In [7], Matsuyama et al. proposed a 3D video processing model, in which some techniques are developed as the following: 1. 2. 3.
Reconstruct dynamic 3D object action from multi-viewpoint video images in real-time. Reconstruct accurate 3D object shape by deforming 3D mesh model. Render natural-looking texture on the 3D object surface from the multiviewpoint video images.
Shown in Figure 3 is the demonstration of sensors deployment at the intersection of a road. In the figure, the red spot indicates a virtual target for cameras to focus on and
Traffic Monitoring and Event Analysis at Intersection
307
record the related information based on time varying sampling. The yellow arrows represent the deployed cameras and their recording direction. How many cameras will be optimal for dispatching to record the real traffic condition is left for implementation-defined. The reason is because there are not any intersections which will be under the same traffic situation. But, at least 4 to 5 will be better for constructing multiviewpoint traffic model. In addition, each dispatched sensors must possess the same specification such as the same color display and calibration system, resolution, and etc.
Fig. 3. Demonstrate the deployment of sensors at the intersection of a road
(a)
(c)
(b)
(d)
(e)
Fig. 4. Illustrate the recorded frames in one of the traffic intersection from 5 different directions according to Figure 3
Shown in Figures 4(a) to Figure 4(e) are the illustrations of frames recorded in the same traffic intersection area from different viewpoints according to Figure 3. In
308
C.-L. Tsai and S.-C. Tai
which, the same bus in different frames that extracted from videos recorded in different direction are marked by rectangular blocks. To construct perspective multi-view from multi-videos, the procedure is stated as the following: Step 1: Evaluate the importance of a traffic intersection. Step 2: Decide how many cameras would be sufficient and suitable for deploying to collect the needed information for construct multi-viewpoint traffic model. Step 3: Determine the mounted location. Step 4: Record the physical traffic information from different directions such as the viewpoint from front-side direction, the viewpoint from right-hand side direction at the front and rear position, and the viewpoint from left-hand side direction at the front and behind position. Step 5: Perform video pre-process such as noise cleaning. Step 6: Integrate and construct 3D traffic video or multi-viewpoint traffic model.
3 Traffic Monitor and Event Analysis 3.1 Preprocess of Traffic Videos To completely comprehend the real traffic condition, correctly perform the desired traffic detection and exactly analyze the driving behavior of each vehicle under the monitored traffic area, the raw traffic video has been applied for preprocessed to reduce those unwanted interference information or even noises. The diagram for preprocessing of traffic video is shown in Figure 5. The completely flowchart and different modules for traffic video process is shown in Figure 6. The detailed traffic videos process procedures are addressed as the following. Step 1: The input video is decomposed into frames. Step 2: Noise cleaning has been performed to reduce those unnecessary noises. Step 3: Background segmentation is performed for further processing. Step 4: In this step, three tasks have been performed including multi-viewpoint traffic model construction, 3D Video presentation, motion vector detection, and vehicle detection by applying Fourier descriptor (FD) for vehicle shaping, representation, recognition and further tracking. Step 5: Petri Net has been performed for status tracking for each vehicle. In addition, integrated traffic information are analyzed in this step. Step 6: In this step, two tasks have been performed including violation detection and continuing to monitor and record the traffic. In which, the module of violation detection is determined from those information generated in step 4. In addition, the proposed mechanism will keep on monitoring and recording the following traffic condition. Input Videos
Frame Extraction
Noise Cleaning
Image Segmentation
Fig. 5. The diagram of preprocessing lists for traffic video
Traffic Monitoring and Event Analysis at Intersection
Input Videos
Video Preprocess
Multi-view & 3D Model
Motion vector Detection
Vehicle Shaping
Petri Net Analysis
Keep on Monitoring
Violation Detection
Fig. 6. Demonstrate the flowchart of traffic video processing
(a)
(c)
(b)
(d)
Fig. 7. Demonstrate the detected motion vector of traffic flows
309
310
C.-L. Tsai and S.-C. Tai
3.2 Combination of Petri Net and Motion Vector Analysis The recorded Petri Net information of each vehicle will be collected and combined with the motion vector of each vehicle for integrated analysis. Shown in Figure 7(b) is the motion vector detected from Figure 7(a) and shown in Figure 7(d) is the motion vector detected from Figure 7(c). Both of the motion vectors are detected inside the traffic intersection area in different timing. All of the vehicles moved on the same direction or took left-turn or right turn or even U-turn from other directions will also be collected and integrated for advanced analysis. Through manipulation, the interactivities of each vehicle inside the same traffic environment or within the same monitored traffic intersection area could be applied to determine if a vehicle is categorized as possible aggressive, dangerous, or violating driving. In addition, if the motion of a vehicle is recognized as violating or aggressive, a violation message will be alarmed. Otherwise, this system will keep on monitoring and analyzing the traffic until there is violating or aggressive driving occurred. The motion vector of each vehicle is measured based on the movement amount of macroblock with the most likelihood color and the highest similarity of vehicle shape traced by Fourier descriptor simultaneously to avoid false tracing induced by considering color only. Some examples are demonstrated as shown in Figure 6(b) and Figure 6(d). Moreover, the traffic light will also be considered as a feature to evaluate if there is any violation.
4 Experimental Results and Discussion 4.1 Experimental Results Shown in Figure 8 are some images/frames extracted from traffic videos. Shown in Figure 9(a) to Figure 9(h) are some sequential frames extracted from the video which is recorded at the same intersection as shown in Figure 8 from the right-side direction. In which, Figure 8(c) is the frame extracted from front-side tapped video. Shown in Figure 9(a), the traffic light of straight direction has changed to be green light. However, those vehicles that took left turn west to north direction such as the bus that marked by red line are still moving. In addition, some different vehicles just behind that bus and also keep moving.
(a)Left-side
(b) Right-side
(c) Front-side direction
Fig. 8. Illustrate the video recorded from different viewpoint in different traffic intersection
Traffic Monitoring and Event Analysis at Intersection
(a)
(b)
(d)
(c)
(e)
(g)
311
(f)
(h)
Fig. 9. Illustrate the extracted sequential frames that recorded at the same intersection as shown in Figure 8 from the right-side
Shown in Figure 9(c) and 9(d), one can clearly find that some motorcycles as marked by green line are moving straight forward from north to south direction, because their traffic light has already changed into green light. However, those left turn vehicles are still not yet finished their left-turn route. Shown in Figure 10 are some frames extracted from the video which is recorded at the same intersection as shown in Figure 8 and Figure 9 from the front-side direction. 4.2 Discussions By inspecting Figure 9(c) to 9(h), one might suspect and can estimate that some vehicles might run over red light as they took left turn from the west direction to the north direction. The behavior of those vehicles is quiet dangerous and aggressive to straight route vehicles. If there is only single camera/sensor monitored, it is difficult to make judgment and well comprehend the real traffic situation. Fortunately, in the paper, 3 cameras have been mounted and recorded the traffic from 3 different directions. Therefore, the exact traffic can be completely established or restored. Moreover, those violated vehicles will be enforced.
312
C.-L. Tsai and S.-C. Tai
(a)
(b)
(c)
(d)
Fig. 10. Demonstrate the extracted frames that recorded in Figure 8 and 9 from the front-side
In order to avoid traffic incidents, the process of motion vector has been adopted in the paper. Therefore, one can easily extract the exact route of each vehicle from the 3 different tapped videos. To construct 3D traffic video, the difficulty is much higher than construct 3D images for static object. Because, the traffic information is recorded from time varying system, therefore the techniques for improving the process of image compensation and interpolation are significantly affect the final performance. As regarding the construction of 3D traffic model for the necessity of presenting from different viewpoint, some works had better be well prepared such as the following: 1. 2.
3. 4.
A lot of videos should be taped from different viewpoints to focus on a same object. Collecting as much information as possible to avoid the problem of information insufficient. The tapping directions must at least include the viewpoints from forward direction, right side direction, and left side direction. The 3D model of an object could be constructed through fragile segmented images. Compensation techniques must be applied to remedy the insufficient information for 3D model construction.
Since the goal of constructing multi-viewpoint traffic model and 3D video automatically, therefore the preprocessing of image normalization and compensation for those extracted frames are quiet important. That is because the results of preprocessing will significant effect the following construction of 3D video. In addition, well process of interpolation is another quiet important key factor for optimal presenting the real
Traffic Monitoring and Event Analysis at Intersection
313
traffic environment. Moreover, the decision for image segmentation might be a tough task in the automatic process. Therefore, the quality of preprocessing must be strictly required.
5 Conclusion To decrease the traffic events and prevent the occurrence of traffic accidents are one of the important responsibilities of the government for all of the nations in the world. Since most of the heavy traffic bottlenecks occurred within the main traffic intersection area, highways, freeways and etc., therefore a lot of traffic monitor systems or earlier warning systems are mounted on those important areas. However, there are still a lot of traffic occurred every day. No government agents could precisely and exactly provide a system to dramatically cut down or even stop the occurrence of traffic events or accidents. In order to correctly find out the root cause of traffic accidents, restore the completely process of the occurrence of traffic events and exactly understand the key factor of traffic management, a traffic monitor and event analysis mechanism based on multi video for establishing 3D video processing are introduced in this paper. In the proposed scheme, there are three modules named as information collection, multiviewpoint traffic model and 3D video process, and expert system. The traffic information are recorded and collected through the deployment of a lot of video monitors around each intersection of those heavy traffic areas along five directions where 3 directions are from front, right-front, and left-front directions and 2 directions are from rear-left direction and rear right direction. Therefore, each vehicle is monitored by many detectors. All of the collected information is transferred to the 3D process for constructing the 3D model based on the options such as vehicles, directions, time factor, and etc. Those significant features for traffic analyzing are also extracted and retrieved in this module. In addition, rationale of Petri Net is adopted for vehicular tracking and interactivity analysis. Finally, the expert module is responsible for integrated traffic information and event analysis and offering decision making such as the application of early warning or enforcement. Experimental results demonstrate the feasibility and validity of our proposed mechanism. However, to construct 3D traffic video, the difficulty is much higher than construct 3D images for static object. That is because the traffic information is recorded based on time varying system. In order to present traffic in 3D model, the future work will be focused on the improvement of image normalization, compensation and interpolation process.
References 1. Starck, J., Maki, A., Nobuhara, S., Hilton, A., Matsuyama, T.: The Multiple-Camera 3-D Production Studio. IEEE Transactions on Circuits and Systems for Video Technology 19, 856–869 (2009) 2. Misener, J., et al.: California Intersection Decision Support: A Systems Approach to Achieve Nationally Interoperable Solutions II. California PATH Research Report UCB-ITSPRR-2007-01 (2007)
314
C.-L. Tsai and S.-C. Tai
3. Wang, Y., Zou, Y., Sri, H., Zhao, H.: Video Image Vehicle Detection System for Signaled Traffic Intersection. In: 9th International Conference on Hybrid Intelligent Systems, pp. 222–227 (2009) 4. Ki, Y.K., Lee, D.Y.: A traffic accident recording and reporting model at intersections. IEEE Transactions on Intelligent Transportation Systems 8, 188–194 (2007) 5. Petri Net, http://en.wikipedia.org/wiki/Petri_net (accessed on August 1, 2010) 6. Moezzi, S., Tai, L., Gerard, P.: Virtual view generation for 3d digital video. IEEE Multimedia, 18–26 (1997) 7. Matsuyama, T., Wu, X., Takai, T., Nobuhara, S.: Real-Time Generation and High Fidelity Visualization of 3D Video. In: Proceedings of MIRAG 2003, pp. 1–10 (2003)
Baseball Event Semantic Exploring System Using HMM Wei-Chin Tsai1, Hua-Tsung Chen1, Hui-Zhen Gu1, Suh-Yin Lee1, and Jen-Yu Yu2 1
Department of Computer Science, National Chiao-Tung University, Hsinchu, Taiwan {wagin,huatsung,hcku,sylee}@cs.nctu.edu.tw 2 ICL/Industrial Technology Research Institute, Hsinchu, Taiwan
[email protected]
Abstract. Despite a lot of research efforts in baseball video processing in recent years, little work has been done in analyzing the detailed semantic baseball event detection. This paper presents an effective and efficient baseball event classification system for broadcast baseball videos. Utilizing the specifications of the baseball field and the regularity of shot transition, the system recognizes highlight in video clips and identifies what semantic baseball event of the baseball clips is currently proceeding. First, a video is segmented into several highlights starting with a PC (Pitcher and Catcher) shot and ending up with some specific shots. Before every baseball event classifier is designed, several novel schemes including some specific features such as soil percentage and objects extraction such as first base are applied. The extracted mid-level cues are used to develop baseball event classifiers based on an HMM (Hidden Markov model). Due to specific features detection, more hitting baseball events are detected and the simulation results show that the classification of twelve significant baseball events is very promising. Keywords: Training step, Classification step, Object detection, Play region classification, Semantic event, Hitting baseball event, Hidden Markov Model.
1 Introduction In recent years, the amount of multimedia information has grown rapidly. This trend leads to the development of efficient sports video analysis. The possible applications of sports video analysis have been found almost in all sports, among which baseball is a quite popular one. However, a whole game is very long but the highlight is only a small portion of the game. Based on the motivation, development of the semantic event exploring system for the baseball games is our focus. Because the positions of cameras are fixed in a game and the ways of showing game progressing are similar in different TV channels. Each category of semantic baseball event usually has a similar shot transition. A baseball highlight starts with PC shot, so the PC shot detection [1] plays an important role in baseball semantic event detection. In addition, a baseball semantic event is composed of a sequence of several play regions, so play region classification [2] is also an essential task during baseball event classification. Many methods are applied on semantic baseball event detection such as HMM [3][4][5], temporal feature detection [6], BBN (Bayesian Belief K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 315–325, 2011. © Springer-Verlag Berlin Heidelberg 2011
316
W.-C. Tsai et al.
Network) and scoreboard information[7][8]. Chang et al. [3] assumes that most highlights in baseball games consist of certain shot types and these shots have similar transition in time. Mochizuki et al. [4] provide a baseball indexing method based on patternizing baseball scenes using a set of rectangles with image features and a motion vector. Fleischman et al. [6] records some objects or features such as field type, speech, and camera motion start time and end time to find the frequent temporal patterns for highlight classification. Hung et al. [8] combines scoreboard information with few shot types for event detection. Even if the previous works report good results on highlight classification, it doesn’t stress the variety of hitting event types such as left foul ball and ground out. In this paper, we aim at exploring hitting baseball events. Twelve semantic baseball event types in baseball games are defined and detected in the proposed system: (1) single (2) double (3) pop up (4) fly out (5) ground out (6) two base hit (7) right foul ball (8) left foul ball (9) foul out (10) double play (11) home run (12) home base out. With the proposed framework, event classification in baseball videos will be more powerful and practical, since comprehensive, detailed and explicit events about the game can be presented to users.
2 Proposed Framework In this paper, a novel framework is proposed for baseball event classification of the batting content in baseball videos. The process can be divided into two steps: training step and classification step as shown in Figure 1. As illustrated in Figure 1, 1.Training Step indexed baseball clips in each type
2. Classification step Unknown clip
Color conversion Object detection Play region classification HMM training
Rule table
Event classification
HMM 1 HMM . 2 . .
Event type is determined
HMM 12
Fig. 1. Overview of the training step and classification step in proposed system
Baseball Event Semantic Exploring System Using HMM
317
in training step, each type as listed in Table 2 of indexed baseball event was input as training data for each baseball event classifier. In classification step, when each observation symbol sequence of unknown clip was input, each highlight classifier will evaluate how well a model predicts a given observation sequence. In both two steps, with baseball domain knowledge, the spatial patterns of field lines and field objects and color features are recognized to classify play region types, such as infield left, outfield right, audience, etc. Finally, from each field shot, a symbol sequence which describes the transition of play regions is generated as HMM training data or input data for event classification. Details of the proposed approaches are described in the following sections. Section 3 introduces the color conversion. Section 4 describes object and feature detection. Section 5 describes play region classification. Section 6 and section 7 describe HMM training and classification of baseball events.
3 Color Conversion In image processing or analysis, color is an important feature for our proposed object detection and feature (the percentage of grass and soil) extraction. However, the color of each baseball game in frames might vary because of the different angles of view and lighting conditions. To obtain the color distribution of grass and soil in video frames, several baseball clips from different video sources composed of grass and soil are input to produce the color histograms in RGB and HSI color space. Figure 2 takes two different baseball clips from different sources as examples. Owing to the discrimination the Hue value in HSI color space is selected as the color feature, and the dominant color of grass (green) and soil (brown) color ranges [Ha1,Hb1],[Ha2,Hb2] are set.
Fig. 2. The color space of RGB and HSI of two baseball clips
4 Object Detection The baseball field is characterized by a well-defined layout as described in Figure 3. Furthermore, important lines and the bases are in white color, and auditorium (AT) is of high texture and no dominant color as shown in Figure 3(b).
318
W.-C. Tsai et al.
Fig. 3. (a) Full view of real baseball field (b) Illustration of objects and features in baseball field
Each object will be elaborated as follows. (1) Back auditorium (AT): The top area which contains high texture and no dominant colors is considered as the auditorium, and is marked as the black area above the white horizontal line in Figure 4(a). (2) Left auditorium (L-AT) and right auditorium (R-AT): The left area and right area which contains high texture and no dominant colors is considered as the left auditorium and right auditorium, as the left black area and the right black area marked with the white vertical line in Figure 4 (b) and Figure 4 (c).
Fig. 4. Illustration of (a) back auditorium (b) left auditorium (c) right auditorium
(3) Left line (LL) and right line (RL) : A Ransac algorithm, which finds the line parameter of line segments [9], is applied to the line pixels and then finds the left or right line. (4) Pitch mound (PM): An ellipse soil region surrounded by a grass region would be recognized as pitcher’s mound as shown in Figure 5 marked with red rectangle. (5) First base (1B) and third base (3B): The square region located on right line, if detected, in soil region would be identified as first base as shown in Figure 5. Similarly, the square region located on left line, if detected, in soil region would be identified as third base.
Baseball Event Semantic Exploring System Using HMM
319
(6) Second base (2B): In a soil region, a white square region on neither field line would be identified as second base as shown in Figure 5. (7) Home base (HB): Home base is located on the region of the intersection between left line and right line as shown in Figure 5.
Fig. 5. Illustration of the objects 1B, 2B, HB, LL, RL, and PM
5 Play Region Classification Sixteen play region types are defined and classified based on the position or percentage of some objects and features as described in section 3. Sixteen typical region types are: IL (infield left), IC (infield center), IR (infield), B1 (first base), B2 (second base), B3 (third base), OL (outfield left), OC (outfield center), OR (outfield right), PS (play in soil), PG (play in grass), AD (audience), RAD (right audience), LAD (left audience), CU (close-up), and TB (touch base), as shown in Figure 6.
IL: infield left
IC: infield center
IR: infield right
PS: play in soil
B3: third base
B2: second base
B1: first base
PG: play in grass
OL: outfield left
OC: outfield center
LAD: left audience RAD: right audience
OR: outfield right
AD: audience
TB: touch base
CU: close-up
Fig. 6. Sixteen typical play region types
320
W.-C. Tsai et al.
The rules of play region type classification are listed in Table 1 modified from [2]. The symbols of first column are our sixteen defined play region types as shown in Figure 6. Wf is the frame width, the function P(Area) returns the percentage of the area Area in a frame, X(Obj) returns the x-coordinate of the center of the field object Obj, and W(Obj) returns true if the object Obj exists. Each play region is classified into one of sixteen play region types using rule table. For example: a field frame would be classified as B1 frame type if the frame meets the following conditions: The percentage of AT is no more than 10%, the object of PM does not exist, the object of RL and 1B must exist, the percentage of soil is more than 30%. After play region classification, each frame in video clip will output a symbol representing the play region type. A video clip will be represented as a symbol sequence. For example, a video clip of ground out would output a symbol sequence ILÆICÆPSÆIRÆB1. Table 1. Rules of play region type classification
Baseball Event Semantic Exploring System Using HMM
321
6 HMM Training for Baseball Events One HMM is created for each baseball event for recognizing time-sequential observed symbols. In our proposed method, twelve baseball events listed in Table 2 are defined so that there are twelve HMMs. Given a set of training data from each type of baseball event, we want to estimate the model parameters λ = (A, B, π) that best describe each baseball event. The matrix A is transition probabilities between states. Matrix B is output symbol probabilities from each state. Matrix π is initial probabilities of each state. First of all, Segmental K-means algorithm [10] is used to create an initial HMM parameter λ and then Baum-Welch algorithm [10] is applied to re-estimate each HMM parameters λ = A , B , π of baseball event.
(
)
Table 2. List of twelve baseball events
Single
Right foul ball
Double
Left foul ball
Pop up
Foul out
Fly out
Double play
Ground out
Home run
Two-base out
Home base out
In our proposed method, two features such as grass and soil, and ten objects as shown in Figure 3(b) are used as observations represented as a 1×12 vector to record whether the object appears or not. To apply HMM to time-sequential video, the extracted features represented as a vector sequence must be transformed into a symbol sequence by rule table as listed in Table 1 for later baseball event recognition. Conventional implementation issues in HMM include (1) number of states, (2) initialization, and (3) distribution of observation at each state. The first problem of determining the number of states is determined empirically and differs in each baseball event. The second problem can be approached by random initialization or using Segmental K-mean algorithm [10]. Finally, the last problem can be solved by trying several models such as Gaussian model and choose the best one. In our approach, we choose Gaussian distribution. The following is the detailed description of each essential element. State S: The number of states is selected empirically depending on different baseball events and each hidden state represents a shot type. Observation O: the symbol mapped from rule table. Observation distribution matrix B: use K-means algorithm and choose the Gaussian distribution at each state [10]. Transition probability matrix A: the state transition probability, which can be learned by Baum-Welch algorithm [10].
322
W.-C. Tsai et al.
Initial state probability matrix π: the probability of occurrence of the first state, which is initialized by Segmental K-means algorithm [10] after determining the number of states. After determining the number of states and setting the initial tuple λ, to maximize the probability of the observation sequence given the model, we can use the BaumWelch algorithm[10] to re-estimate the HMM parameter λ .
7 Baseball Event Classification The idea behind using the HMMs is to construct a model for each of the baseball event that we want to recognize. HMMs give a state based representation for each highlight. After training each baseball event model, we calculate the probability P(O λi) (index i for each baseball event HMM) of a given unknown symbol sequence O for each highlight model λi. We can then recognize the baseball event as being the one by the highest probable baseball event HMM.
∣
8 Experiments To test the performance of baseball event classification, we implement a system capable of recognizing twelve different types of baseball events. All video sources are Major League Baseball (MLB). 120 baseball clips from three different MLB video sources as training data and 122 baseball clips from two different MLB video sources as test data. Each video source is digitized into 352×240 pixel resolution. The experimental result is shown in Table 3. Table 3. Recognition of baseball events
Both the precision and recall are about 80% except for the precision of double, double play and the recall of double, two-base out. The low recall rate of baseball
Baseball Event Semantic Exploring System Using HMM
323
event double and two-base out might result from the missed detection of field object 2B. The low precision rate of baseball event double might be that the transitions of double and Home run are similar if the batter hits the ball to the audience wall. The low precision rate of baseball event double play might be that the transitions of double play and ground out are similar if the batter hits the ball around the second base as shown in Figure 7. Figure 8 shows the miss detection of right foul ball and home run due to the similar shot transition. Figure 9 shows some ambiguities in nature of baseball events such as ground out and left foul ball even if those baseball events are judged by people. Figure 10 shows miss detection between single and ground out because the player in first base does not catch the ball and we do not detect the ball object.
Fig. 7. Comparison between (a) ground out and (b) double play
Fig. 8. Comparison between (a) right foul ball and (b) home run
Fig. 9. Ambiguity of (a) left foul ball (b) replay of left foul ball
324
W.-C. Tsai et al.
Fig. 10. Ambiguity of ground out and single
The miss detection of highlights can be classified into four reasons: (1) similar shot transition, (2) miss object detection, (3) detected objects are not enough, and (4) ambiguity in nature. These could be improved by detecting the object of ball and players, or adding additional information such as scoreboard information. Overall, we still achieve good performance.
9 Conclusions In this paper, a novel framework is proposed for baseball event classification. In training step, first, the spatial patterns of the field objects and lines of each field frame in each baseball event type video clips are recognized based on the distribution of dominant colors and white pixels. With baseball domain knowledge, each field frame is classified into one of the sixteen typical play region types using the rules on the spatial patterns. After play region classification, output symbol sequences of each baseball event type will be used as training data for each baseball event HMM. In classification step, the observation symbol sequence, generated by the play region classification of video clips, is used as an input for each baseball event HMM and each baseball event classifier will evaluate how well a model predicts a given symbol sequence. Finally, we can then recognize the baseball event as being the one by the highest probable baseball event HMM.
References 1. Kumano, M., Ariki, Y., Tsukada, K., Hamaguchi, S., Kiyose, H.: Automatic Extraction of PC Scenes Based on Feature Mining for a Real Time Delivery System of Baseball Highlight Scenes. In: IEEE international Conference on Multimedia and Expo, vol. 1, pp. 277– 280 (2004) 2. Chen, H.T., Hsiao, M.H., Chen, H.S., Tsai, W.J., Lee, S.Y.: A baseball exploration system using spatial pattern recognition. In: Proc. IEEE International Symposium on Circuits and Systems, pp. 3522–3525 (May 2008) 3. Chang, P., Han, M., Gong, Y.: Extract Highlight From Baseball Game Video With Hidden Markov Models. In: International Conference on Image Processing, vol. 1, pp. 609–612 (2002) 4. Mochizuki, T., Tadenuma, M., Yagi, N.: Baseball Video Indexing Using Patternization Of scenes and Hidden Markov Model. In: IEEE International Conference on Image Processing, vol. 3, pp. III -1212-15 (2005) 5. Bach, H., Shinoda, K., Furui, S.: Robust Highlight Extraction Using Multi-stream Hidden Markov Model For Baseball Video. In: IEEE International Conference on Image Processing, vol. 3, pp. III- 173-6 (2005)
Baseball Event Semantic Exploring System Using HMM
325
6. Fleischman, M., Roy, B., Roy, D.: Temporal Feature Induction for Baseball Highlight Classification. In: Proc. ACM Multimedia Conference, pp. 333–336 (2007) 7. Hung, H., Hsieh, C.H., Kuo, C.M.: Rule-based Event Detection of Broadcast Baseball Videos Using Mid-level Cues. In: Proceedings of IEEE International Conference on Innovative Computing Information and Control, pp. 240–244 (2007) 8. Hung, H., Hsieh, C.H.: Event Detection of Broadcast Baseball Videos. IEEE Trans. on Circuits and Systems for Video Technology 18(12), 1713–1726 (2008) 9. Farin, D., Han, J., Peter, H.N.: Fast Camera Calibration for the Analysis of Sport Sequences. In: IEEE International Conference on Multimedia & Expo, pp. 482–485 (2005) 10. Rabiner, R., Juang, B.H.: An Introduction to Hidden Markov Models. IEEE Signal Processing Magazine 3(1), 4–16 (1986)
Robust Face Recognition under Different Facial Expressions, Illumination Variations and Partial Occlusions Shih-Ming Huang and Jar-Ferr Yang Institute of Computer and Communication Engineering, Department of Electrical Engineering, National Cheng Kung University, Tainan, 70101, Taiwan
[email protected],
[email protected]
Abstract. In this paper, a robust face recognition system is presented, which can perform precise face recognition under facial expression variations, illumination changes, and partial occlusions. The embedded hidden Markov model based face classifier is applied for identity recognition in which the proposed observation extraction is presented by performing local binary patterns prior to performing delta operation on the discrete cosine transform coefficients of consecutive blocks. Experimental results show that the proposed face recognition system achieves high recognition accuracy of 99%, 96.6% and 98% under neutral face, expression variations, and illumination changes respectively. Particularly, under partial occlusions, the system achieves recognition rate of 81.6% and 86.6% for wearing sunglasses and scarf respectively. Keywords: Robust face recognition, Delta discrete cosine transform coefficient, Local binary pattern, Embedded hidden Markov model.
1 Introduction Face recognition [1] is to distinguish a specific identity from the unknown objects characterized by face images. In realistic situations, such as video surveillance applications, face recognition may encounter many great challenges such as different facial expressions, illumination variations, and even partial occlusions, which might make face recognition systems unreliable [2-3]. Particularly, under partial occlusions, partial portions of the face might be covered or modified, for instance, by wearing a pair of sunglasses or a scarf. Obviously, partial occlusion problems might lead to a significant deterioration in face recognition performance because some facial features, e.g. eyes, mouth etc., disappear. Thus, a robust face recognition system, which could achieve correct face identification not only under different facial expressions and illumination variations, but also under partial occlusions, is necessary for more practical face recognition applications and to provide a reliable performance. In the literatures, numerous researches have been proposed to achieve successfully face recognition in certain conditions. Generally, we can categorize them into holistic and local feature approaches. For holistic approaches, one observation signal describes the entire face. For local feature approach, a face is represented by a set of K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 326–336, 2011. © Springer-Verlag Berlin Heidelberg 2011
Robust Face Recognition under Different Facial Expressions, Illumination Variations
327
observation vectors each describing a part of the face. Conventionally, the appearance based approaches [4-6], including principle component analysis (PCA), linear discriminant analysis (LDA) and two-dimensional PCA (2D-PCA), are popular for face recognition. Additionally, many other methods, including hidden Markov model (HMM) [7] based approaches [8-13], have been introduced. The HMM is a stochastic modeling technique that has been widely and effectively used in speech recognition [7], handwritten word recognition [14]. Samaria and Young [8] first introduced luminance-based 1D-HMM to solve face recognition problems. Nefian and Hayes III [9] then introduced 1D-HMMs with 2D-DCT coefficients to reduce the computational cost. They also further proposed embedded HMM (EHMM) and embedded Bayesian network (EBN) to enhance performance of face recognition successfully [10-12]. Furthermore, a low complexity 2D HMM (LC-2D HMM) [13] has been presented to process face images in a 2D manner without forming the observation signals in 1D sequences. Thus, we aim at developing an HMM-based face recognition system which can successfully achieve identity recognition under facial expression variations, illumination changes, and even partial occlusions. Previously, to resolve partial occlusion problems, several approaches [15-16] have shown excellent results on AR facial database [17] recently. A local probabilistic approach proposed by Martinez in [15] has been introduced to analyze separated local regions in isolation. Inspired by the compression sensing, Wright et al. proposed a sparse representation-based classification (SRC) exploiting the sparsity nature of the occluded images for face recognition [16]. Face Image
Local Binary Pattern
Sliding Block
2D-DCT
Identity Recognition
Embedded HMM
Delta
Fig. 1. The Proposed Robust Face Recognition System based on embedded HMM classifier
In this study, we has presented a robust face recognition system based on embedded HMM with the proposed robust observation vectors, as depicted in Fig. 1. First, the input face image is processed by the local binary pattern (LBP) [18] operation. Next, the LBP image in a sliding block manner is transformed by the 2D-DCT to obtain the DCT coefficients, which are most commonly used in HMM-based face recognition [9, 19]. However, we suggest that the delta operation be performed to construct delta DCT observation vectors from consecutive blocks [20]. Finally, the observation vectors extracted are used for model training and testing in the EHMM classifier. To explain the whole processes and recognition performances of the proposed system in details, the rest of the paper is organized as follows. Section 2 introduces the observation extraction method. The embedded HMM is introduced in Section 3. Section 4 shows the experimental results and discussions. Finally, conclusions are drawn in Section 5.
328
S.-M. Huang and J.-F. Yang
2 Observation Vector Extraction 2.1 Local Binary Patterns (LBP) The illumination variation is one of the important impacts to realistic face recognition applications. In the literature, to eliminate the lighting impact, Heusch et al. [21] suggested the use of local binary pattern (LBP) [18, 22] to be an image preprocessing for face authentication. The LBP achieves better performance than the histogram equalization (HE) by Wang, et al. [23] and the illumination normalization approach (GROSS) proposed by Gross and Brajovic [24]. Accordingly, in our study, we also adopt the LBP as a preprocessing step prior to generating observation vectors.
Fig. 2. Illustration of LBP operation flow
The LBP is a local texture descriptor and computationally simple. Moreover, it is invariant to monotonic grayscale transformation, so the LBP representation may be less sensitive to illumination variations. As illustrated in Fig. 2, a typical computation of LBP operation takes the pixels in a 3 × 3 block, in which the center pixel is used as the reference to generate a binary 1 for the neighbor is larger than the reference; otherwise to generate a binary 0. The eight neighbors of the reference can then be represented by an 8-bit unsigned integer. In other words, the element-wise multiplication is done with a weighting block. The LBP value for the pixel (x, y) is calculated as follows: 7
LBP( x, y ) = ∑ s (in − ic ) ⋅ 2n
(1)
n=0
where ic is the grey value of the center value in a 3 × 3 block , in is the grey value of the nth surrounding pixel, and the function s ( x) is defined as:
⎧0 if s ( x) = ⎨ ⎩1 if
x<0 x≥0
(2)
2.2 Sliding Block The observation vector ot 0,t1 is extracted by a sliding block manner, as depicted in Fig. 3, where ot 0,t1 is the observation vector at row t0 and column t1 extracted from a face image
Robust Face Recognition under Different Facial Expressions, Illumination Variations
329
Fig. 3. Observation vectors extraction by sliding block manners
2.3 2D Discrete Cosine Transform (2D-DCT) In our system, the observation vectors are extracted from 2D-DCT coefficients of each sliding block, because of energy compaction property, computational efficiency and approaching Karhunen-Loeve Transform (KLT) [19]. The 2D N×N (here we use N=8) DCT is defined as follows: N −1 N −1
C ( k1 , k2 ) = c( k1 )c( k2 ) ∑ ∑ p( n1 , n2 ) ⋅ cos n1 = 0 n2 = 0
( 2n1 + 1) k1π cos ( 2n2 + 1) k2π 2N
(3)
2N
where
n1 , n2 , k1 , k2 = 0,1," , N − 1 , ⎧ 1 , n=0 ⎪ ⎪ N c ( n) = ⎨ ⎪ 2 , n = 1, 2,...N − 1 ⎪⎩ N
and p (n1 , n2 ) is the value in the LBP image The observation vector ot 0,t1 is extracted from each sliding block N×N (=8×8) with shift step Nq×Nq (here we use Nq=2) in a left-right and top-bottom manner, as shown in Fig. 3. Each observation vector consists of (Nr)2 coefficients within a block Nr×Nr (here we use Nr=3) of the low frequency coefficients of 2D-DCT coefficient matrix and is constructed in a zig-zag scan pattern as follows:
ot 0,t1 = [c0 , c1 , c2 , c3 ,..., c8 ] where
(4)
cn represents the nth coefficient of 2D-DCT coefficient matrix ordered in a
zig-zag scan manner within a block Nr×Nr (3×3). In this study, with DCT operation, a 9 dimensional observation vector for each block is constructed. 2.4 Delta DCT Cofficients In [20], it has been proved that Principal Component Analysis (PCA), 2D-DCT, and Gabor based features are sensitive to changes in the illumination direction. Thus, a novel observation vector, Delta DCT, which was called DCTmod2 in [20], was
330
S.-M. Huang and J.-F. Yang
proposed and is more robust than PCA, PCA with histogram equalization preprocessing, and 2D-DCT and 2D Gabor wavelets to face-based identity verification. In Delta DCT observation extraction [20], a given face image is analyzed by a sliding block manner; each block is N×N (i.e., 8×8) and overlaps neighboring blocks by Nq (i.e., 2) pixels. Each block is transformed into 2D-DCT domain [19]. An observation vector for each block is then constructed as:
ot 0,t1 = [Δ h c0 , Δ v c0 , Δ h c1 , Δ v c1 , Δ h c2 , Δ v c2 , c3 ,..., c8 ] where
(5)
Δ h cn and Δ v cn represent the horizontal & vertical delta coefficients respec-
tively; the delta coefficients are computed using DCT coefficients extracted from neighboring blocks. Compared to DCT feature extraction mentioned in Subsection 2.3, the first three DCT coefficients ( c0 , c1 , c2 ) are replaced by their respective horizontal and vertical deltas in order to reduce the effects of illumination variations. In this study, with Delta DCT operation, a 12-dimensional observation vector for each block is constructed as shown in (5).
3 Embedded HMM Based Face Recognition In our study, an embedded HMM based classifier is adopted to face recognition task. The observation vectors are obtained from the robust observation vector extraction method mentioned above. In the case of embedded HMM, the emission probabilities of the HMM (referred to as main HMM) are modeled via a secondary HMM. The states of the secondary HMMs are modeled by a mixture of Gaussians. A formal definition of an embedded HMM, as shown in Fig. 4, is as follows:
N 0 super states, S0 = {s0,1 , s0,2 ,..., s0, N } . ─ The initial state probability π0 = {π 0,i } , where π 0,i is the probability of being in ─ A set of
0
super state i. ─ The state transition probability A 0 = {a0,ij } between the super states, where a0.ij is the state transition probability from super state i to super state j. ─ The emission probability at super sate k is estimated through a standard left-toright HMM, defined as follow: o The number of states of a HMM in super state k, N1k and the set of states of a HMM in super state k, S1k = {s1,ki } . o π1k = {π 1,ki } is the initial state probability of a HMM in super state k,
A1k = {a1,k ij } is the sate transition probability of a HMM in super state k, and B k = {bik (ot 0,t1 )} is the state output probability of the observation vector. The state is characterized by a mixture of Gaussian distribution of the form,
Robust Face Recognition under Different Facial Expressions, Illumination Variations
bik ( o t 0 , t1 ) =
M ik
∑c m =1
k im
k G ( ot 0 , t1 , μ imk , Σ im )
331
(6)
k
where M i is the number of Gaussian mixtures in sate i of a HMM in super k
state k, cim is the mixture weight for the mth mixture in state in sate i of a HMM in super state k, and G is a multivariate Gaussian density function with mean
k μimk and diagonal covariance matrix Σim , and ot 0,t1 is the observation
vector at row t0 and column t1 extracted from a face image by sliding block manner, as illustrated in Fig. 3. ─ The parameter set Λk = {π1k , A1k , Bk } defines a HMM in super state k. Using a compact representation, an embedded HMM can be defined as a triplet λ = ( π0 , A 0 , Λ ), where Λ = {Λ1 ,..., Λ k ,..., Λ N0 } . In training phase, for each individual, one EHMM is trained by 2D segmental K-mean algorithm [10]. In face recognition phase, the likelihood of the observation vectors is computed by a doubly embedded Viterbi algorithm [10]. Then, the identity can be estimated according to the highest likelihood score given by.
r * = arg max P(O | λr )
(7)
1≤ r ≤ R
where O denotes the observation vectors from a input face image,
λr denotes the rth
(here we assume there are R individuals) individual’s embedded HMM model, and
r * denotes the recognized identity based on maximum likelihood criterion. The embedded HMM is appropriate for a face image since a face consists of forehead, eyes, nose, mouth, and chin facial regions in a natural order. The embedded HMM characterizes this important facial trait by five super sates from top to down, i.e. forehead, eyes, nose, mouth, and chin, and in each super state a secondary left-toright HMM characterizes each facial region, as shown in Fig. 4.
Fig. 4. Embedded HMM structure for each individual. Five super states are used to characterize five facial regions. In each super state, the emission probability is modeled via a left-to-right HMM.
332
S.-M. Huang and J.-F. Yang
4 Experimental Results 4.1 AR Face Database We have conducted experiments on a well known face database, the AR face database [17]. The AR database, built by Martinez and Benavente, totally contains 3510 mug shots of 135 subjects (76 males and 59 females) with different facial expressions, lighting changes and partially occlusions. As shown in Fig. 5, each subject contains 26 pictures in two sessions, named from AR01 to AR26. The first session, containing 13 pictures, includes neutral expression (AR01), smile (AR02), anger (AR03), screaming (AR04), different lighting changes (AR05~AR07), and two realistic partial occlusions with lighting changes (AR08~AR13). The second session duplicates the first session in the same way two weeks later, i.e. AR14~AR26. 4.2 Experimental Setup In the experiment, 100 subjects consisting of 50 males and 50 females were chosen for the identity recognition. All face images were manually cropped with size of 120 × 165 and resized to size of 90 × 120 and converted to gray level. Seven face images without partial occlusion from first session, including neutral face and faces under expression variations and illumination variations, were chosen for model training (AR01~AR07). Thirteen face images from second session were used for testing to examine conditions under neutral face (AR14), facial expression variations (AR15~AR17), illumination variations (AR18~AR20), and partial occlusions with a pair of sunglasses (AR21~AR23) and with a scarf (AR24~AR26).
Fig. 5. Sample images of one subject from AR facial database, including neutral face (AR01 and AR14), expression variations (AR02~AR04 and AR15~AR17), illumination changes (AR05~AR07 and AR18~AR20), and partial occlusions (AR11~AR13 and AR24~AR26)
For each individual, the embedded HMM contains totally five super states, as shown in Fig. 4: 3 states for forehead super state, 6 states for eyes super state, 6 states for nose super state, 6 states for mouth super state, and 3 states for chin super state. In observation vector extraction phase, four observation vectors are investigated to face recognition: DCT, LBP+DCT, Delta DCT, and LBP+Delta DCT. 4.3 Results and Discussion Results in Fig. 6 show the performance as a function of the number of Gaussian mixture under different conditions. As shown in Fig. 6(a) to Fig. 6(e), the systems with
Robust Face Recognition under Different Facial Expressions, Illumination Variations
(a)
(b)
(c)
(d)
(e)
(f)
333
Fig. 6. Recognition results on AR face database. (a) Results under neutral face. (b) Results under expression variations. (c) Results under illumination variations. (d) Results under partial occlusion with a pair of sunglasses. (e) Results under partial occlusion with a scarf. (f) Average recognition rate.
334
S.-M. Huang and J.-F. Yang
LBP perform better than that without LBP significantly, whatever under different expressions, illumination variations, and partial occlusions. Overall, from the average recognition accuracies in Fig. 6(f), Delta DCT with LBP performs better than others, i.e. DCT, LBP+DCT, and Delta DCT. Under conditions without partial occlusions, the four observation vector extraction methods can achieve good performance. However, for partial occlusion problems, only LBP+DCT and LBP+Delta DCT could overcome the challenge of partial occlusion. From the experiments, we found that adopting LBP as a preprocessing can not only eliminate illumination effect but also improve robustness to different facial expressions. Besides, adopting LBP as a preprocessing also significantly can help improve performance of embedded HMM based face classifier under partial occlusions.
5 Conclusions In this paper, we have developed an embedded HMM based face recognition system with a robust observation vector. Four observation vectors are investigated: DCT, LBP+DCT, Delta DCT, and LBP+Delta DCT. Our experiments have demonstrated that the proposed face recognition system is able to estimate identity robustly under different facial expressions, illumination variations and partial occlusions, with a robust observation vectors, i.e. LBP+Delta DCT, which can help improve face recognition performance significantly. Finally, the best system, i.e. EHMM with LBP+Delta DCT, is also compared with the eighenface (PCA) and fisherface (FLD) [2, 4] and two distance measures (L1 and L2 norm) with the same experimental protocol. As shown in Table 1, the proposed face recognition system achieves better average accuracy. Particularly, under partial occlusion conditions, our system outperforms PCA and FLD. Table 1. Recognition rate on AR face database
Method Neutral Expression Illumination Sunglasses Scarf Average PCA+L1 97 93 97.6 72 61.3 84.18 PCA+L2 93 95.3 96 71.3 58.3 82.78 FLD+L1 98 95.6 96.3 77 72 87.78 FLD+L2 98 97.3 99 79 72.3 89.12 Ours 99 96.6 98 81.6 86.6 92.36
Acknowledgement This research is partially supported by the National Science Council of Taiwan, R.O.C., under Grant. NSC 99-2218-E-006-001 and AverMedia Information, Inc, Taipei, Taiwan.
Robust Face Recognition under Different Facial Expressions, Illumination Variations
335
References 1. Zhao, W., Chellappa, R., Rosenfeld, A., Phillips, P.J.: Face Recognition: A Literature Survey. ACM Computing Surveys 35(4), 399–458 (2003) 2. Martinez, A.M., Kak, A.C.: PCA versus LDA. IEEE Trans. on Pattern Analysis and Machine Intelligence 23(2), 228–233 (2001) 3. Ekenel, H.K., Stiefelhagen, R.: Why Is Facial Occlusion a Challenging Problem? In: Tistarelli, M., Nixon, M.S. (eds.) ICB 2009. LNCS, vol. 5558, pp. 299–308. Springer, Heidelberg (2009) 4. Belhumeur, P., Hespanha, J., Kriegman, D.: EigenFaces vs. Fisherfaces: Recognition using Class Specific Linear Projection. IEEE Trans. on Pattern Analysis and Machine Intelligence 19(7) (1997) 5. Turk, M., Pentland, A.: Eigenfaces for Recognition. Journal of Cognitive Science, 71–86 (1991) 6. Yang, J., Zhang, D., Frangi, A.F., Yang, J.-y.: Two-Dimensional PCA: A New Approach to Appearance-Based Face Representation and Recognition. IEEE Trans on Pattern Analysis and Machine Intelligence 26(1), 131–137 (2004) 7. Rabiner, L.R.: A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proc. IEEE 77(2), 257–286 (1989) 8. Samaria, F.S., Young, S.: HMM-based Architecture for Face Identification. Image and Vision Computing 12(8), 537–543 (1994) 9. Nefian, A.V., Hayes III, M.H.: Face Detection and Recognition using Hidden Markov Models. In: Proc. International Conference on Image Processing, vol. 1, pp. 141–145 (1998) 10. Nefian, A.V., Hayes III, M.H.: An Embedded HMM-based Approach for Face Detection and Recognition. In: Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 6, pp. 3553–3556 (1999) 11. Nefian, A.V., Hayes III, M.H.: Maximum Likelihood Training of the Embedded HMM for Face Detection and Recognition. In: Proc. International Conference on Image Processing, vol. 1, pp. 33–36 (2000) 12. Nefian, A.V.: Embedded Bayesian Networks for Face Recognition. In: Proc. IEEE International Conference on Multimedia and Exp., vol. 2, pp. 133–136 (2002) 13. Othman, H., Aboulnasr, T.: A Separable Low Complexity 2D HMM with Application to Face Recognition. IEEE Trans. on Pattern Analysis and Machine Intelligence 25(10), 1229–1238 (2003) 14. Chen, M.-Y., Kundu, A., Zhou, J.: Offline Handwritten Word Recognition using A Hidden Markov Model Type Stochastic Network. IEEE Trans. on Pattern Analysis and Machine Intelligence 16(5), 481–496 (1994) 15. Martinez, A.M.: Recognizing Imprecisely Localized, Partially Occluded, and Expression Variant Faces from a Single Sample per Class. IEEE Trans. on Pattern Analysis and Machine Intelligence 24(6), 748–763 (2002) 16. Wright, J., Yang, A.Y., Ganesh, A., Sastry, S.S., Ma, Y.: Robust face recognition via sparse representation. IEEE Trans. on Pattern Analysis and Machine Intelligence 31(2), 210–227 (2009) 17. Martinez, A., Benavente, R.: The AR Face Database. CVC Technical Report 24 (1998) 18. Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. on Pattern Analysis and Machine Intelligence 24(7), 971–987 (2002)
336
S.-M. Huang and J.-F. Yang
19. Jing, X.Y., Zhang, D.: A Face and Palmprint Recognition Approach Based on Discriminant DCT Feature Extraction. IEEE Trans. on Systems, Man, and Cybernetics-Part B:Cybernetics 34(6), 2405–2415 (2004) 20. Sanderson, C., Paliwal, K.K.: Fast feature extraction method for robust face verification. Electronic Letters 38(25), 1648–1650 (2002) 21. Heusch, G., Rodriguez, Y., Marcel, S.: Local Binary Patterns as an Image Preprocessing for Face Authentication. In: Proc. Of 7th International Conference on Automatic Face and Gesture Recognition (FGR 2006), pp. 9–14 (2006) 22. Ojala, T., Pietikainen, M., Harwood, D.: A comparative study of texture measures with classification based on featured distributions. Pattern Recognition 29(1), 51–59 (1996) 23. Wang, Y., Chen, Q., Zhang, B.: Image enhancement based on equal area dualistic subimage histogram equalization method. IEEE Trans. on Consumer Electronics. 45(1), 68– 75 (1999) 24. Gross, R., Brajovic, V.: An image preprocessing algorithm for illumination invariant face recognition. In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688, Springer, Heidelberg (2003)
Localization and Recognition of the Scoreboard in Sports Video Based on SIFT Point Matching Jinlin Guo1,2,4 , Cathal Gurrin1,4 , Songyang Lao2 , Colum Foley1,4 , and Alan F. Smeaton1,3,4 1
Center for Digital Video Processing, Dublin City University, Ireland School of Information System & Management, National University of Defense Technology, China CLARITY: Centre for Sensor Web Technologies, Dublin City University, Ireland 4 School of Computing, Dublin City University, Ireland {jguo,cgurrin,cfoley,asmeaton}@computing.dcu.ie,
[email protected]
2 3
Abstract. In broadcast sports video, the scoreboard is attached at a fixed location in the video and generally the scoreboard always exists in all video frames in order to help viewers to understand the match’s progression quickly. Based on these observations, we present a new localization and recognition method for scoreboard text in sport videos in this paper. The method first matches the Scale Invariant Feature Transform (SIFT) points using a modified matching technique between two frames extracted from a video clip and then localizes the scoreboard by computing a robust estimate of the matched point cloud in a two-stage non-scoreboard filter process based on some domain rules. Next some enhancement operations are performed on the localized scoreboard, and a Multi-frame Voting Decision is used. Both aim to increasing the OCR rate. Experimental results demonstrate the effectiveness and efficiency of our proposed method. Keywords: Localization and Recognition of Scoreboard, SIFT Points Matching, Sports Video.
1
Introduction
With the development of high-speed broadband networks and digital video technology (including generating, compression, storage and processing), the amount of sports videos to which viewers can access is increasing drastically. It’s often not possible for even the most avid sports fan to watch more than a small fraction of the available coverage of a complete event, such as the World Cup. Furthermore, for many sports much of the time during an game is often not significant to the progression of the game or its outcome. Therefore automatic sports video indexing and retrieval techniques have attracted a lot of research interest. Due to the automatic indexing of sports content, users can retrieve their preferred clips of sports video such as goals in soccer. K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 337–347, 2011. c Springer-Verlag Berlin Heidelberg 2011
338
J. Guo et al.
In broadcast sports videos, a superimposed scoreboard is used to display game status such as team names, score, etc., to increase the audiences’ understanding of the game progression. Furthermore, the scoreboard changes after a goal event occurs. Therefore, localization and recognition of the scoreboard is very meaningful for sports video analysis and processing, for example, as a method for detecting score events or as a source of evidence for a score detection or event detection technique. In this paper, we present an effective and efficient method to localize and recognize the scoreboards in the videos based on the observations that the location of scoreboard is static and it is present on-screen for all the duration of the game. Firstly, a bag of matched points obtained by a modified SIFT match technique is used to represent the candidate scoreboards. Then the exact area of scoreboard is localized by computing a robust estimate of the matched points cloud in a two-stage non-scoreboard filtering process. In the recognition step, some text enhancement operations and a Multi-frame Voting Decision are performed before using a commercial OCR for increasing the OCR rate. Experiment results demonstrate the effectiveness and efficiency of our proposed method. The rest of the paper is organized as follows. In section 2, we provide an overview of the state-of-the-art of localization and recognition of superimposed text in videos, section 3 describes the localization and recognition of scoreboard in video in detail. Section 4 presents the experiment results. Finally conclusions are drawn in section 5 and we provide an outlook for further research.
2
Related Work
Localization and recognition of superimposed text in video is a major task in video content analysis and processing. A number of algorithms to localize and recognize superimposed text from still images and video sequences have been published in recent years [1]∼[10], which can be categorized into two types: one type is localizing texts in individual image [1]∼[4], the other type is utilizing the temporality of video sequences [5]∼[10]. Jain A.K. et al. [1] employed color reduction by bit dropping and color clustering quantization firstly, and afterwards a multi-value image decomposition algorithm was applied to decompose the input image into multiple foreground and background images. Then connected component analysis was performed on each of them to localize text candidates. Ngo C-W. et al. [2] presented a background complexities-based text detection and segmentation method, in which video frames were classified into four types according to the edge densities. Edges of the non-text regions were gradually removed by repeated shifting and smoothing operators. In [3] and [4], the authors treated text detection as a classification problem. Xi. Li et al. [3] used SVM to obtain a text region based on the features extracted by stroke filter calculation on stroke maps. Chen D. T. et al. [4] compared the SVMbased method with multilayer perceptrons (MLP) based on text verification over four independent features, namely, the distance map feature, the gray-scale
Localization and Recognition of the Scoreboard in Sports Video
339
spatial derivative feature, the constant gradient variance feature and the DCT coefficient feature. Finally they found that better detection results were obtained by using SVM rather than MLP. In [5] Lienhart R. et al. adopted the redundant information of video frames to refine the coarse text regions detected by a pre-trained feed-forward network. Wang R.R. et al. [6] employed a multi-frame integration method i.e. timebased minimum (or maximum) pixel value search to obtain the integrated images for the purpose of minimizing (or maximizing) the variation of the background of the image. Tang et al. [7] proposed a universal caption detection and recognition method based on a fuzzy-clustering neural network technique. These general methods are however either too complicated, hence timeconsuming, or sensitive to selection of thresholds, and not suitable for scoreboards localization in sports video frames. Recently, texts localization and recognition in sports video has attracted some research interest. In [8] Zhang D. et al. proposed general and domain-specific techniques. They first presented a general algorithm to detect and locate captions, and then they employed a domain model of specific sports, e.g. baseball and basketball, in the text recognition to improve its rate from 78% to 87%. Yih-Ming et al. [9]detected and localized the text region using an iteratively temporal averaging technique in a series of sports video frames at first, and then a accurate extraction of text content was performed based on text identification and model-based segmentation processes. Finally they recognized the characters using a commercial OCR technique. Hsieh C.H. et al. [10] proposed a detection and recognition method of scoreboard for baseball video. They firstly identified the scoreboard type using template matching and then extracted the caption region of each type. At last, the digits in the scoreboard were recognized by a neural network classifier. A Scoreboard can be localized and recognized effectively and efficiently according to its characteristic in sports video frames. That is, the scoreboards is fixed or only slightly changed during the course of the game, namely, the font type and relative location of each field are kept the same over the whole video. Based on this observation,we present an effective and efficient method to localize and recognize scoreboards in sports videos.
3
Proposed Scoreboard Localization and Recognition Method
The method proposed in this paper consists of two processes: localization process and recognition process, as shown in Figure 1. In the localization process, it first matches SIFT points in two frames extracted from the input sports video clip. Then the scoreboard is localized based on robust estimation within a two-stage filter of non-scoreboard matched points. In the recognition process, it identifies the scoreboard text after some text enhancement operations are performed on the scoreboard. The details are described in the following sections.
340
J. Guo et al.
Fig. 1. Flowchart of the proposed approach
3.1
Localization Process
SIFT points Detection and Matching: Recently, it has been shown that region-based approaches are effective methods for object detection and recognition due to the fact that they can cope with the problem of occlusion and geometrical transformation [12]. These approaches are commonly based on the idea of modeling an object by a collection of local salient points. Each of these local features describe a small region around the interesting point and therefore they are robust against occlusion and clutter. In particular, the 128-dimensional SIFT feature proposed in [11] has been proven effective in detecting objects. Because it is designed to be invariant to relatively small spatial shifts of region positions, which often occur in real images. Therefore, we use the SIFT feature as descriptor of local salient points. By combining the results of local point-based matching we are able to match an entire scoreboard. The input video clip can be denoted as: Clip = {f1 , f2 , ..., fN }. Here fi denotes frame, N denotes the number of frames in this video clip. Two frames: fp and fq are extracted from the input clip. It should be noted that these two frames are chosen arbitrarily for the demonstration of this method. No claim is made for any optimal frame-selection. However, these two frames should be extracted from different shots. Because of temporal redundancy, two frames from the same shot will lead to too many matched points. SIFT points are detected on fp and fq using the four steps in [11], denoted as respectively: Tp = {(xpk , ykp , spk , dpk , opk )} Tq = {(xqk , ykq , sqk , dqk , oqk )}
for k ∈ {1, 2, 3, · · · , Np } for k ∈ {1, 2, 3, · · · , Nq }
Here xck , ykc , sck , dck (c ∈ {p, q}) are the x -position, y-position, the scale, and the dominant direction of the kth SIFT point respectively. ock is the 128-dimensional feature vector for each SIFT point. So every extracted frame is represented as a bag of SIFT points. The next step is to find these matched points between two frames, i.e. points matching. The performance of points matching effects the localization of the scoreboard greatly.
Localization and Recognition of the Scoreboard in Sports Video
341
Firstly we review the matching technique in [11]. Denoting P and Q as set of SIFT points for two images respectively, for any point in P, pi ,to which qj and qj the closest and second closest Euclidean distances from points in Q. The corresponding distances are dij and dij respectively, and dij ≤ dij . If dij ≤ dij ∗ α, then pi and qj are matched points. α is a predefined threshold, representing the point’s discrimination, in [11] the authors set α = 0.8. According to this rule, the initial point matching between two feature point sets, in which processing, some mismatches exist, so algorithms such as RANSAC can be used to eliminate mismatches. For the similarity measure S, if S(pi , qj ) = minS (pi , ql ), then qj is the closest ql ∈Q
point in Q,to pi . However, if S (pi , qj ) = minS (pt , qj ), then pi is not the clospt ∈P
est point in P to qj , so it’s not reasonable to set pi and qj as matched points. The robust points matching techniques should have the feature as follows: if pi and qj are matched points, then S (pi , qj ) = minS (pi , ql ) and S (pi , qj ) = ql ∈Q
min S (pt , qj ), vice-versa. Obviously, for the method in [11], dij ≤ dij ∗ α,
pt ∈P
S (pi , qj ) = minS (pi , ql ), but not always S (pi , qj ) = min S (pt , qj ), so (pi , qj ) ql ∈Q
pt ∈P
may be mismatched points. Based on the aforementioned analysis, we set pi and qj as matched points, if they satisfy as follows: d (pi , qj ) = min (pi , ql ) = min (pt , qj ) ql ∈Q
d (pi , qj ) ≤
pt ∈P
min d (pi , ql ) ∗ α
ql ∈Q,l=j
d (pi , qj ) ≤ min d (pt , qj ) ∗ α pt ∈P,t=i
Here d (pi , qj ) is the corresponding Euclidean Distance between pi and qj , and α is set as 0.8 experimentally. Based on the modified matching technique aforementioned, matched points between frame fp and frame fq are obtained (as shown in Figure 2). The next step is filtering some non-scoreboard matched points according to some domainspecific rules.
Fig. 2. SIFT points detection and matching
342
J. Guo et al.
Initial Filtering: Certain characteristics exist when a scoreboard is shown on a video frame, which can be used to remove some non-scoreboard matched points. • For the convenience of viewers’ watching, the scoreboard always appears in the lower or upper areas of a video frame. We assume that the scoreboard always appears in the upper 1/4 area and lower 1/4 area. Therefore those matched points not appearing in these two areas are discarded. • Each distance between matched points and any boundary (top, bottom, left and right ) of the frame should be greater than T, which is a threshold and set as 15 pixels in our experiments based on observation. As shown in Figure 3, the scoreboard always appears in either the R1 or R2 area. After this filtering, most of non-scoreboard matched points are removed (as shown in Figure 4(a)).
Fig. 3. Area where scoreboard is shown
Clustering and Robust Estimate of The Matched Point Cloud: After the first filtering, some non-scoreboard matched points still exist, which is caused by constant appearance of TV Logo or other objects. However, all these matched points can be clustered into one or several clusters in term of proximity of matched points generated by the same object. Clustering in this two dimensional space is performed using X-means proposed in [13]. Unlike K-means, the X-means clustering does not require the number of clusters to be predefined. Robust Estimate is performed on each of these clusters, after which several robust centroids are localized. In this way the exact area for each cluster is obtained. In order to localize the centroid for each cluster in the frame fp and approximate its area, we compute a robust estimate on each matched points cluster. One matched points cluster is so denoted as P = {(x1 , y1 ) , (x2 , y2 ) , · · · , (xn , yn )}. The robust centroid estimate is computed by iteratively solving for (μx , μy ) in n n i=1 Ψ (xi ; μx ) = 0, i=1 Ψ (yi ; μy ) = 0 Here the influence function used is the Turkey biweight and the scale parameter c is estimated using the Median Absolute Deviation (MAD) from the median: M ADx = mediani (|xi − medianj (xj )|). Refinement: After Robust Estimate, the area (represented by a rectangle, as shown in Figure 4(b)) for each cluster is localized. These rectangles whose width
Localization and Recognition of the Scoreboard in Sports Video
343
Fig. 4. SIFT matched points filtering
Fig. 5. Preprocessing and OCR of scoreboard text
values are smaller than T pixels are considered as non-scoreboard and removed. In our experiment, T = 20 pixels based on observation. After this filtering, the scoreboard bounding box is obtained (as shown in Figure 4(c)). Furthermore, because the scoreboard is attached at a fixed location in every frame, the localization of a scoreboard is only performed once for a video of an entire match. 3.2
Scoreboard Text Recognition
Current optical character recognition (OCR) techniques such as ABBYY OCR [15] or ReadIRIS [17] perform rather well and give good accuracy for texts printed on a clear background, and can recognize multiple languages by adding source character libraries. However, since we are interested in recognition of the text printed against shaded and textured backgrounds. OCR technology cannot easily handle such text. Hence we need to preprocess the extracted scoreboard before OCR so that the scoreboard can be recognized correctly and easily. The image for the scoreboard cropped out in the localization processing is relatively simple in nature. It only contains team, score and other text, and uniform color for background (as shown in Figure 4(c)), of which team and score information is the most important. Some operations are performed on the scoreboard image before using OCR software to recognize the texts in scoreboard. Details are provided in the following section.
344
J. Guo et al.
Preprocessing: Step 1: Size Enlargement, to double the size of the scoreboard image by using Bicubic Interpolation [16]. The characters in scoreboard are small and compact,which need to be enlarged for increasing OCR rate. We choose bicubic interpolation due to the fact that the interpolated surface produced by bicubic interpolation is smoother than corresponding surfaces obtained by bilinear interpolation. In addition the nearest-neighbor interpolation and has fewer interpolation artifacts. Step 2: Binarization using a threshold T obtained by the Otsu method [14]. The area of localized scoreboard mainly contains two classes of pixels background and text. Therefore, binarizing the scoreboard using the thresholdT obtained by Otsu method is viable for OCR. Step 3: Morphology Erosion [16]. A morphology erosion operation can effectively remove the noises and decrease the blur of text edges. OCR: The commercial OCR software, designed for all alphabets, digit and symbols, of ABBYY OCR is used in our experiments. It is applied to recognize all the texts in the scoreboards (as shown in Figure 5). Multi-frame Voting Decision: After text recognition, one result from a single frame is obtained. Because the data of each field may change after occurrence of a new score event, the text of the same field generally stays the same for a relative long time (at least 5 seconds). This characteristic can be employed to further improve the recognition rate. In this work, we use the majority voting technique for several consecutive frames to correct the recognition errors of few frames. It is noted that a vote is made from the results of the consecutive frames belonging to the same shot.
4
Experiment Results
In out experiments, a total of 172 video clips, approximately 484 minutes, captured from three kinds of sports game are collected to demonstrate the performance of the proposed approach. Other details for video clips are listed in Table 1. The localization of a scoreboard is only performed once for a video of a whole match. Therefore, only selecting short clips is enough for experiments. Performance evaluation is made on the scoreboard localization and scoreboard text recognition modules separately. Table 1. Details of tested video clips Sport type Frame size Frame rate(f/s) Amount Average Duration(mins) Soccer Basketball 352 × 288 Rugby
25
3.4 1.2 3.3
72 45 55
Localization and Recognition of the Scoreboard in Sports Video
345
Table 2. Results of scoreboard text localization Pixel-based
|
Text box-based
avmatchrate avmiss avfalse 91.4%
8.6%
7.9%
92.3%
Scoreboard Text Localization: For each video clip, the ground truth of scoreboard bounding box (which mainly contains the score and team information) was created manually. Two kinds of evaluation for scoreboard localization are tested: pixel-based and text box-based performance numbers. Pixel-based performance numbers calculate the match rate, miss rate and false rate on the number of pixels the ground truth and the detected scoreboard bounding box have in common (as shown in Figure 6), for detected scoreboard bounding box Di in ith video clip:
Gi ) i matchratepixel−based,i = card(D card(Gi ) misspixel−based,i = 1 − matchratepixel−based,i Gi ) i f alsepixel−based,i = 1 − card(D card(Di )
Fig. 6. Diagram of pixel-based evaluation
Here Di = {d1 , d2 , · · · , dni } and Gi = {g1 , g2 , · · · , gmi } are the sets of pixel set representing the detected scoreboard bounding box and the ground-truth scoreboard bounding box of size ni and mi for ith video clip respectively. N is the number of tested video clips. Operator card (·) counts the number of elements in a set. The average match rate, average miss rate and average false rate are calculated as follows: N avmatchrate = i=1 matchratepixel−based,i avmiss = N misspixel−based,i i=1 N avf alse = i=1 f alsepixel−based,i In contrast, the text box-based performance is evaluated by recall which refers to the number of detected boxes that match with the ground truth. The created scoreboard text bounding box Di was regarded as localized correctly if and only if the two boxes Di and Gi overlapped by at least 85% for ith video clip.
346
J. Guo et al.
recall = Here:
N
i=1
δ(Di ,Gi ) N
1 if min(ComD, ComG) ≥ 0.85 0 else ComD = card (Di Gi ) /card(Di ) ComG = card (Di Gi ) /card(Gi )
y=
Experiment results of localization performance are given in Table 2. The localization approach correctly found 92.3% of all scoreboard boxes. And the average match rate can achieve to 91.4% with miss rate 8.6%. Our experiments show that most of scoreboard text boxes generated by the proposed approach are a little smaller than their corresponding ground-truth text boxes, which leads to the results that the average false rate (7.9% ) is relatively small and the average match rate is close to the recall. Scoreboard text recognition: Scoreboard text recognition is performed as described in section 3.2 on all the correctly localized scoreboards. If the score and team information can be obtained, then we consider the scoreboard is correctly recognized. In our experiments, 88.1%of the correctly localized scoreboards were also recognized correctly. Over all stages, 81.4% (0.881 × 0.923 = 0.814) of all scoreboards were recognized correctly.
5
Conclusions
The scoreboard in sports video is an important semantic clue. Localization and recognition of scoreboards is very meaningful for sports video analysis and processing. According to the observation that scoreboards are attached at fixed locations in the sports video and always exists in all sports video frames, we propose an approach for localizing and recognizing scoreboards based on SIFT points matching. In our experiments on a total of 172 sports video clips, approximately 484 minutes, an average of 91.4% of the scoreboard bounding box are correctly matched with a 7.9% false rate. For localization and recognition, 92.3% of all scoreboards box are correctly localized, and 81.4% of all scoreboards can be recognized. Furthermore, the localization of a scoreboard is only performed once for a video of an entire match, which is efficient. In the future, we will extend our study to detect score events of sports games by the recognized scoreboard texts.
Acknowledgment This paper is supported by the Information Access Disruptions (iAD) Project (Norwegian Research Council), the China Scholarship Council, the National Nature Science Foundation of China (No.60902094) and by Science Foundation Ireland under grant 07/CE/I1147.
Localization and Recognition of the Scoreboard in Sports Video
347
References 1. Jain, A.K., Yu, B.: Automatic Text Location in Images and Video Frames. Pattern Recognition 31(12), 315–333 (1998) 2. Ngo, C.-W., Chan, C.K.: Video Text Detection and Segmentation for Optical Character Recognition. ACM Multimedia Systems 10(3), 261–272 (2005) 3. Li, X.J., Wang, W.Q., Jiang, S.Q., Huang, Q.M., Gao, W.: Fast and Effective Text Detection. In: IEEE International Conference on Image Processing (ICIP), pp. 969–972 (2008) 4. Chen, D.T., Odobez, M.J., Bourlard, H.: Text Detection and Recognition in Images and Videos. Pattern Recognition 37(3), 595–608 (2004) 5. Lienhart, R., Wernicke, A.: Localizing and Segmenting Text in Images and Videos. IEEE Transact. on Circuits and Systems for Video Technology 12(4), 256–268 (2002) 6. Wang, R.R., Wan, J.J., Wu, L.D.: A novel video caption detection approach using Multi-Frame Integration. In: IEEE Proceeding of the 17th International Conference on Pattern Recognition (ICPR), vol. 10(3), pp. 449–452 (2004) 7. Tang, X., Gao, X., Liu, J., Zhang, H.-Z.: A Spatial-temporal Approach for Video Caption Detection and Recognition. IEEE Trans. on Neural Networks 13(4), 961– 971 (2002) 8. Zhang, D., Rajendran, R.K., Chang, S.-F.: General and Domain-Specific Techniques for Detecting and Recognizing Superimposed Text in Video. In: IEEE International Conference on Image Processing (ICIP), pp. 22–25 (2002) 9. Su, Y.M., Hsieh, C.H.: A Novel Model-based Segmentation Approach to Extract Caption Contents on Sports Videos. In: International Conference on Multimedia & Expo. (ICME), pp. 1829–1832 (2006) 10. Hsieh, C.H., Huang, C.P., Hung, M.H.: Detection and Recognition of Scoreboard for Baseball Videos. In: International Conference on Intelligent Computing (ICIC), pp. 337–346 (2008) 11. Lowe, D.: Distinctive image features from scale-invariant keypoints. Computer Vision 60(2), 91–110 (2004) 12. Ballan, L., Bertini, M., Del Bimbo, A., Serra, G.: Video Event Classification Using Bag of Words and String Kernels. In: International Conference on Image Analysis and Processing, pp. 170–178 (2009) 13. Pelleg, D., Moore, A.: X-means: Extending K-means with efficient estimation of the number of clusters. In: International Conference on Machine Learning, pp. 727–734 (2000) 14. Otsu, N.: A Threshold Selection Method from Gray-Level Histograms. IEEE Transact. On Systems, Man and Cybernetics 9(1), 62–66 (1979) 15. ABBYY FineReader, http://www.abbyy.com/ 16. Gonzalez, R.C., Woods, R.E.: Digital Image Processing, 2nd edn. (2007) 17. ReadIRIS, http://www.irislink.com/
3D Model Search Using Stochastic Attributed Relational Tree Matching Naoto Nakamura, Shigeru Takano, and Yoshihiro Okada Graduate School of Information Science and Electrical Engineering, Kyushu University, 744 Motooka, Nishi-ku, Fukuoka, 819-0395, Japan {n-naka,takano,okada}@i.kyushu-u.ac.jp
Abstract. Recent advances in computer hardware technology enable us to handle 3D multimedia data more easily and many 3D Computer Graphics (CG) contents have been created and stored for various application fields. In this situation, we need any 3D multimedia data search system that allows us to efficiently retrieve our required data. The authors have proposed a 3D multimedia data search system using stochastic Attributed Relational Graph (ARG) matching. However, there is a problem that stochastic ARG matching needs huge calculation cost. To reduce such a cost, the authors propose new matching algorithm, called stochastic Attributed Relational Tree (ART) matching because calculation cost of tree matching is less than that of graph matching. The authors applied stochastic ART matching method for 3D model search and obtained better performance rather than those of the conventional matching methods besides stochastic ARG matching method. Keywords: ARG, ART, 3D model search, Graph matching, Tree matching.
1 Introduction Recently, most PC users can enjoy 3D multimedia contents without any difficulty because of high performance of PC. Many 3D Computer Graphics (CG) contents have been created and stored for various application fields including movie and game industries, Web services, and so on. The management of 3D multimedia data has become important because when a content creator wants to use some already existing data, he/she must find them from huge number of data in a database. We need any tool that helps us to retrieve our required data efficiently from such a data pool. Using Attributed Relational Graph (ARG)[11] as a feature of multimedia data, efficient search system is realized in terms of structural similarity of components. So, we developed a 3D model search system using stochastic ARG matching algorithm, and achieved better results than those of conventional 3D model search methods, but there is a problem that stochastic ARG matching needs huge calculation cost [6]. We considered that the structure of contents included in multimedia data has tree structure. So, in this paper, we proposed stochastic ART matching algorithm because generally, the calculation cost of tree matching is less than that of graph matching. The concept of edit distance is often used in tree matching. The calculation of edit K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 348–358, 2011. © Springer-Verlag Berlin Heidelberg 2011
3D Model Search Using Stochastic Attributed Relational Tree Matching
349
distance consists of cost calculation steps for three operations, i.e., deletion, insertion and relabeling operations. We define a new similarity measure between two ARTs depending upon these three operations. In this paper, we describe our new matching algorithm and discuss effectiveness of the algorithm. The remaining of this paper is divided into four sections. First of all, Section 2 introduces related works, and next Section 3 describes stochastic ARG matching algorithm and stochastic ART matching algorithm. Section 4 indicates that experimental results about 3D model search using proposed method show better performance in comparison with D2 method and stochastic ARG matching method. Finally we conclude the paper in Section 5.
2 Related Works There are several researches on the 3D model search. Paquet and Rioux proposed a 3D model database system that employs many popular 3D model matching algorithms [8]. Vranic and Saupe proposed a search method based on characterization of spatial properties of 3D objects used as their feature vectors [10]. It extracts the feature vector from a coarse voxelization of a 3D model by using 3D discrete fourier transform. Osada, et al. proposed a 3D model matching algorithm using distribution of distances between any two random points on the surface of a 3D model [7]. This method called D2 has good evaluation results reported in their paper. Our 3D model search system indicates better evaluation results than those of D2 method in our experiments. Laga et al. proposed the use of spherical wavelet transform as a tool for the analysis of 3D shapes represented by functions on the unit sphere [5]. Sundar et al. use as a shape descriptor a skeletal graph that encodes geometric and topological information [9]. After voxelization of a shape, the skeletal points are obtained by a distance transformbased thinning algorithm developed by Gagvani using a thinness parameter [2]. Recently, the topology information of 3D models is focused on by many researchers. Hiraga, et al. proposed topology matching technique of 3D models using MultiResolutional Reeb Graphs (MRGs) [3]. Bespalov et al. [1] investigate the application of Hilaga’s method to solid models. They found that for solid models, minor changes in topology may result in significant differences in the similarity. Since for solid models topological insensitivity is important, they conclude that the Reeb graph technique requires some improvements. Kaku, et al. proposed a similarity measure between 3D models using their OBBTrees [4]. MRGs and OBBTrees have a graph and a tree structure, but they are not enough to represent a structural feature of content.
3 Stochastic ART Matching ARG is one of the representations for the structural feature of a multimedia content. The stochastic ARG matching algorithm needs huge calculation cost due to the graph matching [11]. But in many case, the structure of multimedia contents can be represented as tree structure. In ARG construction process, we extract ARGs from 3D models although they have tree structure. So, we modified stochastic ARG matching method to make it suitable for the tree structure, i.e., “Attributed Relational Tree”
350
N. Nakamura, S. Takano, and Y. Okada
(ART) and we defined stochastic ART matching method. This section introduces it besides stochastic ARG matching method. 3.1 Stochastic ARG Matching In our previous works, we employ stochastic ARG matching method [6] to measure the similarity between two 3D model data. This section starts with the definition of ARG. Definition. An ARG is a triple G = (V , E , A) , where V is the vertex set, E is the edge set, and A is the attribute set that contains a unary attribute a i assigned to each vertex n i ∈ V and a binary attribute aij assigned to each edge eij = (ni , n j ) ∈ E . To define the similarity measure between two graphs G s and G t , we introduce some notations. Let H p denote a binary random variable according to two hypotheses: The hypothesis H p = 1 means that G t is similar to G s , and the hypothesis H p = 0 implies that G t is not similar to G s . The unary features of G s and G t are denoted by s t Ais , 1 < i < N , and Akt , 1 < k < M , respectively, and the binary ones of G and G are denoted by Aijs , 1 < i, j < N and Aklt , 1 < k , l < M respectively. We define two vectors: s t and At = ( A1t ,...,AMt , A12t ,...AMM . Let p( At | A s , H p = h) be the As = (A1s ,...,ANs , A12s ,...ANN −1) −1)
probability of transforming G s into G t . The similarity measure of G s and G t is defined by
S (G s , G t ) = To calculate the probability
p( At | A s , H p = 1) p( At | A s , H p = 0)
.
(1)
p( At | A s , H p = h) , we need two transformation
processes: vertex copy process (VCP) and attribute transformation process (ATP).
~ G t of G t is made. The vertices of G s are mapped into the ~t vertices of G only when H p = 1 . We denote this mapping by X , referred to as In VCP, a copy
xik be an element of X . ~ xik = 1 means that the i -th vertex of G s is mapped into the k -th vertex of G t . To ~t s get a one-to-one correspondence between G and G , the constraints ∑ xik ≤ 1, ∑ xik ≤ 1 are required:
correspondence matrix, whose elements are 0 or 1 . Let
i
k
ATP is a process that changes the attributes on the vertices of VCP, ATP and X is shown in Fig. 1.
~ G t . An example of
3D Model Search Using Stochastic Attributed Relational Tree Matching
351
Fig. 1. ARG Matching Process
p( At | A s , H p = h) can be decomposed as p ( At | A s , H p = h ) =
∑χ p( A
t
| A s , X , H p = h) p ( X | A s , H p = h).
X∈
Here
χ
denotes a set of the matrix
of VCP, and
X , p ( X | A s , H p = h) denotes a probability
p ( At | A s , X , H p = h) denotes a probability of ATP.
We assume that
X is statistically independent of A s for the given H p . Then we
have
p ( X | A s , H p = h) =
1 ψ ik , jl ( xik , x jl ), ∏φh ( xik ) (i,∏ Z ( h ) ( i ,k ) k ),( j ,l )
with the one-vertex potential function
⎧0, xik = 0, ⎪ φ ( xik ) = ⎨q0 , xik = 1, h = 0, ⎪q , x = 1, h = 1 ik ⎩ 1 and the two-vertex potential function
⎧0, i = j or k = l , xik = x jl = 1, ⎩1, otherwise.
ψ ik , jl ( xik , x jl ) = ⎨ Here the partition function
N ⎛ N ⎞⎛ M ⎞ Z (h) has the form Z ( h) = ∑ ⎜⎜ ⎟⎟⎜⎜ ⎟⎟i! q hi , where i =1 ⎝ i ⎠⎝ i ⎠
q0 and q1 are parameters to control the probability, and are learned from training data.
p( At | A s , X , H p = h) can be written as
p ( A t | A s , X , H p = h) = ∏ p(a kt | ais , xik ) ( i ,k )
∏ p (a
t kl
| aijs , xik , x jl ).
( i , k ),( j ,l )
We also assume that the probability of attributes transforming into other attributes is the normal distribution Ν ( y , Σ ) , i.e.,
352
N. Nakamura, S. Takano, and Y. Okada
p ( a kt | a is , xik = 0) = Ν ( a is , Σ 0 ), p ( a kt | a is , xik = 1) = Ν ( a is , Σ 1 ),
p ( a klt | a ijs , xik ∩ x jl = 0) = Ν ( a ijs , Σ 00 ), p ( a klt | a ijs , x ik = x jl = 1) = Ν ( a ijs , Σ11 ), where the covariance matrices Σ 0 , Σ 1 , Σ 00 , and Σ 11 are learned from training data. 3.2 Stochastic ART Matching ART is a kind of ARG. First of all, we define an ART and introduce stochastic ART matching derived from stochastic ARG matching. Definition. An ART is a triple T = (V , E, A) , where V is the vertex set, E is the edge set, and A is the attribute set that contains a unary attribute a i assigned to each vertex n i ∈ V and a binary attribute aij assigned to each edge eij = (ni , n j ) ∈ E . A vertex vi ∈ V has no or more than one child vertices, and every vertex has no or one parent vertex, and an ART has a root vertex. In tree matching, the concept of edit distance is often used. The edit cost between two trees, where the one tree is transformed into the other, is calculated as the amount cost of deletion, insertion and relabeling operations. As described in [11], stochastic ARG matching process consists of a vertex copy process (VCP) and an attribute transform process (ATP). In VCP, the system tries to match vertices in all cases. For ART matching, VCP are realized by deletion and insertion operations. Fig. 2 illustrates the overview of VCP of ART matching.
Fig. 2. Vertex copy process (VCP)
Generally, when many nodes are copied, insertion and deletion costs become small value and the probability of VCP becomes great value. ATP is considered as relabeling process. With this matching method, we adopt the concept of three operations to calculate the similarity of stochastic ART matching easily. Using matching method described above, we define new similarity measure between ARTs. This measure is based on the similarity measure of ARGs. Similarity S between two ART T s and T t is defined as follows.
3D Model Search Using Stochastic Attributed Relational Tree Matching
S (T s , T t ) =
p ( A t | A s , H p = h) =
∑χ p( A
X∈
p( A t | A s , H p = 1) p( A t | A s , H p = 0) t
353
.
| A s , X , H p = h) p( X | A s , H p = h).
where A is an attribute set, X describes node mapping, H p is switching variable of copying node.
p ( X | A s , H p = h) is transforming probability of VCP and
p( At | A s , X , H p = h) is that of ATP. VCP consists of deletion and insertion operations. Using the concept of edit distance, costs of insertion, deletion and relabeling are defined and used to calculate similarity measure between two trees. We defined the deletion cost and insertion cost, experimentally. It is obvious that the cost of deleting a leaf node and that of a trunk node are different. We defined that nodes which are located in the path between non-deletion candidates are internal nodes, and the rest nodes are external node. Generally, the deletion cost of an external node is lower than that of an internal node because the internal nodes are closely linked with deletion
Fig. 3. VCP calculation
354
N. Nakamura, S. Takano, and Y. Okada
candidates. By the same idea, the insertion cost of an external node is lower than that of an internal node. p ( X | A s , H p = h) is defined as follows. p ( X | A s , H p = h) =
1 C −1 ( A s , X ) −1 . , C ( A, X ) = s −1 1 ( A, X ) + Cost ∑ C ( A , X ′)
X ′∈χ
where, Cost( A, X ) is summation of insertion costs and deletion costs. The relabeling cost is determined by the normal distribution learned by test data. So the transforming probability of ATP is not changed. p ( At | A s , X , H p = h) = ∏ p(a kt | ais , xik ) (i , k )
∏ p (a
t kl
| aijs , xik , x jl ).
( i , k )( j ,l )
In the previous ARG matching algorithm, the probability of VCP depends on only Z, the number of X as shown in Fig. 3 because the transforming probability by copying nodes for each X are assumed to be the same value. In the ART matching, deletion and insertion costs are considered so the transforming probabilities of VCP are different from each other as shown in Fig. 3. Generally, when many nodes are copied, insertion and deletion costs become small value and the probability of VCP becomes great value because if many nodes are copied, it will not take a lot of deletion and insertion operations. So, the structural information of T s affects the calculation of the probability of VCP.
4 Experiments We implemented a 3D model search system using stochastic ART matching method and performed experiments. In this system, we construct a tree from each 3D model by considering components of the 3D model as its vertices and parent-child relationships between the two components as its edges. In this paper, we employed D2 method and our previous 3D model search system as target algorithms for the comparison. D2 uses a histogram of distances between any two random points on a 3D model surface, and D2 has good search results especially for certain models whose shape is simple and typical. However, D2 is poor for searching complicated models. Our previous 3D model search system uses a stochastic ARG matching method. In the comparison with D2, we employed D2 histogram as a vertex attribute in an ART. And, in the comparison with our previous method, we employed the same features as those of the previous method (surface area, position, flatness and sphericity). We made a 3D model database that contains 149 models classified into 21 classes shown in Fig.4. Parent-child relationships of these 3D models are defined manually. We exploit 100 attributes pairs obtained from the database for learning and construct each covariance matrices. We used three evaluation measures “First tier”, “Second tier” and “Top match”. They are defined as follows: First tier: This criteria represents the percentage of top the query) from the query’s class, where
( k − 1) matches (excluding
k is the number of 3D models of the class.
3D Model Search Using Stochastic Attributed Relational Tree Matching
355
Fig. 4. Classes of 3D Model Database
First tier =
Top (k − 1)matches k −1
Second tier: This criteria is the same type of “First tier”, but for the top matches.
Second tier =
2(k − 1)
Top2(k − 1)matches k −1
Top match: This criteria means the percentage of test in which the top match was from the query’s class. The results of comparison with D2 are shown in Fig. 5, 6 and Table 1. In Fig. 5 and 6, numbers at the horizontal axes mean class numbers in Fig. 4. As for both “First tier”
Fig. 5. Results of 3D model search (Stochastic ART matching)
356
N. Nakamura, S. Takano, and Y. Okada
Fig. 6. Results of 3D model search (D2) Table 1. Results of 3D model search (Comparison with D2) Method
Stochastic ART matching
Measure
First tier
Second tier
Average
0.41
0.57 5.2 sec.
Search time
D2
Top match
First tier
0.47
0.38
Second tier
0.49 2.3 sec.
Top match
0.61
and “Second tier”, the averages of our system results are better than those of D2. Although its average search time is not better than that of D2, this is not a serious problem because we can use parallel processing techniques to reduce this calculation cost. The results of comparison with stochastic ARG matching are shown in Fig. 7, 8 and Table 2. The averages of stochastic ART matching results are better than those of stochastic ARG matching, and its search time is better than that of stochastic ARG matching. We succeeded in reducing the calculation time into 7% of those of stochastic ARG matching.
Fig. 7. Results of 3D model search (Stochastic ART matching)
3D Model Search Using Stochastic Attributed Relational Tree Matching
357
Fig. 8. Results of 3D model search (Stochastic ARG matching) Table 2. Results of 3D model search (Comparison with stochastic ARG matching) Method
Stochastic ART matching
Measure
First tier
Second tier
Average
0.46
0.62 23.2 sec.
Search time
Stochastic ARG matching
Top match
First tier
0.47
0.41
Second tier
Top match
0.56 0.41 330.3 sec.
5 Conclusion and Future Works In this paper, we defined Attributed Relational Tree (ART) instead of ARG and proposed stochastic ART matching method using the concept of edit distance. We also applied it to 3D model search. In the experiments, the proposed matching method achieved better search results than those of D2 method and stochastic ARG matching method. For the calculation cost, we succeeded in reducing the calculation time into 7% of those of stochastic ARG matching. As future works, we will study attributes selection to obtain better results. Also, we will apply stochastic ART matching method to the video scene search system and the motion data search system those we have been developing using stochastic ARG matching method [6].
References 1. Bespalov, D., Shokoufandeh, A., Regli, W.: Reeb graph based shape retrieval for CAD. In: Proceedings of the 2003 ASME Design Engineering Technical Conferences (DETC 2003), Chicago, vol. 1, pp. 229–238 (September 2003) 2. Gagvani, N., Silver, D.: Parameter controlled volume thinning. Graphical models and image processing archive 61(3), 149–164 (1999) 3. Hiraga, M., et al.: Topology matching for fully automatic similarity estimation of 3D shapes. In: Proceedings of SIGGRAPH 2001, pp. 203–212 (2001)
358
N. Nakamura, S. Takano, and Y. Okada
4. Kaku, K., Okada, Y., Niijima, K.: Similarity measure based on OBBTree for 3D model search. In: Proceedings of International Conference on Computer Graphics Imaging and Visualization (CGIV 2004), pp. 46–51. IEEE CS Press, Los Alamitos (2004) 5. Laga, H., Takahashi, H., Nakajima, M.: Spherical wavelet descriptors for contentbased 3D model retrieval. In: Proceedings of IEEE International Conference Shape Modeling and Applications, pp. 15–25 (2006) 6. Nakamura, N., Takano, S., Okada, Y.: 3D multimedia data search system based on stochastic ARG matching method. In: Huet, B., Smeaton, A., Mayer-Patel, K., Avrithis, Y. (eds.) MMM 2009. LNCS, vol. 5371, pp. 379–389. Springer, Heidelberg (2009) 7. Osada, R., et al.: Matching 3D models with shape distributions. In: Proceedings of International Conference on Shape Modeling and Applications, pp. 154–165 (2001) 8. Paquet, E., Rioux, M.: Nefertiti: a query by content system for three-dimensional model and image databases management. Image and Vision Computing 17(2), 157–166 (1999) 9. Sundar, H., Silver, D., Gagvani, N., Dickenson, S.: Skeleton based shape matching and retrieval. In: Proceedings of the Shape Modeling International 2003 (SMI 2003), pp. 130– 139 (2003) 10. Vranic, D., Saupe, D.: 3D shape descriptor based on 3D fourier transform. In: Proceedings of the EURASIP Conference on Digital Signal Processing for Multimedia Communications and Services (ECMCS 2001), pp. 271–274 (2001) 11. Zhang, D., Chang, S.: Stochastic attributed relational graph matching for image nearduplicate detection. Columbia university ADVENT technical report #206-2004-6, Columbia university (2004)
A Novel Horror Scene Detection Scheme on Revised Multiple Instance Learning Model Bin Wu1, Xinghao Jiang1,2,*, Tanfeng Sun1,2, Shanfeng Zhang1, Xiqing Chu1, Chuxiong Shen1, and Jingwen Fan1 1 School of Information Security Engineering, Shanghai Jiao Tong University Shanghai Information Security Management and Technology Research Key Lab {benleader,xhjiang,tfsun,may3feng,xqchu,bear0811,DENISEFAN} @sjtu.edu.cn 2
Abstract. Horror scene detection is a research problem that has much practical use. The supervised method requires the training data to be labeled manually, which can be tedious and onerous. In this paper, a more challenging setting of the problems without complete information on data labels is investigated. In particular, as the horror scene is characterized by multiple features, this problem is formulated as a special multiple instance learning (MIL) problem – Multiple Grouped Instance Learning (MGIL), which requires partial labeled training. To solve the MGIL problem, a learning method is proposed – Multiple Distance- Expectation Maximization Diversity Density (MD-EMDD).Additionally, a survey is conducted to collect people’s opinions based on the definition of horror scenes. Combined with the survey results, Labeled with Ranking – MD – EMDD is proposed and demonstrated better results when compared to the traditional MIL algorithm and close to performance achieved by supervised method. Keywords: Horror Scene Detection; Multi-Instance Learning; Machine Learning.
1 Introduction With the evermore rapid development of the Internet and the prevalence use of video capture devices, it is now much more convenient for people to upload their videos or movies. However, some of them may contain horror content, and parents are concerned about the negative influence of such scenes on children. Therefore, an effective automatic filtration algorithm for horror scenes is needed. There has been some related work for scene detection. W.J. Gillespie et al. used RBF network to classify videos [1]. Hana, Ricardo O applied the conventional MLP (Multiple Layer Perceptron) Neural Network and SVM (Support Vector Machine) to classify crime scenes [2]. Simon Moncriefi et al. study on the affect computing in film through sound energy dynamics [3]. Xinghao Jiang et al. adapted Second-Prediction strategy to video pattern recognition [4]. Zhiwei Gu et al. [5] proposed a Multi-Layer Multi-Instance Learning method for video concept detection. *
Corresponding author.
K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 359–370, 2011. © Springer-Verlag Berlin Heidelberg 2011
360
B. Wu et al.
However, there are some deficiencies of previous work. First, in [1][2][3][4], the supervised method was adapted. In order to achieve high accuracy, the supervised method requires every instance to be labeled. Therefore, the labeling work asks for more input of effort before the training process. Moreover, since general horror scenes are not distributed evenly in videos, the supervised method - detecting the video as an instance -does not work effectively in such circumstance. Second, in [3], the authors mainly discussed the sound effect of horror scenes and did not take visual features into consideration. Also, the dataset contains only 4 films. Third, in [5], the proposed method were to detect video concepts and not directly applicable in horror detection (will be demonstrated in part 2). Fourth, the relationship among multiple features in horror scenes has seldom been discussed. Therefore, in order to make the labeling work more efficient, this paper seeks to investigate a more challenging setting of the problems, in which the information of data label is incomplete – Multiple Instance Learning (MIL). What’s more, the relationship among horror scenes’ features is discussed. Since different features probably differ in their contribution to characterize horrible atmosphere, we devise the MIL and propose- Multiple Grouped Instance Learning problem (MGIL).Then, a new learning method – Multiple Distance – EMDD (MD-EMDD) is proposed based on EM-DD [6] to tackle the MGIL problem, especially the horror scene detection problem. Besides, a survey is conducted to collect the surveyees’ scores on all kinds of video. The MD-EMDD is then combined with the scores ranking and Labeled with Ranking– MD – EMDD is proposed. The rest of this paper is organized as follow: In section 2, the horror scene detection as MIL problem is discussed. Then, in section 3, 2 learning methods are proposed based on EM-DD for horror scene detection problem: Multiple Distance – EMDD (MD-EMDD) and Labeled with Ranking– MD – EMDD. Finally, the proposed method is applied to the horror scene detection problem. The experiments in section 4 show that the performance of our method achieves a better result compared to previous MIL methods and close to performance achieved by supervised method.
2 Horror Scene Detection as MIL Problem MIL is a set of algorithms designed to solve the multiple instance problems, where instances are packaged in a whole bag. A bag is defined as positive if there is at least one positive instance in it, and negative if all instances in it are negative. MIL was first proposed by Dietterich et al. [6] in order to solve the drug activity prediction problem. More recent studies have extended MIL from drug problem to other areas. For example, Zhiwei Gu et al. [5] proposed a Multi-Layer Multi-Instance Learning method for video concept detection. Rouhollah Rahmani et al. [7] presented a localized CBIR system using MIL. Many horror videos consist mainly of non-horror scenes, while the rest are very horror-intensive and may have profound negative influence to children. Figure 1.gives an example. It is a typical video that the effect only takes place in the last 2-3 seconds. This video has been selected in our survey (refer to part 5.2) and rated first by the surveyed subjects. Thus, the goal to detect horror videos is to try and find a part that has the highest matching possibility to the definition of horror. Then, the video can be labeled according to the probability.
A Novel Horror Scene Detection Scheme on Revised Multiple Instance Learning Model
00:00:03
00:00:10
361
00:00:18
Fig. 1. The three screen shots are from a typical horror video. Form the beginning to 00:00:17, there is a lovely girl dancing gracefully in the middle of the screen (as the first two screen shots show). However, at 00:00:18, a horrible figure suddenly jumps into the screen, with a sudden scream. Below the screen shots is the corresponding audio track wave. Three orange arrowheads indicate the corresponding positions of the three screen shots in the track wave.
In practice, videos are segmented into scenes. Now suppose a video have been segmented into scenes. The probability of a video to contain horror contents can be expressed as: 1| ,
, i 1, K
(1)
Where ,…, represents the K scenes of the video, L 0, 1 is a binary label indicating whether the video is horror or not. is the hypothesis which consists of a set of model parameters to be determined. The above is the analysis of the “video scene” structure of a horror video. Next a single scene will be discussed.
(a)
(b)
(c)
Fig. 2. The three screen shots describe some typical features of horror scenes. (a) is a horror figure that suddenly appears in the screen, which leads to intensive motion level; (b) is a screen shot with the dominant color of blood-red, which is one of the commonly used colors in horror scenes; (c) is the audio track wave of a scene, where the suddenly increased part is caused by a woman’s scream.
There is no authoritative definition for horror scenes. However, some points of view are widely accepted: 1) The sound track of a horror scene often has a direct say on the impact of the visual component of the film, such as screaming, sudden change of volume and the change in sound energy intensity over time. 2) Color is also special. In horror scenes, the colors are usually dark green, blood-red, etc. 3) Motion intensity is another feature. Directors usually create horror effects by a set of shots with high motion intensity. 4) Horror figure is also important in horror scenes. For example, Regan MacNeil (The Exorcist), Chucky (Child's Play), Jigsaw (Saw) etc.
362
B. Wu et al.
Figure 2.gives 3 examples of the typical features of horror scenes. As is shown, each feature helps characterize a horror scene. Therefore, labeling horror scenes is a process to train and predict the features. With the features extracted, the horror scene labeling can be formulated as a machine learning problem: to estimate the probability of a given scene to rated horror. Suppose ,…, is a set of features that extracted from the scene. The probability that a feature matches the horror genre can be expressed as (suppose it is the feature): 1| Suppose there exists a mapping satisfies: 1| ,
1|
,
(2)
,…,
, and define an operation “+” which
,
…
1|
,
(3)
As long as an F and “+”can be found to make the prediction error decrease, the F and “+” can be considered as reasonable. Therefore, the cores of the above formulation are: 1) The learning of . 2) To find a suitable F and “+”. As for the first point, the supervised method is most commonly adapted, which requires each scene needs to be labeled as accurate as possible instead of labeling the whole piece of video. The process costs much effort and time. In addition, since labeling scenes is more ambiguous than labeling videos (a video can be labeled positive if it contains horror scene), labeling to scenes by users’ own judgments may also introduce some ambiguity. Therefore, if the videos can be labeled directly, it may make the detection technique more practical and reduce the ambiguity introduced by labeling instances. Suppose video V consists of K scenes. ,…, represents the K scenes’ labels. According to the analysis above, V’s label can be expressed as: …
(4)
Intuitively, it matches the multiple instance learning (MIL) problem: a set of scenes is grouped into a bag – video, and each scene is an instance. Therefore, the horror scene detection problem can be formulated into the labeling bag problem in MIL. Since MIL can be applied to solve the “video-scene” level problem, now consider the possibility to solve the “scene-feature” level problem, as stated in [4]. Thus, let’s consider: 1| ,
1|
,
…
1|
,
(5)
However, as stated above, horror scenes are characterized by multiple features. Therefore, applying “ ” may not achieve a good result (as demonstrated it in part 5). One solution is to concatenate features as a new one. In this case, there is only one feature in (2) - . However, this method ignores the possibility that each feature may contribute differently to the horror effect. According to the discussion above and (4)’s idea, the F and “+” can be expressed as:
∪
1| ,
∑
1|
,
,∑
1
(6)
A Novel Horror Scene Detection Scheme on Revised Multiple Instance Learning Model
363
In the discussion above, a three-level-model “video-scene-feature” has been set up. However, in practice, a scene consists of features and no single entity exists as “scene”. Therefore, the horror scene detection model should be “video-feature”, which can be expressed as: 1| ,
∑
1|
,
(7)
Fig. 3. The formulation of horror scene detection as an MIL problem
Apparently, if traditional MIL is adapted, then a video will be detected when a type of feature matches the horror definition, which may render a big error to ignore other instances in the same group. Here, an instance in the video is a type of feature. A scene consists of a group of instances. Figure 3.shows the model of “bag-groupinstance”. Thus, we believe that the proposed problem is a special MIL problem: There are a number of instance groups in a bag. A bag will be labeled positive only if at least one positive group exists, and its instances as a group matches the horror definition. Therefore, this special MIL problem is coined as Multiple Grouped Instance Learning (MGIL). In view of this, a method is proposed based on EMDD: Multiple Distance – EMDD. Specially, when 1/ , the MD-EMDD is termed nonweighted MD-EMDD. Furthermore, the scores collected on the survey and integrated into the training data can be combined with the score information with the MDEMDD method and propose Labeled with Ranking – MD – EMDD (LR-MD-EMDD).
3 Learning Method 3.1 Multiple Distance – EMDD (MD-EMDD) EM-DD [8] is a multiple-instance (MI) learning technique. It combines EM with the Diverse Density (DD) [9] algorithm. It is relatively insensitive to the number of relevant attributes in the data set and its running time does not change much when bags size gets larger. EM-DD is used in a single feature space for traditional MIL problem, and it is intuitive that EM-DD cannot be directly used for MGIL problem. Therefore, a special method based on EM-DD: Multiple Distance – EMDD (MD-EMDD) is proposed.
364
B. Wu et al.
In MD-EMDD, the maxDD point of each feature space is first calculated by EMDD. Then, since each instance group consists of different feature values corresponding to different feature space, the Euclidean Distance cannot be computed as [8] does. In order to find the threshold, three steps should be followed: 1.
Suppose there are N feature spaces, and the importance of each features is the same. For the instance group and feature, the Euclidean Distance from feature in the instance group to the feature space’s maxDD point is computed: (8) Here, ED is the function of normalized Euclidean Distance computing. Then define the distance from
instance group to the maxDD point as:
,
∑ 2.
,∑
1
(9)
Under this ,…, , a corresponding threshold can be obtained by 10-fold cross validation, which minimize the average error (number of wrongly predicted labels)among training bags labels ,…, and the predicted training bags labels ,…, : arg
, 1, 0,
,
(10)
K
(11)
This threshold is used to predict the training bags’ labels and obtain an error, denote as the error:
, 3.
Optimize
,…,
∑
○
, and back to step 2. When average
minimum, the corresponding corresponding threshold as .
is the result needed. Name the
(12) gets to the as
and the
To predict a bag, as is shown in step1, the Euclidean Distance should be computed from each feature in each instance group to every feature space’s maxDD point. Then, as step3, a modified Euclidean Distance is obtained. Suppose there are M instance groups in a test bag: ∑
(13)
Then, compare it with the threshold found in step3 in order to predict the test bag’s label: 1, 0,
(14)
A Novel Horror Scene Detection Scheme on Revised Multiple Instance Learning Model
365
3.2 Labeled with Ranking– MD – EMDD (LR-MD-EMDD) Specially, if the bags with more information like scores of horror can be described, the distances and the scores ranking can be connected. First, for the K videos, ,…, is the ranking sequence of the videos’ scores, where denotes the video. Second, for all the videos, apply the step 1 and step 2 of MDrank of the EMDD. Denote the distance set obtained from eq.(9) as . Then, 1, … , ,…, is obtained.Third, deaccording to , a new ranking sequence note q as the number of the videos whose ranking differ in from .Then, is redefined as:
,
∑
(15)
Fourth, apply step 3 of MD-EMDD and get optimized . Since bags are labeled with not only binary “is/not” labels but also ranking of scores, the method is called as Labeled with Ranking – MD – EMDD (LR-MD-EMDD).
4 Experiments In this section, the training and testing dataset, and the experiment set-up for the horror scene detection problem including how features are selected are described. Then, the performance of the proposed learning methods is presented and compared with other previous methods. 4.1 Dataset In order to generate positive bags, the comments on the horror movies and “Top 50 horror movie” ranked by famous movie website [10] were taken into consideration. Up to 50 classic horror movies like “The Exorcist”, “Saw”, “Kill Bill”, etc. are selected. The 50 movies are segmented into 1200 pieces of video. In fact, MGIL only consider the most possible instance group in a bag, as long a video (bag) contains a number of scenes (instance groups), it will be detected. Therefore, the detection result will not change much as a video’s length increases. Furthermore, in order to mine human’s sense about horror, a survey on horror video is conducted. First, 300 videos are briefly selected, all of which are labeled “horror” or “not horror” by the surveyed subjects. Moreover, 50 out of the 300 videos are chosen to be scored by audiences for their “horror” levels. Specifically, due to the practicability of the survey, the 50 videos are divided into 10 groups and each group consists of 5 videos (Please see 3.2 and 4.2). The subjects will be randomly given a group of videos to score and they are required to score all the videos in the group, which guarantees that the same subject will score all videos in a group. Only if this prerequisite is satisfied could the videos be compared with each other in the same group. The survey has lasted for about 4 months and has received 6920 pieces of valid questionnaires. As for the negative bags, 300 videos of multiple topics are chosen. For example, talk show, news broadcasting, sports, games, humor show, music TV and so on. 40
366
B. Wu et al.
out of the 300 are selected as training data, while the other is test data. The total 600 videos are automatically segmented into 5825 scenes (instance groups). 4.2 Experiment Set-Up First, the EM-DD method is used to test the video descriptors’ accuracy. 19 visual and audio MPEG-7 descriptors were adopted, as well as 1 motion level descriptor, which is designed to examine the contents of consecutive images. For the 20 descriptors, data is trained and tested under each feature space. Please note that all descriptors use the same training and testing data. Table 1 summarizes the result. Table 1. Accuracy of each descriptor using EM-DD (including 19 MPEG-7 descriptors and 1 motion level descriptor). The adopted descriptors are stressed in bold type. Descriptor Dominant Color
Accuracy 0.540
Color Layout Color Structure Scalable Color Homogeneous Texture Edge Histogram
0.575 0.737 0.471 0.534 0.579
Audio Fundamental Frequency Audio Harmonicity Audio Signature
0.602
Audio Spectrum Centroid
0.591
0.542 0.591
Descriptor Audio Spectrum Distribution Audio Spectrum Spread Background Noise Level Band Width Dc Offset Harmonic Spectral Centroid Harmonic Spectral Deviation Harmonic Spectral Spread Harmonic Spectral Variation Motion Level
Accuracy 0.483 0.598 0.510 0.532 0.575 0.537 0.571 0.552 0.526 0.591
As is shown in Table 1, 5 descriptors are adopted in the following experiments. For comparison purpose, the performances of 5 algorithms are evaluated: non-weighted MD-EMDD,MD-EMDD and LR-MD-EMDD are proposed for MGIL problem; EMDD is widely used traditional MIL algorithms; support vector machine (SVM) is a supervised learning algorithm. The first 3 algorithms work only with the bag labels while the SVM work in supervised setting with all instances labeled. The training dataset for SVM is selected from the videos’ scenes and labeled by us. For EM-DD, non-weighted MD-EMDD,MD-EMDD and LR-MD-EMDD are ran with 20 randomly chosen starting points. In order to combine the hypotheses resulted from the multiple runs, the average of the hypotheses is computed. For SVM, the dataset is labeled to scenes and 2751 instances are selected based on the same dataset as the previous 4 methods. Each feature’s accuracy is separately tested and the top 5 are selected. Then, the 5 features’ weights are set according to their accuracies’. Finally, the 5 features are fused with their weights. The SVM
A Novel Horror Scene Detection Scheme on Revised Multiple Instance Learning Model
367
Accuracy
The Accuracies of 5 Methods 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3
LR-MD-EMDD MD-EMDD Non-weighted MD-EMDD EM-DD SVM (supervised)
C
P
N
Fig. 4. Comparison of accuracies of horror scene detection achieved by 5 methods. C: Accuracy of correctly detected bags(MIL)/scenes(SVM); P: Accuracy of positive bags(MIL)/ scenes(SVM); N: Accuracy of negative bags(MIL)/scenes(SVM). Table 2. Detail information of accuracies achieved by 5 methods Algorithm LR-MD-EMDD MD-EMDD Non-weighted MD-EMDD
487 478 436
24 28 36
89 94 128
Accuracy 0.812 0.797 0.727
EM-DD
379
22
199
0.632
SVM (supervised)
2256
132
363
0.820
: Number of correctly detected bags(MIL)/scenes(SVM); :Number of positive bags(MIL)/scenes(SVM) detected as negative; : Number of negative bags(MIL)/scenes(SVM) detected as positive.
method is implemented based on libsvm1. We use RBF as the kernel function, and select the best parameters c and fromcross validation. 4.3 Results Figure 4 and Table 2 summarize the comparison of the accuracies achieved by the 5 methods. First, non-weighted MD-EMDD outperforms the traditional EM-DD by 9.5%. Second, MD-EMDD outperforms the non-weighted MD-EMDD with 7%, which indicates the importance to exploit the MGIL problem. Third, the LR-MD-EMDD’s performance is 1.5% better than MD-EMDD, which indicates the significance of combining score information with MD-EMDD. The accuracy of LR-MD-EMDD is found to be close to the SVM with a margin of 0.8%. 1
Chih-Chung Chang, Chih-Jen Lin. LIBSVM: a library for supportvector machine, http://www.csie.ntu.edu.tw/~cjlin/libsvm/
368
B. Wu et al.
Labeling Time/Bag
Running Time/Bag 100 seconds
times of video length
Furthermore, the 4 MIL methods make little difference on the number of positive bags detected as negative, but differ much on the number of negative bags detected as positive. It is intuitive that in EM-DD, if any feature were predicted as positive, the video will be labeled positive. Therefore, it only achieves 63.2%. Instead, the 3 proposed methods give restriction to each feature and take all features into consideration and perform much better on the detection of negative scenes. What’s more, the more accurate the importance of features is mined, the better the performance is.
2
0
MD-EMDD
LR-MD-EMDD
SVM
50 0
LR-MD-EMDD
EM-DD
Fig. 5. Comparison of labeling and running time of horror scene detection
Figure 5 compares the training data labeling time of MD-EMDD, LR-MD-EMDD and SVM. Thirty videos for a total of 150 minutes are tested. The SVM requires 334 minutes while MD-EMDD requires a total of 29 minutes and saves 91% labeling time, LR-MD-EMDD requires 115 minutes and save 50% labeling time. To explain the results: the MD-EMDD needs to find whether a video contains a positive scene. Therefore, the users do not need to sort through the whole video. For LR-MD-EMDD, the users only need to go through the video. However, for the SVM, the users need to first watch the video and then delete the negative parts. Please note that the time does not include the video segmentation process. Also, as EM-DD performances well on its running time, the EM-DD and LR-MDEMDD are compared, where LR-MD-EMDD costs 62 seconds per bag and is only 5seconds more than EM-DD. The reason is that EM-DD costs mostly in LR-MDEMDD and different features can be trained and tested simultaneously. Table 3. of five adopted descriptors obtained by MD-EMDD Descriptor Color Structure Audio Spectrum Spread Audio Fundamental Frequency Audio Signature Motion Level
ࣆ 0.234 0.212 0.150 0.103 0.301
Table 4. of five adopted descriptors obtained by LR-MD-EMDD Descriptor Color Structure Audio Spectrum Spread Audio Fundamental Frequency Audio Signature Motion Level
ࣆ 0.251 0.211 0.131 0.235 0.172
A Novel Horror Scene Detection Scheme on Revised Multiple Instance Learning Model
369
Finally, Table 4.and Table 5.give the of five adopted descriptors obtained by MD-EMDD and LR-MD-EMDD. As is shown, the visual features account for about 43% of the , which indicates the significance of the introduction of visual features.
5 Conclusions In this paper, horror scene detection problem is investigated with incomplete information on training data labels. Specifically, a formulation of the problem as a Multiple Grouped Instance Learning problem is presented. A discriminative learning method is proposed to solve the MGIL problems. Also, a method to effectively make use of “scores” of bags is demonstrated, which offers a new method for ambiguous concept detection. The newly devised method is demonstrated to be more effective and superior compared to the traditional MIL algorithms and close to performance achieved by supervised method. As for the future work, we plan to: 1) Discuss the possibility of the co-effect of different features. 2) Video is special for its multiple layer structure. It may be valuable to introduce the multiple layer structure and combine it with our method. 3) High level feature can be extracted to achieve higher accuracy. 4) Optimize our method by combining LR-MD-EMDD with SVM to solve the MGIL problems.
Acknowledgements Project supported by The National Natural Science Foundation of China (No.60802057, No.61071153), Shanghai Rising-Star Program (10QA1403700) and Shanghai College Students Innovation Project (IAP3027).
References 1. Gillespie, W.J., Nguyen, D.T.: Video Classification Using aTree-Based RBF Network. In: IEEE International Conference onImage Processing, vol. 3, pp. 465–468 (2005) 2. Hana, R.O.A., Freitas, C.O.A., Oliveira, L.S., Bortolozzi, F.: Crime scene classification. In: Proceedings of the ACM Symposium on Applied Computing, pp. 419–423 (2008) 3. Moncriefi, S., Venkatesh, S.: Horror Film Genre Typing And Scene Labeling Via Audio Analysis. In: ICME 2003, pp. 193–197 (2003) 4. Jiang, X., Sun, T., Chen, B.: Automatic Video Pattern RecognitionBased on Combination of MPEG-7 Descriptors andSecond-Prediction Strategy. In: Second International Symposium on Electronic Commerce and Security, pp. 199–202 (2009) 5. Gu, Z., Mei, T., Hua, X.-S.: Multi-Layer Multi-Instance Learning for Video Concept Detection. IEEE Transactions on Multimdeia 10(8), 1605–1616 (2008) 6. Dietterich, T.G., Lathrop, R.H., Lozano-Perez, T.: Solving The Multiple Instance Problem With Axis-parallel Rectangles. Artificial Intelligence 89(1-2), 31–71 (1997)
370
B. Wu et al.
7. Rahmani, R., Goldman, S.A., Zhang, H., Cholleti, S.R., Fritts, J.E.: Localized Content Based Image Retrieval. IEEE Transactions On Pattern Analysis And Machine Intelligence, Special Issue, 1–10 (November 2008) 8. Zhang, Q., Goldman, S.A.: EM-DD: An Improved Multiple-instance Learning Technique. In: NIPS, vol. 14, pp. 1073–1080 (2002) 9. Maron, O., Lozano-Pérez, T.: A Framework for Multiple-instance Learning. In: Advances in Neural Information Processing Systems, vol. 10, pp. 570–576. MIT Press, Cambridge (1998) 10. Top Rated “Horror” Titles (2010), http://www.imdb.com/chart/horror
Randomly Projected KD-Trees with Distance Metric Learning for Image Retrieval Pengcheng Wu, Steven C.H. Hoi, Duc Dung Nguyen, and Ying He School of Computer Engineering, Nanyang Technological University, Singapore {wupe0003,chhoi,nguy0051,yhe}@ntu.edu.sg
Abstract. Efficient nearest neighbor (NN) search techniques for highdimensional data are crucial to content-based image retrieval (CBIR). Traditional data structures (e.g., kd-tree) usually are only efficient for low dimensional data, but often perform no better than a simple exhaustive linear search when the number of dimensions is large enough. Recently, approximate NN search techniques have been proposed for high-dimensional search, such as Locality-Sensitive Hashing (LSH), which adopts some random projection approach. Motivated by similar idea, in this paper, we propose a new high dimensional NN search method, called Randomly Projected kd-Trees (RP-kd-Trees), which is to project data points into a lower-dimensional space so as to exploit the advantage of multiple kd-trees over low-dimensional data. Based on the proposed framework, we present an enhanced RP-kd-Trees scheme by applying distance metric learning techniques. We conducted extensive empirical studies on CBIR, which showed that our technique achieved faster search performance with better retrieval quality than regular LSH algorithms.
1
Introduction
Similarity search plays an important role for content-based image retrieval (CBIR) systems. The images in CBIR are often represented in high-dimensional space, and the scale of images can be easily over millions or even billions for web-scale applications. These challenges have made CBIR an open challenge although it has been extensively studied for several decades. The NN search problem for CBIR has been extensively studied in literature. A variety of data structures have been proposed for indexing data points in a lowdimensional space [15,5,4,14]. For example, if data points lie in a plane, it can be shown that traditional data structures, such as kd-tree, can exactly solve the NN search problem with O(log n) time using only O(n) space [15]. However, when the number of dimensions grows, these conventional approaches often become less efficient, a phenomenon known as the curse of dimensionality. Specifically, the time or space requirements of these approaches often grow exponentially with the dimensionality. For example, the approach in [5] has a nice query time of O(dO(1) log n), however, it costs about O(nO(d) ) space, making it impractical for large applications. While there exist some efficient data structures using only linear or sublinear space [4,14], the best query time of these approaches is of K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 371–382, 2011. c Springer-Verlag Berlin Heidelberg 2011
372
P. Wu et al.
O(min(2O(d) , dn), which is not better than a simple exhaustive linear search even for moderate dimension d. Until now, researchers have yet to find an efficient solution that can solve the exact high-dimensional NN search problem beyond the exponential dependence on the dimensionality. Recently, instead of pursuing the exact NN search, researchers have attempted to adopt some approximation approaches [10,8,13,2] that remove the exponential dependence on dimensionality. The basic idea is that: instead of finding the nearest point p to the query point q, the approximate NN search algorithm allows to return any point within the distance of (1 + ) times the distance from q to p. Recent studies have shown that by adopting the approximation, the high-dimensional NN search problem can be efficiently resolved by reducing the dependence on the dimensionality from exponential to polynomial complexity. Several recent studies, such as Locality Sensitive Hashing (LSH) [10,8], have successfully applied the random projection idea for approximate NN search over high-dimensional data. Motivated by the above results, in this papper, we propose a new method for approximate NN search in high-dimensional space using the random projection principle, called the Randomly Projected kd-trees (RP-kd-Trees). The basic idea is to project the high-dimensional data points into a lower dimensional space and integrate multiple kd-trees by utilizing the advantage of kd-trees for lowdimensional NN search. Besides, to further improve the performance, we also present a machine learning approach to enhancing RP-kd-Trees by applying distance metric learning techniques. The rest of this paper is organized as follows. Section 2 presents the proposed RP-kd-Trees method, which integrates the Random Projection technique and kdtree structures in the unified framework to efficiently resolve approximate nearest neighbor search problem. Section 3 discusses the enhanced RP-kd-Trees scheme by applying distance metric learning techniques. Section 4 discusses the experimental results of applying the proposed RP-kd-Trees technique for content-based image retrieval application. Section 5 sets out the conclusion of this work.
2
Randomly Projected KD-Trees
We now present a framework of Randomly Projected KD-trees (RP-kd-Trees) for approximate NN search on high-dimensional data. We will first introduce some relevant techniques, including the basic concept of Random Projection, and the kd-tree data structure, followed by presenting the proposed indexing algorithms. 2.1
Random Projection
Random Projection is a technique to reduce the curse of dimensionality with little lost information of distances between pairs of points in a high-dimensional vector space. In order to project the given points onto a lower dimensional space, we multiple X by a random matrix M ∈ Rd ×d , where M often consists of
RP KD-Trees with Distance Metric Learning for Image Retrieval
373
multiple elements in normal distribution N (0, 1). By doing this, we speed up the computation and make it possible to use existing data structures to handle the image similarity search problem. We expect that random projection approximately preserves pair-wise distances. We will describe the technique to overcome the accuracy loss caused by random projections in later part of this paper. Achlioptas [1] proposed sparse random projections by not using the N (0, 1) elements in M but elements in {+1, 0, −1} with probabilities { 16 , 23 , 16 }, attaining a threefold speedup in projection processing time. It shows the following theorem for performance assurance [1]. Theorem 1. Let X be an arbitrary set of n data points in Rd , represented as a 4+2β matrix X ∈ Rd×n . Given , β > 0 let d0 = 2 /2− 3 /3 log n, for integer d ≥ d0 , let M be a d × d random matrix with M (i, j) = Mij , where {Mij } are independent random variables ⎧ from the following probability distribution: √ ⎨ +1, with probability 1/6 0, with probability 2/3 Mij = 3 ⎩ −1, with probability 1/6 Let us define E = √1d M X, and define f : Rd → Rd that maps the ith column of X to the ith column of E. With probability at least (1 − n−β ), for all u, v ∈ X, we then have (1 − )||u − v||2 ≤ ||f (u) − f (v)||2 ≤ (1 + )||u − v||2
(1)
1 Remark. In [12], the authors recommended the use of probabilities { 2√ ,1 − d √ 1 1 √ , √ } for a significant d-fold speedup, with slight loss of accuracy. d 2 d
2.2
The KD-Tree Data Structure
Kd-tree [14] is a binary tree structure for storing a finite set of points in kdimensional space. Every internal node of kd-tree has a splitting hyper-plane that divides the space into two subspaces. The points left to the hyper-plane are represented by the left sub-tree of that node, while the points right to the hyperplane are represented by the right sub-tree. Thus, each node contains information about all its descendants in a hyper-rectangle. The details of building a kd-tree structure can be found in [14]. The NN search process for kd-tree is conducted in a recursive manner. It starts from the root node, and moves down the tree recursively. The idea is trying to prune the candidate hyper-rectangles that definitely do not contain nearest neighbors of the query point. A candidate hyper-rectangle is inspected only if there are some parts of it within the current best distance to the query point. The number of points inspected appears to be reasonable in low-dimensional space, but usually grows rapidly when the dimensionality of the data points increases. This is the reason that prohibits the use of traditional kd-tree for indexing high-dimensional data.
374
2.3
P. Wu et al.
Algorithm
We now proceed to introduce our algorithm, Randomly Projected (RP) kd-trees (RP-kd-Trees) for solving the approximate k-NN problem, which takes advantage of random projection for efficient dimension reduction and kd-tree data structure for efficient low-dimensional data indexing. First, we generate m projection matrices using Achlioptas’s technique [1]. These projection matrices are used to generate m different copies of projected data set in lower dimensional space d . These projected data sets are then stored in m corresponding d -dimensional kd-trees. By performing projections, we make it possible for kd-tree to handle our data points. d ×d For each where Mij is: ⎧ projection process, we choose matrix M ∈ R +1, with probability 1/6 ⎨ √ 0, with probability 2/3 Mij = 3 ⎩ −1, with probability 1/6 The matrix M represents some random projection from Rd to Rd . Then, multiplying X consisting of n vectors in d-dimensions with matrix M leads to the set of projected points X ∈ Rd ×n . Note that the scale factor √1d in projection from Theorem 1 could be ignored because we only need to compare among pairwise distances. Algorithm 1. Preprocessing and Indexing of RP-kd-Trees Input: X - A set of data points, m - number of kd-trees used Output: RP-kd-Trees Tu , u = 1, ..., m procedure Preprocessing Indexing for u ← 1, m do Initialize kd-tree Tu Generate projection matrix Mu end for for u ← 1, m do for i ← 1, n do Compute the projected point of xi with matrix Mu Store it into kd-tree Tu end for end for end procedure
To find k nearest neighbors of a given query point q ∈ Rd , we iterate each of m structures and process as follows: First the query point is projected into d dimensional subspace corresponding to the kd-tree. We use this projected query point and standard nearest neighbors search in kd-tree to find k nearest neighbors. The answer provided by each kd-tree is only an approximate result to the NN problem, and alone may not be very accurate. To improve the accuracy, we try to integrate answers from all the kd-trees by ranking the union set by the distance to the query point q ∈ Rd , and return the top k nearest neighbors. The preprocessing, indexing and querying algorithms are summarized in Algorithm 1 and 2. To attain the final result, one priority queue is maintained. It
RP KD-Trees with Distance Metric Learning for Image Retrieval
375
keeps k current best candidate neighbors and will be updated whenever a nearer candidate is found. The insert and update operations in the priority queue are very fast, with complexity of O(log k). The querying operation over kd-tree in low dimensional space is very efficient, with complexity of logarithm of the number of points. Besides, our method is also easy to be parallelized by querying multiple kd-trees simultaneously using emerging parallel computing techniques. Algorithm 2. Approximate Nearest Neighbor Query in RP-kd-Trees Input: q - a query point, k - number of nearest neighbors Access: RP-kd-Trees Tu , u = 1, ..., m Output: k (or less) approximate nearest neighbors procedure ANN-Query S←∅ for u ← 1, m do Let q ← Mu q the projection of the point q onto the d -dimensional subspace given by Mu Let S ← S∪ {k neighbors returned from Tu with query q’} end for Rank points in S by the distance to the query point q Return the top k nearest neighbors. end procedure
By performing random projection, we will lose some information of the data set. However, with m projection matrices and kd-trees we expect the accuracy will be highly boosted. From Theorem 1, the Achlioptas’s random projection method preserves pairwise distances approximately at (1 ± ) with probability at least γ = 1 − n−β . That means the failure probability of each structure is (1 − γ) = n−β . Then by integrating results from m structures, we lower this probability to n−βm . Hence, to achieve the desired probability (1 − δ), the following inequality must hold: 1 − n−βm ≥ 1 − δ. In other words, one can choose − log δ β by: β ≥ m . Therefore, choosing the value of d as: d ≥ log n should suffice to provide quality guarantee.
2.4
− log δ 4+2 m log n 2 /2−3 /3
log n,
Complexity Analysis
Empirically RP-kd-Trees could provide its best performance with relatively small value of projected dimension d (d =10 when original dimension d = 297 in our experiments). The projection time complexity O(dd ) is not significant because both d and d are not very large. Similarly ranking objects in the result sets is very quick because there are usually a small number of candidates. Thus time consumed mostly falls into the process of querying kd-trees, and is expected as O(d log n) for one kd-tree. Besides, it needs space complexity O(nd ) to store one kd-tree. The RP-kd-Trees method makes use of m trees, so it has the expected time complexity of O(md log n) and space complexity of O(mnd ). For the enhanced
376
P. Wu et al.
RP-kd-Trees by distance metric learning to be discussed in the subsequent section, the distance metric learning process (when used, see below for details) usually can be performed quite efficiently because the number of items in training data set is often not large. The only additional computation for the enhanced RP-kd-tress with distance metric learning approach would be projecting the original data set only once by the linear transformation W ∈ Rd×d learnt from training data set, which is O(nd2 ).
3
Enhancing RP-KD-Trees by Distance Metric Learning
In this section, we consider a machine learning approach to enhancing the indexing performance of RP-kd-Trees for CBIR. In particular, given a training set with side information (pairwise constraints indicate if image pairs are similar/dissimilar), a well-known technique to improve the distance measure is to explore Distance Metric Learning (DML) techniques, which can improve the performance of RP-kd-Trees by finding more effective distance metrics. Considering a DML task, we are given a set of n data points in a d-dimensional vector space C = {xi }ni=1 ⊆ Rd , and some side information which is typically provided in the forms of two sets of pairwise constraints among the data points. Each pairwise constraint (xi , xj ) indicates if two images xi and xj are similar (“must-link”) or dissimilar (“cannot-link”) judged by users. For image retrieval, such information can be easily collected from real-world systems, such as users’ relevance feedback logs in CBIR systems. One key issue of CBIR is to define appropriate distance measure f (xi , xj ) to calculate distance/dissimilarity between any two images xi and xj . Specifically, assume images are represented in a vector space, by specifying a distance metric A ∈ Rd×d , we can express the formula of general Mahalanobis distance below: fA (xi , xj ) = xi − xj 2A = (xi − xj ) A(xi − xj ) = tr(A(xi − xj )(xi − xj ) ) (2) where A is a symmetric matrix of size m×m, and tr stands for the trace operator. In general, A is a valid metric if and only if it satisfies the non-negativity and triangle inequality properties. In other words, matrix A must be positive semidefinite (PSD), i.e., A 0. In general, A parameterizes a family of Mahalanobis distances on the vector space Rd . As a special case, setting A to an identity matrix Id×d , Eqn. (2) reduces to regular Euclidean distance. Despite its simplicity, Euclidean distance has some critical limitations. By Euclidean, all variables are assumed independent, the variance across all dimensions is 1, and the covariances among all variables are 0. Such a scenario is seldom satisfied in practice. Instead of using Euclidean, it is more desirable to learn an optimal metric from real data. This motivates us to study DML to optimize the matrix/metric A for distance measure in real applications. In this paper, our goal is to apply DML techniques to improve the performance of RP-kd-Trees. The structure of the proposed RP-kd-Trees enables the feasibility of exploiting DML techniques in a simple and effective way. In particular, different distance metrics can be learned separately from different projected
RP KD-Trees with Distance Metric Learning for Image Retrieval
377
data sets in d dimensional space. The learned metrics can be applied to the RPkd-Trees by a simple projection. Specifically, each Mahalanobis matrix A can be decomposed as A = W W . As a result, the distance d(x1 , x2 ) is computed: f (x1 , x2 ) = (x1 − x2 ) A(x1 − x2 ) = (x1 − x2 ) W W (x1 − x2 ) = (W (x1 − x2 )) (W (x1 − x2 ))
(3)
Thus, applying a metric A to RP-kd-Trees is equivalent to conducting a projection with the matrix W . It is important to note that the above approach does not increase the time cost of online query or any additional memory cost. In this paper, we apply several popular DML algorithms, including relevant component analysis [3], discriminative component analysis [9], neighbourhood components analysis [11], and large margin nearest neighbor metric learning [16]. For limited space, we skip the discussions on their details.
4
Experiments
This section evaluates the performance of the proposed RP-kd-Trees to identify the advantages and limitations of the proposed method from different aspects. 4.1
Data Sets and Experimental Settings
We experimented with real-world image data sets: (1) The COREL data set consists of 5,000 images, which are classified into 50 categories based on their semantic concepts; each category has 100 images; and (2) the Flickr data set contains 500,000 photos, which were crawled from www.Flickr.com. In the experiments, we split 5,000 COREL images into 2 sets: 2,000 and 3,000 images (i.e. 20 and 30 categories). The 2000-image set was used to test the performance of the methods without DML (RP-kd-Trees, LSH [6], Multiprobe LSH [13]), while the 3000-image set was only used as the training set for the enhanced RP-kd-Trees with DML. Finally, the query set was created by randomly choosing 100 query images from the 2,000 classified images. The image data set to be queried was the combined set of 500k Flickr images and 2000 COREL images, total of 502,000 images, referred as FlickrCOREL. Low-level features were extracted from images, including grid color moment, local binary pattern, Gabor wavelets texture, and edge direction histogram features. All together, a 297-dimensional feature vector was used to represent an image. Practically, storing the FlickrCOREL data set in this way took about 1,137 MB (using double-precision floating point data type or 8 bytes for one coordinate). One trick was exploited to speed up RP-kd-Trees algorthm with little loss on the accuracy that we terminated search process after performing distance checking for a certain number of points. Finally, all experiments were conducted in a Linux machine with 2.8GHz CPU and 16GB memory. For performance assessment, we adopted a fairly standard metric widely used in multimedia retrieval, i.e., Average Precision metric on top returned images, defined as: AverageP recision(t) = ni=1 precision(i)Δrecall(i), where
378
P. Wu et al.
precision(i) is the precision of the first i returned images and Δrecall(i) is the change in the recall from i − 1 to i returned images. A returned image was considered a hit if it belonged to the same category as the query image. All methods were required to retrieve 100 relevant images in later experiments. 4.2
Performance Evaluation of RP-KD-Trees
RP-kd-Trees is expected to provide high accuracy search result by using multiple kd-trees. This experiment is to understand the behaviors of RP-kd-Trees with different parameters. Through this experiment, we also want to find out what value the projected dimension should be for RP-kd-Trees. We varied the number of trees m from 1 to 20 and reported the Average Precision, querying time (s) and memory consumed (MB) when projected dimension d = 5, 10, and 15. The experiments were done against FlickrCOREL data set (d = 297, N = 502,000). FlickrCOREL(d=297,n=502,000)
FlickrCOREL(d=297,n=502,000)
0.14
0.14
0.12
0.12
0.1
Average Precision
Average Precision
0.1
0.08
0.06
0.04
0.08
0.06 Linear Search
0.04
RP−KD−Trees d’=5 RP−KD−Trees d’=10
Basic kd Tree 0.02
0
0.02
Linear Search
0
2
4
6
8 10 12 Dimensions of d’
14
16
(a) Basic kd-tree
18
20
0
RP−KD−Trees d’=15
0
2
4
6
8 10 12 Number of Trees
14
16
18
20
(b) # RP-kd-Trees
Fig. 1. Evaluation of Average Precision: Basic kd-tree vs. RP-kd-Trees
Figure 1 shows a comparison between basic kd-tree and RP-kd-trees. Specifically, Figure 1(b) illustrates how accurate the returned images are for different d . As reflected from the figure, Average Precision increased significantly when the number of trees m increased. When m = 20, the returned average precision of RP-kd-Trees with d = 10 was 0.123, close to the optimal result by the exact linear search. In contrast, the accuracy in Figure 1(a) by basic kd-tree was much lower than the exact linear search. This result indicates that by employing multiple kd-trees the accuracy is highly boosted even though the trick introduced in the preceding section decreases the accuracy slightly. Figure 2 shows the time and memory evaluation of RP-kd-trees. We found that the querying time and memory costs increased linearly with m. When m = 10 and d = 10, RP-kd-Trees needed 0.006 seconds on average to answer one query, which was 67 times faster than linear search. At this configuration, it also needed 655 MB and achieved Average Precision of 0.11. To provide higher accuracy, e.g. 0.123, the structures required about 1300 MB and answered a query in about 0.013 second. Thus, if available memory is large enough, this method will be able to deliver the desired accuracy for the application.
RP KD-Trees with Distance Metric Learning for Image Retrieval FlickrCOREL (d=297, n=502,000)
379
FlickrCOREL (d=297, n=502,000)
0.014
1800 1600
0.012
1400 0.01 Memory (MB)
Time (s)
1200 0.008
0.006
1000 800 600
0.004 RP k−d Trees d’=5
400
RP k−d Trees d’=5
RP k−d Trees d’=10 0.002
0
RP k−d Trees d’=10
RP k−d Trees d’=15
0
2
4
6
8 10 12 Number of Trees
14
16
18
(a) # trees vs. query time
200
20
0
RP k−d Trees d’=15
0
2
4
6
8 10 12 Number of Trees
14
16
18
20
(b) # trees vs. memory
Fig. 2. Evaluation of RP-kd-trees: number of trees vs. query time/meomry
Besides, we notice that Average Precision did not improve much when the projected dimension d increased from 10 to 15. Meanwhile, the querying time and required memory increased considerably. Thus, for this data set, RP-kd-Trees performed well when the projected dimension d was 10. In later experiments, d was set to 10 when RP-kd-Trees was compared with other methods. 4.3
Comparison against Other Methods
We compared RP-kd-Trees with two state-of-the-art methods, i.e., LSH and Multi-Probe LSH. All compared methods were required to return top k = 100 relevant images from dataset FlickrCOREL with respect to each query image. Parameters were selected to reflect the best performance of each method on the training set. The compared methods are listed below: – RP-kd-Trees: we set the projected dimension d = 10 and varied the number of kd-trees (m) from 1 to 20. – LSH [6]: We adopted the E2 LSH package1 . It has two key parameters: L number of hash tables and k - number of elements in LSH functions. In this experiment, k was set as its typical value 10 while L ranges from 2 to 12. – Multi-probe LSH [13]: We adopted the LSHKIT library [7]. In the library, the following parameters must be specified: L - number of hash tables, T number of bins probed in each hash table. When using L = 10 and T = 10, the search accuracy is very good - nearly close to the exact search result, but the query time is intensive, more than 0.4 second for a query. Thus, in this experiment we used lower values, i.e., T = 2 and varied L from 1 to 5. We ran each of the compared methods on the data set 10 times, and reported their average performance. Figure 3 shows the comparison of different methods in terms of accuracy, querying time, and memory cost. 1
http://www.mit.edu/~ andoni/LSH/. The package solves the R-near neighbor problem (to find the neighbors within a radius R of the query), to find k nearest neighbors, we follow the suggestion of E2 LSH’s manual, i.e., we solve R-near neighbor problem for several increasing values of R.
380
P. Wu et al. FlickrCOREL (D=297, n=502,000)
FlickrCOREL (D=297, n=502,000)
0.12
2500 RP k−d Trees
0.1
RP k−d Trees
LSH
LSH
2000
MPLSH
MPLSH
Memory (MB)
Time (s)
0.08
0.06
1500
1000
0.04
500
0.02
0
0
0.02
0.04
0.06 0.08 Average Precision
0.1
0.12
(a) AP vs. query time
0
0
0.02
0.04
0.06 0.08 Average Precision
0.1
0.12
(b) # AP vs. memory
Fig. 3. Comparison of different approximate NN search methods
From Figure 3(a), in terms of processing time, multi-probe LSH ran almost as fast as LSH on FlickrCOREL, however, it needed a much smaller number of hash tables, i.e., only 10 MB to store the indexing structure. Meanwhile, LSH needed a lot of memory in order to provide good result, i.e. to have average precision of 0.120, it needed 1588 MB (40% more than the data set itself). From Figure 3(b), we clearly found that RP-kd-Trees method consistently outperformed other methods on this FlickrCOREL data set. At the average precision of 0.114, RP-kd-Trees achieved up to 7 times faster than LSH method and 10 times faster than multi-probe LSH. The method was still faster when the average precision was higher, says 0.123. Thus, RP-kd-Trees method requires less space while returning nearest neighbors much faster than LSH. In summary, RP-kd-Trees is an efficient approximate NN search method for high dimensional data, which can return highly accurate results very efficiently when memory is sufficiently large. The results show that RP-kd-Trees is promising and more effective than the competing LSH techniques for this challenge. 4.4
Evaluation of Enhanced RP-kd-Trees with DML
This experiment is to evaluate the performance of the enhanced RP-kd-Trees by applying DML techniques. In particular, we formed a collection of 3,000 COREL images as the training data set for learning distance metrics. In our experiments, we implemented the enhanced RP-kd-Trees by applying four different DML algorithms, including RCA, DCA, NCA, and LMNN. Figure 4 shows the performance of the enhanced RP-kd-Trees by the four DML algorithms. From Figure 4, we can draw several observations. First, we found that all the enhanced RP-kd-Trees methods by DML achieved consistently better retrieval accuracy than the original RP-kd-trees without DML. Second, we found that the performance of the enhanced RP-kd-Trees monotonically improved when the number of kd-trees increased. In particular, we found that when the number of kd-trees m was greater than 10, all the enhanced RP-kd-Trees algorithms achieved a significant improvement, outperforming the exhaustive Euclidean
RP KD-Trees with Distance Metric Learning for Image Retrieval
381
FlickrCOREL(d=297,n=502,000) 0.16 0.14
Average Precision
0.12 0.1 0.08 RP−KD−Trees 0.06
Euclidean Linear Search RP−KD−Trees + NCA
0.04
RP−KD−Trees + RCA RP−KD−Trees + DCA
0.02 0
RP−KD−Trees + LMNN
0
2
4
6
8 10 12 Number of Trees
14
16
18
20
Fig. 4. Evaluation of the enhanced RP-kd-Trees by distance metric learning methods
linear search. All these results show that it is effective and promising for applying DML techniques for boosting the performance of the RP-kd-Trees technique. Moreover, by examining different DML techniques, we found that when the number of kd-trees m is small (m ≤ 10), LMNN tends to achieve consistently better than the others, both NCA and DCA perform quite comparably, while RCA seems to be the worst. Further, when m increases, we found that most algorithms tend to converge to the similar performance. Despite their slight differences, we can clearly observe the consistent improvements by the enhanced RP-id-Trees with DML, validating the efficacy of our technique.
5
Conclusions
This paper presented a new approximate NN search method for high dimensional data, called Randomly Projected kd-Trees (RP-KD-Tree), with application to CBIR. Our results showed that the proposed method requires less memory and can achieve up to 7 times faster than the original LSH. Further, we showed that our method can be easily extended by applying distance metric learning techniques. By employing a variety of distance metric learning algorithms, we showed the extended method can provide consistent improvements of retrieval accuracy, even exceeding the accuracy of the exhaustive Euclidean based linear search, with no additional querying time or memory cost. For the merits of efficacy and being easy to implement, we believe RP-kd-Trees could be a practically effective technique for high-dimensional indexing in multimedia applications. Future work will further reduce the space complexity and speed up by parallel techniques.
Acknowledgement The work was supported by Singapore National Research Foundation Interactive Digital Media R&D Program under research grant NRF2008IDM-IDM004-006.
382
P. Wu et al.
References 1. Achlioptas, D.: Database-friendly random projections. In: ACM Symp. on the Principles of Database Systems, pp. 274–281 (2001) 2. Arya, S., Malamatos, T., Mount, D.M.: Space-time tradeoffs for approximate nearest neighbor searching. J. ACM 57(1), 1–54 (2009), doi:10.1145/1613676.1613677 3. Bar-Hillel, A., Hertz, T., Shental, N., Weinshall, D.: Learning a mahalanobis metric from equivalence constraints. JMLR 6, 937–965 (2005) 4. Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975), doi:10.1145/361002.361007 5. Clarkson, K.L.: A randomized algorithm for closest-point queries. SIAM J. Comput. 17(4), 830–847 (1988), doi:10.1137/0217052 6. Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-sensitive hashing scheme based on p-stable distributions. In: Proc. 20th Annual Symposium on Computational Geometry (SCG 2004), New York, NY, pp. 253–262 (2004) 7. Dong, W., Wang, Z., Josephson, W., Charikar, M., Li, K.: Modeling lsh for performance tuning. In: ACM CIKM Conference, USA (October 2008) 8. Gionis, A., Indyk, P., Motwani., R.: Similarity search in high dimensions via hashing. In: VLDB (1999) 9. Hoi, S.C., Liu, W., Lyu, M.R., Ma, W.Y.: Learning distance metrics with contextual constraints for image retrieval. In: CVPR (June 17–22, 2006) 10. Indyk, P., Motwani, R.: Approximate nearest neighbor: Towards removing the curse of dimensionality. In: STOC, pp. 604–613 (1998) 11. Goldberger, J., Roweis, S., Hinton, G., Salakhutdinov, R.: Neighbourhood components analysis. In: NIPS 17 (2005) 12. Li, P., Hastie, T.J., Church, K.W.: Very sparse random projections. In: ACM International Conference on Knowledge Discovery and Data Mining, KDD (2006) 13. Lv, Q., Josephson, W., Wang, Z., Charikar, M.S., Li, K.: Multi-probe lsh: efficient indexing for high-dimensional similarity search. In: VLDB, Vienna, Austria (2007) 14. Robinson, J.T.: The k-d-b-tree: A search structure for large multi-dimensional dynamic indexes. In: SIGMOD, pp. 10–18 (1981) 15. Shamos, M., Hoey, D.: Closest-point problems. In: Proc. 16th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pp. 151–162 (1975) 16. Weinberger, K.Q., Saul, L.K.: Distance metric learning for large margin nearest neighbor classification. JMLR 10, 207–244 (2009)
A SAQD-Domain Source Model Unified Rate Control Algorithm for H.264 Video Coding Mingjing Ai and Lili Zhao State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing 100191, China
[email protected],
[email protected]
Abstract. Rate control aims to adapt coding bit rate to the fluctuating network bandwidth. Since the bit rate at the encoder is determined by the final residual produced after quantizing DCT coefficients, we managed to set up exact relations among coding bit rate, the sum of absolute quantized difference (SAQD) and quantization step, proposed a novel SAQD-domain source model with an analytic justification, and unified it with an easy rate control algorithm for H.264 video coding. The proposed model shows a quite pleasing goodness-offit. Experimental results demonstrate that, compared with rate control mechanism suggested in H.264 proposal, the proposed model unified rate control algorithm yields the desired rate with much smaller deviations (less than 0.22% of the target rate), and provides much better reconstructed visual quality by giving a maximum coding gain up to 0.84dB and an average 0.59dB with smaller PSNR variation. Keywords: Rate control, SAQD-domain, H.264, Video coding.
1 Introduction Better video quality is required at the receiver, considering all kinds of network conditions. And the eventual purpose of rate control is to maximize reconstructed picture quality while adaptively changing coding bits according to the bandwidth-fluctuating network conditions. Hence, accurate rate control has always been an essential part of video coding standards and applications. Based on specific network bandwidth, rate control algorithm adaptively computes the target bits for a current picture and then computes a proper quantization parameter (QP) which is used to quantize transformed coefficients at the encoder. Two types of rate and distortion models based on q-domain and ρ -domain respectively have been used in rate control to produce QP [1]. In the q-domain approach, the source rate is modelled as a function of the quantization step-size and the residual signal energy [1],[2],[3],[4]. The q-domain model was employed as non-normative rate control mechanism in several coding standards such as MPEG-4, H.263 and H.264/AVC. In the ρ -domain approach, where ρ is the percentage of zero DCT coefficients, the number of source bits is modelled as a linear function of (1- ρ ) [5][6]. Although the ρ -domain rate models have been employed successfully for K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 383–392, 2011. © Springer-Verlag Berlin Heidelberg 2011
384
M. Ai and L. Zhao
various conventional video encoders, such as JPEG, MPEG-2, H.263, and MPEG-4, which produce a proper QP using one-to-one correspondence given a certain nonzero percentage (1- ρ )[5], they are difficult to apply with respect to the H.264 encoder, since it is hardly possible to find the one-to-one correspondence between ρ and QP in H.264 due to the complicated coefficient quantization expression [1]. S. Milani et al. proposed a ( ρ , Eq)-domain based rate control algorithm for H.264 [7]. Equation is established between parameter ρ and Eq, where Eq represents the combined action result of QP and residue energy between the original image and the predicted image. Quantization step is then computed according to ρ -Eq relation, given certain zero percentage ρ of DCT coefficients. L. Liu et al. proposed in [8] to determine QP by establishing equation between (1- ρ ) and SATD (sum of absolute transform difference). The proposed square root model in [8] has somewhat improved rate control performance. However, the technique needs complex computation due to its square rooting operation and relatively numerous model parameters. This paper proposes a novel SAQD-domain based source model with an analytic justification, and unifies it with an easy rate control scheme for H.264. As is known, the coding bits generated at the encoder are directly determined by the transmitted residual, which is produced by quantizing DCT coefficients of the prediction error with quantization step q . Consequently, it is quite reasonable to obtain a more straight and accurate equation which connects bit rate R , quantization step size, and SAQD (Sum of Absolute Quantized Difference). The rest of this paper is organized as follows. Section 2 demonstrates the proposed SAQD-domain model with an analytic justification. Section 3 describes the easy rate control algorithm using the proposed model. Experimental results are shown in Section 4, and Section 5 concludes this paper.
2 Proposed R-SAQD Model As mentioned in Section 1, although the ρ -domain rate models have been proved to have successful rate control performance for conventional video encoders, such as JPEG, MPEG-2, H.263, and MPEG-4, it is difficult to employ ρ -domain models in the rate control algorithms for H.264. Inspired by the ρ -domain source modeling, we propose a novel SAQD-domain based rate model. In this section, we will give an analytic justification of the proposed rate control model and demonstrate how to link SAQD with quantization step size q for current basic unit. Note that in the following part of this paper, both SAQD and SATD refer to the item that is normalized by the number of coded luma and chroma coefficients of each frame. 2.1 Model Justification It has been observed that zeros play a key role in transform coding and directly determine the output coding bit-rate. Z. He et al. have proved that there exists approximately a linear relation R( ρ ) between the number of nonzero DCT coefficients
1- ρ and the coding bit rate R [5]. In other words, R( ρ ) has the following expression:
A SAQD-Domain Source Model Unified Rate Control Algorithm
R( ρ ) = θ (1 − ρ )
385
(1)
where θ is constant. According to [9], the DCT coefficients have a Laplacian distribution given by
pl ( x) =
λ 2
e − λ | x|
(2)
where λ can be approximated using the standard deviation of DCT coefficients, s, with λ = 2 / s . The corresponding zero DCT coefficient percentage can be given by
λ ρ = ∫−qq pl ( x)dx = 2∫0q e − λ | x |dx = 1 − e − λq
(3)
2
where q denotes current quantization step size. The SAQD , which is generated by quantizing DCT coefficients of the prediction error with q , can be deduced by: +∞ iq + q −1
SAQD = ∑
∑
i =−∞ j =iq
+∞iq + q −1
i ⋅ p ( j ) = 2 ∑ ∑ i ⋅ p( j )
(4)
i =0 j =iq
here, p ( j ) denotes the probability of coefficient j , which is computed as: j
∫j −1
p( j ) =
pl ( x)dx =
(e λ − 1) −λj e 2
(5)
With (5), (4) can be further deduced as: +∞ iq + q −1
+∞ iq + q −1
i =1 j =iq
i =1 j =iq
SAQD = (e λ − 1) ⋅ ∑ i ∑ e −λj = (e λ − 1) ⋅ ∑ i ∑ e −λ λ
+∞
= (e − 1) ⋅ ∑ i ⋅ e i =1
−λiq
q −1
∑e
(j−iq ) ⋅ e −λiq (6)
−λj
j =0
q −1
Items in ∑ e −λj make up a geometric sequence, and by applying the summation j =0
q −1
1 − e − λq
j =0
eλ −1
formula of geometric progression, we have ∑ e −λj ≈
, and using the sum-
mation result, (6) can be updated as: +∞
SAQD = (e λ − 1) ⋅ ∑ i ⋅ e −λiq ⋅ i =1
1 − e − λq eλ − 1
+∞
= (1 − e −λq ) ⋅ ∑ i ⋅ e −λiq
+∞
e − λq
i =1
(1 − e − λq ) 2
From Appendix, we know that expression ∑ i ⋅ e − λiq converges to hence (7) could be further approximated as:
(7)
i =1
,
386
M. Ai and L. Zhao
SAQD =
e −λq
(8)
1 − e −λ q
Combining (7) and (3), we have: SAQD =
1
ρ
−1
(9)
By taking the second order Taylor expansion of the right hand side of (9), the following can be obtained: 1
SAQD =
ρ
− 1 ≈ ( ρ − 1) + ( ρ − 1) 2
(10)
The proposed R-SAQD Model is ultimately derived by combining (1) and (10): SAQD = θ1 ⋅ R 2 + θ 2 ⋅ R
(11)
where R denotes the target bit rate, whereas both θ1 and θ 2 are positive constants. To better demonstrate the accuracy of the proposed R-SAQD model defined in (11), charts in Fig. 1 illustrates the goodness-of-fit of several sequences in QCIF format. The x and y axis in each chart represent respectively target bit rate R and sum of absolute quantized difference SAQD . 900.0
actual data curve quadratic curve
900.0 800.0
SAQD/q
400.0
500.0 400.0
300.0
300.0
200.0
200.0
100.0
100.0
0.0
0.0 0.0
0.1
0.2 0.3 bits/pixel
(a) frame 12 of "Soccer"
0.4
0.5
SAQD/q
600.0
600.0 500.0
45.0 actual data curve quadratic curve
40.0 35.0
700.0
700.0 SAQD/q
50.0 45.0
actual data curve quadratic curve
800.0
35.0 30.0
30.0 25.0 20.0 15.0 10.0
0.1
0.2 bits/pixel
0.3
0.4
(b) frame 23 of "Ice"
25.0 20.0 15.0 10.0 5.0
5.0 0.0
0.0
actual data curve quadratic curve
40.0
SAQD/q
1000.0
0.0 0.0
0.2
0.4 bits/pixel
0.6
(c) frame 31 of "Bus"
0.8
0.0
0.2
0.4 0.6 bits/pixel
0.8
1.0
(d) frame 12 of "Hourbor"
Fig. 1. Relationship between bit rate and SAQD with quantization parameter ranging from 20 to 48
It can be observed from the charts that the actual data curve shows quite a good match to the quadratic line depicted by the defined model in (11), confirming that the proposed model is effective. 2.2 Link SAQD with Quantization Step
Since quantization step q does not show in the model defined in (11), connection should be found between SAQD and q . As is known, SAQD can be estimated by: SAQD ⋅ q = SATD − Δd
(12)
where SATD is the sum of absolute transformed difference of current basic unit, and Δd denotes the distortion between original and reconstructed image, which is relatively small when compared to the value of SATD.
A SAQD-Domain Source Model Unified Rate Control Algorithm
387
Therefore, SAQD in our proposed algorithm is approximated for simpleness as: SAQD ≈
SATD q
(13)
Combining (11) with (13), we have: SATD = θ1 ⋅ R 2 + θ 2 ⋅ R q
(14)
SATD should be known in order to compute quantization step using the proposed model defined in (10). The computation of SATD causes a chicken and egg dilemma since SATD can not be obtained unless the current block is encoded. Here we take a similar strategy which has been used for decoupling this dilemma in the rate control algorithm [2] proposed for H.264/AVC, a linear model is employed here to predict SATD of current basic unit. Let SATDc and SATD p respectively represent the predicted quantized difference of current basic unit and the actual quantized difference of the co-located basic unit in previous frame. The linear prediction model is then given by SATDc = α ⋅ SATD p + β
(15)
where α and β are two parameters of prediction model with initial values 1.0 and 0.0 respectively, which are updated in the post-encoding stage. Since the deduction of the R and SAQD relationship is determined by distribution of image residual generated after transforming and quantizing operation, the proposed model can be employed in rate control schemes of various coding standards that do transformation and quantization; such as MPEG-2, MPEG-4, H.263, and H.264. Nonetheless, in this paper, we apply the derived SAQD-domain model to an easy rate control algorithm designed for H.264. And we will show how to unify this novel rate control model with the easy rate control algorithm in Section 3.
3 Rate Control Scheme A rate control scheme normally consists of two stages: pre-encoding and postencoding. The objectives of these two stages are to determine a quantization step for current unit and to update the model parameters respectively. In the following, we will demonstrate in three steps the basic unit layer rate control scheme unified with the proposed rate control model discussed in Section 2. The basic unit in the proposed rate control scheme can be selected as either a frame or a fraction of N mbpic , which is the number of macroblocks in a frame. Note that bit budget allocations for each GOP and each basic unit in the proposed algorithm employs a similar bits allocation strategy described in G012 rate control algorithm that has been proposed for H.264/AVC [2], which will not be described here in detail.
388
M. Ai and L. Zhao
1) Determine Quantization Step The quantization step for basic units of the first I frame and P frame in current GOP is predefined based on the available channel bandwidth as described in [2]. The quantization steps of the rest of basic units are computed as follows. Suppose the total bit budget for the current basic unit, Rall , has been calculated using methods in [2]. Bit budget Rt for texture of current basic unit is computed by: Rt = Rall − Rh
(16)
where Rh is number of bits occupied by coding unit header information, which consists of MB types and prediction modes for I frames as well as additional motion vectors and reference frames for P/B frames. In the proposed algorithm, we simply assign the value of Rh generated by the previous unit header information to the Rh of current unit. Note that the final target bit rate Rˆ used in model (11) is computed by dividing Rt with N c , the total number of luminance and chroma coefficients for current basic unit. And SATD of the current basic unit is computed using the linear prediction model described in Section 2. With Rˆ and SATDc , the ultimate quantization step for the current coding unit is computed by: q=
SATDc θ1 ⋅ Rˆ 2 + θ 2 ⋅ Rˆ
(17)
Here, θ1 and θ 2 are respectively assigned with initial values 0.0 and 1.0 for its simplicity of computation, and updated in 2) using a sliding window mechanism with a size of ten. Note that the values of θ1 and θ 2 will be converged after numbers of updating. 2) Update Model Parameters After encoding the ith basic unit, data points, S i , Ri and qi , are stored in a sliding window, where Si , Ri and qi separately represent SAQD, actual bit rate and quantization step size produced by the ith unit. Model parameters α computed using LSM (Least Square Method). α and β are updated as follows:
β
θ1 and θ 2 are all
n
α = S p − β ⋅ Sc , β =
∑ S i +1 ⋅ S i − n ⋅ S p ⋅ S c
i =1
n
2 ∑ S i +1 − n ⋅ S c
2
(18)
i =1
Here, S p and S c represent respectively the average SAQD of the previous one to n and two to n + 1 basic units, with 0 ≤ n ≤ 10, n ∈ N + .
θ1 and θ 2 in (11) are computed by:
A SAQD-Domain Source Model Unified Rate Control Algorithm
θ1 =
D1 D , θ2 = 2 D D
389
(19)
where D , D1 and D 2 are computed as follows: n
n
n
D = ∑ Ri2 ⋅ ∑ Ri4 − ( ∑ Ri3 ) 2 i =1 n
i =1 n S 2 D1 = ∑ Ri ⋅ ∑ i i =1 i =1 qi n n S D2 = ∑ Ri4 ⋅ ∑ i i =1 i =1 qi
i =1
Ri2
n S n − ∑ Ri3 ⋅ ∑ i Ri i =1 qi i =1
(20)
n S n Ri − ∑ Ri3 ⋅ ∑ i Ri2 i =1 qi i =1
3) Loop Repeat steps I, II and III for the next basic unit until all the basic units in current frame are encoded.
4 Experimental Results In order to verify the control efficiency of the proposed SAQD-domain source model unified easy rate control algorithm, we incorporated it into the H.264/AVC reference software JM10.2 [10], and compared it with the rate control scheme described in [2]. Note that the basic unit in this experiment was selected as a frame. The same setting has been applied to both the proposed algorithm and the standard reference for fair comparison. Sequences in both QCIF and CIF formats were tested. Each sequence was encoded at 30fps in 100 frames using GOP of 15, with the structure IPP and no B frames. Ten reference frames are used for inter prediction and the skip mode for P Table 1. Rate control performance comparison between the algorithm in JM10.2 and the proposed algorithm
Format
Target rate (kbps) 64.00
QCIF
96.00 128.00 512.00
CIF
768.00 1024.00
JM10.2 Seq. City Highway Carphone Coastguard Football Harbour Bus Waterfall Flower Stefan Soccer Tempete
Rate (kbps) 64.21 64.42 96.27 95.98 129.98 128.08 514.16 514.38 769.43 770.92 1024.76 1025.84
Δrate +0.21 +0.42 +0.27 -0.02 +1.98 +0.08 +2.16 +2.38 +1.43 +2.92 +0.76 +1.84
Proposed PSNR Rate Δrate (dB) (kbps) 30.63 64.05 +0.05 37.16 64.10 +0.10 37.38 96.21 +0.21 30.62 96.10 +0.10 26.86 128.00 0.00 28.38 127.79 -0.21 30.59 512.52 +0.52 36.72 511.92 -0.08 30.97 767.95 -0.05 33.97 768.46 +0.46 38.27 1023.20 -0.80 34.86 1024.62 +0.62
PSNR(dB) 31.29(+0.66) 37.74(+0.58) 37.87(+0.49) 31.10(+0.48) 27.47(+0.61) 28.83(+0.45) 31.07(+0.48) 37.56(+0.84) 31.35(+0.38) 34.55(+0.58) 38.99(+0.72) 35.54(+0.68)
390
M. Ai and L. Zhao 32.0
32.5
31.2
41.5
Proposed 31.5
30.6
30.5
29.5
30.4 30.2
Proposed JM10.2
20
40 60 Frame
80
(a) City
100
39.5 38.5
40
Frame
70
(b) Coastguard
100
30.5 Proposed JM10.2
29.5
36.5
10
31.0
30.0
37.5
28.5 0
31.5
40.5 PSNR(dB)
30.8
Proposed JM10.2 PSNR(dB)
JM10.2 PSNR(dB)
PSNR(dB)
31.0
0
20
40 60 Frame (c) Soccer
80
100
0
20
40
60 Frame
80
100
(d) Bus
Fig. 2. Comparison of PSNR variations of the algorithm in JM10.2 and the proposed algorithm
frames is disabled for the tests. Bit rate and PSNR (Peak Signal to Noise Ratio) were computed to compare algorithm performance. To better compare the reconstructed visual quality of the two rate control schemes, we also figured the comparison of frame-to-frame PSNR variations for several sequences, since the displayed sequence with strong PSNR variations will look unnatural and visually unpleasant. Table 1 compares rate control performance results of the algorithm in JM10.2 and the proposed algorithm in terms of average luminance PSNR gains and coding bit rate accuracy. Here, Δrate in the table denotes the bit rate deviation compared with the target bit rate. A positive/negative value means the generated bit rate is higher/lower than the target bit rate. As can be seen from the table, compared with rate control algorithm in JM10.2, the proposed SAQD-domain model unified rate control algorithm generates a much better visual quality with higher PSNR and less control error with more accurate bit rate. Results presented in Table 1 show that the maximum and average luminance PSNR gains are 0.84dB and 0.59dB, respectively. Moreover, it can be seen from the table that for most sequences the proposed rate control scheme can yield the desired rate with much smaller deviations (less than 0.22% of the target rate) than the rate control algorithm in the JM10.2. Reconstructed visual quality variations are compared in Fig. 2. As mentioned in section 3, the quantization step for basic units of the first I frame and P frame in current GOP is predefined using the same method as the standard reference [2], so the PSNR performance of the first two frames (first I and first P) of the two rate control algorithms in the charts is identical. We can see from the charts that PSNR variations of the proposed scheme are relatively smaller than that of the JM10.2 for each sequence, with higher PSNR gains as a whole.
5 Conclusions A novel SAQD-domain source model was proposed in this paper to provide a straight and accurate correspondence between bit rate and quantization parameter, and was unified with an easy rate control algorithm for H.264. The proposed model has shown a quite pleasing goodness-of-fit, and can also be used in rate control algorithms for other coding standards as well. Experimental results showed that the proposed algorithm outperforms the rate control algorithm suggested in JM10.2 by giving more accurate rate control performance with both better reconstructed visual quality and closer target bit rate for most sequences.
A SAQD-Domain Source Model Unified Rate Control Algorithm
391
As is shown in this paper, we employed in our proposed rate control scheme the bit allocation mechanism of H.264. It is expected that the SAQD-domain source model can be applied to H.264 based stereo video encoder with more efficient bit allocation mechanism, which is our future research area. Acknowledgments. We thank Zhijie Di and Mingyang Tan for their help collecting experimental data. This work is supported by the National High-tech R&D Program (863 Program) of China.
References 1. Kwon, D., Shen, M.Y., Kuo, C.J.: Rate Control for H.264 Video With Enhanced Rate and Distortion Models. IEEE Trans. Circuits Syst. Video Technol. 17, 517–529 (2007) 2. Li, Z.G., Pan, F., Lim, K.P., Feng, G., Lin, X., Rahardaj, S.: Adaptive Basic Unit Layer Rate Control for JVT. Doc. JVT-G012. ISO/IEC JTC1/SC29/WG11 and ITU-T Q6/SG16 (2003) 3. Chang, C.Y., Chou, C.F., Chan, D.Y., Lin, T., Chen, M.H.: A q-Domain CharacteristicBased Bit-Rate Model for Video Transmission. IEEE Trans. Circuits Syst. Video Technol. 18, 1307–1311 (2008) 4. Xu, Q., Liu, Y.L., Lu, X.A., Gomila, C.: A New Source Model and Accurate Rate Control Algorithm with QP and Rounding Offset Adaptation. In: IEEE International Conference on Image Processing, pp. 2496–2499 (2008) 5. He, Z., Mitra, S.K.: A Linear Source Model and a Unified Rate Control Algorithm for DCT Video Coding. IEEE Trans. Circuits Syst. Video Technol. 12, 970–982 (2002) 6. He, Z., Mitra, S.K.: Optimum Bit Allocation and Accurate Rate Control for Video Coding via ρ - domain Source Modelling. IEEE Trans. Circuits Syst. Video Technol. 12, 840–848 (2002) 7. Milani, S., Celetto, L., Mian, G.A.: An Accurate Low-Complexity Rate Control Algorithm Based on (ρ, Eq)–Domain. IEEE Trans. Circuits Syst. Video Technol. 18, 257–262 (2008) 8. Liu, L., Zhuang, X.H.: A Novel Square Root Rate Control Algorithm for H.264/AVC Encoding. In: IEEE International Conference on Multimedia and Expo, pp. 814–817 (2009) 9. Lam, E.Y., Goodman, J.W.: A Mathematical Analysis of the DCT Coefficient Distributions for Images. IEEE Trans. Image Processing 9, 1661–1666 (2000) 10. H.: 264/MPEG4 AVC Reference Software, http://iphome.hhi.de/suehring/tml
Appendix +∞
+∞
i =1
i =1
Let Δ 2 = ∑ i ⋅ e −γ qi Σ = ∑ i ⋅ e −λiq , then we have: Σ = e − λq + 2 ⋅ e −2λq + 3 ⋅ e −3λq + … + k ⋅ e − kλq + …
(21)
We multiple both sides of equation in (21) by e −γ q to make: e − λq Σ = e −2λq + 2 ⋅ e −3λq + 3 ⋅ e −4λq + … + k ⋅ e − (k +1) λq + …
(22)
392
M. Ai and L. Zhao
With (21) and (22), subtracting e − λq Σ from Σ , we have: (1 − e −λq ) ⋅ Σ = e −λq + e −2λq + e −3λq + … + e − kλq + …
(23)
Items in the right-hand side of (23) make up a geometric sequence, and by applying the summation formula of geometric progression, we have: (1 − e −λq ) ⋅ Σ = e −λq ⋅ lim
n→+∞
i.e.:
Σ≈
1 − e − n λq 1 − e −λ q
e −λq
(1 − e −λq ) 2
≈
e − λq
1 − e − λq
(24)
(25)
A Bi-objective Optimization Model for Interactive Face Retrieval Yuchun Fang, Qiyun Cai, Jie Luo, Wang Dai, and Chengsheng Lou School of Computer Engineering and Science, Shanghai 200072, China
[email protected]
Abstract. In this paper, based on Bayesian relevance feedback methods, we propose a novel interactive face retrieving model based on two objective functions, one is the Maximum a Posterior (MAP) and the other is maximization of mutual information. The proposed bi-objective optimization model aims at minimizing both the number of interactive iterations and the average length of iterations. Moreover, we deduce a top-bottom search algorithm to solve the proposed. Experiments with real testers prove that the proposed algorithm could largely improve the interactive searching efficiency in face databases. Keywords: Face retrieval, Relevance feedback, Face recognition, Interactive image retrieval.
1 Introduction Face retrieval aims at searching target face images in a given large face database. As traditional application, example-based face retrieval searches images of target by comparing images in the face database with the query images, which are real images of the target [1-4]. However, in many application domains, such queries are unavailable as references. To find the target, one way is to build up queries such as the sketch-based face retrieval. Yuen and Man adopted line-drawing feature to represent face images [5].Wang and Tang proposed a Markov field model to rendering sketches from grey images for retrieval [6]. Xu et al used a hierarchical graphic model to combine the local structural facial features [7]. The sketch based face retrieval suffers from the difficulties in building up a universal query-reconstruction model and risks to lose the target information during reconstruction. Another way is to find targets through an interactive retrieval containing a series of query and answer between users and computers realized with the relevance feedback model. During the iterations of query and answer, the model will incrementally learn the target from the answers provided by the users through comparing several candidates until the target appears among candidates. For learning target with relevance feedback model, one solution is to train binary classifier to identify target and non-target. Navarret et al applied Self Organizing Mapping (SOM) as relevance feedback model [8]. Yang and Laaksonen realized a PicSOM system based on long-term learning and multi-features [9]. He et al adopted K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 393–400, 2011. © Springer-Verlag Berlin Heidelberg 2011
394
Y. Fang et al.
support vector machine to learn user feedback [10]. Since there are few positive examples in interactive face retrieval for each class, i.e. each subject, the trained classifier might over-fit to the inaccurate answer and thus fails to search the target in a large database. Another difficult with the relevance feedback model is the so-called semantic gap between the subjective user feedback and the metrics of machine in low-level feature spaces. Introducing semantic knowledge into face retrieval is a common way to handle the problem. Sridharan et al realized interactive face retrieval with the semantic description provided by users [11]. Ito and Koshimizu also assumed that the face databases are already with semantic labels [12]. Zhang et al designed a hierarchical semantic feature system that allows users to add personal inclination information in the databases [13]. Up to now, face retrieval with semantic knowledge still require labeling the face databases manually and it is hard to be applied to many available databases. To handle the above two problems, i.e. the lack of training samples and semantic gap, a more reliable solution is to incrementally establish statistical model for interactive retrieval based on Bayesian rules. The Bayesian model proposed by Cox et al [14] is a benchmark for interactive image retrieval, in which the displayed candidates were selected with the rule of MAP. If the posterior is correctly updated, a bump will appear in the probability mass distribution around the target and the target will be found the most quickly with the MAP rule. Instead of the MAP rule, Fang and Geman proposed to select the candidates that maximize the mutual information between the target and answer based on a Bayesian model for interactive face retrieval [15]. The candidates were selected to grab the most amount of information about targets, which leads to an average minimum iteration numbers. For a successful interactive retrieval system, we desire to have both the minimum iteration number in any test and the minimum average iteration numbers around all tests. Hence, we propose a bi-objective optimization model, which is a combination of both the MAP and the maximization of mutual information. Experiments prove its effectiveness in interactive face retrieval. The remainder of the paper is organized as follows. The basic Bayesian relevance feedback model is formulated in Section 2. Section 3 describes the bi-objective retrieval model and the top-bottom search solution to this model is deduced in Section 4. In Section 5, we present the experimental results with the real testers. Conclusions are drawn in Section 6.
2 The Relevance Feedback Model The interactive face retrieval contains several major steps as illustrated inside the ellipse of Figure 1. The user provides answers to the candidates and the relevance feedback model provides candidates as query to user based on the updated knowledge of target. Such iterations will continue until the target appears or the user abandons the search. Outside the ellipse part of Figure 1 is the feature space in which the relevance feedback model measures image similarity. We formulate the relevance feedback process with a probabilistic model as shown in [15]. Suppose there are N images in the face database F , each is denoted with its
A Bi-objective Optimization Model for Interactive Face Retrieval
Target
395
Feature database
Answer
Relevance feedback
Candidates
Query
User abandon N
?
Target appear
Terminate
Y
Fig. 1. Illustration of Interactive Face Retrieval
i ∈ {1,..., N } . The candidates in the t -th iteration, i.e. the query, forms a subset of F is denoted as Qt with n(<< N ) elements and the value of integer label as
A(Qt ) . Two major random variables involved are the target Y (Y = i, i = 1,..., N ) and the answer X Qt ( X Qt = A(Qt )) . Under the assumption
answer is denoted as
that the user provides answer independently between any two iterations, the posterior probability of Y after t iterations could be calculated with Bayesian rules as in Equation (1).
pi( t ) = P (Y = i | A(Q1 ),..., A(Qt )) (1) 1 = pi( t-1) ⋅ P ( X Qt = A(Qt ) | Y = i ) K ( 0) where i ∈ {1,..., N } and K is a coefficient for normalization. When t = 0 , p i is the prior probability of Y . As to the conditional probability item in Equation (1), we define A(Qt ) = at ∈ Qt and θ ⋅ d (i , a t ) P( X Qt = A(Qt ) | Y = i ) = K ′ (2) d (i, Qt \ {at })
θ is the parameter controlling the proportion of contribution of d (i, a t ) , Qt \ {at } is the complementary set of {at } with respect to universe Qt , d (⋅) is the metrics in feature database as illustrated in Figure 1 and K ′ is a coefficient for where
normalization. The conditional probability is directly proportional to the distance
396
Y. Fang et al.
between image elements in
i and the user answer at , and inversely proportional to that of other
Qt . So with the iteration going on, the posterior will bump up around the
answers of user and recesses around those non-answers among candidates. This model proves simple to adapt to real user response and easy to be realized [15]. As shown by Equation (1), the accumulation of information about target is incrementally evolved with the update of posterior probability. Even if the answer of the user is incoherent with that of machine according to metrics in feature spaces in a few iterations, there exists the probability for the model to drive the posterior to the right way if the coherent answer is dominant. In such a way, a probabilistic model could diminish the problems of over-fitting and semantic gap.
3 Bi-objective Optimization Model With any relevance feedback model, the primary challenge is the selection of candidates as query though normally candidates are ordered images from the database according to their likelihood to be target. As for a Bayesian model, the MAP rule is a usually adopted solution in image retrieval [14] as in Equation (3).
Qt = {k | p k(t −1) ≥ max pi(t −1) } i∈F \ Qt
(3)
In other words, the candidates for current iteration are the images with the largest n posteriors in a previous iteration. If the Bayesian model estimated follows a correct distribution, then the candidates drawn with the MAP rule should leads to a fastest retrieval according to the Bayesian decision rule. With the accumulation of information during iterations of retrieval, the entropy of the posterior will decrease accordingly. So another method to pick candidates is to maximize the mutual information between Y and X Qt [15]. The objective function is shown in Equation (4), which aims at decreasing the greatest amount of uncertainty about target, so its solution should lead to the minimum average iteration numbers of interactive retrieval.
Qt = arg max I (Y | A(Q1 ),..., A(Qt -1 ); X F ′ ) ∀F ′ ⊂ F and card ( F ′) = n
(4)
Since the posterior accumulates around the images close to answers, the interactive iteration with the pure MAP rules would converge fast only if the posterior are correctly tuned to have a mass around target. Nevertheless, the incoherence between user and machine might inhibit the correct way of adjustment of posterior and hence lead to slow convergence in real tests. While with the maximization of mutual information, it happens that the candidates exclude the target even if the posterior around it is bumped up, because the candidates are those that lower the uncertainty of the target the most. We propose to combine both ways to form a new bi-objective model that could complement each other’s drawback as formulated in Equation (5). It can
A Bi-objective Optimization Model for Interactive Face Retrieval
397
weaken the effect of wrong choice, which causes degradation of the MAP method. And the MAP method can lead to the convergence of the maximization of mutual information way.
Qt = arg max I (Y | A(Q1 ),..., A(Qt -1 ); X F ′ ) ∀F ′ ⊂ F and card ( F ′ ) = n
st.
∑ pi(t −1) = max
i∈Qt
∑ p k(t −1)
(5)
∀F ′ ⊂ F and card ( F ′ ) = n k∈F ′
4 Top-Bottom Search for Candidates As the formulation in Equation (5) is a conditioned combinatorial optimization problem, i.e. picking n(<< N ) elements in F (card ( F ) = N ) that maximize the mutual information while have the posterior as large as possible, no close-form solution can be deduced except the time-consuming method of exhaustion, which is not applicable during interactive retrieval when the size of image database is large. We propose to solve the problem with a top-bottom search. Instead of finding all candidates simultaneously, the candidates are selected one by one until all n elements are fixed. As mentioned in [15], if the answer of user is coherent with machine, the objective function in Equation (4) is equivalent to pick candidates that maximize the entropy as shown in Equation (6)
Qt = arg max H ( X F ′ | A(Q1 ),..., A(Qt -1 )) ∀F ′⊂ F and card ( F ′) = n
(6)
Since
P( X F ′ | A(Q1 ),..., A(Qt -1 )) N
= ∑ P( X F ′ | Y = i) P(Y = i | A(Q1 ),..., A(Qt -1 ))
(7)
i =1
The solution to the objective function in Equation (6) is a subset of F whose elements form a uniform partition (in the sense of posterior) to F according to metrics. Each element in this subset could be viewed as a representative to those images in its partition. In comparison with the MAP rule that picks the most likely to be target images as candidates, the rule of maximizing mutual information picks most representative images of the database to lower the uncertainty of the target. Based on the above analysis, we deduce a solution to the top-bottom search containing the following two steps. Step 1: According to Equation (5), the first element is randomly picked from those images with the largest posterior. Step 2: With the first n ′(1 ≤ n′ < n) candidates fixed, the n ′ + 1 -th candidate is the
image with the maximum posterior in F excluded the already picked n ′ candidates and their neighbor images assigned into their uniform partition according to posterior.
398
Y. Fang et al.
5 Experimental Analysis We test the proposed interactive face retrieval algorithm with a face database containing 849 subjects, each subject one image. All subjects are Asian and undergraduate students, so they have little variation in age. One reason to take use of such database is to have relatively uniform distribution of semantics (in the sense of age and race) to avoid cognition limitation across semantics. Normally human are good at distinguishing people from the same semantic category. A second reason is that it is hard to find large amount of subjects of the same semantic categories in available open face databases such as FERET etc, though they may contain more images in total. As mention in [16], “The number of people tested is more significant than the total number of attempts in determining test accuracy”. Most face databases except one contain around 300 subjects in the tests of FRVT 2006 [16], ours is nearly three times of that size. The third reason is that on the student database we could easily find enough real users with their classmates as the assumed target. The images are represented by fusion of multi-directional Local Binary Pattern features [17]. For the number of candidates, we take the same value n = 8 as in [15]. In our experiment, 35 students have performed up to 177 tests, the users are allowed to give up the test when the iteration number is larger than 40, hence among them there are 154 full tests and the rate of abandon is less than 13%. A statistic graph of the 154 full tests is illustrated in Figure 2, which contains curves of cumulative precision with respect to the iteration numbers. The cumulative precision is the percent of successful retrieval among all tests inside a given iteration number. In comparison, we paint the curve in the case that candidates are randomly picked as a reference, which is a straight line [15]. By the curve in Figure 2, the cumulative precision is over 95% within 55 iterations with our model. While for a random display strategy, the mean of iteration number is 54.4. We also compare the proposed model with the one in [15], and the results are summarized in Table 1. The experiments in [15] were performed on a database 1 0.9
Cumulative Precision
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
Real User Test Random display 5
10
15
20
25 30 35 40 Number of Iterations
45
50
55
Fig. 2. The statistical result with the proposed model
60
A Bi-objective Optimization Model for Interactive Face Retrieval
399
contain 531 subjects with close number of Asians, Caucasians and Blacks and without limitation of age. Our database is of larger size and harder because all 849 person are 17-22 year old Asians. With the semantic information as prior, the target should be easier to search, so our database is more close to possible real application. The average iteration number reported in [15] is 14.7 and the rate of speedup is 2.2 in comparison with random display, whose average iteration number is 32.5. With our model, the rate of speedup grows up to 2.5(54.4/21.4). Table 1. Comparison with the algorithm in [16]
Method in [15] Our method
Test number 78 154
Subject Number 531 849
Average iteration number 14.7 21.4
6 Conclusions We propose a bi-objective optimization model for interactive face retrieval. Based on a framework of Bayesian relevance feedback model, the objective of MAP rule serves to pick candidates the fastest in a single retrieval test, and the other objective is to maximize mutual information between the target and user answer, which serves to minimize the average iteration number. To solve the model, a top-bottom search algorithm is deduced to pick candidates fast. Comparison experiments demonstrate that the proposed model could largely improve the interactive retrieval efficiency. Our work also shows that it is possible to realized interactive retrieval in larger face databases in combination with prior semantic knowledge. Acknowledgments. This work is supported by the National Natural Science Foundation Project (60605012), the Natural Science Foundation of Shanghai, China (08ZR1408200), the Open Project Program of the National Laboratory of Pattern Recognition of Chinese Academy of Sciences (08-2-16), the Shanghai Leading Academic Discipline Project (J50103) and the Innovation Fund for the Graduate Students of Shanghai University (SHUCX102176).
References 1. Liu, C.: Enhanced Independent Component Analysis and its Application to Content Based Face Image Retrieval. IEEE Transactions on Systems, Man, and Cybernetics (Part B) 34(2), 1117–1127 (2004) 2. Kim, T.-K., Kim, H., Hwang, W., Kittler, J.: Component-based LDA Face Description for Image Retrieval and MPEG-7 Standardisation. Image and Vision Computing 23(7), 631– 642 (2005) 3. Gao, Y., Qi, Y.: Robust Visual Similarity Retrieval in Single Model Face Databases. Pattern Recognition 38, 1009–1020 (2005) 4. Vikram, T.N., Chidananda, K., Gowda, D.S., Guru, S.R.: Urs.:Face Indexing and Retrieval by Spatial Similarity. Congress on Image and Signal Processing 1, 543–547 (2008)
400
Y. Fang et al.
5. Yuen, P.C., Man, C.H.: Human Face Image Searching System Using Sketches. IEEE Transactions on Systems, Man and Cybernetics (Part A) 37(4), 493–504 (2007) 6. Wang, X., Tang, X.: Face Photo-Sketch Synthesis and Recognition. IEEE Transactions on PAMI 31(11), 1955–1967 (2009) 7. Xu, Z., Chen, H., Zhu, S.-C., Luo, J.: A Hierarchical Compositional Model for Face Representation and Sketching. IEEE Transactions on Pattern Analysis and Machine Intelligence 30(6), 955–969 (2008) 8. Ruiz-del Solar, J., Navarrete, P.: FACERET: An Interactive Face Retrieval System Based on Self-Organizing Maps. In: Lew, M., Sebe, N., Eakins, J.P. (eds.) CIVR 2002. LNCS, vol. 2383, pp. 157–164. Springer, Heidelberg (2002) 9. Yang, Z., Laaksonen, J.: Interactive Retrieval in Facial Image Database Using SelfOrganizing Maps. In: Proc. of IAPR Conference on Machine Vision Applications (2005) 10. He, R., Zheng, W.-S., Ao, M., Li, S.Z.: Reducing Impact of Inaccurate User Feedback in Face Retrieval. In: Chinese Conference on Pattern Recognition, pp. 1–6 (2008) 11. Sridharan, K., Nayak, S., Chikkerur, S., Govindaraju, V.: A probabilistic approach to semantic face retrieval system. In: Kanade, T., Jain, A., Ratha, N.K. (eds.) AVBPA 2005. LNCS, vol. 3546, pp. 977–986. Springer, Heidelberg (2005) 12. Ito, H., Koshimizu, H.: Face Image Retrieval and Annotation Based on Two Latent Semantic Spaces in FIARS. In: Eighth IEEE International Symposium on Multimedia, pp. 831–836 (2006) 13. Zhang, L., Yang, Q., Bao, T., Vronay, D., Tang, X.: Imlooking: Image-based Face Retrieval in Online Dating Profile Search. In: CHI Extended Abstracts, pp. 1577–1582 (2006) 14. Cox, I., Miller, M., Minka, T., Papathomas, T., Yianilos, P.: The Bayesian Image Retrieval System, Pichunter: Theory, Implementation and Psychological Experiments. IEEE Trans. Image Processing (9), 20–37 (2000) 15. Fang, Y., Geman, D.: Experiments in Mental Face Retrieval. In: Kanade, T., Jain, A., Ratha, N.K. (eds.) AVBPA 2005. LNCS, vol. 3546, pp. 637–646. Springer, Heidelberg (2005) 16. Phillips, P.J., Scruggs, W.T., O’Toole, A.J., et al.: FRVT 2006 and ICE 2006 Large-Scale Experimental Results. IEEE Trans. Pattern Analysis and Machine Intelligence 32(5), 831– 846 (2010) 17. Fang, Y., Luo, J., Lou, C.: Fusion of Multi-directional Rotation Invariant Uniform LBP Features for Face Recognition. In: International Symposium on Intelligent Information Technology Application, vol. 2, pp. 332–335 (2009)
Multi-symbology and Multiple 1D/2D Barcodes Extraction Framework Daw-Tung Lin and Chin-Lin Lin Department of Computer Science and Information Engineering National Taipei University No. 151, University Rd., Sansia, Taipei County 23741, Taiwan
[email protected],
[email protected]
Abstract. Image-based barcode recognition technique is a robust and extendable approach for versatile 1D/2D barcodes reading. Most of methods discussed in literature may either work for single 1D/2D barcode or rely on finding the unique finder pattern. Multi-symbology barcode extraction is a practical issue and yet challenging issue. Extended from our preliminary investigation and for realistic consideration, this work proposes a general segmentation framework to achieve extraction of real barcodes under complex background when multiple types of symbology appear in the same snapshot for 1D barcodes, 2D barcodes, or both co-exist. The proposed algorithm has three main steps: background small clutters elimination, potential barcodes segmentation and barcode verification. The whole algorithm combines several image processing methods, namely, image subtraction, Gaussian smoothing filtering, morphological operation, connected component labeling and iterative thresholding. Experimental results indicate that the proposed approach can segment multiple barcodes from the complex background with acceptable accuracy. Keywords: Multiple symbology; 1D/2D barcode segmentation; background clutter elimination; barcode verification; automatic barcode extraction.
1 Introduction The barcode technology has been widely applied in numerous fields such as daily goods labels, industrial products, pharmacy descriptions, automatic identification, inventory inspections, postal services, library management, and banking system [1–5]. Two barcode symbology types have been standardized according to the way of data representing. One is one-dimensional (1D) barcode and the other is two-dimensional (2D) barcode. One-dimensional barcode consists of black and white collateral lines with different widths. Two-dimensional barcode is a two-dimensional black and white pattern, however, different 2D symbols have their own morphological structure including stacked rows of 1-D barcodes (e.g. PDF417), special patterns embedded one or several locating target marks including QR code and Maxicode, and 2D patterns initiated by specific edges such as Datamatrix. 2D barcode methodology gains more advantages over linear barcode in terms of large information storage and robust error correcting capability. In Japan, QR code has been widely used to exchange messages K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 401–410, 2011. © Springer-Verlag Berlin Heidelberg 2011
402
D.-T. Lin and C.-L. Lin
in daily life. Taiwan High Speed Rail prints the QR code on the train ticket to prevent the fake ticket intrusion. The PDF417 barcode is utilized by the Ministry of Finance in Taiwan as the income tax record. Besides, 2D barcodes are being deployed extensively in several tagging systems such as life sciences, agricultural product portfolios, semiconductor, electronics. Laser barcode scanners are commonly employed to read 1D barcode. However, the laser barcode scanner has constrains of reading and can only read one barcode at a time. Image-based barcode scanner is more practical and has advantages over a laser
Fig. 1. Flowchart of the proposed general barcode extraction framework
Multi-symbology and Multiple 1D/2D Barcodes Extraction Framework
403
scanner. For instance, an image-based barcode reading system can read multiple 1D and 2D barcodes at the same time. Many image-based methods for barcode identification have been proposed. Chen et al. presented a two-stage approach, involving connecting the contours of the orientation-based region and locating the contourconnected component-based target, to segment a diversified barcode [6]. Zhang et al. proposed another two-stage approach with two down-sampled resolutions, in which the barcode is identified by region-based analysis [7]. Chandler and Batterman developed an omnidirectional barcode reader that computes the accumulated sum of the products of the derivatives of respective first and second lines to locate the barcode image, and then calculates the cross-correlation of interpolated scan line data to obtain the orientation of the located barcode [8]. Fang et al. presented a projection method to extract linear code (code-39) feature for recognition [9]. Ando and Hontani extended their feature extraction method to extract and read barcodes in 3D scenes using categorization and projection for edges, ridges, corners and vertices [10]. Ouaviani et al. adopted some image processing techniques to segment some most common 2D barcode includes QR code, Maxicode, Datamatix, and PDF417 [11]. Hu et al. presented a 2D barcode extraction system based on texture direction analysis. [12]. Chin et al. used integral images to detect 2D barcodes at generic angles [13]. Liang et al. combined three stages of image processing algorithms to segment the barcode from the original image with acceptable accuracy [14]. Zafar et al. utilized two existing software tools to detect and decode multiple Datamatrix barcodes [15]. Most of other techniques revealed in literature may either work for single 1D/2D barcode, rely on finding the unique finder pattern, or base on naïve assumptions such as known code type or specific starting location. Extended from our preliminary investigation [16] and for realistic consideration, we propose a general segmentation framework to segment out the location of real barcodes under complex background when multiple types of symbology appear in the same snapshot for 1D barcodes, 2D barcodes, or both co-exist. The rest of this paper is organized as follow. Section II describes the detailed issues involved in automatic barcode segmentation procedure of our proposed system. Section III presents some testing results, are Section V draws conclusions and possible future works.
2 Automatic Barcodes Segmentation The proposed system can be divided into three parts: background small cluster reduction, candidate barcode segmentation and barcode verification. Figure 1 delineates the flowchart of the proposed barcode extraction system. Firstly, the input image is converted to the gray level image. Secondly, the max-min differencing operation is utilized to reduce the background small cluster. Then, some digital image processing techniques are used to segment the candidate barcode. After the candidate barcode is obtained, this system will classify the candidate barcode is real barcode or not. Finally, this system will rotate the genuine barcode, if the barcode isn’t in the correct direction. 2.1 Background Small Clutters Elimination In general, the input image may contain a lot of objects, such as texts, logs, lines, and so on. Furthermore, 1D and 2D barcodes may appear in the same snapshot as shown
404
D.-T. Lin and C.-L. Lin
in Fig. 2(a). A max-min differencing operation is proposed to reduce the small cluster. This operation removes thin and small background clutter in the input image, and enhances the barcode region by interleaving space between barcode black locations. Two reduced-size images are generated by over-sampling and under-sampling operations to determine the maximum and minimum pixel values of each 4 by 4 sub-images of the original image. Then subtracting the under-sampled image from over-sampled image can obtain the differencing image D(x, y) calculated by the following equation:
D( x, y ) = max{ f (4 x − i, 4 y − j )} − min{ f (4 x − i, 4 y − j )} .
(1)
H W , 1≤ y ≤ , and H an W denote the height and width of original 4 4 image, respectively. The term “max” indicates the maximum pixel value of a 4 by 4 sub-image f (4 x − i, 4 y − j ) , where 0 ≤ i ≤ 3 and 0 ≤ j ≤ 3 . Similarity, the term “min” indicates the minimum pixel value of a 4 by 4 sub-image in the original image. By this operation, the barcodes and solid printed regions are obtained, while most of the thin and small texts and lines are removed. This method recognizes the object region with high variation between neighborhood pixels of black and white barcode characteristics. Figure 2(b) displays the result of applying the background clutter elimination method to input barcode image Fig. 2(a). where 1 ≤ x ≤
(a)
(b)
Fig. 2. (a) The input image captured by CCD camera containing versatile 1D/2D barcodes. (b) The differencing image processed by max-min differencing.
2.2 Potential Barcodes Segmentation
The differencing of over-sampled and under-sampled image reserves the barcode regions, yet some noises and objects exist as shown in Fig. 2(b). The Gaussian smoothing filter is used to remove some noises. Further, using morphological operations to fill the vacant of the potential barcode region, and a connected component labeling algorithm is utilized to segment the potential barcode region. To remove the noise in the differencing image, a 5 by 5 smoothing filter is used. This operation removes the significant noise and small regions by thresholding, as derived in Equation (2):
Multi-symbology and Multiple 1D/2D Barcodes Extraction Framework
2 2 ⎧ w(i, j ) D ( x + i, y + j ) ≥ T1 ⎪1, if C ( x, y ) = ⎨ . i = −2 j = −2 ⎪ ⎩0, otherwise
∑∑
405
(2)
where w(i, j) is the mask coefficient of a smoothing filter with equal weights 1/25, D(x, y) denotes the differencing image as described in previous stage, and T1 represents a threshold, which is set to 33. After the Gaussian smoothing filter is used, there are some vacancies or holes in the potential barcode region. To fill these vacancies or holes, some morphological operations are utilized. First, the smoothed image C(x, y) is shrunk by eroding it using a 5 by 5 square structure element. Then, the resulting image is expanded by dilating it with a 5 by 5 circular structure element. As expected, the proposed method successfully extracts candidate barcode regions as shown in Fig. 3.
Fig. 3. Potential barcode regions obtained by morphology operation
The final step of candidate barcode segmentation is to determine the connected area of the remaining objects by connected component labeling. Instead of using a conventional progressive approach of connected component labeling, this work adopts an array to deal with the equivalent label, shown in Algorithm 1. Pass one records equivalent labels, and assigns temporary labels. Pass two replaces each temporary label by its equivalent label in the equivalent list. The proposed two-pass connected component labeling algorithm is more efficient compared to Chang, Chen, and Lu’s algorithm [17]. 2.3 Barcode Verification
When the potential barcode is detected, the real barcode need to be distinguished from background clutters. Identifying real barcodes effectively is important to avoid false positive. This work classifies the candidate barcode according to histogram examination and the size of segmented region. The histogram of real barcodes may distribute over low grayscale, yet the histogram of background clutters may scatter over whole grayscale. Additionally, the size of a real barcode should larger than a fixed constant, set to 725. Thus, we can use these conditions to verify the candidate barcode. Figure 4 demonstrates that the proposed system can achieve multiple omnidirectional barcodes segmentation with regard to the example of the barcodes image shown in Fig. 2(a).
406
D.-T. Lin and C.-L. Lin
Algorithm 1: Connected Component Labeling Algorithm
f: input image r: output image Create the equivalent list L[] Label = 0 NumComponent = 0 // pass one for row = 1 to height do for col = 1 to width do if f(row, col) is not background then if f(row–1, col) and f(row, col–1) are background then Label += 1; r(row, col) = Label; L[Label] = Label; else if f(row–1, col) is background then r(row, col) = r(row, col–1); else if f(row, col–1) is background then r(row, col) = r(row–1, col); else m = min(r(row–1, col), r(row, col–1)); r(row, col) = m; L[max(r(row–1, col), r(row, col–1))] = m; end if end if end for end for // process the equivalent list for index = 1 to Label do if L[index] == NumComponent then NumComponent += 1; L[index] = NumComponent; else L[index] = L[L[index]]; end if end for // pass two for row = 1 to height do for col = 1 to width do r(row, col) = L[r(row, col)]; end for end for
Multi-symbology and Multiple 1D/2D Barcodes Extraction Framework
407
Fig. 4. The extraction result of the proposed system for test image in Fig. 2(a)
3 Experimental Results The experiments were performed on the mixture of 1D and 2D barcode. Multiple types of symbology are tested including seven 1D barcode types: EAN13, EAN-8, UPC-E, UPC-A, Code 39, Code 128, and Interleaved 2 of 5 (I25), and nine 2D barcode types: PDF417, Datamatrix, QR Code, Micro PDF417, Micro QR Code, Maxicode, Codablock, Aztec Code, and Composite Code. We applied SONY DSCP10 auto-focus digital camera to capture several types of barcode with VGA resolution (640 × 480). The distance between camera and barcodes is not fixed. Totally, there are 627 images of various objects containing 415 1D barcode images and 492 2D barcode images with complex backgrounds. Figure 5 illustrates some examples of test images. The barcode extraction results of Fig’s. 5(a), 5(b) and 5(c) are presented in Fig’s. 6(a), 6(b), and 6(c), respectively.
(a)
(b)
(c)
Fig. 5. Some examples of barcodes test images
(a)
(b)
(c)
Fig. 6. The barcode extraction results of test images shown in Fig’s. 5(a), 5(b) and 5(c), respectively
408
D.-T. Lin and C.-L. Lin
Previous study focuses on single type (1D) barcode recognition and has achieved satisfactory segmentation performance [16]. Table 1 re-prints the experimental results of the proposed method and compares with other two works. The developed framework works well for pure 1D barcodes segmentation and achieves 95.62% correct segmentation rate. The proposed system outperforms other methods, including those of Chen et al. [6] and Zhang et al. [7] and with much less false positive segmentation. Extended from our previous study, we are aiming in multi-symbology barcode segmentation, i.e., 1D and 2D barcode may co-exist in the same input image. The feature of 1D barcodes is that there are many parallel lines; however, 2D barcodes don’t have Table 1. Correct extract rate of proposed method, and those of Chen et al. and Zhange et al., applied to various objects and multiple barcodes [16] Code type Code 39 I25 UPC-E UPC-A EAN-8 EAN-13 Code 128 Overall
# of barcode 56 47 61 43 70 55 56 388
Proposed method 100% 95.74% 95.08% 95.35% 91.43% 98.18% 94.64% 95.62%
Chen at al. [6] 96.43% 93.62% 93.44% 90.70% 90.00% 94.55% 91.07% 92.78%
Zhange et al. [7] 94.96% 91.49% 85.25% 83.72% 72.86% 76.36% 83.93% 83.51%
Table 2. Correct extract rate of 415 1D barcodes contained in the test images code type EAN13 EAN-8 UPC-E UPC-A Code 39 Code 128 I25 overall
# of barcode 86 63 61 45 57 56 47 415
# of correct segmentation 76 56 55 44 55 51 44 367
correct segmentation rate 88.37% 88.89% 90.16% 97.78.% 96.49% 91.07% 93.62% 92.34.%
Table 3. Correct extract rate of 492 2D barcodes contained in the test images code type PDF417 Datamatrix QR Code Micro PDF417 Micro QR Code Maxicode Codablock Aztec Code Composite Code overall
# of barcode 67 64 57 55 62 43 48 50 46 492
# of correct segmentation 58 53 48 52 54 37 46 43 40 424
correct segmentation rate 86.57% 82.81% 84.21% 94.55% 87.10% 86.05% 95.83% 86.00% 86.96% 87.79%
Multi-symbology and Multiple 1D/2D Barcodes Extraction Framework
409
a common characteristic. Thus, it’s challenging to segment 2D barcodes and 1D barcodes at the same time. We have tested 627 images in which each image may contain various and different combinations of 1D and 2D barcodes. Although it is a difficult task, our proposed method still achieves an acceptable correct segmentation rate. Table 2 and Table 3 list the experimental results of each type of barcodes contained in the test images. For 1D barcodes, the proposed system accomplishes an average successful segmentation rate of 92.34%. While the average correct extraction rate is 87.79% for 2D barcodes.
4 Conclusion This study proposes a general segmentation framework to achieve extraction of real barcodes under complex background especially when multiple types of symbology appear in the same snapshot for 1D barcodes, 2D barcodes, or both co-exist. The proposed method is divided into three main parts, namely background small clutters elimination, potential barcodes segmentation and barcode verification. The experimental result indicates that the proposed framework performs an acceptable accurate segmentation rate. There are some possible tasks need to be considered. To improve the segmentation performance, we should review the incorrect segmentation cases and revise the barcode verification approach. Additionally, the proposed system will further test on scale-change barcodes. Furthermore, we will continue to develop the decode mechanism for 2D barcodes which will assist and improve the segmentation method.
References 1. Sriram, T., Vishwanatha Rao, K., Biswas, S., Ahmed, B.: Applications of Barcode Technology in Automated Storage & Retrieval Systems. In: Proceedings of the IEEE International Conference on Industrial Electronics, Control, and Instrumentation, vol. 1, pp. 641– 646 (1996) 2. Youssef, S.M., Salem, R.M.: Automated Barcode Recognition for Smart Identification and Inspection Automation. Expert Systems with Applications 33, 968–977 (2007) 3. Lu, X., Fan, G., Wang, Y.: A Robust Barcode Reading Method Based on Image Analysis of a Hierarchical Feature Classification. In: International Conference on Intelligent Robots and Systems, pp. 3358–3362 (2006) 4. Ohbuchi, E., Hanaizumi, H., Hock, L.A.: Barcode Readers Using the Camera Device in Mobile Phones. In: Proceedings of the IEEE International Conference on Cyberworlds, pp. 260–265 (2004) 5. Kato, H., Tan, K.T.: 2D Barcodes for Mobile Phones. In: International Conference on Mobile Technology, Applications and Systems, pp. 1–8 (2005) 6. Chen, Y., Yang, Z., Bai, Z., Wu, J.: Simultaneous real-time segmentation of diversified barcode symbols in complex background. In: First International Workshop on Intelligent Networks and Intelligent Systems, ICiNIS 2008, pp. 527–530 (2008) 7. Zhang, C., Wang, J., Han, S., Yi, M., Zhang, Z.: Automatic real-time barcode localization in complex scenes. In: IEEE International Conference on Image Processing, pp. 497–500 (2006)
410
D.-T. Lin and C.-L. Lin
8. Chandler, D.G., Batterman, E.P.: Omnidirectional barcode reader with method and apparatus for detecting and scanning a bar code symbol. US Patent 5,155,343 (1992) 9. Fang, X., Wu, F., Luo, B., Zhao, H., Wang, P.: Automatic recognition of noisy code-39 barcode. In: 16th International Conference on Artificial Reality and Telexistence, pp. 79– 82 (2006) 10. Ando, S., Hontani, H.: Automatic visual searching and reading of barcodes in 3-D scene. In: Proceedings of the IEEE International Vehicle Electronics Conference, pp. 49–54 (2001) 11. Ouaviani, E., Pavan, A., Bottazzi, M., Brunelli, E., Caselli, F., Guerrero, M.: A Common Image Processing Framework for 2D Barcode Reading. In: 7th International Conference on Image Processing and Its Applications, vol. 2, pp. 652–655 (1999) 12. Hu, H.Q., Xu, W.H., Huang, Q.: A 2D Barcode Extraction Methods Based on Texture Direction Analysis. In: Fifth International Conference on Image and Graphics, pp. 759–762 (2009) 13. Chin, T.J., Goh, H., Tan, N.M.: Exact Integral Images at Generic Angles for 2D Barcode Deteaction. In: ICPR 2008 19th International Conference on Pattern Recognition, pp. 1–4 (2008) 14. Liang, Y., Wang, Z., Cao, X., Xu, X.: Real Time Recognition of 2D Bar Codes in Complex Image Conditions. In: International Conference on Machine Learning and Cybernetics, pp. 1699–1704 (2007) 15. Zafar, I., Zakir, U., Edirisinghe, E.A.: Real Time Multiple Two Dimensional Barcode Reader. In: IEEE International Conference on Industrial Electronics and Applications, pp. 427–432 (2010) 16. Lin, D.T., Lin, M.C., Huang, K.Y.: Real-time Automatic Recognition of Omnidirectional Multiple Barcodes and DSP Implementation. Journal of Machine Vision and Applications, in revision (2010) 17. Chang, F., Chen, C.J., Lu, C.J.: A Linear-Time Component Labeling Algorithm Using Contour Tracing Technique. In: The 7th International Conference on Document Analysis and Recognition, pp. 741–745 (2003)
Wikipedia Based News Video Topic Modeling for Information Extraction Sujoy Roy, Mun-Thye Mak, and Kong Wah Wan Institute for Infocomm Research, A*STAR, Singapore
Abstract. Determining the topic of a news video story (NVS) from its audio-visual footage is an important part of meta-data generation. In this paper we propose a news story topic modeling approach that takes advantage of online knowledge resources like Wikipedia to model the topic of a news story. A NVS is modeled as a distribution over several Wikipedia pages related to the story. The mapping of the NVS to a Wikipedia page table-of-contents (TOC) is also determined. The specific advantages of this topic modeling approach are. (1) The topic is interpretable as a weighted distribution over a set of semantically meaningful story title phrases instead of just being a collection of words. (2) It facilitates organizing news video stories as a taxonomy that captures several perspectives to the story. (3) The taxonomy facilitates exploration and non-linear search. Performance evaluations from an information extraction perspective validate the efficacy of the proposed topic modeling approach compared to TF-IDF and LDA based approaches on a large news video corpus. Keywords: Video Topic Modeling, Non-linear Search, Wikipedia.
1
Introduction
An essential component of any multimedia content based recommendation system is discriminative metadata in a human expressible and interpretable form. A valuable metadata for news videos is the audio transcript (ASR-t). Given a video of a news story, its ASR-t can be used to infer the topic of the story. The most straightforward way of getting some idea about the topic is to find discriminative words inside the ASR-t. While this has been quite effective in enabling information extraction goals by simple word matching, it can fail when the information extraction intent is higher level semantics. For example, a search query “Sichuan earthquake relief” will find videos that contain the words “Sichuan”, “earthquake” and “relief”. It will also return videos just on “relief” or just “Sichuan” etc, while the query intent was probably about “relief efforts” in the context of the “Sichuan earthquake”. Lack of understanding of semantic structure in the query and the way we model the topic of news stories leads to this ambiguity. So the question raised herein is, can we design a topic modeling approach that carries semantic structure and is human interpretable. K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 411–420, 2011. c Springer-Verlag Berlin Heidelberg 2011
412
S. Roy, M.-T. Mak, and K.W. Wan
Fig. 1. Information Extraction Framework enabled by Wikipedia based topic modeling approach
This will facilitate semantic queries better. That is, given a query, the IR system can understand the users expressed intent and recommend content to facilitate that intent.
2
State of the Art
One of the tasks of the TDT competition organized in 2001 by NIST was topic detection of news videos. The approach adopted by most techniques is to view an audio transcript as a bag of words with some tf-idf based weighting for each word in it. This gives a term (word) vector in word space. Given a training set of transcripts represented as word vectors in word space, topics are determined by the cluster centers of the clusters of word vectors. This gives a discriminative model of the topics in the training set with the cluster centers as the latent topics. A new transcript is assigned a topic as per its nearness to a cluster center. Although some good results are reported it lacks in several ways.(1) It does not have a theoretical basis. (2) A topic is described as a collection of words whose interpretability is not known because of lack of higher semantic structure. (3) Stories with short lifespan are hard to identify. In recent times several probabilistic topic modeling approaches (pLSI[7], LDA[3], CTM[2]) have been proposed where a document is expressed as a mixture of topics. The topic proportions are drawn once per document and the topics are shared across the training set corpus. This can be applied to news audio transcripts also where given a collection of web news articles within a time interval around the concerned story, latent topics can be learnt, from which the story is assigned a topic. Nevertheless, the problem of not detecting stories with short lifespan remains and also the interpretability of the topic is not clear as the topic is still a collection of words. Recent work by Chang et. al[5] shows that these complicated probabilistic topic modeling approaches although do well in terms of finding a collection of words as topics in large collections of unstructured text, the topics themselves are less interpretable than simpler models as assessed by human evaluation tasks.
Wikipedia Based News Topic Modeling for Information Extraction
413
Moreover from an IR perspective because most search technique involve simple tf-idf weighting scheme based word matching and does not account for any semantic structure in the query or in the documents searched for, intuitively, complicated topic modeling based information extraction will not necessarily perform significantly better than information extraction on the collection of ASR-t’s themselves. This is also confirmed by Xing et. al [12]. This points at the importance of developing topic models that carry semantic structure beyond textual word semantics.
3
Wikipedia-Based Topic Modeling
Wikipedia is the largest community developed online encyclopedia which is very up to date in its coverage of most news stories. Unlike other standard ontologies, such as WordNet, Wikipedia itself is not a structured thesaurus. However, it is much more comprehensive and well-formed. The English Wikipedia has over 3 million pages till date and it is growing every day. Wikipedia is also a huge graph with links between pages which gives the relationship between them. In Wikipedia, each article is identified by its distinct title which can be considered a topic. The title of each article is a succinct phrase that resembles an ontology term. Equivalent concepts are grouped together by re-directed links. Several Wikipedia articles (particularly those on specific stories) contain a table-of-contents structure that divides the text into several perspectives to the topic. Wikipedia also contains a hierarchical categorization system, in which each article belongs to at least one category. In this paper we propose a topic modeling algorithm where a news story topic is modeled as a distribution over several Wikipedia pages. For example, a news story on “The re-capture of Mas Salemat Bin Kastari” will be modeled by a weighted distribution over related Wikipedia articles like “Mas Salemat Bin Kastari”, “JI”, “Internal Security Act”, “Royal Malaysian Police”, “Internal Security Department”, “Pasukan Gerakan Khas” etc. Figure 2 presents an example topic model for an ASR-t. The weights of the distribution give the relevance of the ASR-t to the Wikipedia articles. The model also contains the mapping of the ASR-t into each wiki page table of contents. The first observation we make is that the topic of an ASR-t can convey different perspectives to the same story. Hence modeling the topic as a distribution over different perspectives of the same story seems obvious. Our topic modeling approach realizes this fact and presents a simple way of representing it. The use of Wkipedia as the space of reference articles gives us the added advantage of tapping into a resource that is almost complete and up-to-date. Secondly, the mapping of the ASR-t into the TOC of each Wikipedia page gives us the perspective that the concerned ASR-t is dealing with. Thirdly, we get to know how the different video stories are related to each other based on the TOC (taxonomy) of each Wikipedia page. This is a simple way of organizing video stories based on relationship between pairs of video stories which are no more a mere number but carry semantic meaning. Note that the TOC organization is a taxonomy that
414
S. Roy, M.-T. Mak, and K.W. Wan 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0
(a)
20
40
60
(b)
80
100
(c)
Fig. 2. Proposed Topic Modeling Approach. (a) A news story transcript with important words highlighted and perspective underlined. (b) topic of (a) as a distribution over Wikipedia pages (c) Top 15 Wikipedia pages that model the news story topic.
allows non-linear search and hence facilitates exploration and learning. Figure 1 depicts the complete flow of our proposed topic modeling approach. In the rest of the paper we describe each component of this figure and put together the final system in the context of information extraction. Although some works[8,9] have looked at using Wikipedia as an external information resource for understanding video content, most works either use Wikipedia categories to organize content which is limited because of the shear size and generality of Wikipedia categories or use Wikipedia as a dump to learn about contextual supervisory information for other purposes. This work shows an effective way of modeling videos using the Wikipedia articles themselves in an unsupervised way and use the TOC structure of each Wikipedia article to organize them. 3.1
Topic Modeling
This section details the proposed method for modeling the topic of an ASRt. Given an ASR-t A of a video story and a set of n web news articles D = {T1 , ...Tn } published within a window of time around the time of broadcast of A, our goal is to identify a distribution over a set of Wikipedia articles that are related to A. The weights of the distribution give the degree of relevance of A to the Wikipedia article. Note that the set of web news articles D is used to compensate for the fact that A is noisy. First the web news articles in D are ranked according to degree of relevance to A using a modified tf-idf based ranking scheme. This gives a score S = {s1 , .., si , ., sn } for every web news article in relation to A. Next, the headlines of each web news articles Ti ∈ D is used to search for Wikipedia articles through Google Search.1 This gives a rank list Ri = {r1 , ., rj , ., rm } of Wikipedia articles Wi = {w1 , ., wj , .., wm } corresponding to each Ti . The relevance of each wj to Ti is computed as si /rj . This gives a table W vs. D wherein a column contains the relevance of Wikipedia articles to each Ti . The relevance of all the Wikipedia 1
The reason for going through Google search, instead of directly searching Wikipedia is based on the observation that Google handles different kinds of search queries, accounting for spelling mistakes and acronym mapping very robustly.
Wikipedia Based News Topic Modeling for Information Extraction
415
articles W = {w1 , .., wm×n } to A is computed by averaging out the scores along each row in this table. This gives a relevance score S for every Wikipedia article in W. We note that this can give low scores to relevant Wikipedia articles whose relevance to A is small yet significant. To circumvent this problem we observe that related Wikipedia articles generally form a connected sub-graph in the Wikipedia graph. To mine this sub-graph from the Wikipedia graph itself is a difficult task. In our framework we observe that we can limit our search for the relevant sub-graphs to the set of Wikipedia articles in W and thus leverage on the human annotated relations between stories. This basically gives a better re-ranking of the Wikipedia articles. The connected subgraphs of Wikipedia articles in W are identified by actually processing the text and hyperlinks inside a Wikipedia article which point to other related articles. Articles not in W are discarded. This gives us the connected components of Wikipedia articles. Next we find the average scores for the subgraphs in S and rank Wikipedia articles based on that score. Nodes common to two subgraphs is considered in the subgraph with the higher score. This gives ˆ = {w1 , .., wt } that are a re-ranked list or a distribution of Wikipedia articles W relevant to A. Next we apply a state-of-the art near-duplicate detection routine to lower the weight of near-duplicate videos in the collection of video stories mapped to a Wikipedia article. The weight over the set of Wikipedia articles is normalized, where the weight of each article signifies the relevance of the article to the ASR-t. Applying this topic modeling approach to every ASR-t gives us a unique set of t = T Wikipedia articles which defines the space of Wikipedia articles relevant to a set of ASR-t’s. Note that we did not directly use an ASR to find the documents from Wikipedia itself because the ASR is too large to use as a query. Even if we use the news article headlines to search Wikipedia. ˆ we also find the mapping of each A to the table-of-contents Apart from W ˆ (M : A → k where k is a (TOC) of each component Wikipedia article wj ∈ W, leaf index in the TOC of wj ) using a tf-idf based matching method. We generate search results for triplet’s of words in A as query and collate the ranked results to find the mapping of the ASR to the TOC. 3.2
Video Story Similarity Matching
The next concern is about finding the similarity/dissimilarity between two news video stories based on matching their topic models. Note that the topic model allows matching at different levels of abstractions. Given a pair of video story ˆ 1 and W ˆ 2 , at the first level of abstracASR-t’s A1 and A2 , with topic models W tion, the similarity score between them is computed as S1,2 =
T
ˆ 1 (1 : i), W ˆ 2 (1 : i))/i, d(W
i=1
ˆ 1 sequentially ordered according to their ˆ 1 (1 : i) are the i elements in W where W relevance to A1 based on weight, and d(A, B) is the cardinality of set A ∩ B.
416
S. Roy, M.-T. Mak, and K.W. Wan
At a second level of abstraction, the dissimilarity score between them can be computed at the TOC level. The first level of abstraction gives a list of Wikipedia articles that form the intersection of their topic models. Hence given that the pair of video ASR-t’s A1 and A2 map to the Wikipedia article wj , the dissimilarity score D1,2→j between them wrt wj is the distance between the nodes of the wj -TOC that these stories map to. The total dis-similarity score is given by D1,2 =
D1,2→j
ˆ 1 ∩W ˆ2 ∀wj ∈W
Both S1,2 and D1,2 are used to compute the final matching score. This scoring technique could be used to rank the video stories while presenting stories in a linear search form. It is also used to construct a relationship network (graph) between the different news video stories. Note that ability to ascertain matching at two levels of abstraction enables the ability to mine semantic relationship between videos in a better way. The search framework is detailed in the next section. 3.3
Search Framework
The ability to compute different levels of abstraction of matching scores between videos story pairs enables both horizontal and vertical search. Horizontal search refers to conventional linear search where the results are presented as a ranked list of ASR-t’s relevant to a query. The relationship between the ASR-t themselves need not be accounted for in such a presentation. On the other hand, a vertical search refers to enabling semantic search where not only the relationship between the query and the articles is considered but the relationship between the articles themselves is used to present them as a taxonomy (or any other semantic structure) that enables exploration. In this work we present a search framework that is essentially horizontal but leverages on the ASR-t to TOC mapping information to present search results as a taxonomy, thus enabling exploration. Note that our proposed topic modeling approach generates an index for all ASR-t’s to a unique set of Wikipedia articles. The elements of the index contain the weight of relevance of an ASR-t to a Wikipedia article and the mapping of the ASR-t to the TOC of the Wikipedia article. To enable search, we compute another term-document index of the unique set of Wikipedia pages, where the elements of the index contain the tf-idf relationship between terms in the Wikipedia articles and the articles themselves. Note that a more sophisticated index can be computed where we keep track of where the terms appear in the article and so on. But we have not implemented that and chose to go with a simpler implementation[6]. Given a search query and the above two indices, we first find the set of Wikipedia articles W relevant to the query by simple tf-idf based word matching. Next we use these articles to look up the ASR-t vs. Wikipedia articles index. This gives the list ASR’s that are most relevant to the query and their position in the TOC of each Wikipedia article in W .
Wikipedia Based News Topic Modeling for Information Extraction
417
Fig. 3. Search User Interface
Figure 3 depicts the search user interface of our system. The search user interface presents the search results in two panels, (1) the list of ranked relevant Wikipedia articles (bottom panel in Figure 3) and (2) a taxonomy of video stories based on the mapping information of ASR-t’s to the TOC of the Wikipedia article in the top of the list in (1) (side panel in Figure 3). Note that the interface allows non-linear search and exploration of a story topic using the taxonomy, although the underlying search is a simple linear search. Hence without explicitly finding the relationships between the videos themselves, mapping the videos to a fixed taxonomy gives a convenient way of organizing and categorizing news story videos.
4 4.1
Experimental Analysis Data
Although several standard audio-visual datasets exist for research in multimedia retrieval (e.g. TRECVID), because of the additional requirement for a set of time-aligned news articles, in this paper we created a new evaluation dataset using a subset of videos from our in-house news video retrieval system. This dataset comprises of 3650 hours of videos collected from nine channels of English broadcast news over a period of 2 years from August 2007 to July 2009, totaling 127K news ASR-t and 202K news articles. On this dataset, we perform two sets of experiments to evaluate the information retrieval performance of our methods on both broad and specific queries,
418
S. Roy, M.-T. Mak, and K.W. Wan Table 1. Broad queries 1. Air France AF447 2. Caspian Air Crash 3. Environment Issues 4. Italy Earthquake 5. Pakistan Taliban 6. US Airways Hudson River 7. Yemenia Air Crash 8. beijing olympics 9. H1N1 flu 10. mas selamat 11. mumbai attack 12. myanmar protest 13. cyclone nargis 14. obama presidential election 15. sichuan earthquake 16. tamil tigers
respectively. The former concerns retrieval of broad general queries and our intent is to compare retrieval performance with existing methods, e.g. TF-IDF [6], and LDA [3]. Our second experiment looks at the case when users are more specific about their information needs. Together, both experiments are intended to cover the wide spectrum of user querying scenarios and thus validate the performance of the proposed topic modeling approach from an information extraction perspective. 4.2
Broad Queries
In this experiment, we compare the retrieval performance of our method with term-matching (TF-IDF) [6] and Bayesian topic modeling (LDA) [3] based topic representation. The main idea is to evaluate how a difference in topic modeling based representation influences information extraction. To constrain the computation load in LDA topic modeling, we use a variant of the method in [1] in the learning process. For LDA-based retrieval, we adopt the language modeling adaptation in [11]. As evaluation metric, we use the Mean Average Precision (MAP) over the 16 broad queries in Table 1. An advantage of LDA topics is that they are generated from unstructured text, whereas the proposed approach is dependent on the availability of a human generated taxonomy of organized information like Wikipedia. From a search perspective, the availability of a taxonomy enables understanding of the contextual and actual words in the query. This is facilitated in the search framework outlined in Section 3.3. Table 2 shows the comparative retrieval results. As expected, the TF-IDF method that relies on exact term matching has the worse IR results. This is exacerbated by the noisy ASR-t documents. The LDA-retrieval method of [11] can better cope with noise and achieve minor improvement over the TF-IDF baseline. The Wikipedia based approach outperforms the other two approaches for almost all 16 queries and hence has the highest MAP score. 4.3
Specific Queries
Given the Table-of-Content (TOCs) of a set of Wikipedia pages, our second experiment now looks at how well we can match each ASR-t into the TOCs. Note that the subtopics enumerated by the TOCs represent a kind of perspective,
Wikipedia Based News Topic Modeling for Information Extraction
419
Table 2. Comparative MAP results Query 1. Air France AF447 2. Caspian Air Crash 3. cyclone nargis 4. Environment Issues 5. H1N1 flu 6. US Air Hudson River 7. Italy Earthquake 8. mas selamat 9. mumbai attack 10. myanmar protest 11. obama election 12. Pakistan Taliban 13. sichuan earthquake 14. Yemenia Air Crash 15. beijing olympics 16. tamil tigers MAP
TF-IDF LDA [11] 0.6885 0.7151 0.5833 0.4022 0.3970 0.5035 0.9309 0.9107 0.5733 0.7136 0.8957 0.8273 0.6292 0.6432 0.5539 0.5316 0.3180 0.2662 0.7516 0.7606 0.7546 0.7801 0.7953 0.8128 0.7855 0.7994 0.5415 0.7835 0.4501 0.3562 0.3981 0.4272 0.6279 0.6396
Wiki 0.7154 0.6109 0.5454 0.5623 0.7355 0.7794 0.4131 0.5859 0.8280 0.6834 0.7478 0.6604 0.8000 0.8305 0.6799 0.4265 0.6628
Table 3. Recall@K versus K K 1 3 5 10 Recall@K 0.3412 0.6382 0.7422 0.8478
or aspects, or facets [4] of the information need. By mapping ASR-t into the subtopics in the TOCs, we introduce a new class of faceted retrieval models [4,10] as follow: • Given a ASR-t, compute its text vector-based cosine distance to all subtopics in the TOCs. • Assign the ASR-t to each of the top-K nearest subtopics. • Given a query, similarly compute its text vector-based cosine distance to all subtopics in the TOCs. • As the user navigates over the list of top-K nearest subtopics, he can browse the ASR-t that have been assigned to that subtopic. We use a set of 25 queries randomly taken from the title of the subtopics in a list of Wikipedia TOCs relevant to the 16 broad queries used in the first experiment. For example, on the broad query “Sichuan Earthquake”, a specific query is taken from the title of one of the subtopic, say, “Relief Effort”. This then creates a new specific query “Sichuan Earthquake Relief Effort”. Table 3 shows the Recall@K over varying K, averaged over 25 queries. A larger value of K indicates the willingness of the user to browse up to K ASR-t assigned to a subtopic. Note that this model is useful when query expansion is applied to the original query, producing a longer, articulated query with greater specificity.
420
5
S. Roy, M.-T. Mak, and K.W. Wan
Discussions
In this work we have reported a Wikipedia based Topic Modeling approach and validated its efficacy for information extraction tasks. The proposed approach outperforms standard TF-IDF and LDA topic modeling based approaches in terms of MAP performance. We have also demonstrated the additional advantage of enabling specific search under the proposed framework. One clear advantage of the proposed topic modeling approach is the ability to map videos to the Table-of-Contents based taxonomy of Wikipedia. Figure 1 depicts the complete flow of our system. One limitation of the Wikipedia based approach is that there is an underlying assumption that there will be at least one Wikipedia article that mentions about the news video story. This may not be the case for all news stories. Our observation is that amazingly Wikipedia is such a well covered repository that it does contain reference to practically all kinds of news stories.
References 1. AlSumait, L., Barbar, D., Domeniconi, C.: On-line LDA: Adaptive topic models for mining text streams with applications to topic detection and tracking. In: Perner, P. (ed.) ICDM 2008. LNCS (LNAI), vol. 5077, pp. 3–12. Springer, Heidelberg (2008) 2. Blei, D.M., Lafferty, J.D.: Correlated topic models. In: NIPS (2005) 3. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003) 4. Carterette, B., Chandar, P.: Probabilistic models of ranking novel documents for faceted topic retrieval. In: CIKM, pp. 1287–1296 (2009) 5. Chang, J., Boyd-Graber, J., Wang, C., Gerrish, S., Blei, D.M.: Reading tea leaves: How humans interpret topic models. In: Neural Information Processing Systems, Vancouver, British Columbia (2009) 6. Hatcher, E., Gospodnetic, O.: Lucene in Action. In Action series (2004) 7. Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of ACM SIGIR Conference, pp. 50–57 (1999) 8. K¨ ursten, J., Richter, D., Eibl, M.: Videoclef 2008: Asr classification with wikipedia categories. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜ nas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 931–934. Springer, Heidelberg (2009) 9. de Wit, J., Raaijmakers, S., Versloot, C.: A cocktail approach to the videoCLEF 2009 linking task. In: Peters, C., Caputo, B., Gonzalo, J., Jones, G.J.F., KalpathyCramer, J., M¨ uller, H., Tsikrika, T. (eds.) CLEF 2009. LNCS, vol. 6242, pp. 401– 408. Springer, Heidelberg (2010) 10. Wan, K., Tan, A., Lim, J., Chia, L.: Faceted topic retrieval of news video using joint topic modeling of visual features and speech transcripts. To appear in ICME (2010) 11. Wei, X., Croft, W.: LDA-based document models for ad-hoc retrieval. In: ACM SIGIR, pp. 178–185 (2006) 12. Yi, X., Allan, J.: Evaluating topic models for information retrieval. In: CIKM, pp. 1431–1432 (2008)
Advertisement Image Recognition for a Location-Based Reminder System Siying Liu, Yiqun Li, Aiyuan Guo, and Joo Hwee Lim Institute for Infocomm Research 1 Fusionopolis Way, # 21-01 Connexis South Tower, Singapore 138632
Abstract. In this paper, we propose a location-based reminder system on mobile phones using image recognition technology. With this system, mobile phone users can actively capture images from their favorite product or event promotional materials. Upon recognizing the image sent to a computer server, location-based reminders will be downloaded to the phone. The mobile phone will alert the user when he/she is close to the place where the product is being sold or the event is happening. Nearduplicate image recognition is employed to identify the advertisement. Using scale-invariant features followed by Kd-tree image matching and geometric validation, near-duplicates of trained images in the database are recognized. The image recognition provides accurate and efficient retrieval of the corresponding reminders. A mobile client application is developed to capture images, conduct GPS location tracking and to pop up reminders.
1
Introduction
We have existing tools in our mobile phones and PC calendar to remind us to do something based on time schedule. With the increasing popularity of GPSenabled mobile phones, location-based service is growing quickly. Geominder [1] is one of the applications that provides location-based reminder. By associating to-do tasks to physical locations, location-based reminders provide more convenient means for task management than conventional time-based ones. Geominder uses the mobile network’s cell id information to get the location of a place such as home, market, office etc. The mobile phone user is first required to teach the Geominder about the locations, the reminders for those locations are then created. After that, when arriving at a marked location, the reminder will notify the user using either text or recorded voice message. In our daily life, we are flooded by advertisements competing for our attention. We may be interested in some of the promotions but it would be troublesome to commit them to memory. Recent advances in wireless communication open up new opportunities for mobile advertising. Many advertisers target at location-based advertising using mobile phones. They push advertisements to the mobile phones according to their locations. In such push type advertisement, the advertiser must know the user’s current location. Thus, delivery of the advertisement will depend on whether the user is willing to reveal his current location. To overcome this privacy issue, we K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 421–431, 2011. c Springer-Verlag Berlin Heidelberg 2011
422
S. Liu et al.
propose a system which gives the user control over the selection of advertisement that he/she wishes to create a location-based reminder. To make the advertisement and product promotion more effective, we developed a system to help the mobile phone users remember the advertisement or product promotion easily. This is realized by providing the mobile phone users with a location based reminder. Minimum effort is required from the user. He/she can simply capture an image of a printed advertisement using the built-in camera on the mobile phone, without going through the hassle to look up maps for the destination. Once the image is captured, it will be sent to a computer server. The corresponding reminder will be automatically downloaded to the sender’s mobile phone. When the user approaches the location where the product promotion is ongoing, the reminder will alert the user of the promotion. In the system, contents of the reminders are created in the server and they are sent to the user upon request. The server can hence keep track of all requests for the reminders. This would provide valuable information for the study of market trend and facilitate timely evaluation of the advertising/promotional campaign. It will also enable the advertisers to identify their target customers and formulate more effective marketing strategies. The contributions of this work are twofold. We developed a novel locationbased reminder system on mobile phones using near-duplicate image recognition. With careful considerations on practical issues, we devised an accurate, efficient and robust solution for the system.
2
System Overview
To facilitate ease of input, we propose a new method for automatic generation of reminder by snapping an image. The mobile phone user is only required to snap an image of a brand logo, product or any other picture on newspapers, magazines, advertisement posters or product catalogs etc. Using our application, a reminder will be automatically downloaded to the mobile phone. When the user passes by the locations of interest, the user will be reminded of the promotions. For example, Sandra has a credit card which currently offers a promotion at Swensen’s Cafe Restaurant. She snaps an image of the advertisement, sends it to the server and receives the reminder on her phone. When she enters the green box as shown on the map in Fig. 1, a reminder pops up with a navigation map showing the directions to the restaurant. When the server receives an image from the mobile client, it will use our image recognition engine to classify the image to one of the categories of the image database. If a match is found, GPS location information together with product promotion information related to the received image will be sent to the mobile client in the format of a location-based reminder. To create the reminder, geographical locations of the product promotion venue and the promotion contents are required. The geographical location can be obtained from Google Maps or
Advertisement Image Recognition for a Location-Based Reminder System
Swensen’s 1 for 1 Sundae
Fig. 1. Location-based reminder on mobile phone
Server Offline server process Geo locations
Ad images
Ad contents
Reminder creation
Feature extraction
Reminders
Image features
Online server process Kd-tree for image features
Get reminder
Load features
Kd-tree construction NN match & GM validation
Feature extraction
3G/wireless communication Mobile client Query Snap Reminder image image
Fig. 2. System architecture
Start client app
423
424
S. Liu et al.
provided by the advertisers. The information is linked to the feature data extracted from the advertisement images. As shown in Fig. 2, there are 2 offline processes on the server side, one for reminder generation and the other for feature extraction on the query image. After these 2 processes, we obtain the reminders and image features and store them in the harddisk. When the mobile user sends an image to the server, the server will extract the image features and search for the best match and returns the corresponding location reminder. For the online process, the trained image features are organized in a kd-tree for fast search. The server then waits for the query input. Once a query image is received, features are extracted and nearest neighbor search is performed followed by geometric validation to classify the input.
3
Identifying Advertisement Images
To know which reminder the user wants to download, on the server, our image recognition engine needs to identify the picture sent by the user. In the proposed system, image recognition is cast as a near-duplicate recognition problem. Image near-duplicate refers to a pair of images in which one is close to the exact duplicate of the other, but differs in appearance due to distortions contributed by different capturing devices (camera parameters, color resolution, etc), change in viewpoint or lighting conditions, occlusion, clutter or cropping. Fig. 3 illustrates some typical photometric and geometric distortions that a duplicate may suffer from.
(a) Original
(b) Scaling
(c) Illumination change
(d) Occlusion
(e) Cropping
(f) Background noise
Fig. 3. Effects of typical geometric and photometric distortions
Identifying near-duplicate images amid all geometric and photometric variations as described above is deemed difficult, especially when retrieval time is a concern for a practical application. Related works typically seek to identify features invariant to distortions/transformations. Based on PCA-SIFT, Ke et
Advertisement Image Recognition for a Location-Based Reminder System
425
al . [4] developed a point-based matching algorithm with an affine geometric verification to eliminate outliers. The algorithm is tested on images with simulated distortions which are less realistic than the distortions experienced by real duplicate data. Chum et al . [5] addressed the large-scale near-duplicate identification problem by utilizing both global features and local SIFT descriptors. However, they used a bag-of-word approach on SIFT features without considering any geometric configuration of such features. For a practical advertisement image recognition system, handling a large amount of data adds to the challenge. Large scale also implies more overhead in the matching process. In this paper, we propose a near-duplicate recognition system which seeks to strike a balance between recognition speed and accuracy, while meeting the scalability requirement. 3.1
Image Feature Extraction
Interest points are commonly employed in image recognition applications, because they are relatively insensitive to changes in viewing conditions. There are three considerations to using interest points. First, the description should be distinctive to give reliable differentiation from one interest point to others. Second, the interest points should be invariant to transformations such scaling, translation, in-plane rotation and moderate 3D rotation as well as intensity variations caused by change in imaging conditions. Lastly, the matching between local descriptors should be computationally efficient. This is of great concern when building large-scale systems. Unfortunately, the high dimensionality needed for distinctive representations limits the matching efficiency and hence the scalability of the recognition system. In view of the above considerations, we adopt SURF [6] for interest point extraction and representation. SURF is a speeded up version of SIFT by making use of Harr-like features and integral images for detection of scale-space extrema. It extracts features from grayscale images and thus is invariant to the color and illumination changes caused by different cameras. It has good repeatability and is invariant to 2D geometric transformations and mild 3D rotations. The dimensionality of SURF features is 64. 3.2
Near-Duplicate Recognition
In the training phase, SURF features are extracted from the sample images and stored in a tree structure for future knn search. For the querying phase, we propose a two-stage matching scheme for recognizing advertisement images. In stage 1, interest points extracted from the query image are matched against those stored in the database by nearest neighbor search. Voting is then performed to short-list the top matching candidates. In stage 2, a geometric validation based on Homography evaluated from point correspondence with the top matching classes is carried out to reduce false positive and to improve recognition accuracy.
426
S. Liu et al.
Matching by Fast Nearest Neighbor Search. Matching query features to the trained database is achieved by fast approximate nearest neighbor search using randomized kd-trees [2][3]. It is claimed to be a more efficient knn search algorithm that outperforms previously proposed methods, such as Locality-ensitive Hashing (LSH), in the handling of high-dimensional spaces. Another concern on approximate nearest neighbor search lies in the distance measure between feature vectors. Since LSH uses the L1 norm, it is less reliable than the L2 (Euclidean) norm. In the randomized kd-tree approach, the trees are built by choosing the split dimension randomly from the first D dimensions on which data has the greatest variance. We used the FLANN library1 to implement the fast knn search (default value of D is 5 in the FLANN implementation). When searching the trees, a single priority queue is maintained across all the randomized trees so that search can be ordered by increasing distance to each bin boundary. The degree of approximation is determined by examining a fixed number of leaf nodes, n, at which point the search is terminated and the best candidates returned. In our implementation, n is set to 64 and we split the data into 4 randomized kd-trees. Geometric Validation. Ideally, the greater the number of matches found to a specific reference in the database, the more likely that the query image is a near-duplicate. However, it is still possible to have false positive in the keypoint matching phase, as there can be similar patterns present in different classes of advertisement images. Due to cropping or background clutter, the actual features detected may be outnumbered by other non duplicate features. A point-based matching followed by voting is not sufficient to discriminate one class of advertisement from the others. To resolve this ambiguity, we introduce a geometric constraint, the homography, to validate the top m matching candidates. Homography is the linear mapping which relates 2 corresponding planar points in 2 views. Since the printed advertisements are planar objects, there exists a homography matrix which relates 2 images of a planar scene. Under the homography constraint, one keypoint in the reference image, denoted by xr = [Xr , Yr , 1]T in Homogeneous coordinate, can be mapped to one point xq = [Xq , Yq , 1]T in the query image by the transformation H: xq = Hxr
(1)
where xr and xq are the coordinates of keypoints in the reference image and the query image, respectively. H warps a point in the reference image onto the coordinate frame of the query image and it can be computed with N ≥ 4 corresponding pairs. For a robust solution, we use the RANSAC algorithm to reject the outlier matches and minimize the total transformation error between the intensities of the warpped reference image and the query image. Given H, we can estimate the coordinates of keypoints in the reference image corresponding to those in the query using Eq. (1), i.e. xq = Hxr . We evaluate the average matching error between the reference and the query image as follows: 1
http://www.cs.ubc.ca/$\sim$mariusm/index.php/FLANN/FLANN
Advertisement Image Recognition for a Location-Based Reminder System
K 1 e= |V (xr,i ) − V (xq ,i )|2 K i=1
427
(2)
where K denotes the number of validated correspondence pairs between the reference and the query images and V (x) denotes the SURF vector at spatial coordinate x. Note that V (xq ,i ) refers to the SURF features detected on the query image, at the projected coordinate xq ,i given by the homography transformation. The SURF keypoints may not lie exactly at the estimated location and hence we check in a 3 × 3 neighbourhood centred at xq ,i for the corresponding keypoints. We evaluate the matching errors e for the top 3 candidates and take the minimum value, emin , for decision making. If K <= 4, the candidate class is rejected. If emin ≤ th, the corresponding class label will be returned. Otherwise, -1 is returned indicating the query does not belong to any class.
4
Experimental Results
To test the validity of the proposed system, we conducted the experiments using a dataset of 82 classes with 1 image per class for training and a query dataset of approximately 15 images per class for testing. The minimum error threshold th = 0.10. To test the system’s rejection of invalid queries, another “distractor” dataset containing 362 advertisement images which do not belong to any of the trained classes are used. The 2 datasets are tested separately. All advertisement images in the dataset are crawled from the web and their printed hardcopies are photographed by 3 different mobile cameras to create the testing data. The training images are resized to a resolution of 240 × 320 and the testing images are acquired at the same resolution. We apply SURF to extract interest points for both training and testing images. The number of features detected on a 240 × 320 image ranges from 50 to over 100, depending on the complexity of image content. To avoid training background noise as reference features, for each class, we train only 1 sample, i.e., the resized advertisement images collected from the web. Feature points extracted from these clean images are stored in a kdtree as discussed in Section 3.2. Furthermore, as complexity of the knn search is proportional to the number of data points, training fewer samples helps to reduce the number of reference feature vectors in the kd-trees and hence higher efficiency in the nearest neighbor search. Some samples of the training and test images are shown in Fig. 4 and Fig. 5. Apart from capturing the test data with different cameras and taking the images from various view points and changing the lighting conditions, to stresstest the recognition system, we have included some challenging testing samples in our dataset. These adverse samples are created by casting strong shadows on the printed advertisement or putting the advertisement against a cluttered background. Examples of such test cases are shown in Fig. 6.
428
S. Liu et al.
Fig. 4. Samples of training images used in our system
Fig. 5. Samples of test images used in our system
Advertisement Image Recognition for a Location-Based Reminder System
(a)
(b)
(c)
(d)
429
Fig. 6. (a) and (b) Severe shadows cast on the advertisement; (c) and (d) Background clutter
To evaluate the performance of our system, we measure the recall and precision of the near-duplicate recognition. These measures are defined as follows: Recall = P recision =
number of true positives total number of positives
(3)
number of true positives total number of matches (correct or false)
(4)
To assess the effects of incorporating geometric validation in the matching process, we compare the performance of the system with and without the homography validation (Refer to Table 1 for the results). For the classification without homography validation, we threshold the median Euclidean distance between the correponding SURF features. This experiment shows that adding a global geometric constraint helps to reduce false positive. Quantitatively, it improves precision and True Negative rate (for the “Distractor” dataset) by about 4% and 5% respectively. As compared to the affine transformation used in [4], the homography transform used in our system is better at modeling the view direction changes of cameras and it gives more accurate localization of correspondence points. For a total of 1153 query images for 82 classes, the average recognition time is 326.18 ms on a Core 2 Duo E8400 @ 3.0 GHz desktop with 3.25 GB of memory. Table 1. Effects of incorporating homography validation measure Without H Precision 92.4% Recall 91.5% True Negative (“Distracter” only) 94.3%
With H 96.3% 95.9% 99.8%
430
S. Liu et al. 1 0.9
True Positive Rate
0.8 0.7 0.6 0.5 0.4 without H with H
0.3 0.2 0
0.02
0.04
0.06 0.08 False Positive Rate
0.1
0.12
Fig. 7. ROC curves for cases with and without homography validation
Average Recognition Time (ms)
335
330
325
320
315
310
305
0
10
20
30
40
50
60
70
80
90
No. of Classes
Fig. 8. A plot of average recognition time vs. number of classes
It does not include the time taken for transmitting the image data from the mobile phone to the server. 4.1
Scalability
Scalability is an important issue for practical recognition systems. To check the scalability of our system, we plot the average recognition time versus the number of classes trained (presented in Fig. 8). With no significant increase in running time, our system scales well with the increase in dataset size, suggesting the potential to handle datasets of larger scales. Our system gains leverage from
Advertisement Image Recognition for a Location-Based Reminder System
431
efficient interest point extraction and knn search. This is also due to the fact that we train only 1 sample from each class to minimize the overhead introduced by additional classes.
5
Conclusion and Future Work
In this paper, we have presented a system for mobile users to create locationbased reminders by snapping an image of the printed advertisement material. Near-duplicate image recognition technique was employed to recognize the query image and return location-specific reminder. The recognition system is efficient and it produces promising results. We have also shown that by incorporating geometric validation on the matched features, the recognition system is more robust against local noises and has better rejection of invalid data input. In future, we will test our near-duplicate recognition on more extensive datasets involving more variants of mobile phone cameras and capturing conditions.
References 1. http://ludimate.com/products/geominder/ 2. Silpa-Anan, C., Hartley, R.: Optimised KD-trees for fast image descriptor matching. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2009) 3. Muja, M., Lowe, D.: Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration. In: International Conference on Computer Vision Theory and Application VISSAPP, pp. 331–340 (2009) 4. Ke, Y., Sukthankar, R., Huston, L.: Efficient near-duplicate detection and sub-image retrieval. In: ACM International Conference on Multimedia (MM), pp. 869–876 (2004) 5. Chum, O., Philbin, J., Isard, M., Zisserman, A.: Scalable Near Indentical Image and Shot Detection. In: ACM International Conference on Image and Video Retrieval, pp. 549–556 (2007) 6. Bay, H., Tuytelaars, T., Van Gool, L.: Surf: Speeded up robust features. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 404– 417. Springer, Heidelberg (2006)
Flow of Qi: System of Real-Time Multimedia Interactive Application of Calligraphy Controlled by Breathing Kuang-I Chang1, Mu-Yu Tsai1, Yu-Jen Su1, Jyun-Long Chen1, and Shu-Min Wu2 1
Center for Measurement Standards, Industrial Technology Research Institute, Hsinchu, Taiwan 300, R.O.C. 2 Creativity Laboratory, Industrial Technology Research Institute, Hsinchu, Taiwan 300, R.O.C. {AlanChang,tsaimuyu,yrsu,ChenJounLong,ShuMin_Wu}@itri.org.tw
Abstract. By using novel UWB (ultra-wideband) sensor, we may acquire the breathing signals of human body. After real-time analyzing, parameters such as amplitude, frequency, and slope of breathing waves will be obtained continuously. Combine these parameters with strokes of Chinese calligraphy, for example, amplitude stands for the darkness of stroke and then speed and slope both represent the rhythm of stroke, an interesting multimedia interactive application of visual art which is related to famous Chinese calligraphy is developed and a suitable algorism for real-time interaction of breathing signal is accomplished. Keywords: UWB, calligraphy, breathing, stroke, darkness, rhythm, visual art, real-time interaction, algorithm, multimedia.
1 Introduction There are so many methods to make multimedia processing much more real-time. [1][2]However, interactive real-time algorism of sensing human breathing is not presented so far. By considering not only frequency of breathing, but also slope and amplitude of breathing wave which are measured by UWB(ultra-wideband) sensing technology, status of human breathing is able to be clarified.[3]Different visual effects interchanged continually caused by different status of human breathing. UWB is a radio technology with operating band wider than 0.5GHz or higher than 20% of its central frequency. [4]UWB sensor is able to measure very minute movements in human chest without contacting the human body and retrieve physiological parameters such as breathing and heart beats. Contrary to conventional methods of measurement, UWB detector applies short pulse electromagnetic waves with power lower than continuous waves and is only 0.01% of the power of GMS mobile phones. Without physical contact, UWB sensor is more convenient and comfortable for users who need long-term monitoring. Due to its high efficiency of penetration, UWB is able to be installed or hidden in the ceiling or furniture such as chair. Similar technology can be applied to physiological monitoring in healthcare institutes, security from intrusion, cultural and creative industries, entertainment, anti-collision crushproof systems and positioning systems as well. K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 432–441, 2011. © Springer-Verlag Berlin Heidelberg 2011
Flow of Qi: System of Real-Time Multimedia Interactive Application of Calligraphy
433
Here we present a real-time interactive application of multimedia art exhibition. Two participants should sit facing a field of white sand on the floor. UWB sensors which are hidden in the chairs individually to record the slightest variations in their breathing continuously(Fig. 1). From interactive point of view, we choose breathing which could be easily controlled by human body, rather than heart beats, as our monitored parameter to perform this interactive art piece. After a brief series of introductory slides explained procedure and context, the Chinese calligraphy begins to be projected onto the sand field. The breathing of the two viewers influences the calligraphy through the transmission of the UWB sensor data to a PC with customized software. One person affects the fluidity of the strokes (the faster the breathing, the faster the strokes), while the other affects the density (deeper breathing results in darker calligraphy). Participants can adjust their breathing to feel the rhythm associated with each stroke, and thus engage in a unique dialogue with famous Chinese calligraphers.
Fig. 1. Two participants are sitting and breathing to experience the interactive visual art of Chinese calligraphy which is projected upon the sands on the floor
2 The Measuring Method and Real-Time Signal Processing Algorithm 2.1 The Measuring Method UWB sensor is operating under the principle of Doppler Effect. Doppler Effect is the change in frequency of a wave for an observer moving relative to the source of the wave.[5] The received frequency is about in 2 meters during the approach, the measurement efficiency is about 99 %. As UWB does not have specific main frequency, it is not easily affected by other wireless radio frequencies. Therefore, UWB sensors can easily measure physiologic signals of human body without considering too much noise signals. (Fig. 2)
434
K.-I. Chang et al.
Fig. 2. Structures and operating units of a UWB sensor
Fig. 3. UWB sensor is hidden in the chair to monitor participants’ breathing
In order to detect breathing of participants unwittingly, two UWB sensors are hidden in two chairs (Fig. 3) individually and the signal of breathing will be precisely transformed into the movement of ink in calligraphy. This may enable participants to
Flow of Qi: System of Real-Time Multimedia Interactive Application of Calligraphy
435
use their breathing to interact with treasured works of calligraphy from the collection of the National Palace Museum in Taiwan. The UWB sensors used in this art piece measure and calculate the depth and speed of the two participants’ breathing before converting and projecting the parameters onto the calligraphic works. One of them controls the ink by breathing deeper or shallower; meanwhile, the other one controls the speed of writing by breathing faster or slower. By applying UWB technology, this art piece may analyze not only the rhythm of breathing but also the depth of breathing while monitoring the change of chest movement caused by breathing. And feedback to audience with visual effects through detecting and calculating physiological parameters of breathing. 2.2 Real-Time Signal Processing Algorithm Generally speaking, “Breathing” is able to be controlled by our brain. In other words, we may control it on purpose. However, signal processing for breathing is not easy to perform real-time interactive action compare with rapid nerve network. This kind of delay usually makes audience feel no real-time response at all. For example, in our case, we have to control the writing strokes of calligraphy by our breathing. When audience takes a deep breath, they are waiting to see the response immediately. But the UWB sensor, however, takes 3-4 seconds to distinguish whether it is breathing or not. Because respiration rate of adults is 15 times/per minute [6]. It means every single cycle of respiration takes almost 4 seconds. Therefore the audience will feel that the system is delayed seriously when he/she is going to make a deep breathing while the system is still collecting the signals yet. In addition, when the audience holds their breathing, our system needs a period of time to tell it is holding breathing or just an extremely slow breathing. It turns out serious delay, too. According to this, we developed a novel algorithm to reduce the feeling of delay. On the other hand, the visual effect intends to reflect the result from writing by breathing instead of the Chinese writing brush. Accordingly, the algorithm is as follows, a. The Beginning We collect the beginning 30 seconds as personal database during the brief series of introductory slides to evaluate the average amplitude between every audience in order to avoid personal differences in amplitude of breathing. And we sort them by the value of amplitudes. We regard the one-fourth amplitude as the minimum limit, mA , and the three-fourth amplitude as the maximum limit, MA .
“ ”
“ ”
b. The Amplitude We calculate the amplitude during every 3 seconds period. If the amplitude is larger than “MA”, we will present the darkest ink. On the other hand, if the amplitude is smaller than “mA”, we will present the shallowest ink. If the amplitude was between “MA” and “mA”, the shade of ink will distributed proportionally. c. The Speed The speed of breathing here is determined by two parameters. One is the instantaneous slope of the breath “S”, the other is the breathing rate “F”. As long as “F” is
436
K.-I. Chang et al.
smaller than 15, which is the average breathing rate of adults, “F” is defined equal to 0. In this case, the speed of the breathing will be set to “S” only. Otherwise, the speed of the breathing will be set to S+2F in order to show dramatic difference in visual effect. Under this operation, the audience may feel no delay especially if they slowdown their breathing. Moreover, the audience may easily have the highest speed value of breathing without breathing such quickly in order to prevent from uncomfortable or faint feelings. But dissimilar to aspect of amplitude, we define the maximum limit “MS“ and the minimum limit mS by our experience because average speed of breathing is well known. If the speed of breathing is greater than “MS“, the calligraphy shows the effect of brush past ink. If the speed of the breath is smaller than mS , the visual effect become spread out of ink which is correspond to what will happened if somebody holding his brush at certain position. If the speed of the breath is between MS and mS , the writing speed will distributed proportionally.
“ ”
“ ”
“ ” “ ”
Fig. 4. Example of breathing signal for a darker and spreading out of ink
d. Examples If the audience takes a deep and slow breathing, the visual effect shows a darker and spreading out of ink (Fig. 4 ). If the audience takes a shallow and slow breathing, the visual effect shows a dilute and spreading out of ink (Fig. 5 ). If the audience takes a deep and rapid breathing, the visual effect shows a darker and brush past of ink (Fig. 6.) If the audience takes a shallow and rapid breathing, the visual effect shows a dilute and brush past of ink (Fig .7) Otherwise, the visual effects of calligraphy, for instance, dark or dilute; spreading out or brush past, will be distributed proportionally.(Fig. 8) Flow chart of our proposed algorithm which is used for performing real-time status changing of breathing is shown (Fig. 9)
Flow of Qi: System of Real-Time Multimedia Interactive Application of Calligraphy
Fig. 5. Example of breathing signal for a dilute and spreading out of ink
Fig. 6. Example of breathing signal for a darker and brush past of ink
Fig. 7. Example of breathing signal for a dilute and brush past of ink
437
438
K.-I. Chang et al.
Fig. 8. Example for visual effects of Chinese calligraphy
Flow of Qi: System of Real-Time Multimedia Interactive Application of Calligraphy
Fig. 9. Flow chart of algorism
439
440
K.-I. Chang et al.
2.3 Relation between Breathing and Calligraphy
Intensity Linear FFT
In order to clarify the status of breathing during writing calligraphy, we invited three people with different experience in calligraphy to do the experiment. We recorded the breathing signals when they are writing calligraphy. For comparison, the breathing signals in relaxed state are also recorded. All the breathing signals are converted to frequency domain by using FFT method. Fig. 10 is the breathing signal of a professional calligrapher. When he is relaxed, breathing signals is disordered. If we set the 20% of the highest peak value as our threshold, there’re 17 peaks above this threshold. But when he is concentrated writing, there’re only 8 peaks even he is writing a difficult style. P r o fe s s io n a l R e la x 2 0 % th r e s h o ld
1 .8 1 .6 1 .4 1 .2 1 .0 0 .8 0 .6 0 .4 0 .2 0 .0 -0 .2
n = 1 7 a b o v e th r e s h o ld
FFT Linear Intensity
0 .0
0 .5
1 .2
1 .0
Hz
1 .0 0 .8 0 .6
1 .5
2 .0
P ro fe s s io n a l, s tr e s s 2 0 % th r e s h o ld n = 8 a b o v e th r e s h o ld
0 .4 0 .2 0 .0 0 .0
0 .5
1 .0
1 .5
2 .0
Hz
Fig. 10. FFT signals of professional calligrapher’s breathing
Then we asked an amateur calligrapher to do the same test. Fig. 11 is the signals we recorded. There’re 13 peaks above the threshold in relaxed state and only 3 peaks in concentrated writing calligraphy. a m a te u r re la x 2 0 % th re s h o ld
FFT Linear Intensity
0 .3 0 0 .2 5 0 .2 0 0 .1 5
n = 1 3 a b o v e th re s h o ld
0 .1 0 0 .0 5 0 .0 0
FFT Linear Intensity
0 .0
0 .5
0 .6
1 .0
1 .5
Hz
0 .5 0 .4 0 .3
2 .0
a m a te u r w ritin g 2 0 % th r e s h o ld n = 3 a b o v e th re s h o ld
0 .2 0 .1 0 .0 0 .0
0 .5
1 .0
1 .5
2 .0
Hz
Fig. 11. FFT signals of amateur calligrapher’s breathing
The same result is also happened on the beginner’s test. There’re 24 peaks above the threshold in relaxed state and 15 peaks in concentrated writing calligraphy.(Fig. 12) According to the result, when people is focusing on writing calligraphy, the
Flow of Qi: System of Real-Time Multimedia Interactive Application of Calligraphy
441
breathing signals in frequency domain are ordered. If we compare the highest peak value, professional calligrapher is the greatest one, beginner is the weakest. That means the professional calligrapher can control his breathing in steady state to make better calligraphy. Beginner tried to make his breathing regular, but without much experience, it’s not easy to be achieved. Therefore, we designed the algorism that if the breathing of participant is better, the calligraphy of visual effect is better. FFT Linear Intensity
0 .0 6
Linear FFT Intensity
0 .0 7
0 .0 5
B e g in n e r,re la x 2 0 % th re s h o ld
0 .0 5
n = 2 4 a b o v e th re s h o ld
0 .0 4 0 .0 3 0 .0 2 0 .0 1 0 .0 0 0 .0
0 .5
1 .0
Hz
0 .0 4 0 .0 3
1 .5
2 .0
B e g in n e r, w ritin g 2 0 % th re s h o ld
n = 1 5 a b o v e th re s h o ld
0 .0 2 0 .0 1 0 .0 0 0 .0
0 .5
1 .0
1 .5
2 .0
Hz
Fig. 12. FFT signals of beginner’s breathing
3 Conclusion We present a novel algorism and sensors which are very easy and suitable for demonstration and for applying to real-time interactive multimedia applications. By using UWB sensors, audience may interact with multimedia arts such as visual effect of Chinese calligraphy as we mentioned. Result from our algorism of evaluating breathing, audience may not feel any delay while they are changing their status of breathing. Because each cycle of breathing usually takes more than 4 seconds to distinguish, without adopting proper real-time algorithm like we proposed here, it’s hardly to have a smooth experience if audience is able to forecast what will happen in the next right moment.
References 1. Deshpande, S.G., Hwang, J.-N.: A real-time interactive virtual classroom multimedia distance learning system. IEEE Transactions on Multimedia 3(4), 432–444 (2001) 2. Aigrain, P., Joly, P.: The automatic real-time analysis of film editing and transition effects and its applications. Computers & Graphics 18(1), 93–103 (1994) 3. Immoreev, I.Y., Samkov, S., Tao, T.-H.: Short-distance ultra wideband radars. Aerospace and Electronic Systems Magazine 20(6), 9–14 (2005) 4. Report and Order by the FCC , February 14 (2002) (FCC 02-48) 5. http://en.wikipedia.org/wiki/Doppler_effect 6. Tortora, G.J., Anagnostakos, N.P.: Principles of Anatomy and Physiology, 6th edn., p. 707. Harper-Collins, New York (1990)
Measuring Bitrate and Quality Trade-Off in a Fast Region-of-Interest Based Video Coding Salahuddin Azad, Wei Song, and Dian Tjondronegoro Faculty of Science and Technology, Queensland University of Technology, Brisbane 4001, Australia
[email protected], {w1.song,dian}@qut.edu.au
Abstract. Prevailing video adaptation solutions change the quality of the video uniformly throughout the whole frame in the bitrate adjustment process; while region-of-interest (ROI)-based solutions selectively retains the quality in the areas of the frame where the viewers are more likely to pay more attention to. ROI-based coding can improve perceptual quality and viewer satisfaction while trading off some bandwidth. However, there has been no comprehensive study to measure the bitrate vs. perceptual quality trade-off so far. The paper proposes an ROI detection scheme for videos, which is characterized with low computational complexity and robustness, and measures the bitrate vs. quality trade-off for ROI-based encoding using a state-of-the-art H.264/AVC encoder to justify the viability of this type of encoding method. The results from the subjective quality test reveal that ROI-based encoding achieves a significant perceptual quality improvement over the encoding with uniform quality at the cost of slightly more bits. Based on the bitrate measurements and subjective quality assessments, the bitrate and the perceptual quality estimation models for non-scalable ROI-based video coding (AVC) are developed, which are found to be similar to the models for scalable video coding (SVC). Keywords: Bitrate modeling, Quality modeling, Region-of-Interest, H.264.
1
Introduction
Adaptive video streaming adjusts the bitrate of the video stream and the perceptual quality to meet the current network bandwidth constraint. Existing adaptation solutions affect the quality of the video equally throughout the whole frame in the bitrate adjustment process. However, it has been found that there are certain regions in the video frame where the viewers mostly concentrate on than other regions [10]. This is due to the highly non-uniform distribution of photoreceptors on the retina in human eyes. In the retina, only a small region of 2-5 degrees of visual angle (the fovea) around the center of gaze is captured at high resolution, with logarithmic resolution falloff with the distance from the center [4]. Thus, it may not be useful to encode each video frame with uniform K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 442–453, 2011. c Springer-Verlag Berlin Heidelberg 2011
Measuring Bitrate and Quality Trade-Off of ROI-Based Video Coding
443
quality, since the human observers will crisply perceive only a very small fraction of each frame, depending on their current point of fixation. Region-of-interest (ROI)-based video coding can improve the apparent perceptual quality of videos by selectively retaining the quality in the areas where the viewers are more likely to pay more attention to [5]. During the last three decades, a great deal of object detection or visual attention models have been developed. The techniques consider the human visual features and are able to be used to detect the regions of interest. For example, motion vector-based object detection [2] is a faster and robust approach and can be implemented in near real-time and compression domain. However, it is prone to global motion. An segmentation and region growing method [7] is based on color-texture features and tracks the segmented objects. The skin-color & face detection methods [15][16] detect the important area human face by extracting skin-color pixels. The saliency maps method [11] integrates different visual features (color, orientation, movement etc.) into one single topographic saliency map and selects spotlight of attention. A recent method is to combine various features and color contrast, object motion and face detection to determine the ROIs [1]. However, these solutions are computationally expensive and time consuming, which makes unsuitable for real-time processing of large number of videos. Although it is expected that higher accuracy of ROI detection can be achieved through using more features of video content, it is highly difficult to create a proper detection algorithm with various features to suit different videos. In fact, according to our latest study on evaluation of objective ROI detection methods with subjective assessment, using motion feature only gains a similar accuracy in most types of video contents. The usual method for ROI-based video coding is to use lower quantization parameter (QP) for the macroblocks within the ROI and higher QP for the macroblocks outside the ROI so that it can achieve higher perceptual quality than the uniform quality encoding. In trade-off, ROI-based encoding consumes more bandwidth than uniform encoding. However, ROI-based encoding still takes significantly less bandwidth than the maximum quality video as it degrades quality in most of the areas. In a nutshell, ROI-based coding should be the winner on both sides - retaining quality and saving bitrate. To the best of the authors’ knowledge, there has been no comprehensive study so far to measure the cost-performance trade-off for ROI-based encoding. The recent objective assessment [6] on ROI-based encoding has analyzed the impact of two QP degradation methods - linear quality distance adaptation and logarithmic quality distance adaptation on the human visual system. The former method degrades the quality of each macroblock linearly with distance from the area of maximum user interest (MAUI), while later method degrades the quality of the macroblock logarithmically with distance from MAUI. The experimental results suggest that if the viewer is highly interested in certain areas and has a very little interest in other areas, the linear method performs better. However, if the user has relatively balanced interest in various areas of the image with an obvious peak in MAUI, the logarithmic method performs
444
S. Azad, W. Song, and D. Tjondronegoro
better. Nevertheless, both the schemes have the disadvantage of providing very poor local quality at the furthest points from the MAUI. This is the reason why the paper considers only two distinct qualities for the ROI and non-ROI areas. The bitrate and perceptual quality modeling of video bitstreams allows a video adaptation system to estimate the bitrate and perceptual quality of degraded bitstreams without actually extracting them and this is particularly useful when the adaptation decision is taken at the client-side. The reduction in bitrate for scalable video bitstreams due to adjustment in encoding parameters - frame rate, quantization parameter and spatial resolution has been modeled in [3] and [18]. The modeling of bitrate in terms of QP for non-scalable MPEG video bitstreams has been developed in [8]. The modeling of bitrate for ROI-based encoding is yet not studied, which is really necessary to design a ROI-based video adaptation system. The perceptual quality modeling of scalable video bitstreams was developed in [17] and [14]. Both the bitrate and perceptual quality models contain parameters that are somehow dependent on the actual contents of the videos. The exact relationship between these parameters and the video contents is hard to determine accurately and still a challenging research topic. The main objective of this paper is to investigate the cost-performance benefit of ROI-based quality adjustment with non-scalable H.264/MPEG-4 AVC encoding [17]. The paper uses the motion-based ROI detection scheme to automatically identify the ROIs in the test videos. A number of standard and nonstandard videos were encoded with different quantization parameters with both uniform quality and ROI-based quality. The analysis shows that the additional bitrate for enhancing the quality of ROI is on average 10% of the original bitrate without ROI. A subjective quality test was also conducted on the encoded videos to see the perceptual quality gain compared to the uniform quality encoding. The test results show that the quality improvement of ROI-based coding over uniform encoding is on average 10%. Besides drawing the above conclusion, the paper develops the bitrate and quality models for non-scalable ROI-based encoding that are found to be similar to models for scalable videos [18]. The rest of the paper is organized as follows. Section 2 addresses the motionbased ROI detection scheme and Section 3 describes the experimental setup for bitrate and quality measurement for ROI-based encoding of videos. The analysis of the bitrate and perceptual quality of ROI-based encoding and comparison with the uniform quality encoding are presented in Section 4 and 5 respectively. Section 6 provides a further discussion. Finally, Section 7 concludes the paper.
2
Motion-Based ROI Detection
The motion-based ROI detection scheme used in this paper to produce ROIbased encoding is done using YUV (YCbCr) color space. The technique is based on the fact that the most important objects (e.g., the anchor in the news videos) are most likely to remain around the central region in the frame throughout an entire scene. The approach divides each video into a number of shots based on the number of pixels changed between successive frames. In the next step, the
Measuring Bitrate and Quality Trade-Off of ROI-Based Video Coding
445
statistics for likelihood of the luma changes throughout all frames in an entire scene are calculated and the position of the fixed size rectangular region where the change is most likely is determined. The center of that region is considered as the center of ROI. The proposed technique consists of the following steps: 1. Scene Change Detection The scene change is simply determined by whether the changed luminous pixels in each frame exceeds a threshold value. The threshold value is defined to set the minimum distance between two successive scenes. 2. Maximum Change Location Detection The next step finds the location of the rectangular region where the maximum number pixel change occurs in each frame. The operation traverses the whole frame with a certain stepsize at both horizontal and vertical directions. Using the statistics of all frames throughout a given scene, a histogram is created which records the frequency of occurrences of the maximum number of pixel changes within each rectangular area. This step is executed on each frame of the video. 3. Forming Cluster of Rectangular Areas This step forms the cluster of rectangular areas in the previous step. In this step, a greater rectangular area represents the cluster which contains a central rectangular area and all its surrounding and overlapping rectangular areas. The clusters centers are separated by Δ distance apart. The value of Δ is chosen such that clusters formed this way overlap with their neighboring clusters. The smaller the value of delta Δ, the better would be the prediction of the center of ROI. A second level histogram is then constructed that actually records the total frequency of the occurrences of maximum number of pixel changes within that particular cluster throughout an entire scene. This step is carried out once in each scene. 4. Finding the Center of ROI The center of ROI is the cluster with the maximum value of the frequency. Given the center of the ROI, it is possible to choose an arbitrary sized and shaped ROI around that center. The outputs of the proposed ROI detection algorithm carried out on six news videos (“GMnews”, “Disease”, “AFL”, “Tennis”) shown in Fig. 1. In the experiment, the ROI is a rectangular area having the width and height half of that of the original frame.Unlike the existing motion-based algorithm[2] for detecting ROIs in video frames, the novelty of the proposed scheme is that it can detect the degree of importance of the objects using the extent and clustering of luma changes.
3
Experiment Setup
The experiment used eight standard videos (“city”, “crew”, “football”, “foreman”, “harbour”, “mobile”, “news”, “soccer”) with CIF resolution and six news videos (“AFL”, “CQfire”, “Disease”, “GMnews”, “Spiderman”, “Tennis”) with 480×360 resolution. Each video was first encoded with uniform quality with four different QP values 28, 32, 36, 40 which correspond to quantization stepsize values
446
S. Azad, W. Song, and D. Tjondronegoro
Fig. 1. The ROIs (inside the red color rectangles) detected by the proposed scheme in (a)“GMnews” (b)“Disease” (c)“AFL” and (d)“Tennis” news videos
q=16,26,40,64 respectively. Later on, the same videos were encoded with ROI-based quality with the same series of QP values for areas outside ROI but keeping the QP within ROI fixed at 28 (corresponding to qmin = 16). The encoding of the videos was done with a custom-modified x264 encoder, which can encode a frame with uniform quality or different quality for ROI and non-ROI areas (x264 is a state-of-the-art free encoder library which produces video bitstream in H.264/MPEG-4 AVC format [17]. The ROI for each video is a rectangular area having the height and the width equal to half of the height and width of the original frame respectively. The area of ROI is a quarter of the area of the original frame, and according to the authors’ experiments, this size is reasonably optimal. This is because for a too large ROI, the bitrate will be higher, whereas for a too small a ROI, it may not fully cover the objects of interest and as a result, the perceptual quality improvement may not be noticeable by the users. For the subjective quality test, a subset of the aforementioned standard videos (“city”, “crew”, “football”, “foreman”, “harbour”, “mobile”, “news”, “soccer”) and news videos (“AFL”, “Disease”, “GMnews”, “Tennis”) as test materials. Subjective test was performed in a laboratory, which is a soundproof meeting room with controlled lighting conditions according to the ITU’s recommendations [13]. The test device was a SAMSUNG R700 laptop with a 17inch TFT LCD monitor, which display resolution was set as 1280×768. A total of 20 viewers took part in the subjective perceptual video quality assessment. Among the participants, there were 12 males and 8 females, 10 of them with image/video processing experience. The age of the participants ranged from 22 to 36 and all
Measuring Bitrate and Quality Trade-Off of ROI-Based Video Coding
447
of them reported to have normal vision. Each test content was presented with the following conditions: an explicit reference and a hidden reference (high uniform quality with QP=28) and six impaired test sequences (uniform quality and ROIbased quality at QP=32, 36, 40). After watching each test sequence, the subject used a 11-scale (0-10) slider to mark the impaired quality of the watched video to the reference video. After receiving all scores from the subjects, the normality of data distribution was examined by 1-Kolmogorov-Smirnov (K-S) test (p > .05) [9]. The mean opinion score (MOS) for each sequence was calculated by averaging all scores for that sequence.
4
Bitrate Modeling
Fig. 2a and 2b show the normalized bitrates due to the increment in quantization stepsize using uniform quality encoding for standard and news videos respectively, while Fig. 3a and 3b show the same using ROI-based encoding. The normalized bitrate means the ratio of the actual bitrate of a given bitstream to the bitrate of the maximum quality (corresponding to qmin = 16) bitstream.
Fig. 2. Normalized bitrate for (a) standard and (b) news videos for different quantization stepsizes compressed with uniform quality encoding
Fig. 3. Normalized bitrate for (a) standard and (b) news videos for different quantization stepsizes compressed with ROI-based encoding when qmin = 16 inside ROI
448
S. Azad, W. Song, and D. Tjondronegoro
Fig. 4. Pair-wise normalized bitrate differences under given q values between ROIbased encodings with qmin = 16 inside ROI and the uniform quality encoding for (a) standard and (b) news videos
Fig. 5. Pair-wise normalized bitrate differences between the maximum quality encoding and ROI-based encodings with qmin = 16 inside ROI and given q values outside ROI for (a) standard and (b) news videos
Based on the normalized bitrate curves for the uniform quality encoding, the normalized bitrate can be modeled as an inverse power function, i.e., R(q) = (
q qmin
)−a
a>1
(1)
considering the minimum quantization stepsize as qmin = 16. From the curve fitting data, it can be observed that the value of a slightly varies for different standard videos, while the value of a is uniform for the news videos. The reason behind the deviation is that the range of quantization parameters chosen was small compared to [19]. The average value of a is found to be 1.2 approximately, which is the same as the one reported in [19], despite the fact that the later model was proposed for scalable videos. For ROI-based encoding, the quantization stepsize is fixed at qmin = 16 within the ROI which is one quarter of the frame. Hence the non-ROI area, which is the remaining three quarters of the frame, accounts for the bitrate reduction. Therefore, the normalized bitrate for ROI-based encoding can be model as
Measuring Bitrate and Quality Trade-Off of ROI-Based Video Coding
Rc (q) = α + (1 − α)(
q −a ) qmin
a>1
449
(2)
According to the ROI size considered in the paper (1/4 of the whole frame), Fig. 3a and 3b confirm that the bitrate curves for the news videos are consistent with the model in formula (2), while curves for the news videos deviate slightly from the model. Specially, videos of “city”, “foreman”, “harbour”, “mobile” achieve better compression than the model in (2) mainly due to their uniform motions. The reason why news videos achieve more consistent curves than standard videos is that each of the standard videos contains only a single shot and the contents vary largely from one video to another; in contrast, each of the news videos contains multiple and diverse shots and the length is twice as that of a standard video. Fig. 4a and 4b show the differences between the normalized bitrates of the uniform quality and ROI-based encodings for standard and news videos respectively. As using qmin = 16 into the ROI, the difference between the two methods is zero when q=16 . In other cases, the additional bitrates for ROIbased encodings exceed uniform quality encoding range from 6% to 23% with a median of 10% for news videos. On the other hand, Fig. 5a and 5b show the differences between the normalized bitrates of the maximum quality encoding (using q=16 for the whole frame) and ROI-based encodings for standard and news videos respectively. From these figures, the bitrates for ROI-based encodings are far less than the bitrate for the maximum quality, the bitrate differences ranging from 40% to 80% with a median of 60%.
5
Quality Modeling
Fig. 6a and 6b show the normalized MOS due to the increment in quantization stepsize using uniform quality encoding for standard and news videos respectively, and Fig. 7a and 7b show the same for ROI-based encoding. The normalized MOS means the ratio of the actual MOS for a given bitstream to the MOS for the maximum quality (corresponding to qmin = 16) bitstream. The normalized MOS curves for standard videos are not much consistent, while the curves for news videos are. This is because the response of the viewers to the changes in quality largely depends on the content of the video. Since the content of a standard video remains almost the same throughout the playback period and the content widely differs from each another, the user’s perception of quality also varies accordingly. Based on the normalized MOS curves in Fig. 6a and 6b, the normalized perceptual quality can be modeled as a falling exponential function, i.e., Q(q) = e−c e
−c( q
q min
)
c<1
(3)
where the minimum quantization stepsize as qmin = 16. Based on the curve fitting data, the approximate value of c is found to be 0.35 which is larger than the one reported in [19]. Although this is a negative exponential function, the small value of c makes drop in normalized MOS much slower than the drop in
450
S. Azad, W. Song, and D. Tjondronegoro
Fig. 6. Normalized perceptual quality for (a) standard and (b) news videos for different quantization stepsizes compressed with uniform quality encoding
Fig. 7. Normalized perceptual quality for (a) standard and (b) news videos for different quantization stepsizes compressed with ROI-based encoding when qmin = 16 inside ROI
normalized bitrate. If the normalized MOS is assumed to follow the same trend as the normalized bitrate for ROI-based encoding, the normalized perceptual quality for ROI-encoded videos can be expressed as Qc (q) = α + (1 − α)e−c e
−c( q
q min
)
c<1
(4)
Fig. 7b confirm that normalized perceptual quality for ROI-based encodings is generally very close to Qc (q). Fig. 8a and 8b show the differences between normalized MOSs of the uniform quality and ROI-based encodings bitrate for standard and news videos respectively. According to these figures, the quality improvements in ROI-based encodings over uniform encodings range from 5% to 20% with a median of 10%. On the other hand, Fig. 9a and 9b show the differences between the normalized MOSs of the maximum quality encodings and ROI-based encodings for standard and news videos respectively. According to these figures, the quality difference between the ROI-based coding and maximum quality video ranges from 10% to 70% with a median of 40%.
Measuring Bitrate and Quality Trade-Off of ROI-Based Video Coding
451
Fig. 8. Pair-wise normalized MOS differences under given q values between ROI-based encodings with qmin = 16 inside ROI and the uniform quality encoding for (a) standard and (b) news videos
Fig. 9. Pair-wise normalized MOS differences between the maximum quality encoding and ROI-based encodings with qmin = 16 inside ROI and given q values outside ROI for (a) standard and (b) news videos
6
Discussion
Careful observation reveals that the lowest normalized bitrate achieved in the experiment is around 20% of the maximum bitrate, whereas the lowest perceptual quality is 40% of the maximum quality. Comparing the bitrate overhead and quality improvement between ROI-based encoding and uniform quality encoding, it can be concluded that the median of 10% for quality improvement significantly outweighs the median of 10% bitrate overhead as the normalized quality curve drops much slower than the bitrate curve. Also, the small bitrate overhead for ROI means that it can be a useful tool for fine-tuning bitrates. In spite of the general benefit from ROI-based video coding, it should be noted that it does not contribute to all content types. From Fig. 8a and 8b, it can be observed that ROI-based encoding cannot guarantee quality improvement for shots (“AFL”, “football”, “city” and “mobile”) with global motions and/or too many objects. This is ascribed to two main reasons: content feature and ROI detection. On the one hand, for videos with high textual complexity and slow global motion (e.g., “mobile” and “city”), their perceptual quality is not very
452
S. Azad, W. Song, and D. Tjondronegoro
impacted by quantization level(see slowly downed curves in Fig.6a and Fig.7a). On the other hand, the automatic ROI detection scheme with fixed size ROI used in this paper, may not work perfectly for the videos with many objects (e.g., “football” and “AFL”) and/or global motions. Therefore, we argue that encoding this type of shots, uniform encoding rather than ROI-based encoding will be the better choice. Future ROI detection schemes should be able to take care of this kind of shots.
7
Conclusion
The paper measures the bitrate vs. perceptual quality trade-off for non-scalable ROI-based encoding using H.264 encoder. Based on the quantitative measurements, a bitrate model and a perceptual quality model are developed to predict the bitrate and perceptual quality of the ROI-based encoding so that the adaptation decision can be easily made at the client-side. The experiment shows that ROI-based encoding achieves more perceptual quality than the bandwidth traded off. Moreover, it is observed that the quality gain is influenced by content features, quantization stepsize, and the efficacy of ROI detection. Development of more effective ROI detection schemes will improve the perceptual quality even further. In addition, since only one ROI and fixed ROI size were used in this paper, the impacts of multiple regions of interest and different ROI size should be addressed in the future work.
References 1. Abdollahian, G., Taskiran, C.M., Pizlo, Z., Delp, E.J.: Camera Motion-based Analysis of User Generated Video. IEEE Trans. on Multimedia 12(1), 28–41 (2010) 2. Ahmad, A.M.A.: Content-based Video Streaming Approaches and Challenges. In: Ibrahim, I.K. (ed.) Handbook of Research on Mobile Multimedia. Idea group Reference, London, pp. 357–367 (2006) 3. Azad, S., Song, W., Tjondronegoro, D.: Bitrate Modeling of Scalable Videos Using Quantization Parameter, Frame rate and Spatial Resolution. In: Proc. of ICASSP 2010, pp. 2334–2337. IEEE Press, Los Alamitos (2010) 4. Wandell, B.A.: Foundations of Vision. Sinauer, Sunderland (1995) 5. Chi, M., Chen, M., Yeh, C., Jhu, J.: Region-of-Interest Video Coding Based on Rate and Distortion Variations for H. 263+. Image Commun. 23(2), 127–142 (2008) 6. Ciubotaro, B., Muntean, G.-M., Ghinea, G.: Objective Assessment of Region of Interest-aware Adaptive Multimedia Streaming Quality. IEEE Trans. on Broadcasting. 55(2), 202–212 (1982) 7. Deng, Y., Manjunath, B.S.: Unsupervised Segmentation of Color-Texture Regions in Images and Video. IEEE Trans. on Pattern Analysis and Machine Intelligence 23(8), 800–810 (2001) 8. Ding, W., Lu, B.: Rate Control of MPEG Video Coding and Recording by Rate Quantization Modeling. IEEE Trans. on Circuits and Sys. for Video Technology 6, 12–20 (1996) 9. Eadie, W.T., Drijard, D., James, F.E., Roos, M., Sadoulet, B.: Statistical Methods in Experimental Physics, pp. 269–271. North-Holland, Amsterdam (1971)
Measuring Bitrate and Quality Trade-Off of ROI-Based Video Coding
453
10. Gulliver, S., Ghinea, G.: Stars in Their Eyes: What Eye Tracking Reveals about Multimedia Perceptual Quality. IEEE Trans. on Sys., Man and Cybernetics 34(4), 472–482 (2004) 11. Guo, C., Zhang, L.: A novel Multiresolution Spatiotemporal Saliency Detection Model and Its Applications in Image and Video Compression. IEEE Trans. on Image Processing. 19(1), 185–198 (2010) 12. X264 codec, http://www.videolan.org/developers/x264.html 13. ITU-T: Subjective video quality assessment methods for multimedia applications. P.910 Recommendation (1999) 14. Ou, Y.-F., Ma, Z., Wang, Y.: A novel quality metric for compressed video considering both frame rate and quantization artefacts. In: Proc. of Intl. Workshop Video Processing and Quality Metrics for Consumer, VPQM (2009) 15. Peer, P., Solina, F.: An automatic human face detection method. In: Proc. of CVWW 1999, pp. 122–130 (1999) 16. Solina, F., Peer, P., Batagelj, B., Juvan, S.: 15 seconds of fame - an interactive computer vision-based art installation. In: Proc. of ICARCV 2002, pp. 198–204 (2002) 17. Sullivan, G.J., Topiwala, P., Luthra, A.: The H.264/AVC Advanced Video Coding Standard: Overview and Introduction to the Fidelity Range Extensions. In: Pro. of the SPIE Conf. on Applications of Digital Image Processing, pp. 1–22 (2004) 18. Sullivan, G.J., Wiegand, T., Schwarz, H.: Amd.3 Scalable video coding, ISO/IEC JTC1/SC29/WG11, MPEG08/N9574, Antalya, TR (2008) 19. Wang, Y., Ma, Z., Ou, Y.-F.: Modeling rate and perceptual quality of scalable videos as functions of quantization and frame rate and its application in scalable video adaptation. In: Proc. of 7th International Packet Video Workshop (2009)
Image Annotation with Concept Level Feature Using PLSA+CCA Yu Zheng, Tetsuya Takiguchi, and Yasuo Ariki Graduate School of Engineering, Kobe University 1-1, Rokkodai, Nada, Kobe, 657-8501 Japan
[email protected], {takigu,ariki}@kobe-u.ac.jp http://www.kobe-u.ac.jp/
Abstract. Digital cameras have made it much easier to take photos, but organizing those photos is difficult. As a result, many people have thousands of photos in some miscellaneous folder on their hard disk . If computer can understand and manage these photos for us, we can save time. Also it will be useful for indexing and searching the web images. In this paper we propose an image annotation system with concept level search using PLSA+CCA,which generates the appropriate keywords to annotate the query image using large-scale image database. Keywords: image annotation, PLSA, CCA, image recognition.
1
Introduction
With the production of large digital image collections favored by cheap digital recording and storage devices,these is a clear need for efficient indexing and retrieval systems. In QBE systems,various low-level visual features are preliminarily extracted from the data set and stored as image index. The query is an image example that is indexed by its features,and retrieved images are ranked with respect to their similarity to this query index. The natural query process is textual and images in a collection are indexed with words. Automatic image annotation has thus emerged as one of the key research areas in multimedia information retrieval. Image annotation has been an active research topic in recent years due to its potentially large impact on both image understanding and web image search. We target at solving the automatic image annotation in a novel search framework. Given an uncaptioned image, first in the search stage a set of visually similar images are found from a large-scale image database. The database consists of images from the World Wide Web(Flickr Group)with rich annotations and surrounding text made by user. In the mining stage, a search result clustering technique (PLSA) and Canonical correlation analysis (CCA) are utilized to find most representative keywords from the annotations of the retrieved image subset. These keywords, after ranking, are finally used to annotate the uncaptioned image. K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 454–464, 2011. c Springer-Verlag Berlin Heidelberg 2011
Image Annotation with Concept Level Feature Using PLSA+CCA
2
455
Prior Work
A large number of techniques have been proposed in the last decade. Most of these deal with annotation as translation from image instances to keywords. The translation paradigm is typically based on some model of image and text cooccurrences. One of this translation model is the Correspondence Latent Dirichlet Allocation(CorrLDA)[1],a model that finds conditional relationships between latent variable representations of sets of image regions and sets of words. Although it considers associations through a latent topic space in a generatively learned model, this class of models remains sensitive to the choice of topic model,initial parameters and prior image segmentation. MBRM [2] shown in Fig.1 proposed approaches to automatically annotating and retrieving images by learning a statistical generative model called a relevance model using a set of annotated training images.The images are partitioned into rectangles and features are computed over these rectangles.A joint probability model for image features and words called a relevance model will be learned and is used to annotate test images which have not been seen.Words are modeled using a multiple Bernoulli process and images modeled using a kernel density estimate.However,the complexity of the kernel density representations may hinder MBRM’s applicability to large data set. Building .
g
1
r
1
g
2
r
2
g
3
r3
. .
Grass .
=
w
P(w|J)
J
P(g|J)
. . .
Fig. 1. MBRM.The annotation w is a binary vector. The image is produced by first sampling a set of feature vectors g1 ....gn , and then generating image regions r1 ....rn from the feature vectors. Resulting regions are tiled to form the image.
Recent research efforts have focused on extensions of the translation paradigm that exploit additional structure in both visual and textual domains. For instance,[3]utilizes a coherent language model, eliminating independence between keywords. The added complexity, however, makes the models applicable only to limited settings with small-size dictionaries.[4]developed a real-time ALIPR image search engine which uses multiresolution 2D Hidden Markov Model to model concepts determined by a training set. While this method successfully infers higher level semantic concepts based on global features, identification of more specific categories and objects remains a challenge. In this paper,we propose a method to solve the problem of the trade off between the computational efficiency with the large-scale dataset and the precision performance on complex annotation tasks.We use the concept level representation to solve the precision
456
Y. Zheng, T. Takiguchi, and Y. Ariki
and dimension problem and use the Internet database with concept groups for the training data to solve the large-scale dataset problem.By the experiment we can chose the best parameter, the number of topics K at PLSA model,and build the concept feature to learn correlation with label features.
3 3.1
Approach Outline
Automatically assigning keywords to images is of great interest as it allows one to index, retrieve, and understand large collections of image data. Given an input image, the goal of automatic image annotation is to assign a few relevant text keywords to the image that reflect its visual content. Automatic image annotation systems take advantage of existing annotated image data sets to link the visual and textual modalities by using machine learning techniques. The question is how to model the relation between captions and visual features to achieve the best textual indexing. This paper investigates this concept,proposing a new dependence between words and images based on latent aspects,we propose a probabilistic framework to analyze the contribution of the textual and the visual modalities separately. We assume that the two modalities share the same conditional probability distribution over a latent aspect variable that can be estimated from both or one of the two modalities for a given image. Image annotation is a difficult task for two main reasons: First is the semantic gap problem, which points to the fact that it is hard to extract semantically meaningful entities using just low level image features. Doing explicit recognition of thousands of objects or classes reliably is currently an unsolved problem. The second is to find the training image set with the keywords. We do not only use the low level image features,but also the concept level of the images. In our system shown in Fig.2 for image automatic annotation, a user gives query image to the system in the beginning, and obtains keywords associated with given query image finally. In this paper, we propose a new method to select relevant keywords to the given query image from images gathered from the Web. Our method is based on generative probabilistic latent topic models such as Building .
g
1
z
1
g
2
z
2
g
3
z
3
. .
Grass .
=
w
P(w|J)
J
P(g|J)
. . .
Fig. 2. Approch.The annotation w is a binary vector.The image is produced by first sampling a set of feature vectors g1 ....gn .The image regions r1 .....rn is replaced with concept representation z1 .....zn .
Image Annotation with Concept Level Feature Using PLSA+CCA
457
Probabilistic Latent Semantic Analysis (PLSA), Firstly, we gather images related to the given query image from the Web based on image features extracted from images themselves. Secondly, we use the gathered images for the training data in the PLSA model, and train a probabilistic latent topic model with them. Finally, we search the relationship between the visual and texture modalities based on concept level by CCA. The problem of generic object recognition algorithm is that the semantic gap between the image feature and the name of the object have not been solved. We proposed the annotation system that use two latent variable spaces. The computation of the conversion parameter in the CCA change with the number of dimension, the high dimension feature will not be suitable to CCA, so we need to reduce the number of dimension for CCA analysis. CCA does not search the direct relationship between the two variables, but first convert to new variables P and Q can represent correlation of the two variables well. The concept feature vector z with p variables and the label feature vector w with q variables. The latent variable Pi derived from concept feature z. We obtain the conversion parameters matrix A and B by CCA use the training data D={ (z1 , w1 )...(zN , wN ) }. Learning the correlation between the image concept feature z and the label feature w, we build the model can convert the two variables to new variables with high correlation.We chose the matrix parameter A and B to maximize the correlation between Pi and Qi shown in Eq.(1).and Eq.(2). z¯ and w ¯ are the means of z and w. Pi = AT (zi − z¯)
(1)
Qi = B T (wi − w) ¯
(2)
Two types of features that images feature and label feature cannot be compared directly before, but by this model they can be compared. We first take out training sample image data from the image database. And extract the image feature and label feature from the training data. Conduct the CCA analysis for the two vectors expressed in Eq.(3). p(z,w) is the co-occurrence probability of z and w. Nl is the number of latent variables. p(z, w) = p(Pi )
Nl
p(z|Pi ) ∗ p(w|Pi )
(3)
i=1
p(z|Pi ) =
exp(− 12 (P − Pi )T Σ −1 (P − Pi )) (2π)d | Σ |
p(w|Pi ) = μδw,Pi + (1 − μ)
Nw NW
(4)
(5)
458
Y. Zheng, T. Takiguchi, and Y. Ariki
The probability of z when given latent variable Pi p(z|Pi ) expressed in Eq.(4),it is the Gauss distribution with the mean of Pi in the latent variable space. z is the concept feature vector. Pi is the latent variable derived from the concept feature vector z. The probability of the label feature derived from the latent variable Pi p(w|Pi ) expressed in Eq.(5). It is designed to top-down by language model. Nw is the occurrence of labels w in the training data. δw,Pi will be 1 when label w labeling to Pi , otherwise will be 0. μ is fixed to 0.99. The PLSA-CCA construction has been shown in Fig.3. SIFT feature vector
label representation
{
x1 , x 2 , .. ... xn }
⎡ a1 ⎤ ⎢ a ⎥ ⎢ 2⎥ ⎢ ... ⎥ ⎢ ⎥ ⎣ ar ⎦
R
⎡ b1 ⎤ ⎢ ⎥ b ⎢ 2⎥ ⎢. .. ⎥ ⎢ ⎥ ⎣bs ⎦
y y2 .... yn }
{ 1,
Concept vector {z , z ,.....z }
1
2
n
Fig. 3. We first get the SIFT feature x1 , ..., xn of the image and then convert it to concept feature z1 , ..., zn by PLSA and to do the CCA analysis with label feature
3.2
Training Data Search
Support vector machines (SVMs) are a set of related supervised learning methods used for classification and regression. To search the most similar image group for the training data,measuring image similarity became an effective way. Two images are similar if they are likely to belong to the same Flickr groups. We use SIFT as the image feature and quantize them. Using online photo sharing sites, such as Flickr be shown in Fig.4. People have organized many millions of photos into hundreds of thousands of semantically themed groups. How can we learn whether a photo is likely to belong to a particular Flickr group? we can easily download thousands of images belonging to the group and many more that do not ,and then we calculate the SIFT value of the images from the Flickr groups,finally quantize them to form the feature,suggesting that we train a classifier SVM as shown in Fig.5. For each group, we train a SVM. For a test image, we also calculate the SIFT feature of the test image and use the trained group classifiers to predict likely group memberships. We use these predictions to measure similarity,and decide which group is the test image belongs to. 3.3
PLSA-Mixed Concept Feature
A document is a mixture of latent aspects. These latent aspects are defined by multinomial distributions over words that are learned for each text corpus
Image Annotation with Concept Level Feature Using PLSA+CCA
459
Fig. 4. Flickr is almost certainly the best online photo management and sharing application in the world. With millions of users, and hundreds of millions of photos and videos, Flickr is an amazing photographic community, with sharing at its heart.
Flickr group
Training data 1
SVM 1
Training data 2
Training data N ......................
SVM 2
Training
SVM N
data
Fig. 5. Search training data with Flickr group.We download thousands of images from many Flickr groups.Groups that we use are organized by objects. For each group,we train a SVM classifier. For a test image,we use the trained group classifiers to predict likely group memberships. We train classifiers to predict whether an test image is likely to belong to a Flickr group. The group will be took out for the training data.
considered. These distributions characterize the aspects and show that a correspondence between topics identified by humans and latent aspects can exist. The concept of latent aspects is not restricted to text documents. Images are intuitively seen as mixtures of several content types, which make them good candidates for a latent aspect approach. Different latent aspect models, adapted from the LDA model for text, have been proposed to model annotated images. An image is generally composed of several entities (car, house, door, tree, rocks...) organized in often unpredictable layouts. Hence, the content of images from a specific scene type exhibits a large variability. PLSA, an unsupervised probabilistic model for collections of discrete data, integrates the recently proposed scale-invariant feature and probabilistic latent space model frameworks,has dual ability to generate a robust, low-dimensional representation. The bag-of-visual representation is simple to build. PLSA is a statistical model as
460
Y. Zheng, T. Takiguchi, and Y. Ariki
M
N d
w
z
p(z|d)
p(w|z)
Fig. 6. Joint probability model. Plate notation representing the PLSA model. d is the document variable,z is a topic drawn from the topic distribution for this document,p(z|d). w is a word drawn from the word distribution for this topic,p(w|z). The d and w are observable variables,the topic z is a latent variable.
shown in Fig.6 that associates a latent variable zl ∈ Z = {z1 , ..., zNA } with each observation (the occurrence of a word in a document ). These variables, usually called aspects, are then used to build a joint probability model over images and visterms, defined as the mixture. P (vj , di ) = P (di )
NA
P (zl |di )P (vj |zl )
(6)
l=1
PLSA introduces a conditional independence assumption: it assumes the occurrence of a visual word vj to be independent of the image di it belongs to, given an aspect zl . The model in Eq.(6) is defined by the conditional probabilities P (vj |zl ) which represent the probability of observing the visual word vj given the aspect zl , and by the image-specific conditional multinomial probabilities P (zl |di ). The model expresses the conditional probabilities P (vj |di ) as a convex combination of the aspect specific distributions P (vj |zl ). The parameters of the model are estimated using the maximum likelihood principle, using a set of training images D. The training images have been got in the first step. The optimization is conducted using the ExpectationMaximization (EM) algorithm. This estimation procedure allows to learn the aspect distributions P (vj |zl ). These image independent parameters can then be used to infer the aspect mixture parameters P (zl |d) of any image d given its bag-of-visterms (BOV) representation. Consequently, the second image representation we will use is defined by Eq.(7). Eq.(7)is the concept level image representation. (P (zl |d))l=1,2.........NA
(7)
Comparing to feature representation,concept representation shown in Eq.(7) can find out accurate images because it searches data based on objects in the images. 3.4
Annotation Using PLSA-CCA
PLSA has been recently shown to perform well on image classification tasks, using the aspect mixture proportions to learn the classifiers. The conditional
Image Annotation with Concept Level Feature Using PLSA+CCA 1
2
1
3
2 Grass
brown
5
3 Snow
Bear
4
fox
river
4
Image feature space
461
Bird sky
5
Labels feature space
CCA
Latent space
Fox
Testing queries
river
Fig. 7. The algorithm of CCA
probability distribution over aspects P (z|dnew ) can be inferred for an unseen document dnew . The folding-in method maximizes the likelihood of the document dnew with a partial version of the EM algorithm, where P (x | z) is obtained from training and kept fixed.In doing so,P (z|dnew ) maximizes the likelihood of the document dnew with respect to the previously learned P (x | z) parameters. The PLSA-MIXED model learns a standard PLSA model on a concatenated representation of the textual and the visual features x=(w,v).Using a training set of captioned images, P (x | z) is learned for both textual and visual co-occurrences to capture simultaneous occurrence of visual features and words. Once P (x | z) has been learned,it can be used to infer a distribution over words for a new image as follows: The new image dnew is represented in the concatenated vectorspace, where all word elements are zero(no annotation): xnew =(0,vnew ).The multinomial distribution over aspects given the new image P (z|dnew ) is then computed with the partial PLSA steps and allows the computation of P (x|dnew ) .The conditional probability distribution over words P (w|dnew ) is extracted from P (x|dnew ) and allows the annotation of the new image dnew . Given two column vectors X = (x1 , ...xn )T and Y = (y1 , ..., ym )T of random variables,canonical correlation analysis seeks vectors a and b such that the random variables aT X and bT Y maximize the correlation ρ =cor(aT X, bT Y ) expressed in Eq.(8). The random variables U = aT XandV = bT Y are the first pair of canonical variables. aT ΣXY b ρ = aT ΣXX a bT ΣY Y b
h=
N
p(znew |Pi )p(w|Pi )
(8)
(9)
i=1
The unkown images is inputed and the concept feature znew will be computed and then the posterior probability of the label p(w|znew ) will be computed. p(znew ) and p(Pi ) have the same value for all the label.The h value Eq.(9) of all the label w will be ordered and assigned to the unknown image based on the h.
462
4
Y. Zheng, T. Takiguchi, and Y. Ariki
Experiment
Predicting annotations with an unlimited vocabulary, which is a significant advantage of this annotation system benefited from Web-scale data, to get a better similarity measure to obtain a more semantically relevant image set To obtain a well-annotated image database, we gathered 1K images from photo forum site, images in photo forums have rich and accurate descriptions provided by photographers. We used the random 1K images for the test images. The number of topic: In the Table.1 we compared the performance of precision at concept level with different number of topics. It can be seen that the precision will change with different number of topics. Meanwhile noisy or irrelevant words resulting in some drop in precision can be improved by concept search.We chose the best parameter K which gets the highest precision. The number of image: In the Table.1 we also change the number of test images.The performances improved when the number of images increased. This implies that more images may bring more noises, and at the concept level the noises can be reduced effectively. High precision can be achieved benefited from the large-scale data. As be shown in Table.2 we compared the performance on the task of automatic image annotation with different models. CRM and CRM-Rectangles are essentially the same model but the former uses regions produced by a segmentation
Table 1. Average precision with different number of topics and different number of images Topic 10 20 50 100 200 500 1000
K=1 56.7 52.0 43.7 41.5 53.6 57.9 56.8
K=2 67.8 47.8 57.9 49.2 57.1 67.5 58.3
K=3 42.7 69.0 48.2 45.3 49.6 59.2 52.4
K=4 79.8 54.9 64.7 51.5 65.9 69.9 63.1
K=5 86.6 67.9 59.9 49.8 58.8 72.3 59.0
K=6 70.1 57.8 79.8 63.9 67.8 71.4 72.3
K=7 73.8 71.5 69.0 55.1 73.4 68.0 74.4
K=8 69.7 72.7 80.3 57.7 70.8 70.4 79.2
K=9 87.9 70.4 76.5 64.2 75.5 77.3 76.0
K=10 82.5 68.9 80.1 65.4 67.9 77.0 68.6
Table 2. Performance comparison on the task of automatic image annotation with different model Models Translation CRM CRM-Rectangles MBRM Proposed(best) 100 34 70 75 78 65.4 500 20 59 72 74 77.3 1000 18 47 63 69 79.2
Image Annotation with Concept Level Feature Using PLSA+CCA
463
algorithm while the latter uses a grid. We can see that when inputed 100 images, MBRM performs best. When inputed 500 or 1000 images,the proposed method performs best. Precision-recall have been shown in fig.8. As shown in Fig.9,the images with much prior knowledge such as building and mountain can achieve high precision. But the images with less prior knowledge such as Ferris wheel dose not perform well.
Precision-Recall
Precision
100 90 80 70 60 50 40 30 20 10 0
MBRM Proposed
10 20 30 40 50 60 70 80 90 100 Recall
Fig. 8. Precision-recall when K=8 with the 1000 input images
Ferris wheel Mountain Building
Lighthouse Sea cloud
Building Mountain
Tree Road lawn
Flower Snow Mountain
Trees Bicycle person
Fig. 9. Example
5
Conclusion
In this paper, we have presented a practical and effective image annotation system. We formulate the image annotation as searching for similar images and mining key phrases from the descriptions of the resultant images, based on two key techniques: image search -index and the search result clustering technique. We use these techniques to bridge the gap between the pixel representations of images and the semantic meanings. However identifying objects ,events, and activities in a scene is still a topic of intense research with limited success. In the future we will investigate how to improve the annotation quality without any prior knowledge.
464
Y. Zheng, T. Takiguchi, and Y. Ariki
References 1. Blei, D.M., Jordan, M.I.: Modeling annotated data. In: Proc. ACM SIGIR, pp. 127–134 (2003) 2. Feng, S.L., Manmatha, R., Lavrenko, V.: Multiple bernoulli relevance models for image and video annotation. In: IEEE Conf. Computer Vision and Pattern Recognition (2004) 3. Jin, R., Chai, J.Y., Si, L.: Effective automatic image annotation via a coherent language model and active learning. In: ACM Multimedia Conference, pp. 892– 889 (2004) 4. Li, J., Wang, J.: Automatic linguistic indexing of pictures by a statistical modeling approach. IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (2003) 5. Barnard, K., Duygulu, P., Forsyth, D., Freitas, N., Blei, D., Jordan, M.: Matching words and pictures. JMLR (2003) 6. Garneiro, G., Vasconcelos, N.: A Database Centric View of Semantic Image Annotation and Retrieval. In: SIGIR (2005) 7. Duygulu, P., Barnard, K., de Freitas, J.F.G., Forsyth, D.: Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2353, pp. 97–112. Springer, Heidelberg (2002) 8. Barnard, K., Forsyth, D.A.: Learning the semantics of words and pictures. In: ICCV, pp. 408–415 (2001) 9. Zhang, H., Berg, A., Maire, M., Malik, J.: SVM-KNN: Discriminative Nearest Neighbor Classification for Visual Category Recognition. In: Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, vol. 2, pp. 2126–2136 (June 2006) 10. Datta, R., Joshi, D., Li, J., Wang, J.Z.: Image retrieval: Ideas, influences, and trends of the new age. ACM Computing Surveys (2008)
Multi-actor Emotion Recognition in Movies Using a Bimodal Approach Ruchir Srivastava1 , Sujoy Roy2 , Shuicheng Yan1 , and Terence Sim3 1
Dept. of Electrical and Computer Engineering, National University of Singapore, Singapore 117576 2 Institute for Infocomm Research, 1 Fusionopolis Way, #21-01 Connexis, Singapore 138632 3 School of Computing, National University of Singapore, Singapore 117417
Abstract. Approaches for emotion recognition in movie scenes using high level features, consider emotion of only a single actor. The contribution of this paper is to analyze using emotional information from multiple actors present in the scene instead of just one actor. A bimodal approach is proposed for fusing emotional cues from different actors using two different fusion methods. Emotional cues are obtained from facial expressions and dialogs. Experimental observations show that emotions of other actors do not necessarily provide helpful information about the emotion of the scene and recognition accuracy is better when emotions of only the speaker are considered. Keywords: Human Computer Interface, emotion recognition, movie analysis, Semantic Orientation, bimodal.
1
Introduction
As computers are becoming closer to our lives, Human Computer Interaction (HCI) is emerging as a promising research area. An important avenue of HCI is Human emotion recognition by which computers can become aware of human emotions. Emotion Recognition (ER) finds application in various areas such as social robotics, determining consumer response, automated tutoring, surveillance, entertainment and so on. Most of the works on ER have analyzed emotions displayed under highly controlled laboratory environment. This limits the application of these methods in more natural environments which pose difficulties such as variations in face scale and pose, ambient illumination, occlusion, head motion, presence of background noise and so on. These difficulties of a natural environment bring out a need for developing ER methods capable of dealing with environments less constrained than the laboratory environments. However, suitable data is required for developing such methods. A movie environment is closer to a natural setting as compared to a lab environment. Apart from being a step towards emotion recognition in natural environment, ER in movies can also be of application in movie recommendations, rankings and indexing. Movies are usually categorized K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 465–475, 2011. c Springer-Verlag Berlin Heidelberg 2011
466
R. Srivastava et al.
Fig. 1. Fusing emotional cues from multiple actors. Consider two scenes with negative emotions and male actors being the speaker. a. Non-speaker’s neutral expression might not be helpful. b. Note that 1) Speaker’s positive emotions are insufficient in indicating about the tragedy in the scene, 2) Emotion of actress with near frontal face may be more reliable than those of the actor with profile view face. c. Proposed bimodal multi-actor emotion recognition framework. d. 23 Facial Feature Points (FFPs) used for FER.
into genres such as action, comedy, adventure and so on. However, this categorization cannot distinguish between comedic adventure movies and horrifying adventure movies. This distinction can be made using ER. Furthermore by using a probabilistic approach to ER, it can be predicted how fearful a horror movie is. Other possible applications are movie classification and retrieval based on emotional content rather actors, title or genre. Considering these applications of ER in movies, we might be more interested in emotions of the scene as a whole and not just of a single actor of the movie. In movies, many times there are multiple actors in a scene. Current approaches for ER recognize emotions of a single subject (single actor approach). But emotion of just one actor may not reflect the emotion of the scene. The main contribution of the proposed work is to analyze multi-actor emotion recognition i.e. recognizing emotions in movie clips by fusing emotional cues from different actors. There is a need to analyze the effect of recognizing emotions of all the relevant actors in a movie scene (multi-actor approach). Such analysis can be interesting considering the different possibilities that may arise in a movie scenario. E.g. Fig.1b. depicts a scene where an actor consoles the actress after a tragedy. If we consider cues from both the dialog and the facial expression, speaker (male actor) has a positive emotion while the other actor1 has a sad expression. Scene represents a sorrowful situation which is indicated by the non-speaker. In this case, combining emotional information from non-speaker will be of help. Consider another case in Fig.1a. where speaker (male) has negative expression but the nonspeaker has almost neutral expression. If emotion cue from the non-speaker is fused with that of the speaker, probability of predicted emotion being negative 1
For convenience, females are also refered to as actors; in this paper.
Multi-actor Emotion Recognition in Movies Using a Bimodal Approach
467
reduces. So, it is a debatable issue whether emotion cues of actors other than speaker are relevant for recognizing emotion of the scene, or not. Amongst other challenges, predicted emotion for each actor might not be equally reliable as in Fig.1b. where the prediction of ER algorithm for the actor with a frontal face may be more reliable than that for the smaller face. The reliability of emotion of each actor needs to be determined. In the cases discussed above, we have emotional cues from facial expressions (visual cue) of two actors and dialog (lexical cue) of the speaker. It is not easy to devise a method to determine relevance of each of these cues and fuse them automatically to determine emotion of the scene.
2 2.1
Related Works Using Low-Level Features
Low level features have mostly been visual in nature. Rasheed et al. [6] combine four computable video features viz. average shot length, color variance, motion content and lighting key in a framework to provide a mapping to four movie genres Comedies, Action, Dramas or Horror. Kang [4] uses 15 color features, motion information and shot cut rate to recognize three emotions (Fear, Sad and Joy) along with normal state. Wang and Cheong [8] use content audio and video features to classify basic emotions elicited by movie scenes. Audio was classified into music, speech and environment signals and has been treated separately to shape an affective feature vector. The audio affective vector was used with videobased features such as key lighting and visual excitement to form a scene feature vector. Finally, the scene feature vector was classified and labeled with emotions. 2.2
Using High-Level Features
Black and Yacoob [1] use local parametrized models of image motion for recovering and recognizing the non-rigid and articulated motion of human faces. These models are then used for Facial Expression Recognition on video sequences which include movie clips. However, it is not clear if the movie clips contain more than one actor. In recognizing emotions from audio, Giannakopoulos et al. [2] use a dimensional approach (Valence-Arousal representation) to recognize emotions from movies. They use 10 audio features based on Mel-frequency cepstral coefficients (MFCC), Fast Fourier Transforms (FFT), Zero Crossing Rate (ZCR), pitch information and variation of chroma elements. Approaches using visual high-level features have considered source of information from only one actor. However it will be interesting to test out the utility of emotions of other actors as well. The main contribution of the proposed work is an analysis of multi-actor emotion recognition which fuses emotional information from different actors for recognizing emotions in movie clips. Movie clips are labeled as having positive or negative emotions. Considering the subtle nature of natural emotions as compared to posed emotions, a two-class prediction
468
R. Srivastava et al.
a. Profile face: pose1 = 0.04, pose2 = 22.8
b. Non-profile face: pose1 = 0.67, pose2 = 2.3
Fig. 2. Dealing with pose variations using two pose parameters
is made which is a step towards a more challenging multi-class prediction in close to natural environments. Moreover, even the two class prediction can be used for the commercial application of movie recommendation. A bimodal fusion methodology is proposed which combines visual cues from facial expressions and lexical cues from dialogs. Cues are weighted according to their relevance for ER. Relevance of facial expression cues is estimated by how much these cues are affected by variations such as pose and scale of the face, expression intensity and so on. Relevance of dialogs is estimated by estimating how much information they convey about the emotion. This estimation is performed using the semantic orientation approach details of which are in section 4.
3
Facial Expression Recognition (FER)
FER gives visual cue about emotions in the movie scene. Given the input image sequence F, containing frames Ff , where f ∈ [1, Nf ], denotes frame number and Nf the total number of frames in the sequence; FER algorithm gives the probabilities of each frame Ff having a positive expression. Let the probability th be given by P ipos actor, V refers to visual and pos V (f ), where i denotes the i refers to positive. The FER algorithm uses displacement of 23 Facial Feature Points (FFPs) (Fig.1d.) from their position in the neutral face (preselected) as features. The pseudocode for the algorithm is as follows: Pseudocode of the FER algorithm: Inputs: Randomly selected training and test image sequences with a neutral frame (Denoted by Fne ) corresponding to each sequence. Output: P ipos V (f )∀ f ∈ [1, Nf ]: Probability of each frame having positive expression. Training for all training sequences – Initialize the training feature matrix, Xtr =[ ] – Detect faces in F1 using Viola Jones face detector[3]. – for each face 1. Mark 23 FFPs in F1 and Fne . Let Dne be the 23 × 2 matrix storing (x, y) coordinates of 23 FFPs for neutral. 2. Track FFPs using a PPCA based algorithm[5]. Let Df store (x, y) coordinates of FFPs for f th frame.
Multi-actor Emotion Recognition in Movies Using a Bimodal Approach
469
3. Find residue Rf , Rf = Df − Dne , normalize them and shift origin to nose tip. 4. Let residues for f th frame after step 3 be RSf (23 × 2 matrix). Form feature matrix X and concatenate it to Xtr as follows: ⎡ T ⎤ T RS1x RS1y T T ⎢ RS2x RS2y ⎥ ⎥ , Xtr = Xtr X=⎢ (1) ⎣... ⎦ ... X T T RSN RSN fx fy where subscripts x and y refer to residues of x and y coordinates, respectively. 5. Train SVM classifiers using Xtr . Testing – For each face in the test sequence, find Xtes as X was found in steps 1 to 4 above. – Select relevant features using significance ratio test [9]. – Use SVM for classification of each frame into having positive or negative expression.
3.1
Dealing with Large Variations in Facial Pose
In order to deal with large range of facial poses encountered in the data, two SVM classifiers were trained, one for profile and other for non-profile views. In the testing phase, the test clip is classified into non-profile or profile and then the corresponding classifier is used for classifying that clip i.e. SVM classifier trained on profile faces is used to test profile clips and the classifier trained on non-profile faces is used to test non-profile clips. In order to classify facial pose in a particular clip into profile or non-profile, two parameters were used which are defined as follows: 1. pose1: If dmn denotes the euclidean distance between FFPs number m and n, pose1 is defined as: d49 pose1 = (2) (d34 + d39 ) For a frontal face this distance should be around 0.5 while for a profile face, due to a small value of d49 , value of pose1 is close to zero. 2. pose2: pose2 measures the nature of the spread of the FFPs in the xy plane. If we perform PCA on the position of FFPs, since FFPs for a frontal face are evenly spread along the principal directions, the two eigenvalues are of similar order. However, for a profile face spread of FFPs is not the same in both the principal directions. This makes the first eigenvalue to be much larger than the second one. pose2 is defined as pose2 = λ1 /λ2 , where λ1 and λ2 are the first and second eigenvalues respectively after applying PCA on the FFP coordinates. The value of pose2 is much larger for profile faces as compared to frontal faces. For classification of facial pose, the first frame of the clip is considered. If pose1 is less than a certain threshold and pose2 is greater the another threshold, for the first frame, the face pose in the clip is classified as profile otherwise it is classified as non-profile. The thresholds are manually chosen based on training data. Fig.2 shows values of pose1 and pose2 for sample clips.
470
4
R. Srivastava et al.
Lexical Analysis of Dialogs
Lexical analysis of the dialogs finds out emotions expressed through dialogs using Semantic Orientation (SO). SO or a word or a phrase tells us whether that word or phrase is positively or negatively oriented. In general, positive SO values indicate positive orientation and negative SO values indicate negative orientation. A smaller magnitude of SO values indicates absence of a strong emotion. E.g. The phrase ‘That’s very sweet’ indicates a positive emotion and its SO value is +1.68. On the other hand, the phrase ‘You shameless woman!’ indicates a negative emotion which is also reflected by an SO value of -1.37. A mild phrase such as ‘not deal’ has SO value of -0.15, low magnitude of which indicates it’s mildness. SO is calculated using Whissell’s Dictionary of Affect in Language (DAL)[10]. DAL gives the emotional connotation of 8742 words along three dimensions viz. evaluation, activation and imagery. Scores for pleasantness range from 1 (unpleasant) to 3 (pleasant), for activation range from 1 (passive) to 3 (active) and for imagery range from 1 (difficult to form a mental picture of this word) to 3 (easy to form a mental picture). To calculate SO, DAL scores are mapped to a range of -1 to +1. For each word in the dialog, evaluation is taken as a direct measure of its SO. If a word is missing in the dictionary, evaluation score of its synonyms or related forms are used. Since very few words were missing in DAL, synonyms and related forms could be manually entered. SO for entire dialog is given by a mean SO value of individual words. Probability of a dialogue with SO value S being positive is given by: P spos L =
S/SO+
ifS ≥ 0
= 1 − S/SO−
ifS < 0
(3) (4)
where SO+ and SO− are the maximum and minimum SO values for training examples and s refers to the speaker.
5
Fusing Visual and Lexical Cues
For fusing visual with lexical cues, identifying the speaker amongst all the actors in the scene is important since lexical cue from dialog can be combined with the visual cue only from the speaker’s face and not those of other actors. In some cases, instead of the speaker, audience is seen in the video which makes it necessary to identify whether speaker is present or not. Speaker detection is done by lip motion analysis . Details are in section 6.2. Visual and lexical cues are fused in a weighted manner. Weights are determined as follows: 5.1
Finding Weights
Weights given to the individual cues are proportional to the confidence that decision made based on that cue is correct. For the lexical cue, a high SO value
Multi-actor Emotion Recognition in Movies Using a Bimodal Approach
471
indicates a greater confidence level. If S denotes the SO value, weight for lexical cues is chosen as: ws =
αs αfv αmax
+ αs
,
where αs =
P spos L
= 1 − P spos L
ifS ≥ 0 ifS < 0
(5) (6)
αfv
and is defined later in eq. 8. To calculate weights for visual cues, factors were identified which can affect the performance of emotion recognition from visual cues. These factors are pose and scale of the face, tracking accuracy, intensity of expression and head motion. For each of these factors an associated parameter was extracted from frame f of an image sequence as follows: – Pose: Let (xfj , yjf ) be the coordinates of j th FFP in f th frame with the origin at the nose tip. FFPs are numbered as per Fig.1d. One of the parameters to estimate pose is taken as the ratio of horizontal distance between nose (FFP no. 3) and outer left eyebrow corner (FFP no. 4) and the horizontal distance between nose and outer right eyebrow corner (FFP no. 9). Second parameter is the slope of line connecting the two outer eyebrow corners. Mathematically these parameters are given as: pf1 =
xf3 − xf4
, f
xf9 − x3
pf2 =
y9f − y4f xf9 − xf4
(7)
For a frontal face pf1 should be almost 1 and any deviation from 1 indicates rotation of the face in yaw direction. pf2 for a frontal face is zero and deviation from zero shows rotation of the face in roll direction. – Tracking accuracy: Any tracking failure is expected to reduce FER performance. To detect tracking failure, an assumption is made that motions of pairs of some adjacent points on the face will be almost similar, when facial expression changes. E.g. when a person lifts his eyebrow, FFPs 4 and 5 (Fig.1d.) usually move upwards with similar displacement magnitudes. As long as both the FFPs belonging to a chosen pair are correctly tracked, their inter-frame displacements will be similar. Any tracking failure is expected to displace the wrongly tracked point more as compared to the correctly tracked point. Differences of inter-frame displacements of the two points gives an indication of the tracking failure. These differences are summed up for all the chosen pairs of FFPs. The chosen pairs are (4,5), (5,6), (7,8), (8,9), (10,11), (12,13) and (15,16). This summation (say D) should be under a certain threshold (say Dmin ) for the tracking to be reasonably good. In case of a tracking failure, due to wrongly predicted position of an FFP, D would cross Dmin . Dmin is chosen manually. – Scale: Normalization distance is taken as a measure of the scale of the face and is represented by the parameter pf4 . See section 6 for a definition of the normalization distance. This distance should be large for a good recognition performance.
472
R. Srivastava et al.
– Intensity of expression: An expression will be more intense if the normalized residues are more in magnitude. For a frame, intensity is given by the sum of the magnitudes of residues for that frame. This parameter is represented by pf5 . Higher value of pf5 promises a better recognition. – Head motion: Head motion is taken to be captured most by the nose motion since there is minimal non-rigid motion for the point on the nose tip. Displacement of the nose tip is from its position in the first frame is taken to be a measure of head motion for that frame. Let this parameter be pf6 . For a good recognition performance, pf6 should be low. Weight given to the visual cue for a frame f is calculated as wvf
αfv αmax
= (
αfv αmax
+ αs )
, αfv =
pi pi pf
3 4 5
(1 + af1 pf1 − 1 )(1 + af2 pf2 )(1 + ai3 pf6 )
(8)
and αmax is the maximum value of αfv for the training data. af1 , af2 and af3 are chosen manually. pos Once weights to individual cues are determined, fusion of P ipos V (f ) and P sL can be achieved in two ways: Scheme 1: pos 1. Fuse P spos using weights wss and wsfv , respectively. Let V (f ) with P sL pos P s (f ) be the resultant probability for f th frame. 2. Combine P spos (f ) with P ipos V (f ) in a weighted fashion; where i denotes all actors except the speaker. Weight for P spos (f ) is mean of ws and wsfv , while f weights for P ipos V (f ) are wiv for different values of i. Mean probability over all frames is the probability of the clip having positive emotion.
Scheme 2: f 1. Combine P ipos V (f ) for all actors (for all is) with weights given by wiv , then pos pos fuse the resulting visual probability with P sL . Weight for P sL is given by ws . Again, probability of the clip having positive emotion is given by mean probability over all frames.
6 6.1
Experimental Results and Discussions Dataset
The data for the experiments consists of 388 movie clips from 17 movies classified into 5 genres-Comedy(6), Action(2), Adventure (3), Drama(1), Horror(5). 232 clips contain negative emotions while 156 contain positive. Each of the clips contains more than one persons and were manually labeled as having positive or negative emotions by 20 volunteers. Out of the clips used for each class, 75% were used for training and 25% for testing. The movie scenes contained different variations possible in real world such as low illumination, pose variations, occlusion and small face size . Some example scenes are depicted in Fig 3.
Multi-actor Emotion Recognition in Movies Using a Bimodal Approach
473
Fig. 3. Sample movie clips used of different difficulty levels. Top row: Pose changes and head motion. Bottom row: (from left to right) small faces, low illumination and occlusion.
Table 1. Confusion matrices for classification using a) Only speaker’s facial expression, and b) Facial expressions of multiple actors c) Only lexical cue (SO). ARR: Average Recognition Rate; Posit: Positive; Negat: Negative. For each class, 291 clips: Training; 97 clips: Testing.
6.2
(a) ARR=76.3%
(b) ARR=74.8%
(c) ARR=84.7%
Posit Negat Posit 78.5 21.5 Negat 25.9 74.1
Posit Negat Posit 79.2 20.8 Negat 29.6 70.4
Posit Negat Posit 88.2 11.8 Negat 18.8 81.2
Speaker Detection
Presence of the speaker in the video is done by analyzing the ratio (Rs ) of the inter-frame displacements of the lip FFPs (FFPs 12 to 19) to that of other FFPs. If Rs exceeds a threshold, speaker is present and the actor for which Rs has the maximum mean value is taken to be the speaker. 6.3
Facial Expression Recognition
Training: FER is performed using the algorithm outlined in section 3. The position of FFPs in the neutral face is found by selecting a neutral frame of the actor with facial pose closest to that present in the training video. For scale normalization, inter-eye distance cannot be used due to pose variations. Instead, distances between all possible pairs of FFPs is calculated and the maximum of those distances is taken as the normalizing distance. After feature selection using significance ratio test [9], top 5 most relevant residues correspond to FFPs 6, 10, 11, 12 and 16. Selected residues are used for training SVM classifier with an RBF kernel with parameters c = 4096 and g = 0.125 (found using grid search). Frames with tracking failure were automatically excluded using method described in section 5.1. Testing: A test clip is preprocessed and features are extracted in the same manner as for training sequences (Section 3). FER experiments were performed for both single actor and multi-actor cases. SVM classification results for both cases, averaged over 10 runs are given in the form of confusion matrices in Table 1(a) and (b) respectively. It is observed that recognition accuracy decreased
474
R. Srivastava et al.
a. Dialog by actor on left: “ No, you see, you have the wrong idea.”
b. Dialog by female actor: “How are you feeling? Okay?...the greatest of luck.”
Fig. 4. Examples of failures in using multi-actor approach. a. Contradicting facial expressions of actors led to a wrong FER result. This brings out the need to use lexical cues from dialogs, b. Even fusion of visual and lexical cues gave incorrect results. Table 2. Confusion matrices for classification a) Fusion Scheme 1, single actor, b) Fusion Scheme 1, multi-actor, c) Fusion Scheme 2, single actor, and d) Fusion Scheme 2, multi-actor. ARR: Average Recognition Rate (%); Pos: Positive; Neg: Negative. For each class, 291 clips: Training; 97 clips: Testing. (a) ARR:88.3
(b) ARR:83.9
(c) ARR:90.9
(d) ARR:89.7
Pos Neg Pos 78.5 21.5 Neg 1.8 98.2
Pos Neg Pos 69.2 30.8 Neg 1.5 98.5
Pos Neg Pos 83.8 16.2 Neg 2.1 97.9
Pos Neg Pos 81.5 18.5 Neg 2.1 97.9
slightly, in multi-actor approach. This is possible due to the fact that speaker might be more intensely displaying facial expressions as compared to other actors (E.g. Fig.1a.). Another reason can be contradiction in facial expressions of different actors (Fig.4a. and b.). This brings in the role of using lexical cues. Using lexical cues alone gave an Average Recognition Rate of 84.7% (Table1c.). 6.4
Experiments Combining Visual and Lexical Cues
Fusion was performed using two fusion schemes (Section 5) results of which, averaged over 10 runs, are shown in Table 2. For each run, training and test sets were randomly chosen. Results are shown for both single actor and multiactor cases using both fusion schemes. Even after fusing lexical cues, recognition accuracy is better in single actor case An example of failure of multi-actor approach is shown in Fig.4b. On the other hand, an example of effectiveness of using multi-actor approach is in the case of Fig.1b. which is already discussed in section 1. It is observed that in cases when the speaker’s emotions are not indicative of emotion of the scene, multi-actor approach was helpful but such cases are not many.
7
Conclusion
The paper analyzed the effect of considering emotions of all the relevant actors in a movie scene for recognizing emotion of the scene. A bimodal approach was proposed to fuse visual cues from facial expressions and lexical cues from
Multi-actor Emotion Recognition in Movies Using a Bimodal Approach
475
spoken dialogs. A weighting scheme was proposed considering the relevance of available emotional cues. Result for facial expression was considered to be of high confidence if face was less affected by variations in pose, scale, illumination etc. However, result of lexical analysis was relied upon more if its Semantic Orientation was larger in magnitude. Upon experimentation it was observed that multi-actor approach was not found to improve recognition accuracy as compared to single-actor approach, instead it degraded the accuracy. However, multi-actor approach can be helpful in cases when speaker displays an emotion contradicting to the emotion of the scene. There are some improvements possible in the proposed approach. In the FER part of the proposed approach, tracking algorithm has to be made robust to variations in facial pose and head motion by using a generic model for tracking in which facial shape is better defined. Manual intervention in selecting neutral face can be eliminated by clustering the faces of the actor in the movie and automatically detecting neutral face out of those faces.
References 1. Black, M.J., Yacoob, Y.: Recognizing facial expressions in image sequences using local parameterized models of image motion. Int. J. Computer Vision. 25(1), 23–48 (1997) 2. Giannakopoulos, T., Pikrakis, A., Theodoridis, S.: A dimensional approach to emotion recognition from movies. In: IEEE Int. Conference on Acoustics, Speech and Signal Processing (2009) 3. Jones, M., Viola, P.: Fast multi-view face detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Citeseer (2003) 4. Kang, H.B.: Affective content detection using HMMs. In: 11th ACM Int. Conference on Multimedia (2003) 5. Nguyen, T., Ranganath, S.: Tracking facial features under occlusions and recognizing facial expressions in sign language. In: Proc. Conf. Face & Gesture Recognition, FG 2008, pp. 1–7 (2008) 6. Rasheed, Z., Sheikh, Y., Shah, M.: On the use of computable features for film classification. IEEE Trans. Circuits Syst. Video Technol. 15(1) (2005) 7. Srivastava, R., Roy, S., Yan, S., Sim, T.: Bimodal emotion recognition in movies (submitted) 8. Wang, H.L., Cheong, L.F.: Affective understanding in film. IEEE Trans. Circuits Syst. Video Technol. 16(6) (2006) 9. Weiss, S.M., Indurkhya, N.: Predictive Data Mining: A Practical Guide. Morgan Kaufmann Publishers, San Francisco (1998) 10. Whissell, C.M.: The dictionary of affect in language. In: Plutchik, R., Kellerman, H. (eds.) Emotion: Theory, Research, and Experience, pp. 113–131. Academic Press, New York (1989)
RoboGene: An Image Retrieval System with Multi-Level Log-Based Relevance Feedback Scheme Huanchen Zhang, Haojie Li*, Shichao Dong, and Weifeng Sun School of Software, Dalian University of Technology
[email protected],
[email protected],
[email protected],
[email protected]
Abstract. This demo presents an image retrieval system named RoboGene using multi-level log-based relevance feedback scheme. By analyzing previous users’ perception on the content of images stored in user log, MLLR ranks the images with stronger log-based relevance to current user’s positive feedback higher than those with weaker relevance. The final rank of images returned by RoboGene is the combination of rank of log-based relevance and that of low-level features.
1 Introduction The semantic gap between high-level concepts and low-level features like color, texture and shape reduces the performance of content-based image retrieval (CBIR) system. User's feedback consists of valuable information about current user's perception on the content of the image and can be used to bridge this semantic gap. It has been shown as a powerful tool to improve the retrieval performance of CBIR [1]. To reduce the times of iteration and accelerate the relevance feedback, some research focused on the active learning techniques [2]. However, in an active learning process, users are asked to label additional images selected by the system which are considered as the most informative ones and this additional feedback often causes users’ impatience. Relevance scores and similarities of images can be estimated and then used to rank images [3]. The valuable information of previous users’ common perception of images can help understand the similarities of images and accelerate the relevance feedback [4]. In this demo, we present a new image ranking approach, named multilevel log-based relevance (MLLR) which is the core algorithm of RoboGene, using user log and positive feedback to rerank images [5]. Compared to previous research [4] which only uses the direct log-based relevance, we utilize both the direct and indirect log-based relevance and thus better mine the deeper information in log. Besides, we only use positive feedback instead of both the positive and negative feedback like previous study [4] which gives images irrelevant to user’s negative feedback positive log-based relevance. However, images irrelevant to user’s negative feedback may be another kind of negative feedback rather than positive feedback, so information in negative feedback can be ambiguous and harmful. *
Corresponding author.
K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 476–478, 2011. © Springer-Verlag Berlin Heidelberg 2011
RoboGene: An Image Retrieval System with MLLR Feedback Scheme
477
2 Approach In MLLR, an implicit link is built between two images which are positive feedback in one feedback session, and implicit links between images built by previous users are stored in log database. Those implicit links between images reflect previous users’ judgments on the relevance between images and build a multi-level structure showing the direct and indirect relevance between images. The top level is the current user’s feedback and is given 100% log-based relevance to user’s desired target. Supposing image A is one of the positive samples fed back by the user, if D is relevant to C, C is relevant to B and B is relevant to A, then D, C is relevant to A indirectly. We call B is in an upper level than C and C is in an upper level than D. The upper level images’ log-based relevance to the user’s desired target can be transmitted to the lower level images level by level through the implicit links between them. MLLR ranks images which are more relevant to current user’s feedback higher by analyzing previous user log. And then, combination of the rank based on log and low-level feature are used to give each images a final rank.
Fig. 1. Flow of user’s interaction with RoboGene
After a user launches a query in RoboGene, the system will first return set of images whose cosine distance to the query image exceeds an empirically determined threshold. If the user cannot obtain the desired targets from the initial results, the user may choose to begin a relevance learning procedure by selecting several images which satisfy his/her intention in the initial results as feedback. On the one hand, the user’s feedback is used to enrich user log which can help later users. The feedback information is stored into the log database after one feedback session is over. To manage the log information well, an implicit link vector (ILV), ILV = {i , j , cij } , is constructed to represent the implicit link information. The first two elements i and j denote the image i and image j which are the endpoints of the implicit link. The third element cij denotes the number of session in which image i and j are fed back as two positive samples, i.e. the implicit links between i and j. On the other hand, the user’s feedback is viewed as the reflection of user’s real intention and used by RoboGene as
478
H. Zhang et al.
the top level of the multi-level structure to rerank images in the initial result. If the user is still unsatisfied with the result, he or she can choose to begin another feedback session. Figure 1 shows the flow of user’s interaction with RoboGene.
3 Demonstration Figure 2 shows an example of enhanced ranking performance by RoboGene. (A) is the initial result returned by RoboGene based on low-level feature after the user submits an image query of ‘white tiger’. We can see that two images of cat are ranked higher than tiger in the initial result because cats are quite similar to tiger based on the low-level feature. After the user selects several images of tiger as the feedback, RoboGene returns the reranked result as (B). We can see that those two images of cat are ranked lower and the top images are all white tigers.
(A)
(B)
Fig. 2. An example of RoboGene performance
Acknowledgements. This work is supported by the Fundamental Research Funds for the Central Universities (1600-852009).
References 1. Zhou, X.S., Huang, T.S.: Relevance Feedback in Image Retrieval: A Comprehensive Review. Multimedia Systems 8, 536–544 (2003) 2. Meng, W., Xian-Sheng, H.: Active Learning in Multimedia Annotation and Retrieval: A Survey. ACM Transactions on Intelligent Systems and Technology (in press) 3. Meng, W., Kuiyuan, Y., Xian-Sheng, H., Hong-Jiang, Z.: Towards Relevant and Diverse Search of Social Images. IEEE Transactions on Multimedia (2010) 4. Hoi, S.C., Lyu, M.R., Jin, R.: A unified log-based relevance feedback scheme for image retrieval. IEEE Trans. Knowl. Data Eng. 18(4), 509–524 (2006) 5. Huanchen, Z., Weifeng, S., Shichao, D., Long, C., Chuang, L.: Multi-level Log-based Relevance Feedback Scheme for Image Retrieval. In: Advanced Data Mining and Application (2010)
Query Difficulty Guided Image Retrieval System Yangxi Li1 , Yong Luo1 , Dacheng Tao2 , and Chao Xu1 2
1 Key Laboratory of Machine Perception, Peking University, Beijing School of Computer Engineering, Nanyang Technological University, Singapore {liyangxi,luoyong}@cis.pku.edu.cn,
[email protected],
[email protected]
Abstract. Query difficulty estimation is a useful tool for content-based image retrieval. It predicts the performance of the search result of a given query, and thus it can guide the pseudo relevance feedback to rerank the image search results, and can be used to re-write the given query by suggesting “easy” alternatives. This paper presents a query difficulty estimation guided image retrieval system. The system initially estimates the difficulty of a given query image by analyzing both the query image and the retrieved top ranked images. Different search strategies are correspondingly applied to improve the retrieval performance. Keywords: Query difficulty estimation and Content-based image retrieval.
1
Introduction
Popular information retrieval technologies suffer from a radical performance variance in accordance with different queries. Even for the retrieval systems which perform well normally, their performance can be unsatisfactory for some “difficult” queries. The difficult queries here refers to those poorly performed queries. Therefore, query difficulty estimation, also called query performance prediction, is proposed to quantitatively estimate the retrieval performance of a given query for a pre-specified dataset. Although the query difficulty estimation (QDE) has not been ever studied for content-based image retrieval (CBIR), it plays an important role and is invaluable for CBIR from the following two perspectives: 1) for the CBIR users, QDE provides valuable feedbacks that can help them refine their queries, e.g., cropping the original images, refining the bounding box that specifies the target query object, and suggesting alternative query images; 2) for the CBIR system, the ranking algorithm can invoke flexible retrieval strategies for different queries adaptively[7,8]. We present in this paper a QDE guided image retrieval system, which utilizes the estimated query difficulty to improve the retrieval performance. Retrieval results of a query image are obtained initially. Afterward, these results and the query image are used jointly to estimate the query difficulty. According to the difficulty score, different retrieval strategies are invoked to improve the retrieval results finally. K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 479–482, 2011. c Springer-Verlag Berlin Heidelberg 2011
480
Y. Li et al.
Fig. 1. The architecture of QDE guided image retrieval system
2
System Overview
Figure 1 presents the system architecture that consists of three modules. Initially, the system searches the query image in the database with an ad-hoc retrieval model. The top ranked images are then used as auxiliary information for estimating the difficulty of the query image. After mining the information residing in the query image and its returned results from various perspectives and getting the difficulty score, two different retrieval strategies, named Guided Pseudo Relevance Feedback (GPRF) and Selective Query Refinement (SQR) with bounding box, are applied to the proposed system to further boost performance. Detailed descriptions of these three modules are as follows. Retrieval. Retrieval result is a useful clue for determining query’s difficulty. By representing each image as a bag of visual words (BOVW) [4], we initially retrieve images from the database via an ad-hoc retrieval model (e.g., cosine, language model [5], or learning approach [6]), which generally scales up well due to the sparse property of BOVW based image representation. Query difficulty estimation. In this module, the system calculates the query difficulty by analyzing the query image and top ranked images from the following three perspectives: 1)Clarity score [2] measures the divergence between a query’s distribution and the whole collection’s distribution. The larger the clarity score is, the more distinctive the query language model is, and thus the better the query’s performance would be; 2)Spatial verification of visual words provides more information for query difficulty estimation. The degree of the spatial consistency between query and top results is related to the query’s performance; and 3)Appearance consistency, which measures the average global feature similarities between the query and the images in the database, is taken into account in the query difficulty estimation. Performance improvement strategies. After the query image’s difficulty score is determined, two performance improvement strategies are used in our system, which are GPRF and SQR. GPRF dynamically performs Pseudo Relevance Feedback(PRF) for mid-range queries in terms of difficulty score, while SQR modifies the query images by drawing bounding box for those difficult queries. In our demo system, the Rocchio algorithm [3] with 15 top ranked feedback images is adopted in GPRF, and the associated bounding box settings in the Oxford Building [1] dataset is used in SQR.
Query Difficulty Guided Image Retrieval System
3
481
Demonstrations
Figure 2 shows an example of QDE guided image retrieval system. The query “ashmolean 3” is considered as a difficult query by the system, and takes a difficulty score of 76 out of 100. System selects SQR according to this score, and improves the performance moderately (Results in row 3 highlighted with green). Results of GPRF and GPRF+SQR are showed in row 2 and 4, respectively, which demonstrates that PRF deteriorates the performance of the difficult query. This example suggests the effectiveness of the QDE guided image retrieval system.
Fig. 2. Illustration of QDE guided image retrieval system
Acknowledgement This Paper is supported by Chinese 973 Program (2011CB302400), NSFC (60975 014) and NSFB (4102024).
References 1. http://www.robots.ox.ac.uk/~ vgg/data/oxbuildings/ 2. Cronen-Townsend, S., Zhou, Y., Croft, W.B.: Predicting query performance. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 299–306 (2002) 3. Rocchio, J.: Relevance Feedback in Information Retrieval. The SMART Retrieval System (1971)
482
Y. Li et al.
4. Sivic, J., Zisserman, A.: Video Google: A text retrieval approach to object matching in videos. In: ICCV (2003) 5. Geng, B., Yang, L., Xu, C.: A Study of Language Model for Image Retrieval. In: ICDMW (2009) 6. Wang, M., Hua, X.-S.: Active Learning in Multimedia Annotation and Retrieval: A Survey. ACM Transactions on Intelligent Systems and Technology (in press) 7. Bian, W., Tao, D.: Biased Discriminant Euclidean Embedding for Content-Based Image Retrieval. IEEE Transactions on Image Processing 19(2), 545–554 (2010) 8. Tao, D., Tang, X., Li, X., Wu, X.: Asymmetric Bagging and Random Subspace for Support Vector Machines-Based Relevance Feedback in Image Retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(7), 1088–1099 (2006)
HeartPlayer: A Smart Music Player Involving Emotion Recognition, Expression and Recommendation Songchun Fan, Cheng Tan, Xin Fan, Han Su, and Jinyu Zhang Software Institute, Nanjing University, 22 Hankou Road, 210093, Nanjing, P.R. China {fsc07,tc07,fx07,suh07}@software.nju.edu.com
Abstract. In this demo, we present HeartPlayer as a smart music player that can understand the music. When a song is being played, an animation figure expresses the emotion of the song by certain facial expressions. Meanwhile, the emotion of the music item is calculated according to the music features retrieved by a program running in the backstage. In GUI, we use six colors to indicate the six classes of songs with different emotions. Moreover, by recording the user’s play history, HeartPlayer gets a series of analysis results, including the user’s preference, today’s mood and music personality. Our contribution mainly lies in providing a novel music player interface, and also exploring new facilities in music player. Keywords: Music player, emotion-based, music information retrieval, music recommendation.
1 Introduction Today, listening to digital music songs has become a common way to connect the digital world with the physical world. By using most of the traditional music players, we can only “hear” the music, which makes us disregard the emotion that the singers or composers want to express. Some music visualization technique, such as Isochords [1] only provide effects according to music structure, without giving out any further information of the music emotion. Most of the emotion-based music players such as LyQ [2] use emotion information to recommend related music rather than expressing them. Thus there is no music player that can both collect the emotion information of music items and express it. In this demo, we present HearPlayer as a smart music player, which helps the user “see” the music by creating a visual way for the user to understand and feel the emotion of the tune. Moreover, through analyzing the songs which the user has been listening to, Heartplayer can identify the user’s mood and recommend suitable songs. As a music player with complex functions, HeartPlayer consists of four parts: The player, the figure, the music library, and the emotion-calculating program. The first three parts make up the user interface, as is shown in Figure 1, while the third part is a K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 483–485, 2011. © Springer-Verlag Berlin Heidelberg 2011
484
S. Fan et al.
program that runs in the back-end. Every part, especially the last one, can be pluggable, in order to be updated or replaced conveniently. The player part cannot be more common. So we will only introduce the other three parts one by one in the rest of the paper. The rest of the paper is organized as follows. In section 1.1 we give a description of the figure which makes facial expressions. In section 1.2 we briefly introduce the music library. In section 1.3 we expound the calculating module and its logic. Finally we summarize our contribution in Section 2.
Fig. 1. A Runtime Screenshot
1.1 Figure To show its understanding of music, HeartPlayer uses a figure which looks like a human kid to express emotion. As soon as the emotion is calculated out, the figure changes its facial expressions. Before that, it only makes body movements such as waving its arms. Particularly, the movements are changed according to the instant music information like amplitude. The figure also shows a rapid changing in the color of its eyes when there is an obvious variation in the spectrum. 1.2 Music Library In the music library, analyses like “My Favourites”, “Today’s Mood”, “Emotion Log” and “Music Personality” are provided in the form of or dynamic charts or diagrams. Figure 2 contains the two of them mentioned above. Particularly, six colours stand for six kinds of music emotion. With the analysis results, the recommendation is made.
Fig. 2. Emotion Log and Music Personality
A Smart Music Player Involving Emotion Recognition, Expression and Recommendation
485
1.3 Calculating Program We make this part isolated from other parts for the sake of a convenient procedure of updating or altering the algorithm. Calculating program is responsible for calculating the emotion of the music synchronously when the music is being played. The program first retrieves some music features of the song, such as tempo, melody and amplitude. A model is built to calculate the emotion score via these music features. We define six intervals in the range of scores. The six intervals correspond to the six emotions, which are “excited”, “joyful”, “peaceful”, “sorrowful”, “desperate” and “nervous”. The classification is made according to an optimization of the Tellegen-Watson-Clark model of mood [3]. The program then calculates the score, and if the score falls into one of the intervals, the emotion of the song is identified. To verify the accuracy of the calculating program, an experiment was conducted. 200 college students were invited to a classroom and to identify the emotions of the 50 songs played in a recorder. We then compared the results, which the majority of students agreed to, with what had been calculated by the program. The results showed that the program reached an accuracy of 86% of all the songs tested.
2 Contribution In this demo, we propose a music player: HeartPlayer. Our contribution mainly lies in providing a novel music player interface, and also exploring new facilities in music player. To the best of our knowledge, it is the first music player that can relatively precisely recognize the emotion of the instant music, express the emotion, and analyze the results to give recommendations. That’s why we call it a smart player.
References [1] Bergstrom, T., Karahalios, K., Hart, J.C.: Isochords: visualizing structure in music. In: Proceedings of Graphics Interface 2007, GI 2007, Montreal, Canada, May 28 - 30, pp. 297– 304. ACM, New York (2007), http://doi.acm.org/10.1145/1268517.1268565 [2] Hsu, D.C., Hsu, J.Y.: LyQ-An Emotion-aware Music Player. In: 2006 Workshop of American Association for Artificial Intellegence (2006) [3] Tellegen, A., Watson, D., Clark, L.A.: On the Dimensional and Hierarchical Structure of Affect. Psychological Science 10, 297 (1999), doi:10.1111/1467-9280.00157
Immersive Video Conferencing Architecture Using Game Engine Technology Chris Poppe, Charles-Frederik Hollemeersch, Sarah De Bruyne, Peter Lambert, and Rik Van de Walle Multimedia Lab, Ghent University - IBBT, Gaston Crommenlaan 8, B-9050 Ledeberg-Ghent, Belgium {chris.poppe,charlesfrederik.hollemeersch,sarah.debruyne, peter.lambert,rik.vandewalle}@ugent.be http://multimedialab.elis.ugent.be
Abstract. This paper introduces the use of gaming technology for the creation of immersive video conferencing systems. The system integrates virtual meeting rooms with avatars and life video feeds, shared across different clients. Video analysis is used to create a sense of immersiveness by introducing aspects of the real world in the virtual environment. This architecture will ease and stimulate the development of immersive and intelligent telepresence systems. Keywords: Video Conferencing, Game Engine, Video Analysis.
1 Introduction Telepresence allows a person to feel as if they were present at a location other than their true location. Current systems lack in conveying the true telepresence feeling since they just visualize the other meeting rooms. Recent work shows the interest in more immersive telepresence by creating virtual worlds or 3D repre- sentations of participants [1]. Additionally, efforts have been done to introduce different information sources to video conferencing. Typical examples are the application of video analysis for head or gaze tracking [2, 3]. In this work we show that a game engine is well-suited to create such immer- sive telepresence systems and allows for easy updating, modular architectures, advanced networking and rendering capabilities. The work is part of the iCocoon (immersive Communication through Computer vision) project1.
2 Video Conferencing Using Game Engine Technology Current game engines allow the development of 3D multi-player games. They provide means for creating, editing and rendering virtual worlds, interaction and network exchange of information. Created for the ease of game development, we propose to benefit from the efforts made in this area to create an immersive video conferencing 1
http://www.ibbt.be/en/projects/overview-projects/p/detail/icocoon-2
K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 486–488, 2011. © Springer-Verlag Berlin Heidelberg 2011
Immersive Video Conferencing Architecture Using Game Engine Technology
487
Fig. 1. Application of Unity for immersive video conferencing
system. For this purpose, Unity2, a multi-player game development tool, is used to create game applications (see Fig. 1 for an example). Each of those applications can act as server or client and present the end-user with a view on a virtual environment consisting of avatars, life video feeds, and models of the meeting rooms. The user can interact with the virtual world on different levels (e.g., chatting, logging in and off, and selecting different camera views). The latter can be accomplished using C#-script, inherent to Unity to implement game functionality. Both the virtual world and avatars are synchronized using the built-in network support of Unity (remote procedure calls and state synchronization), allowing to shield the developer from low-level network programming. The video feeds are also analysed by external video analysis modules which extract relevant information, (using OpenCV3). Different analysis scenarios can be imagined, e.g., detection of illumination, motion of the camera, moving objects, face detection, gaze tracking, and gesture recognition. The results of the analysis is represented as metadata and send to the server over socket communication. The server consequently processes the metadata and makes appropriate changes to the virtual world. Using the game technology, these changes are automatically communicated to the other clients. Changes of the avatars, meeting rooms, and camera viewpoints are typical examples of actions that are triggered by the metadata. Figure 2 shows screen shots of a possible set-up. A virtual meeting room is rendered with avatars and life video feeds for each connected client. When the light is turned of in the real world, the illumination change is detected by video analysis and send to the server. The server updates the light source in the virtual environment, which is consequently reflected on all clients. 2 3
http://unity3d.com/unity/ http://opencv.willowgarage.com/wiki/
488
C. Poppe et al.
(a)
(b)
Fig. 2. Screen shot of the virtual meeting room with video feed
3 Conclusions We propose the use of game development tools to create immersive video conferencing. This allows to create virtual meeting rooms with avatars shared across different clients. Additionally, video analysis is integrated to create a sense of immersiveness by introducing aspects of the real world in the virtual environment. Future work consists of testing the system with multiple users on aspects like scalability, performance, and quality of user experience. Acknowledgments. The research activities that have been described in this paper were funded by Ghent University, the Interdisciplinary Institute for Broadband Technology (IBBT), the Institute for the Promotion of Innovation by Science and Technology in Flanders (IWT-Flanders), the Fund for Scientific Research-Flanders (FWOFlanders), and the European Union.
References 1. Schreer, O., Feldmann, I., Atzpadin, N., Eisert, P., Kauff, P., Belt, H.J.W.: 3D presence - a system concept for multi-user and multi-party immersive 3D video conferencing. In: IET 5th European Conference on Visual Media Production, pp. 10–17 (2008) 2. Kompatsiaris, I., Strintzis, M.G.: Spatiotemporal Segmentation and Tracking of Objects for Visualization of Videoconference Image Sequences. IEEE Trans. on Circuits and Systems for Video Technology 10, 1388–1402 (2000) 3. Vertegaal, R., Weevers, I., Sohn, C., Cheung, C.: GAZE-2: Conveying Eye Contact in Group Video Conferencing Using Eye-Controlled Camera Direction. In: Conference on Human Factors in Computing Systems, pp. 521–528. ACM, New York (2003)
Author Index
Ai, Mingjing II-383 Ardabilian, Mohsen I-206 Ariki, Yasuo II-454 Ashraf, Golam I-536 Assent, Ira I-140 Azad, Salahuddin II-442 Bailer, Werner I-359, II-219 Beecks, Christian I-140, I-381 Benoit, Alexandre I-350 Boeszoermenyi, Laszlo I-129 Boll, Susanne I-84 Cai, Junjie II-77 Cai, Qiyun II-393 Chan, Antoni B. I-317 Chang, King-Jen I-548 Chang, Kuang-I II-432 Chang, Richard I-328 Chang, Shih-Ming I-514, II-296 Chao, Hui II-65 Chen, Bing-Yu I-73, I-435 Chen, Hsin-Hui II-168, II-252 Chen, Hua-Tsung II-315 Chen, Jie II-12 Chen, Jin-Shing I-548 Chen, Jyun-Long II-432 Chen, Kuan-Wen I-171 Chen, Liming I-206 Cheng, Hsu-Yung II-187 Cheng, Jian I-307 Cheng, Kai-Yin I-73 Cheng, Sheng-Yi I-457 Chu, Wei-Ta I-229 Chu, Xiqing II-359 Chua, Tat-Seng I-262, I-392 Chua, TeckWee I-328 Chung, Sheng-Luen II-177 Ciobotaru, Madalina I-350 Cock, Jan De I-29 Collomosse, John I-118 Cooray, Saman H. I-424 Dai, Feng Dai, Wang
I-21, I-51 II-393
De Bruyne, Sarah I-29, II-486 Dejraba, Chabane I-251 Ding, Jian-Jiun II-168, II-252 Doman, Keisuke II-135 Dong, Shichao II-476 Du, Ruo II-35 Duan, Ling-Yu II-12 El Sayad, Ismail Emmanuel, Sabu
I-251 II-285
Fan, Chih-Peng I-10 Fan, Jianping II-46, II-111 Fan, Jingwen II-359 Fan, Songchun II-483 Fan, Xin II-483 Fang, Yuchun II-393 Fang, Yuming I-370 Feng, Xiaoyi II-46, II-111 Foley, Colum II-337 Fu, Chuan II-274 Fu, Qiang II-57 Gao, Sheng II-99 Gao, Wen II-12 Goto, Satoshi I-161 Gu, Hui-Zhen II-315 Gu, Zhouye I-40 Guo, Aiyuan II-421 Guo, Jing-Ming II-177 Guo, Jinlin II-337 Gurrin, Cathal II-337 Han, Tony X. II-1 Hang, Kaiyu I-413 He, Xiangjian II-35 He, Xun I-161 He, Ying I-217, II-371 Hoeffernig, Martin II-263 Hoi, Steven C.H. I-217, II-371 Hollemeersch, Charles-Frederik I-29, II-486 Hong, Wei-Tyng II-187 Hou, ZuJun I-328
490
Author Index
Hsiao, Pei-Yung II-208 Hsu, Chiou-Ting I-503 Hsu, Hui-Huang I-514, II-296 Hsu, Su-Chu I-548 Hsu, Winston H. II-146 Hsu, Yu-Ming II-146 Hua, Xian-Sheng I-107 Huang, Chun-Kai I-435 Huang, Di I-206 Huang, Jingjing I-403 Huang, Jiun-De II-168 Huang, Shih-Ming II-326 Huang, Shih-Shinh II-208 Huang, Szu-Hao I-151 Huang, Yea-Shuan I-457, I-525 Hung, Chia-Jung I-435 Hung, Shang-Chih I-151 Hung, Yi-Ping I-171, I-548 H¨ urst, Wolfgang II-157, II-230 Ide, Ichiro II-135 Ikenaga, Takeshi I-492 Ionescu, Bogdan I-350 Jabbar, Khalid I-182 Jang, Lih-Guong I-446 Jeng, Bor-Shenn II-187 Ji, Rongrong II-12 Jia, Wenjing II-35 Jian, Er-Liang II-208 Jiang, Xinghao II-359 Jin, Jesse S. II-25 Jin, Xiaocong I-492 Jin, Xin I-161 Jose, Joemon I-118 Kaiser, Rene II-263 Kankanhalli, Mohan II-285 Katayama, Norio I-284 Kuai, Cheng Ying II-135 Kuo, Tzu-Hao I-73 Lai, Shang-Hong I-151 Lambert, Patrick I-350 Lambert, Peter I-29, II-486 Lao, Songyang II-337 Lau, Chiew Tong I-40, I-370 Lay, Jose I-296
Le, Nguyen Kim Hai I-536 Lee, Bu-sung I-40, I-370 Lee, Chien-Cheng II-187 Lee, Felix II-219 Lee, Hyowon I-424 Lee, Ming-Sui I-548 Lee, Pei-Jyun I-171 Lee, Su-Ling II-196 Lee, Suh-Yin II-315 Lee, Tung-Ying I-151 Lee, Tzu-Heng II-168 Leman, Karianto I-328 Levy, David I-296 Li, Bing II-12 Li, Chu-Yung I-525 Li, Fan I-470 Li, Haizhou II-99 Li, Haojie II-476 Li, Meng-Luen I-229 Li, Peng I-307 Li, Tom L.H. I-317 Li, Yangxi II-479 Li, Yiqun I-262, II-421 Li, Zechao I-307 Liang, Chao II-88 Liao, Zhuhua II-274 Lien, Cheng-Chang I-446 Lim, Joo Hwee II-421 Lin, Che-Hung II-177 Lin, Chia-Wen I-370 Lin, Chin-Lin II-401 Lin, Daw-Tung II-401 Lin, Pao-Yen II-168, II-252 Lin, Shou-De I-339 Lin, Shouxun I-21 Lin, Tsung-Yu I-151 Lin, Weisi I-40, I-370 Lin, Yen-Liang II-146 Lin, Yu-Shin II-296 Liu, Guizhong I-470 Liu, Haiming II-241 Liu, Jingjing I-481 Liu, Siying II-421 Liu, Yan I-1 Liu, Yang II-88 Lo, Hung-Yi I-339 Lou, Chengsheng II-393 Lu, Hanqing I-307, I-481 Luo, Jie II-393 Luo, Jiebo I-403
Author Index Luo, Suhuai II-25 Luo, Yong II-479 Mahapatra, Amogh I-273 Mak, Mun-Thye II-411 Mark Liao, Hong-Yuan I-503 Martinet, Jean I-251 Matsuoka, Yuta I-96 Mayer, Harald II-263 Mei, Tao I-107 Miao, Chen-Hsien I-10 Mo, Hiroshi I-284 Mulholland, Paul II-241 Murase, Hiroshi II-135 Nakamura, Naoto II-348 Nguyen, Duc Dung II-371 Nijholt, Anton II-122 O’Connor, Noel E. I-424 Okada, Yoshihiro II-348 Ong, Alex I-182 Ouji, Karima I-206 Park, Mira II-25 Peng, Jinye II-46, II-111 Peng, Yu II-25 Pham, Nam Trung I-328 Plass-Oude Bos, Danny II-122 Poel, Mannes II-122 Poppe, Chris I-29, II-486 Preneel, Bart I-62 Qian, Xueming I-413 Quah, Chee Kwang I-182 Qu´enot, Georges I-240 Rabbath, Mohammad I-84 Ren, Reede I-118 Roy, Sujoy II-411, II-465 R¨ uger, Stefan II-241 Safadi, Bahjat I-240 Sandhaus, Philipp I-84 Satoh, Shin’ichi I-284 Schoeffmann, Klaus I-129 Seah, Hock Soon I-182
Seidl, Thomas I-140, I-381 Shao, Ling I-1 She, Lanbo I-403 Shen, Chuxiong II-359 Shih, Shen-En I-193 Shih, Timothy K. I-514, II-296 Shirahama, Kimiaki I-96 Sim, Terence II-465 Smeaton, Alan F. II-337 Snoek, Cees G.M. II-230 Song, Dawei II-241 Song, Wei II-442 Spoel, Willem-Jan II-230 Srivastava, Jaideep I-273 Srivastava, Ruchir II-465 Su, Han II-483 Su, Jia I-492 Su, Yu-Jen II-432 Sun, Tanfeng II-359 Sun, Weifeng II-476 Tai, Shih-Chao II-304 Takahashi, Tomokazu II-135 Takano, Shigeru II-348 Takiguchi, Tetsuya II-454 Tam, King Yiu I-296 Tan, Cheng II-483 Tang, Feng II-57 Tang, Nick C. I-503 Tao, Dacheng II-479 Tian, Qi II-77 Tian, Yonghong I-273 Tjondronegoro, Dian II-442 Tomin, Mate II-230 Tong, Yubing I-240 Tretter, Dan II-65 Tsai, Chang-Lung II-304 Tsai, Chun-Yu I-73 Tsai, Hsin-Ming II-208 Tsai, Joseph I-514 Tsai, Joseph C. II-296 Tsai, Ming-Hsiu I-446 Tsai, Mu-Yu II-432 Tsai, Wei-Chin II-315 Tsai, Wen-Hsiang I-193 Tsai, Ya-Ting I-446 Tseng, Chien-Cheng II-196 Tu, Meng-Qui II-208 Tyan, Hsiao-Rong I-503
491
492
Author Index
Uehara, Kuniaki I-96 Uren, Victoria II-241 Urruty, Thierry I-251 Uysal, Merih Seran I-381 Wagner, Claudia II-263 Walle, Rik Van de I-29, II-486 Wan, Kong Wah II-411 Wan, Xin I-273 Wang, Brian II-146 Wang, Dayong I-217 Wang, Hsin-Min I-339 Wang, Jinqiao I-481, II-12 Wang, Lin II-35 Wang, Minghui I-161 Wang, Yan I-107 Wang, Yue I-328 Wang, Yunhong I-206 Wang, Zengfu II-77 Wei, Xin I-492 Weng, Li I-62 Wezel, Casper van II-157 Why, Yong Peng I-536 Wu, Bin II-359 Wu, Peng II-57 Wu, Pengcheng II-371 Wu, Qiang II-35 Wu, Shu-Min II-432 Xu, Changsheng II-1, II-88 Xu, Chao II-479
Yan, Chenggang I-51 Yan, Shuicheng II-1, II-465 Yan, Zhe I-413 Yang, Chunlei II-111 Yang, Jar-Ferr II-326 Yang, Jing II-274 Yeh, Wei-chang II-35 Yu, Chih-Chang II-187 Yu, Jen-Yu II-315 Yu, Like I-21 Yu, Meng-Chieh I-548 Yu, Nenghai I-403 Zha, Zheng-Jun I-262, II-77 Zhang, Guoqing II-274 Zhang, Huanchen II-476 Zhang, Hui I-1 Zhang, Jinyu II-483 Zhang, Peng II-285 Zhang, Qing I-470 Zhang, Shanfeng II-359 Zhang, Tong II-65 Zhang, Xinming II-88 Zhang, Yongdong I-21, I-51 Zhang, Zheng I-182 Zhao, Lili II-383 Zhao, Yi-Liang I-392 Zheng, Yan-Tao I-262, I-392 Zheng, Yu II-454 Zhou, Ning II-46 Zhou, Xiangdong I-392 Zhu, Guangyu II-1 Zhuang, Liansheng I-403