Paolo Remagnino, Dorothy N. Monekosso, and Lakhmi C. Jain (Eds.) Innovations in Defence Support Systems – 3
Studies in Computational Intelligence, Volume 336 Editor-in-Chief Prof. Janusz Kacprzyk Systems Research Institute Polish Academy of Sciences ul. Newelska 6 01-447 Warsaw Poland E-mail:
[email protected] Further volumes of this series can be found on our homepage: springer.com Vol. 314. Lorenzo Magnani, Walter Carnielli, and Claudio Pizzi (Eds.) Model-Based Reasoning in Science and Technology, 2010 ISBN 978-3-642-15222-1 Vol. 315. Mohammad Essaaidi, Michele Malgeri, and Costin Badica (Eds.) Intelligent Distributed Computing IV, 2010 ISBN 978-3-642-15210-8 Vol. 316. Philipp Wolfrum Information Routing, Correspondence Finding, and Object Recognition in the Brain, 2010 ISBN 978-3-642-15253-5 Vol. 317. Roger Lee (Ed.) Computer and Information Science 2010 ISBN 978-3-642-15404-1 Vol. 318. Oscar Castillo, Janusz Kacprzyk, and Witold Pedrycz (Eds.) Soft Computing for Intelligent Control and Mobile Robotics, 2010 ISBN 978-3-642-15533-8 Vol. 319. Takayuki Ito, Minjie Zhang, Valentin Robu, Shaheen Fatima, Tokuro Matsuo, and Hirofumi Yamaki (Eds.) Innovations in Agent-Based Complex Automated Negotiations, 2010 ISBN 978-3-642-15611-3 Vol. 320. xxx Vol. 321. Dimitri Plemenos and Georgios Miaoulis (Eds.) Intelligent Computer Graphics 2010 ISBN 978-3-642-15689-2 Vol. 322. Bruno Baruque and Emilio Corchado (Eds.) Fusion Methods for Unsupervised Learning Ensembles, 2010 ISBN 978-3-642-16204-6 Vol. 323. Yingxu Wang, Du Zhang, and Witold Kinsner (Eds.) Advances in Cognitive Informatics, 2010 ISBN 978-3-642-16082-0 Vol. 324. Alessandro Soro, Vargiu Eloisa, Giuliano Armano, and Gavino Paddeu (Eds.) Information Retrieval and Mining in Distributed Environments, 2010 ISBN 978-3-642-16088-2
Vol. 325. Quan Bai and Naoki Fukuta (Eds.) Advances in Practical Multi-Agent Systems, 2010 ISBN 978-3-642-16097-4 Vol. 326. Sheryl Brahnam and Lakhmi C. Jain (Eds.) Advanced Computational Intelligence Paradigms in Healthcare 5, 2010 ISBN 978-3-642-16094-3 Vol. 327. Slawomir Wiak and Ewa Napieralska-Juszczak (Eds.) Computational Methods for the Innovative Design of Electrical Devices, 2010 ISBN 978-3-642-16224-4 Vol. 328. Raoul Huys and Viktor K. Jirsa (Eds.) Nonlinear Dynamics in Human Behavior, 2010 ISBN 978-3-642-16261-9 Vol. 329. Santi Caball´e, Fatos Xhafa, and Ajith Abraham (Eds.) Intelligent Networking, Collaborative Systems and Applications, 2010 ISBN 978-3-642-16792-8 Vol. 330. Steffen Rendle Context-Aware Ranking with Factorization Models, 2010 ISBN 978-3-642-16897-0 Vol. 331. Athena Vakali and Lakhmi C. Jain (Eds.) New Directions in Web Data Management 1, 2011 ISBN 978-3-642-17550-3 Vol. 332. Jianguo Zhang, Ling Shao, Lei Zhang, and Graeme A. Jones (Eds.) Intelligent Video Event Analysis and Understanding, 2011 ISBN 978-3-642-17553-4 Vol. 333. Fedja Hadzic, Henry Tan, and Tharam S. Dillon Mining of Data with Complex Structures, 2011 ISBN 978-3-642-17556-5 Vol. 334. Álvaro Herrero and Emilio Corchado (Eds.) Mobile Hybrid Intrusion Detection, 2011 ISBN 978-3-642-18298-3 Vol. 335. Radomir S. Stankovic and Jaakko Astola From Boolean Logic to Switching Circuits and Automata, 2011 ISBN 978-3-642-11681-0 Vol. 336. Paolo Remagnino, Dorothy N. Monekosso, and Lakhmi C. Jain (Eds.) Innovations in Defence Support Systems – 3, 2011 ISBN 978-3-642-18277-8
Paolo Remagnino, Dorothy N. Monekosso, and Lakhmi C. Jain (Eds.)
Innovations in Defence Support Systems – 3 Intelligent Paradigms in Security
123
Dr. Paolo Remagnino
Prof. Lakhmi C. Jain
Kingston University Faculty of Computing, Information Systems and Mathematics Penrhyn Road Campus Kingston upon Thames Surrey KT1 2EE United Kingdom E-mail:
[email protected]
SCT-Building University of South Australia Adelaide Mawson Lakes Campus South Australia Australia E-mail:
[email protected]
Dr. Dorothy N. Monekosso University of Ulster at Jordanstown Faculty of Computing and Engineering School of Computing and Mathematics Shore Road BT37 0QB Newtownabbey United Kingdom E-mail:
[email protected]
ISBN 978-3-642-18277-8
e-ISBN 978-3-642-18278-5
DOI 10.1007/978-3-642-18278-5 Studies in Computational Intelligence
ISSN 1860-949X
c 2011 Springer-Verlag Berlin Heidelberg This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typeset & Cover Design: Scientific Publishing Services Pvt. Ltd., Chennai, India. Printed on acid-free paper 987654321 springer.com
This book collection is dedicated to all researchers in the field of intelligent environments.
Preface
Intelligent Paradigms in Security is a collection of articles introducing the latest advances in the field of intelligent monitoring. The book is intended for readers from a computer science or an engineering background. It describes techniques for the interpretation of sensor data in the context of automatic understanding of complex scenes captured in large public areas. The aim is to guide the reader through a number of research topics for which the existing video surveillance literature describes partial and incomplete solutions, and introduces the next challenges in intelligent video surveillance. Each chapter in the book presents a solution to one aspect of the problem of monitoring a public space. The types of environment of interest are often characterized by clutter and complex interactions between people and between people and objects. Each chapter proposes a sophisticated solution to a specific problem. Public environments, such as an airport concourse, a shopping mall, a train station and similar public spaces, are large and require numerous sensors to monitor the environment. The deployment of a large number of sensors produces a large quantity of video data (petabytes or larger) that must be processed; in addition different scenes may require processing at a different level of granularity. Service robots might soon inhabit public areas, collecting global information but also approaching regions / areas of interest in scene to collect more detailed information, such as an abandoned luggage. The processing of visual data or data from any sensor modality must occur at a speed consistent with the objective / goal. This might be real-time, although some data and information can be processed off-line, for instance to generate new knowledge about the scene for the purpose of enhancing the system. The performance of systems must be evaluated according to criteria that include time latency, accuracy in the detection of an event or the global dynamics, response to queries about the event monitored or the expected dynamics in the environment. To achieve this, the concept of normality is used. A model of what constitutes normality permits deviations and/or large anomalies in the behavior of people or position of objects in the monitored environment to be detected. We are confident that our collection will be of great use to practitioners of the covered fields of research, but could also be formative for doctoral students and researchers in
VIII
Preface
intelligent environments. We wish to express our gratitude to the authors and reviewers for their time and vision as well as to the Springer for the assistance during the publication phase of the book. London, Belfast, Adelaide September 2010
Paolo Remagnino Dorothy N. Monekosso Lakhmi Jain
Acknowledgements
The editors wish to thank all the contributors of the collection for their endeavors and patience.
Contents
1
2
Data Fusion in Modern Surveillance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lauro Snidaro, Ingrid Visentini, Gian Luca Foresti 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Terminology in Data Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.2 Motivation to Sensor Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.3 The JDL Fusion Process Model . . . . . . . . . . . . . . . . . . . . . . . 1.2 Data Fusion and Surveillance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 JDL Model Contextualized to Surveillance: An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 A Closer Look to Fusion in Level 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Data Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Feature Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.3 Classifier Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.4 Combiner Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.5 An Example: Target Tracking via Classification . . . . . . . . . . 1.4 Sensor Management: A New Paradigm for Automatic Video Surveillance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Distributed Intelligent Surveillance Systems Modeling for Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stefano Maludrottu, Alessio Dore, Carlo S. Regazzoni 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Surveillance Systems as Data Fusion Architectures . . . . . . . 2.1.2 Surveillance Tasks Decomposition . . . . . . . . . . . . . . . . . . . . . 2.2 Intelligent Surveillance System Modeling . . . . . . . . . . . . . . . . . . . . . 2.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Smart Sensor Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 1 3 4 5 6 7 8 9 10 10 11 12 13 16 17 23 24 25 27 28 28 29
XII
3
4
5
Contents
2.2.3 Fusion Node Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.4 Control Center Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.5 Performance Evaluation of Multi-sensor Architectures . . . . 2.3 A Case Study: Multi-sensor Architectures for Tracking . . . . . . . . . . 2.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Multisensor Tracking System Simulator . . . . . . . . . . . . . . . . 2.3.3 Performance Evaluation of Tracking Architectures . . . . . . . 2.3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30 31 32 33 33 34 38 39 43 43
Incremental Learning on Trajectory Clustering . . . . . . . . . . . . . . . . . . Luis Patino, Franc¸ois Bremond, Monique Thonnat 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 General Structure of the Proposed Approach . . . . . . . . . . . . . . . . . . . 3.3 On-Line Processing: Real-Time Object Detection . . . . . . . . . . . . . . . 3.4 Trajectory Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Object Representation: Feature Analysis . . . . . . . . . . . . . . . . 3.4.2 Incremental Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Trajectory Analysis Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
Highly Accurate Estimation of Pedestrian Speed Profiles from Video Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Panagiotis Sourtzinos, Dimitrios Makris, Paolo Remagnino 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Motion Detection and Tracking . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Static Foot Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Speed Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . System-Wide Tracking of Individuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christopher Madden, Massimo Piccardi 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Shape Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47 50 51 52 53 53 54 59 64 65 66 68 69 71 71 72 73 73 73 77 78 80 80 83 83 87 88
Contents
XIII
5.2.2 Appearance Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.2.3 Mitigating Illumination Effects . . . . . . . . . . . . . . . . . . . . . . . . 94 5.3 Feature Fusion Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6
7
A Scalable Approach Based on Normality Components for Intelligent Surveillance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Javier Albusac, Jos´e J. Castro-Schez, David Vallejo, Luis Jim´enez-Linares, Carlos Glez-Morcillo 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Formal Model to Build Scalable and Flexible Surveillance Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 General Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Normality in Monitored Environments . . . . . . . . . . . . . . . . . 6.3.3 Global Normality Analysis by Aggregating Independent Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Model Application: Trajectory Analysis . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Normal Trajectory Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.2 Set of Variables for the Trajectory Definition . . . . . . . . . . . . 6.4.3 Preprocessing Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.4 Constraint Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Distributed Camera Overlap Estimation – Enabling Large Scale Surveillance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anton van den Hengel, Anthony Dick, Henry Detmold, Alex Cichowski, Christopher Madden, Rhys Hill 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Activity Topology and Camera Overlap . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Estimating Camera Overlap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 Joint Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.2 Measurement and Discrimination . . . . . . . . . . . . . . . . . . . . . . 7.4.3 The Original Exclusion Estimator . . . . . . . . . . . . . . . . . . . . . 7.4.4 Accuracy of Pairwise Occupancy Overlap Estimators . . . . . 7.5 Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.1 The Exclusion Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.2 Partitioning of the Joint Sampling Matrices . . . . . . . . . . . . . 7.5.3 Analysis of Distributed Exclusion . . . . . . . . . . . . . . . . . . . . .
105
106 107 109 109 112 119 122 123 124 126 128 131 141 143 147
147 149 150 151 152 154 156 159 159 161 163 164 170
XIV
8
Contents
7.5.4 Evaluation of Distributed Exclusion . . . . . . . . . . . . . . . . . . . . 7.6 Enabling Network Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
172 176 180 181
Multi-robot Teams for Environmental Monitoring . . . . . . . . . . . . . . . . Maria Valera Espina, Raphael Grech, Deon De Jager, Paolo Remagnino, Luca Iocchi, Luca Marchetti, Daniele Nardi, Dorothy Monekosso, Mircea Nicolescu, Christopher King 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Overview of the System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.1 Video-Surveillance with Static Cameras . . . . . . . . . . . . . . . . 8.2.2 Multi-robot Monitoring of the Environment . . . . . . . . . . . . . 8.2.3 Experimental Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 Multi-robot Patrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.2 Multi-robot Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.3 Task Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.4 Automatic Video Surveillance . . . . . . . . . . . . . . . . . . . . . . . . 8.4 Representation Formalism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Event-Driven Distributed Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.1 Layered Map for Event Detection . . . . . . . . . . . . . . . . . . . . . 8.5.2 From Events to Tasks for Threat Response . . . . . . . . . . . . . . 8.5.3 Strategy for Event-Driven Distributed Monitoring . . . . . . . . 8.6 Implementation and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6.1 A Real-Time Multi-tracking Object System for a Stereo Camera – Scenario 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6.2 Maximally Stable Segmentation and Tracking for Real-Time Automated Surveillance – Scenario 2 . . . . . . . . . 8.6.3 Multi-robot Environmental Monitoring . . . . . . . . . . . . . . . . . 8.6.4 System Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
183
183 185 186 186 186 187 187 188 189 189 191 191 193 193 195 196 197 197 198 202 204 206 207
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
The Editors
Dr Paolo Remagnino (PhD 1993, University of Surrey) is a Reader in the Faculty of Computing, Information Systems and Mathematics at Kingston University. His research interests include image and video understanding, pattern recognition, machine learning, robotics and artificial intelligence.
Dr Dorothy N. Monekosso (MSc. 1992 and PhD 1999, from University of Surrey) is a Senior Lecturer in the School of Computing and Mathematics at the University of Ulster at Jordanstown, Northern Ireland. Her research interests include machine learning, intelligent systems and intelligent environments applied to assisted living and robotics.
XVI
The Editors
Professor Lakhmi C. Jain is a Director/Founder of the Knowledge-Based Intelligent Engineering Systems (KES) Centre, located in the University of South Australia. He is a fellow of the Institution of Engineers Australia. His interests focus on the artificial intelligence paradigms and their applications in complex systems, art-science fusion, virtual systems, e-education, e-healthcare, unmanned air vehicles and intelligent agents.
List of Contributors
Chapter 1 L.Snidaro · I.Visentini · G.L.Foresti Department of Mathematics and Computer Science, University of Udine, 33100 Udine, Italy,
[email protected] Chapter 2 S. Maludrottu · C.S. Regazzoni Department of Biophysical and Electronic Engineering (DIBE) - University of Genoa, Genoa, ITALY A.Dore Institute of Biomedical Engineering, Imperial College London, London, SW7 2AZ, UK,
[email protected] Chapter 3 Luis Patino · Franc¸ois Bremond · Monique Thonnat INRIA Sophia Antipolis - M´editerran´ee, 2004 route des Lucioles - BP 93 - 06902 Sophia Antipolis, Jose-Luis.Patino
[email protected] Chapter 4 Panagiotis Sourtzinos Dimitrios Makris · Paolo Remagnino Digital Imaging Research Centre, Kingston University, UK,
[email protected] Chapter 5 Christopher Madden University of Adelaide, Australian Centre for Visual Technologies, Adelaide, SA 5007, Australia
XVIII
List of Contributors
Massimo Piccardi University of Technology, Sydney, Department of Computer Science, Ultimo, NSW 2007, Australia,
[email protected] Chapter 6 J.Albusac · J.J.Castro-Schez · D.Vallejo · L.Jim´enez-Linares · C.Glez-Morcillo Escuela Superior de Inform´atica, Universidad de Castilla-La Mancha, Paseo de la Universidad, 4, 13071 Ciudad Real, Spain,
[email protected] Chapter 7 Anton van den Hengel · Anthony Dick · Henry Detmold · Alex Cichowski · Christopher Madden · Rhys Hill University of Adelaide, Australian Centre for Visual Technologies, Adelaide, SA 5007, Australia,
[email protected] Chapter 8 Maria Valera Espina · Raphael Grech · Deon De Jager · Paolo Remagnino Digital Imaging Research Centre, Kingston University, London, UK Luca Iocchi · Luca Marchetti · Daniele Nardi Department of Computer and System Sciences University of Rome “La Sapienza”, Italy Dorothy Monekosso Computer Science Research Institute, University of Ulster, UK Mircea Nicolescu · Christopher King Department of Computer Science and Engineering,University of Nevada, Reno
[email protected]
Chapter 1
Data Fusion in Modern Surveillance Lauro Snidaro, Ingrid Visentini, and Gian Luca Foresti
Abstract. The performances of the systems that fuse multiple data coming from different sources are deemed to benefit from the heterogeneity and the diversity of the information involved. The rationale behind this theory is the capability of one source to compensate the error of another, offering advantages such as increased accuracy and failure resilience. While in the past ambient security systems were focused on the extensive usage of arrays of single-type sensors, modern scalable automatic systems can be extended to combine multiple information coming from mixed-type sources. All this data and information can be exploited and fused to enhance situational awareness in modern surveillance systems. From biometrics to ambient security, from robotics to military applications, the blooming of multi-sensor and heterogeneous-based approaches confirms the increasing interest in the data fusion field. In this chapter we want to highlight the advantages of the fusion of information coming from multiple sources for video surveillance purposes. We are thus presenting a survey of existing methods to outline how the combination of heterogeneous data can lead to better situation awareness in a surveillance scenario. We also discuss a new paradigm that could be taken into consideration for the design of next generation surveillance systems.
1.1 Introduction Interest in automatic surveillance systems has gained significant attention in the past few years. This is due to an increasing need for assisting and extending the capabilities of human operators in remotely monitoring large and complex spaces such as public areas, airports, railway stations, parking lots, bridges, tunnels, etc. The last generation of surveillance systems was designed to cover larger and larger areas dealing with multiple streams from multiple sensors [61, 12, 20]. Their Lauro Snidaro · Ingrid Visentini · Gian Luca Foresti Department of Mathematics and Computer Science, University of Udine 33100 Udine, Italy P. Remagnino et al. (Eds.): Innovations in Defence Support Systems – 3, SCI 336, pp. 1–21. c Springer-Verlag Berlin Heidelberg 2011 springerlink.com
2
L. Snidaro, I. Visentini, and G.L. Foresti
are meant to automatically assess the ongoing activities in the monitored environment flagging and presenting to the operator suspicious events as they happen in order to prevent dangerous situations. A key step that can help in carrying out this task is analysing the trajectories of the objects in the scene and comparing them against known patterns. The system can in fact be trained by the operator with models of normal and suspicious trajectories in the domain at hand. As recent research shows, this process can even be carried out semi-automatically [66]. Real-time detection, tracking, recognition and activity understanding of moving objects from multiple sensors represent fundamental issues to be solved in order to develop surveillance systems that are able to autonomously monitor wide and complex environments. Research can be conducted at different levels, from the preprocessing steps, involving mainly image processing algorithms, to the analysis and extraction of salient features to represent the objects moving in the scene, to their classification, to the tracking of their movements, and to the analysis of their behaviour. The algorithms that are needed span therefore from image processing to event detection and behaviour understanding, and each of them requires dedicated study and research. In this context, data fusion plays a pivotal role in managing the information and improving system performance. Data fusion is a relatively old concept: from the 70s, when it bloomed in the United States, through the 90s until nowadays this research field has been quite popular due to its polymorphism, and to the effective benefits it provides. The manifest example of human and animal senses has been a critical motivation for integrating information from multiple sources for a reliable and feature-rich perception. In fact, even in case of sensor deprivation, biological systems are able to compensate for lacking information by reusing data obtained from sensors with an overlapping scope. Based on this domain model, the individual interacts with the environment and makes decisions about present and future actions [26]. We can say that two main motivations brought data fusion to its actual point of evolution. First of all, the amount of collected information was growing very fast coherently with the expansion of commodity hardware. Storing the consequent increasing volume of available data was considered often too expensive in terms of resources like space and time. Thus, the idea to “condensate” all this data into a single decisions set, or into a subset of information with reduced dimensionality (but rich in semantic), has rapidly taken place [28]. As a natural evolution, new forms of sensors have evolved in the last two decades opening the door to a wide range of possibilities and potential. On the other hand, the concepts of redundancy, uncertainty and sensor feasibility promoted the good marriage between sensors and fusion algorithms. The resemblance with human and animal senses encouraged to integrate information from heterogeneous sources to improve the knowledge of the observed environment. The natural ability to fuse multi-sensory signals has evolved to a high degree in many animal species and it is in use for millions of years. Today the application of fusion concepts in technical areas has constituted a new discipline, that spans over many
1
Data Fusion in Modern Surveillance
3
fields of science, from economics [4, 36] to military applications [28], from commercial software [27] to videosurveillance [64] and multimodal applications [62, 33] to cite some.
1.1.1 Terminology in Data Fusion At this point, it is necessary to formalize the concepts regarding data fusion. Unfortunately, there is not much agreement on the terminology for fusion systems. The terms sensor fusion, data fusion, information fusion, multi-sensor data fusion, and multi-sensor integration have been widely used in the technical literature to refer to a variety of techniques, technologies, systems, and applications that use data derived from multiple information sources. Fusion applications range from real-time sensor fusion for the navigation of mobile robots to the on-line fusion of human or technical strategic intelligence data [63]. Several attempts have been made to define and categorize fusion terms and techniques. In [73], Wald proposes the term “data fusion” to be used as the overall term for fusion. However, while the concept of data fusion is easy to understand, its exact meaning varies from one scientist to another. Wald uses “data fusion” for “a formal framework in which are expressed the means and tools for the alliance of data originating from different sources. It aims at obtaining information of greater quality; the exact definition of greater quality will depend upon the application”. This is also the meaning given by the Geoscience and Remote Sensing Society, by the U. S. Department of Defense, and in many papers regarding motion tracking, remote sensing, and mobile robots. Unfortunately, the term has not always been used in the same meaning during the last years. In some models, data fusion is used to denote fusion of raw data [15]. There are classic books on fusion like “Multisensor Data Fusion” [74] by Waltz and Llinas and Hall’s “Mathematical Techniques in Multisensor Data Fusion” [29] that propose an extended term, multisensor data fusion. According to Hall, it is defined as the “technology concerned with the combination of data from multiple (and possible diverse) sensors in order to make inferences about a physical event, activity, or situation”. To avoid confusion on the meaning, Dasarathy decided to use the term information fusion as the overall term for fusion of any kind of data [16]. As a matter of fact, information fusion is an all-encompassing term covering all aspects of the fusion field. A literal definition of information fusion can be found in [16]: “Information Fusion. encompasses theory, techniques and tools conceived and employed for exploiting the synergy in the information acquired from multiple sources (sensor, databases, information gathered by human, etc.) such that the resulting decision or action is in some sense better (qualitatively or quantitatively, in terms of accuracy, robustness, etc.) than would be possible if any of these sources were used individually without such synergy exploitation”. By defining a subset of information fusion, the term sensor fusion is introduced as:
4
L. Snidaro, I. Visentini, and G.L. Foresti
“Sensor Fusion. is the combining of sensory data or data derived from sensory data such that the resulting information is in some sense better than would be possible when these sources were used individually”.
1.1.2 Motivation to Sensor Arrays Systems that employ sensor fusion methods expect a number of benefits over single sensor systems. A physical sensor measurement generally suffers from the following problems: • Sensor failure: The deprivation of the sensor means the total loss of the input. • Limited spatial coverage: Depending on the environment, a single sensor could simply be not enough. An obvious example, a fixed camera cannot monitor areas wider than its field of view. • Limited temporal coverage: Some sensors could not be able to provide data at all times. Usually, color cameras are useless at night without proper illumination, being substituted by infrared ones. • Imprecision: The accuracy of the data is the accuracy of the sensor. • Uncertainty: Uncertainty, in contrast to imprecision, depends on the object being observed rather than the observing device. Uncertainty arises when features are missing (e. g., occlusions), when the sensor cannot measure all relevant attributes of the percept, or when the observation is ambiguous [51]. The video surveillance application example is self-explanatory. A single camera cannot provide alone the needed spatial and temporal coverage. Data could be extremely inaccurate since precision is seriously affected by the camera-target distance: the closer the target is to the limit of the field of view, the more imprecise is the information on the target’s location the camera can provide. Noise and occlusions are the main sources for detection uncertainty in this case. Not to mention the total blindness given by sensor failures. On the contrary, broad areas can be monitored 24/7 by a network of heterogeneous sensors. Robust behaviour against sensor deprivation can be achieved by using cameras with overlapping fields of view. This also greatly reduces uncertainty since multiple detections of the same target from different views can be available. This can help discriminating true positives (target’s presence) from false positives (noise), and false negatives (occlusions) from true negatives (target’s absence). The following advantages can be expected from the fusion of sensor data from a set of heterogeneous or homogeneous sensors [8]: • Robustness and reliability: Multiple sensor suites have an inherent redundancy which enables the system to provide information even in case of partial failure. • Extended spatial and temporal coverage: One sensor can look where others cannot respectively can perform a measurement while others cannot. • Increased confidence: A measurement of one sensor is confirmed by measurements of other sensors covering the same domain.
1
Data Fusion in Modern Surveillance
5
• Reduced ambiguity and uncertainty: Joint information reduces the set of ambiguous interpretations of the measured value. • Robustness against interference: By increasing the dimensionality of the measurement space (e. g., measuring the desired quantity with optical sensors and ultrasonic sensors) the system becomes less vulnerable against interference. • Improved resolution: When multiple independent measurements of the same property are fused, the resolution of the resulting value is better than a single sensor’s measurement. In [59], the performance of sensor measurements obtained from an appropriate fusing process is compared to the measurements of the single sensor. According to this work, an optimal fusing process can be designed, if the distribution function describing measurement errors of one particular sensor is precisely known. Intuitively, if all this information is available, this optimal fusing process performs at least as well as the best single sensor. A further advantage of sensor fusion is the possibility of reducing system complexity. In a traditionally designed system the sensor measurements are fed into the application, which has to cope with a big number of imprecise, ambiguous and incomplete data streams. This is also the case for multisensor integration as will be described later. In a system where sensor data is preprocessed by fusion methods, the input to the controlling application can be standardized independently of the employed sensor types, thus facilitating application implementation and providing the capability of adapting the number and type of employed sensors without changing the software [19].
1.1.3 The JDL Fusion Process Model Several fusion process models have been developed over the years. The first and most known originates from the US Joint Directors of Laboratories (JDL) in 1985 under the guidance of the Department of Defense (DoD). The JDL model [28] comprises five levels of data processing and a database, which are all interconnected by a bus. The five levels are not meant to be processed in a strict order and can also be executed concurrently. Steinberg and Bowman proposed revisions and expansions of the JDL model involving broadening the functional model, relating the taxonomy to fields beyond the original military focus, and integrating a data fusion tree architecture model for system description, design, and development [67]. This updated model, sketched in Figure 1.1, is composed by the following levels: • Level 0 - Sub-Object Data Assessment: estimation and prediction of signal/object observable states on the basis of pixel/signal level data association and characterization; • Level 1 - Object Assessment: estimation and prediction of entity states on the basis of observation-to-track association, continuous state estimation (e.g. kinematics) and discrete state estimation (e.g. target type and ID);
6
L. Snidaro, I. Visentini, and G.L. Foresti
Fig. 1.1 The JDL data fusion process model.
• Level 2 - Situation Assessment: estimation and prediction of relations among entities, to include force structure and cross force relations, communications and perceptual influences, physical context, etc.; • Level 3 - Impact Assessment: estimation and prediction of effects on situations of planned or estimated/predicted actions by the participants; to include interactions between action plans of multiple players (e.g. assessing susceptibilities and vulnerabilities to estimated/predicted threat actions given ones own planned actions); • Level 4 - Process Refinement: adaptive data acquisition and processing to support mission objectives. The model is deliberately very abstract which sometimes makes it difficult to properly interpret its parts and to appropriately apply it to specific problems. However, as already mentioned, it was originally conceived more as a basis for common understanding and discussion between scientists rather than a real guide for developers in identifying the methods that should be used [28]. A recent paper by Llinas et al. [37] suggests revisions and extensions of the model in order to cope with issues and functions of nowadays applications. In partcular, further extensions of the JDL Model-version are proposed with an emphasis in four areas: (1) remarks on issues related to quality control, reliability, and consistency in data fusion (DF) processing, (2) assertions about the need for co-processing of abductive/inductive and deductive inferencing processes, (3) remarks about the need for and exploitation of an ontologically-based approach to DF process design, and (4) discussion on the role for Distributed Data Fusion (DDF).
1.2 Data Fusion and Surveillance While in the past ambient security systems were focused on the extensive usage of arrays of single-type sensors [34, 71, 38, 50], modern surveillance systems aim to combine information coming from different types of sources. Multi-modal systems [62, 33], even more often used in biometrics, or multi-sensor multi-cue approaches
1
Data Fusion in Modern Surveillance
7
[45, 21] fuse heterogeneous data in order to provide a more robust response and enhance situational awareness. The JDL model presented in section 1.1 can be contextualized and fitted to a surveillance context. In particular, we can imagine a typical surveillance scenario where multiple cameras monitor a wide area. A concrete example on how the levels of the JDL scheme can be reinterpreted can be found in Figure 1.2. In the proposed example, the levels correspond to specific video-surveillance tasks or patterns as follows: • Level 0 - Sub-Object Data Assessment: the raw data streams coming from the cameras can be individually pre-processed. For example, they can be filtered to reduced noise, processed to increase contrast, scaled down to reduce the processing time of subsequent elaborations. • Level 1 - Object Assessment: multiple objects in the scene (e.g. typically pedestrians, vehicles, etc.) can be detected, tracked, classified and recognized. The objects are the entities of the process, but no relationships are involved yet at this point. Additional data as, for instance, the map or sensible areas are a priori contextual information. • Level 2 - Situation Assessment: spatial or temporal relationships between entities are here drawn: a target moving, for instance, from a sensible Zone1 to
Fig. 1.2 Example of contextualization of the the JDL scheme of Figure 1.1 to a surveillance scenario.
8
L. Snidaro, I. Visentini, and G.L. Foresti
Fig. 1.3 Several fusion levels [41] in JDL Level 1.
Zone2 can constitute an event. Simple atomic events are built considering brief stand-alone actions, while more complex events are obtained joining several simple events. Possible alarms to the operator are given at this point. • Level 3 - Impact Assessment: a prediction of an event can be an example of what, in practice, may happen at this step. An estimation of a trajectory of a potential target, or a prediction of the behaviour of an entity can be a focus of this level. For instance, knowing that an object crossed Zone1 heading to Zone2 , we can presume it will cross even Zone3 according to the current trajectory. • Level 4 - Process Refinement: after the prediction given by Level 3, several optimization can be taken in this phase regarding all the previous levels. For instance, the sensors can be relocated to better monitor Zone3 , new thresholds can be imposed in Level 0 procedures, or different algorithms can be employed in Level 1.
1.3 A Closer Look to Fusion in Level 1 Within Level 1 of the JDL model, several fusion approaches can be considered. For instance, we can outline a hierarchy of different fusion steps, as shown in Figure 1.3, that combine data, features, and classifiers with various criteria [41]. The data level aims to fuse different data sets, to provide, for instance, different views of the same object. The feature-based level merges the output of different features, like colour histograms rather than shape features or the response of a heterogeneous feature set. The classifier level focuses on the use of several heterogeneous base classifiers, while the combination level is dedicated to the study of different combiners that follow various ensemble fusion paradigms. Under these assumptions, each level can be considered as a black box, in which different functions and methods can be alternated transparently.
1
Data Fusion in Modern Surveillance
9
More in general, this taxonomy reflects a bottom-up processing, from a low-level fusion that involves raw data coming straight from sensors to high-level combination of abstract and refined information. The data flow through the levels can be considered as a fan-in-tree, as each step provides a response to higher levels that receive input from the previous ones. A critical point in designing a data fusion application is deciding where the fusion should actually take place.
1.3.1 Data Level The data sources for a fusion process are not specified to originate from identical sensors. McKee distinguishes direct fusion, indirect fusion and fusion of the outputs of the former two [48]. Direct fusion means the fusion of sensor data from a set of heterogeneous or homogeneous sensors, and history values of sensor data, while indirect fusion uses information sources like a priori knowledge about the environment and human input. Therefore, sensor fusion describes direct fusion systems, while information fusion also includes indirect fusion processes. The sensor fusion definition of Section 1.1.1 does not require that inputs are produced by multiple sensors, it only says that sensor data or data derived from sensor data have to be combined. For example, the definition also comprises sensor fusion systems with a single sensor that take multiple measurements subsequently at different instants which are then combined. Another frequently used term is multisensor integration. Multisensor integration means the synergistic use of sensor data for the accomplishment of a task by a system. Sensor fusion is different to multisensor integration in the sense that it includes the actual combination of sensory information into one representational format [46]. The difference is outlined in Figure 1.4: while sensor fusion combines data (possibly applying a conversion) before handing it to the application, an application based on multisensor integration directly process the data from the sensors. A classical videosurveillance sensor fusion application is remote sensing where imagery data acquired with heterogeneous sensors (e.g. colour and infrared cameras) is fused in order to enhance sensing capabilities before further processing. Sensors don’t have to produce commensurate data either. As a matter of fact, sensors could be very diverse and producing different quantities. Sensor fusion reduces the incoming data into a format the application can further process.
Fig. 1.4 Conceptual comparison between (left) Sensor Fusion and (right) Multisensor Integration
10
L. Snidaro, I. Visentini, and G.L. Foresti
1.3.2 Feature Level Feature fusion (or selection) is a process used in machine learning and pattern recognition to obtain a subset of features capable to discriminate different input classes. Object detection and tracking in video sequences are known to benefit from the employment of multiple (heterogeneous) features (e.g. color, orientation histograms, etc.) as shown in [13, 23, 75]. The combination of different features within a higher level framework, as for instance the boosting meta-algorithm, can be a winning strategy in terms of robustness and accuracy [81]. In a surveillance context, a feature selection mechanism such as the one presented in [13] can be used to estimate the performance of the sensor in detecting a given target. The approach selects the most discriminative colour features to separate the target from the background by applying a two-class variance ratio to log likelihood distributions computed from samples of object and background pixels. The algorithm generates a likelihood map for each feature, ranks the map according to its variance ratio value, and then proceeds with a mean-shift tracking system that adaptively selects the top-ranked discriminative features for tracking. The idea was further developed in [78] where additional heterogeneous features were considered and fused via likelihood maps. Another example of successful feature selection for target tracking can be found in [54], where the fusion focus is on a high-level integration of responses: the confidence maps coming from a feature extraction step are merged in order to find the target position. Heterogeneous weighted data is combined without involving classifiers and without maintaining a model of the object, but performing a frame-to-frame tracking with salient feature extracted at each epoch. In [23] the heterogeneity of the features improves recognition performance, even when the target is occluded.
1.3.3 Classifier Level Classification is a crucial step in surveillance systems [79, 35]. Usually, a classifier is considered as a stand-alone entity, that formulates a decision over a problem providing a real-valued or a binary output, or simply a label. A classifier aims to translate the feature outputs into meaningful information, producing higher-level decisions. One classifier can work on raw data or be bound to one or more features; these can be independent or conditionally independent because referring to the same object. Some synonyms of classifier are hypothesis, learner, expert [41]. A classifier can be trained, that means that exploits past knowledge to formulate a decision (i.e., Neural Networks, Bayesian classifiers), or not (i.e., Nearest Neighbours, clustering algorithms); in the latter case, usually a classifier considers a neighbourhood of samples to formulate its response. A wide survey on base classifiers is presented in [7, 41]. In the scheme presented in Figure 1.3, the classifier level refers to the usage of different heterogeneous classifiers to create a fusion scheme that aims to improve the performances with respect to the single-type classifier algorithm. Heterogeneity
1
Data Fusion in Modern Surveillance
11
has, in fact, been linked to diversity [30] and its contribution to performance improvement has been empirically demonstrated in several cases [70, 30, 5].
1.3.4 Combiner Level The problem of fusing different classifiers to provide a more robust and accurate detection has been studied since the early 90s, when the first approaches to the problem were detailed [31, 60, 55, 32]. Classifier fusion is proved to benefit from the diverse decision capabilities of the combined experts, thus improving classification performance with particular respect to accuracy and efficiency [39]. Two advisable conditions are the accuracy (low error rates) and the diversity (committing different errors) of the base learners. A strong motivation to classifier fusion is the idea that the information coming from several classifiers is combined to obtain the final classification response, considering each one of their individual opinion, and mutually compensating their errors [42]. The literature on classifier fusion techniques, that is the design of different combiners, is really vast: the MCS workshops series has a noticeable importance in this sense, but classifier ensembles are often used in a broad number of application fields, from medical imaging [57] to network security [22], from surveillance [72] to handwritten digit recognition [43, 80] including a large range of real-world domains [53]. Trying to formalize the benefits of the aggregation of multiple classifiers, Dietterich [18] and Kuncheva [41] gave a few motivations why multiple classifiers systems may be better than a single classifier. The first one is statistical: the ensemble may be not better1 than the best individual classifier, but the risk of picking an “inadequate single classifier” is reduced. The second reason why we should prefer an ensemble rather than a single classifier is computational: a set of decision makers can provide a solution working each one on a subsection of the problem in less time than a single base learner. The last motivation refers to the possibility of the classifiers’ space not containing the optimum. However, the ability of a mosaic of classifiers to approximate a decision boundary has to be considered; in this respect, it may be difficult or even impossible to adapt the parameters of an individual classifier to fit the problem, but a set of tuned classifier can approach the solution with a good approximation. Under these considerations, the focus is on the fusion criterion that, even suboptimal, achieves better performances than a single trained classifier. Typically, classifiers can be combined by several fusion rules, from simple fixed ones (i.e., sum, mean, vote by majority, etc.) to more complex fusion schemes (i.e., belief functions, Dempster-Shafer evidence theories, etc.). Some meta-learners, as the Boosting technique, can be employed as well, preceded by an off-line training phase. An exhaustive survey of combination methods can be found in [41]. 1
The group’s average performance is not guaranteed to improve on the single best classifier [41].
12
L. Snidaro, I. Visentini, and G.L. Foresti
1.3.5 An Example: Target Tracking via Classification Modern surveillance systems operating in complex environments, like an urban setting, have to cope with many objects moving in the scene at the same time and with events of always increasing difficulty. In order to single out and understand the behaviour of every actor in the scene, the system should be able to track each target as it moves and performs its activities. Position, velocity, and trajectory followed constitute basic pieces of information from which simple events can be promptly inferred. For this reason, a robust and accurate tracking process is of paramount importance in surveillance systems. Tracking is not an easy task since sensor noise and occlusions are typical issues that have to be dealt with to keep each target associated with its ID. In a system with multiple cameras this task (multi-sensor multi-target tracking) is even more daunting, since measurements from different sensors but generated because of the observation of the same target have to be correctly associated. Significant progress in the object tracking problem has been made during the last few years (see [77] for a recent survey). However, no definitive solution has been proposed, considering challenging situations as illumination variation conditions, appearance changes events or unconstrained video sources. Tracking of an object can be performed at different levels. At sensor level, on the image plane of each sensor, the system executes an association algorithm to match the current detections with those extracted in the previous frame. In this case, available techniques range from template matching, to feature matching [11], to more recent and robust algorithms [14]. Another step forward considers to mix different features to enhance the robustness of the tracker; as seen in Section 1.3.2, heterogeneous features brought a big improvement in target localization and tracking. Recently the tracking via classification concept has received a new boost [3, 9, 25, 56, 69, 52]; a single classifier or a classifier ensemble track an object separating the target from the background. In general, using each one of these approaches to tackle the tracking problem we can potentially consider several fusion levels, as shown in Figure 1.5. We can combine more than one source of information, as, for instance, audio and video, as
Fig. 1.5 Example of application
1
Data Fusion in Modern Surveillance
13
they has been repeatedly considered as a potential improvement to the limitation imposed from single-type sensors [10, 6, 62]. Different techniques can be applied to the data to extract heterogeneous relevant features; the joint usage of different information can improve significantly the tracking result quality [77]. Eventually, tracking via classification is considered more robust to occlusions and illumination changes [24] and greater system robustness and performance is achievable with an ensemble of classifiers through information fusion techniques [9, 56].
1.4 Sensor Management: A New Paradigm for Automatic Video Surveillance Video surveillance systems have always been based on multiple sensors since their first generation (CCTV systems) [61]. Video streams from analog cameras were multiplexed on video terminals in control rooms to help human operators monitor entire buildings or wide open areas. The last generation makes use of digital equipment to capture and transmit images that can be viewed virtually everywhere by using Internet. Initially, multi-sensor systems were employed to extend surveillance coverage over wide areas. The recent advances in sensor and communication technology, in addition to lower costs, allow to use multiple sensors for the monitoring of the same area [64, 2]. This has opened new possibilities in the field of surveillance as multiple and possibly heterogeneous sensors observing the same scene provide redundant data that can be exploited to improve detections accuracy and robustness, enlarging monitoring coverage and reducing uncertainty [64]. While the advantages of using multiple sources of information are well know to the data fusion community [44], the full potential of multi-sensor surveillance is yet to be discovered. In particular, the enrichment of available sensor assets has allowed to take advantage of data fusion techniques for solving specific tasks like for example target localization and tracking [65], or person identification [62]. This can be formalized as the application of JDL Level 1 and 2 fusion techniques [44] to surveillance strictly following a processing stream that exploits multi-sensor data to achieve better system perception performance and in the end improved situational awareness . A brief exemplification of the techniques that can be employed in Level 1 and 2 have been presented in Section 1.2. While many technical problems remain to be solved for integrating heterogeneous suites of sensors for wide area surveillance, a principled top-down approach is probably still left unexplored. Given the acknowledged increased complexity of architectures that can be developed nowadays, a full exploitation of this potential is probably beyond the possibilities of a human operator. Think for example to all the possible combinations of configurations that are made available by modern sensors: Pan-Tilt-Zoom (PTZ) cameras can be controlled to cover different areas, day/night sensors offer different sensing modalities, radars can operate at different frequencies, etc. The larger the system the more likely it will be called to address many different surveillance needs. A topdown approach would be needed in order to develop surveillance systems that are
14
L. Snidaro, I. Visentini, and G.L. Foresti
able to automatically manage large arrays of sensors in order to enforce surveillance directives provided by the operator that in turn translate the security policies of the owning organization. Therefore, a new paradigm is needed to guide the design of architectures and algorithms in order to build the next generation of surveillance systems that are able to organize themselves to collect data relevant to the objectives specified by the operator. This new paradigm would probably need to take inspiration by the principles behind the Sensor Management policies foreseen by JDL Level 4 [37]. The JDL model Level 4 is also called Process Refinement step as it implies adaptive data acquisition and processing to support mission objectives. Conceptually, this refinement step should be able to manage the system in its entirety: from controlling hardware resources (e.g. sensors, processors, storage, etc.) to adjusting the processing flow in order to optimize the behaviour of the system to best achieve mission goals. It is therefore apparent that the Process Refinement step encompasses a broad spectrum of techniques and algorithms that operate at very different logical levels. In this regard, an implemented full-fledged Process Refinement would provide the system a form of awareness of its own capabilities and how they relate and interact with the observed environment. The Process Refinement part dedicated to sensors and data sources is often called Sensor Management and it can be defined as “a process that seeks to manage, or coordinate, the use of a set of sensors in a dynamic, uncertain environment, to improve the performance of the system” [76]. In other words, a Sensor Management process should be able to, given the current state of affairs of the observed environment, translate mission plans or human directives into sensing actions directed to acquire needed additional or missing information in order to improve situational awareness and fulfil the objectives. A fivelayered procedure has been proposed in [76] and is reproduced in Figure 1.6. The chart schematizes a general sensor management process that can be used to guide the design of a real sensor management module. In the following, the different levels will be contextualized in the case of a surveillance system.
Mission Planning This level takes as input the current situation and the requests from the human operator and performs a first breakdown of the objectives by trying to match them with the available services and functionalities of the system. In a surveillance system the requests from the operator can be events of interest to be detected (e.g. a vehicle being stationary outside a parking slot) and alarm conditions (e.g. a person trespassing a forbidden area). Each of the events should be given a priority by the operator. The Mission Planning module is in charge of selecting the functions to be used in order to detect the required events (e.g. target tracking, classification, plate reading, face recognition, trajectory analysis, etc.). Actually this module should work in a way similar to a compiler, starting from the description of the events of interest expressed in a high level language, parsing the description and determining
1
Data Fusion in Modern Surveillance
15
Fig. 1.6 Five-layered sensor managing process [76].
the relevant services to be employed. The module will also identify the areas to be monitored, the targets to look for, the frequency of measurements and the accuracy level.
Resource Deployment This level identifies the sensors to be used among the available ones. If mobile and/or active sensors are available their repositioning may be needed [49]. In particular, this level would take into consideration aspects such as coverage and sensing modality. For example, depending on the time of the day a certain event is to be detected a sensor may be preferred to another.
Resource Planning This level is in charge of tasking the individual sensors (e.g. movement planning for active sensors [17, 49]) and coordinating them (e.g. sensor hand-overs) in order to carry out a certain task (e.g. tracking). The level also deals with sensor selection
16
L. Snidaro, I. Visentini, and G.L. Foresti
techniques that can choose for every instant and every target the optimal subset of sensors for tracking or classifying it. Several approaches to sensor selection have been proposed in the literature such as, for example, information gain based [40] and detection quality based [65].
Sensor Scheduling Depending on the planning and requests coming from Resource Planning, this level is in charge of determining a detailed schedule of commands for each sensor. This is particularly appropriate for active (i.e. PTZ cameras), mobile (e.g. robots) and multimode (day/night cameras, multi-frequency radar) sensors. The problem of sensor scheduling has been addressed in [47], and a recent contribution on the scheduling of visual sensors can be found in [58].
Sensor Control This is the lowest level and possibly also the simplest. The purpose of this level is to optimize sensor parameters given the current commands imposed by Level 1 and 2. For video sensors this may involve regulating iris and focus to optimize image quality. Although this is performed automatically by sensor hardware in most of the cases, it could be beneficial to manage sensor parameters directly according to some figure of merit which is dependent on the content of the image. For example, contrast and focus may be adjusted specifically for a given target. An early treatment of the subject may be found in [68], while a recent survey may be found in [1].
1.5 Conclusions Borrowed from biological systems where multiple sensors with an overlapping scope are able to compensate the lack of information from other sources, the data fusion field has gained momentum since its first appearance in the early 70s. Its rapid spread from biometrics to ambient security field, from robotics to everyday applications matches the rise of multi-sensor and heterogeneous-based approaches. In this regard, modern surveillance systems are moving toward techniques that, from the past single-type sensors, rely on mixed-type sensors and exploit multiple cues to solve real-time tasks. Data fusion is a necessary tool to combine heterogeneous information, to provide flexibility to manage unpredictable events and to enhance situational awareness in modern surveillance systems. In this chapter, we discussed the impact of the fusion of information coming from multiple sources, presenting a survey of existing methods and proposing some examples to outline how the combination of heterogeneous data can lead to better situation awareness in a surveillance scenario. We have also presented a possible new paradigm regarding sensor management that could be taken into account in the design of next generation surveillance systems.
1
Data Fusion in Modern Surveillance
17
References [1] Abidi, B.R., Aragam, N.R., Yao, Y., Abidi, M.A.: Survey and analysis of multimodal sensor planning and integration for wide area surveillance. ACM Computing Surveys 41(1), 1–36 (2008), DOI http://doi.acm.org/10.1145/1456650.1456657 [2] Aghajan, H., Cavallaro, A. (eds.): Multi-Camera Networks. Elsevier, Amsterdam (2009) [3] Avidan, S.: Ensemble tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(2), 261–271 (2007), http://doi.ieeecomputersociety.org/10.1109/TPAMI.2007.35 [4] Baker, K., Harris, P., O’Brien, J.: Data fusion: An appraisal and experimental evaluation. Journal of the Market Research Society 31(2), 152–212 (1989) [5] Bian, S., Wang, W.: On diversity and accuracy of homogeneous and heterogeneous ensembles. International Journal of Hybrid Intelligent Systems 4(2), 103–128 (2007) [6] Bigun, J., Chollet, G.: Special issue on audio-based and video-based person authentication. Pattern Recognition Letters 18(9), 823–825 (1997) [7] Bishop, C.M.: Pattern Recognition and Machine Learning (Information Science and Statistics), 1st edn. Springer, Heidelberg (2007) [8] Boss, E., Roy, J., Grenier, D.: Data fusion concepts applied to a suite of dissimilar sensors. In: Proceedings of the Canadian Conference on Electrical and Computer Engineering, vol. 2, pp. 692–695 (1996) [9] Chateau, T., Gay-Belille, V., Chausse, F., Laprest, J.T.: Real-time tracking with classifiers. In: European Conference on Computer Vision (2006) [10] Choudhury, T., Clarkson, B., Jebara, T., Pentl, A.: Multimodal person recognition using unconstrained audio and video. In: International Conference on Audio- and VideoBased Person Authentication, pp. 176–181 (1998) [11] Collins, R.T., Lipton, A.J., Fujiyoshi, H., Kanade, T.: Algorithms for cooperative multisensor surveillance. Proceedings of the IEEE 89(10), 1456–1477 (2001) [12] Collins, R.T., Lipton, A.J., Kanade, T.: Introduction to the special section on video surveillance. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8), 745–746 (2000) [13] Collins, R.T., Liu, Y., Leordeanu, M.: Online selection of discriminative tracking features. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(10), 1631– 1643 (2005) [14] Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based object tracking. IEEE Transactions on Pattern Analysis Machine Intelligence 25(5), 564–575 (2003) [15] Dasarathy, B.V.: Sensor fusion potential exploitation-innovative architectures and illustrative applications. Proceedings of the IEEE 85, 24–38 (1997) [16] Dasarathy, B.V.: Information fusion - what, where, why, when, and how? Information Fusion 2(2), 75–76 (2001) [17] Denzler, J., Brown, C.: Information theoretic sensor data selection for active object recognition and state estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(2), 145–157 (2002) [18] Dietterich, T.G.: Ensemble methods in machine learning. In: Kittler, J., Roli, F. (eds.) MCS 2000. LNCS, vol. 1857, pp. 1–15. Springer, Heidelberg (2000) [19] Elmenreich, W., Pitzek, S.: Using sensor fusion in a time-triggered network. In: Proceedings of the 27th Annual Conference of the IEEE Industrial Electronics Society, Denver, CO, USA, vol. 1, pp. 369–374 (2001)
18
L. Snidaro, I. Visentini, and G.L. Foresti
[20] Foresti, G.L., Regazzoni, C.S., Varshney, P.K.: Multisensor Surveillance Systems: The Fusion Perspective. Kluwer Academic Publisher, Dordrecht (2003) [21] Gavrila, D.M., Munder, S.: Multi-cue pedestrian detection and tracking from a moving vehicle. Int. J. Comput. Vision 73(1), 41–59 (2007), DOI http://dx.doi.org/10.1007/s11263-006-9038-7 [22] Giacinto, G., Perdisci, R., Rio, M.D., Roli, F.: Intrusion detection in computer networks by a modular ensemble of one-class classifiers. Information Fusion 9(1), 69–82 (2008); Special Issue on Applications of Ensemble Methods, doi:10.1016/j.inffus.2006.10.002 [23] Gouet-Brunet, V., Lameyre, B.: Object recognition and segmentation in videos by connecting heterogeneous visual features. Computer Vision and Image Understanding 111(1), 86–109 (2008) [24] Grabner, H., Grabner, M., Bischof, H.: Real-time tracking via on-line boosting. In: Proceedings of the British Machine Vision Conference (BMVC), vol. 1, p. 4756 (2006) [25] Grabner, H., Sochman, J., Bischof, H., Matas, J.: Training sequential on-line boosting classifier for visual tracking. In: International Conference on Pattern Recognition (2008) [26] Grossmann, P.: Multisensor data fusion. The GEC Journal of Technology 15, 27–37 (1998) [27] Hall, D.L., Linn, R.J.: Survey of commercial software for multisensor data fusion. In: Aggarwal, J.K., Nandhakumar, N. (eds.) Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, vol. 1956, pp. 98–109 (1993) [28] Hall, D.L., Llinas, J.: An introduction to multisensor data fusion. Proceedings of the IEEE 85(1), 6–23 (1997) [29] Hall, D.L., McMullen, S.A.: Mathematical Techniques in Multisensor Data Fusion. Artech House, Boston (2004) [30] Hsu, K.W., Srivastava, J.: Diversity in combinations of heterogeneous classifiers. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS, vol. 5476, pp. 923–932. Springer, Heidelberg (2009) [31] Hu, W., Hu, W., Maybank, S.: Adaboost-based algorithm for network intrusion detection. IEEE Transactions on Systems, Man, and Cybernetics, Part B 38(2), 577–583 (2008) [32] Islam, M., Yao, X., Nirjon, S., Islam, M., Murase, K.: Bagging and boosting negatively correlated neural networks. IEEE Transactions on Systems, Man, and Cybernetics, Part B 38(3), 771–784 (2008), doi:10.1109/TSMCB.2008.922055 [33] Jain, A., Hong, L., Kulkarni, Y.: A multimodal biometric system using fingerprints, face and speech. In: 2nd International Conference on Audio- and Video- based Biometric Person Authentication, pp. 182–187 (1999) [34] Javed, O., Rasheed, Z., Shafique, K., Shah, M.: Tracking across multiple cameras with disjoint views. In: Proceedings of the 9th IEEE International Conference on Computer Vision (ICCV), vol. 2, pp. 952–957 (2003), doi:10.1109/ICCV.2003.1238451 [35] Javed, O., Shah, M.: Tracking and object classification for automated surveillance. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2353, pp. 343–357. Springer, Heidelberg (2002) [36] Jephcott, J., Bock, T.: The application and validation of data fusion. Journal of the Market Research Society 40(3), 185–205 (1998) [37] Llinas, J., Bowman, C.L., Rogova, G.L., Steinberg, A.N., Waltz, E.L., White, F.E.: Revisiting the JDL data fusion model II. In: Svensson, P., Schubert, J. (eds.) Proceedings of the Seventh International Conference on Information Fusion, vol. II, pp. 1218–1230. International Society of Information Fusion, Stockholm (2004), http://www.fusion2004.foi.se/papers/IF04-1218.pdf
1
Data Fusion in Modern Surveillance
19
[38] Kang, J., Cohen, I., Medioni, G.: Multi-views tracking within and across uncalibrated camera streams. In: IWVS 2003: First ACM SIGMM International Workshop on Video Surveillance, pp. 21–33. ACM, New York (2003), DOI http://doi.acm.org/10.1145/982452.982456 [39] Kittler, J., Hatef, M., Duin, R.P., Matas, J.: On combining classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(3), 226–239 (1998) [40] Kreucher, C., Kastella, K., Hero III, A.O.: Sensor management using an active sensing approach. Signal Processing 85(3), 607–624 (2005), http://www.sciencedirect.com/science/article/ B6V18-4F017HY-1/2/a019fa31bf4135dfdcc38dd5dc6fc6c8, doi:10.1016/j.sigpro.2004.11.004 [41] Kuncheva, L.I.: Combining Pattern Classifiers: Methods and Algorithms. Wiley Interscience, Hoboken (2004) [42] Kuncheva, L.I., Whitaker, C.J.: Measures of diversity in classifier ensembles. Machine Learning 51, 181–207 (2003) [43] Lee, S.W., Kim, S.Y.: Integrated segmentation and recognition of handwritten numerals with cascade neural network. IEEE Transactions on Systems, Man, and Cybernetics, Part C 29(2), 285–290 (1999) [44] Liggins, M.E., Hall, D.L., Llinas, J.: Multisensor data fusion: theory and practice, 2nd edn. The Electrical Engineering & Applied Signal Processing Series. CRC Press, Boca Raton (2008) [45] Liu, H., Yu, Z., Zha, H., Zou, Y., Zhang, L.: Robust human tracking based on multi-cue integration and mean-shift. Pattern Recognition Letters 30(9), 827–837 (2009) [46] Luo, R.C., Kay, M.: Multisensor integration and fusion in intelligent systems. IEEE Transactions on Systems, Man, and Cybernetics 19(5), 901–930 (1989) [47] McIntyre, G., Hintz, K.: Sensor measurement scheduling: an enhanced dynamic, preemptive algorithm. Optical Engineering 37, 517 (1998) [48] McKee, G.T.: What can be fused? In: Multisensor Fusion for Computer Vision. Nato Advanced Studies Institute Series F, vol. 99, pp. 71–84 (1993) [49] Mittal, A., Davis, L.: A general method for sensor planning in multi-sensor systems: Extension to random occlusion. International Journal of Computer Vision 76(1), 31–52 (2008) [50] Monekosso, D., Remagnino, P.: Monitoring behavior with an array of sensors. Computational Intelligence 23(4), 420–438 (2007) [51] Murphy, R.R.: Biological and cognitive foundations of intelligent sensor fusion. IEEE Transactions on Systems, Man and Cybernetics 26(1), 42–51 (1996) [52] Nguyen, H.T., Smeulders, A.W.: Robust tracking using foreground-background texture discrimination. International Journal of Computer Vision 69(3), 277–293 (2006) [53] Oza, N.C., Tumer, K.: Classifier ensembles: Select real-world applications. Information Fusion 9(1), 4–20 (2008); Special Issue on Applications of Ensemble Methods, doi:10.1016/j.inffus.2007.07.002 [54] Parag, T., Porikli, F., Elgammal, A.: Boosting adaptive linear weak classifiers for online learning and tracking. In: International Conference on Computer Vision and Pattern Recognition (2008) [55] Parikh, D., Polikar, R.: An ensemble-based incremental learning approach to data fusion. IEEE Transactions on Systems, Man, and Cybernetics, Part B 37(2), 437–450 (2007), doi:10.1109/TSMCB.2006.883873
20
L. Snidaro, I. Visentini, and G.L. Foresti
[56] Petrovi´c, N., Jovanov, L., Piˇzurica, A., Philips, W.: Object tracking using naive bayesian classifiers. In: Blanc-Talon, J., Bourennane, S., Philips, W., Popescu, D., Scheunders, P. (eds.) ACIVS 2008. LNCS, vol. 5259, pp. 775–784. Springer, Heidelberg (2008) [57] Polikar, R., Topalis, A., Parikh, D., Green, D., Frymiare, J., Kounios, J., Clark, C.M.: An ensemble based data fusion approach for early diagnosis of alzheimer’s disease. Information Fusion 9(1), 83–95 (2008); Special Issue on Applications of Ensemble Methods, doi:10.1016/j.inffus.2006.09.003 [58] Qureshi, F., Terzopoulos, D.: Surveillance camera scheduling: A virtual vision approach. Multimedia Systems 12(3), 269–283 (2006) [59] Rao, N.S.V.: A fusion method that performs better than best sensor. In: Proceedings of the First International Conference on Multisource-Multisensor Information Fusion, pp. 19–26 (1998) [60] Ratsch, G., Mika, S., Scholkopf, B., Muller, K.: Constructing boosting algorithms from svms: An application to one-class classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(9), 1184–1199 (2002) [61] Regazzoni, C.S., Visvanathan, R., Foresti, G.L.: Scanning the issue / technology - Special Issue on Video Communications, processing and understanding for third generation surveillance systems. Proceedings of the IEEE 89(10), 1355–1367 (2001) [62] Ross, A., Jain, A.: Multimodal biometrics: An overview. In: Proc. XII European Signal Processing Conf., pp. 1221–1224 (2004) [63] Rothman, P.L., Denton, R.V.: Fusion or confusion: Knowledge or nonsense? In: SPIE Data Structures and Target Classification, vol. 1470, pp. 2–12 (1991) [64] Snidaro, L., Niu, R., Foresti, G., Varshney, P.: Quality-Based Fusion of Multiple Video Sensors for Video Surveillance. IEEE Transactions on Systems, Man, and Cybernetics 37(4), 1044–1051 (2007) [65] Snidaro, L., Visentini, I., Foresti, G.: Quality Based Multi-Sensor Fusion for Object Detection in Video-Surveillance. In: Intelligent Video Surveillance: Systems and Technology, pp. 363–388. CRC Press, Boca Raton (2009) [66] Stauffer, C., Grimson, W.E.L.: Learning patterns of activity using real-time tracking. IEEE Pattern Analysis and Machine Intelligence 22(8), 747–757 (2000) [67] Steinberg, A.N., Bowman, C.: Revisions to the JDL data fusion process model. In: Proceedings of the 1999 National Symposium on Sensor Data Fusion (1999) [68] Tarabanis, K., Allen, P., Tsai, R.: A survey of sensor planning in computer vision. IEEE Transactions on Robotics and Automation 11(1), 86–104 (1995) [69] Tomasi, C., Petrov, S., Sastry, A.: 3d tracking = classification + interpolation. In: ICCV 2003: Proceedings of the Ninth IEEE International Conference on Computer Vision, p. 1441 (2003) [70] Tsoumakas, G., Katakis, I., Vlahavas, I.: Effective voting of heterogeneous classifiers. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 465–476. Springer, Heidelberg (2004) [71] Valin, J.M., Michaud, F., Rouat, J.: Robust localization and tracking of simultaneous moving sound sources using beamforming and particle filtering. Robot. Auton. Syst. 55(3), 216–228 (2007), http://dx.doi.org/10.1016/j.robot.2006.08.004 [72] Visentini, I., Snidaro, L., Foresti, G.: On-line boosted cascade for object detection. In: Proceedings of the 19th International Conference on Pattern Recognition (ICPR), Tampa, Florida, USA (2008) [73] Wald, L.: A european proposal for terms of reference in data fusion. International Archives of Photogrammetry and Remote Sensing 7, 651–654 (1998)
1
Data Fusion in Modern Surveillance
21
[74] Waltz, E., Llinas, J.: Multisensor Data Fusion. Artech House, Norwood (1990) [75] Wu, B., Nevatia, R.: Optimizing discrimination-efficiency tradeoff in integrating heterogeneous local features for object detection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2008) [76] Xiong, N., Svensson, P.: Multi-sensor management for information fusion: issues and approaches. Information fusion 3(2), 163–186 (2002) [77] Yilmaz, A., Javed, O., Shah, M.: Object tracking: A survey. ACM Comput. Surv. 38(4), 13 (2006), DOI http://doi.acm.org/10.1145/1177352.1177355 [78] Yin, Z., Porikli, F., Collins, R.: Likelihood map fusion for visual object tracking. In: IEEE Workshop on Applications of Computer Vision, pp. 1–7 (2008) [79] Zhang, L., Li, S.Z., Yuan, X., Xiang, S.: Real-time object classification in video surveillance based on appearance learning. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007) [80] Zhang, P., Bui, T.D., Suen, C.Y.: A novel cascade ensemble classifier system with a high recognition performance on handwritten digits. Pattern Recognition 40(12), 3415–3429 (2007) [81] Zhang, W., Yu, B., Zelinsky, G.J., Samaras, D.: Object class recognition using multiple layer boosting with heterogeneous features. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2005), vol. 2, pp. 323–330. IEEE Computer Society, Washington (2005)
Chapter 2
Distributed Intelligent Surveillance Systems Modeling for Performance Evaluation Stefano Maludrottu, Alessio Dore, and Carlo S. Regazzoni
Abstract. In the last few years, the decrease of hardware costs and the contemporary increase of processing capabilities make possible the development of more and more complex surveillance systems with hundreds of sensors deployed in large areas. New functionalities are now available such as scene understanding, contextdependent processing and real-time user-driven functionality selection. The efficient use of these tools in architecturally complex systems for advanced scene analysis needs the development of specific data fusion algorithms able to merge multi-source information provided by a large number of homogeneous or heterogeneous sensors. In this context, the possibility of distributing the intelligence is one of the more innovative and interesting research fields for such systems. Therefore, several studies are focused on how the logical tasks can be partitioned between smart sensors, intermediate processing nodes and control centers. Typical tasks of surveillance systems (context analysis, object detection, tracking...) are organized into hierarchical chains, where lower levels in the architectures provide input data for higher level tasks. Each element of the architecture is capable of autonomous data processing. The complexity of such systems remarks the importance of good design choices for both logical and physical architecture. The main objective of this book chapter is to present possible solutions to evaluate the overall performance and technical feasibility as well as the interactions of the subparts of distributed multi-sensor surveillance systems. Performance evaluation of a multi-level hierarchical architecture does not pertain only to the accuracy of involved algorithms but several other aspects must be considered as the data communication between smart sensors and higher level nodes, the computational complexity and the memory used. Then, in order to define Stefano Maludrottu · Carlo S. Regazzoni Department of Biophysical and Electronic Engineering (DIBE) - University of Genoa, Genoa, Italy Alessio Dore Institute of Biomedical Engineering, Imperial College London, London, SW7 2AZ, UK
P. Remagnino et al. (Eds.): Innovations in Defence Support Systems – 3, SCI 336, pp. 23–45. c Springer-Verlag Berlin Heidelberg 2011 springerlink.com
24
S. Maludrottu, A. Dore, and C.S. Regazzoni
a procedure to assess the performances of this kind of systems a general model of smart sensors, intermediate processing nodes and control centers has been studied taking into account the elements involved in a multi-sensor distributed surveillance tasks. The advantage of the proposed architecture analysis method are: 1) to allow quantitative comparison of different surveillance structures through the evaluation of performance metrics and 2) to validate the algorithm choice with respect to the physical structure (communication links, computational load...) available. Finally example structures are compared in order to find the best solution to some benchmark problems.
2.1 Introduction In recent years, due to the increasing robustness and accuracy of computer vision algorithms, to the decrease of sensors and processing hardware prices, and to an increasing demand for reliable security systems in different areas (such as transports, industrial plants, public places or crowded events), surveillance systems have become more and more sophisticated. Hundreds, even thousands of sensors can be deployed in very large areas to perform different security tasks typical of those systems: people counting, tracking, abandoned object detection, movement detection and so on. Accuracy can be achieved through sensor redundancy or by combining different sensor typologies in order to operate under a number of difficult situations such as crowded scenes or poorly illuminated areas. The large amount of multisensorial data produced by heterogeneous and spatially distributed sensors is fused to automatically understand the events in the monitored environment. This data fusion process is instantiated at different levels of the architecture in a distributed and sequential way in order to provide to remote control centers (and to the attention of operators) aggregated data in the form of alarms or collections of salient video sequences about events of interest. A typical description of surveillance systems is a tree-like hierarchical structure composed of smart sensors, intermediate processing nodes and control centers connected by heterogeneous communication links. In this structure, each sensor contributes to the global surveillance task monitoring a small physical area and gathering local data about the environment. Since in advanced surveillance systems data can be locally analyzed using sensors with processing capabilities (and therefore defined as smart sensors) environment. Only salient information can be transmitted to fusion nodes devoted to specific data association and data fusion tasks. Since modern surveillance systems can monitor very large areas, the data fusion process itself can be subdivided into a hierarchical task performed by multiple processing units. More than one layer of intermediate fusion nodes can be dedicated to data fusion. Control centers are devoted to gather the data pertaining the whole monitored area and to present them with appropriate interfaces to the human operators.
2
Performance Evaluation for Intelligent Surveillance
25
This decomposition does not imply a centralized monitoring/control approach, but it reflects a wide range of current systems and it is intended as a descriptive model of the general structure of surveillance systems [10] [5]. Moreover, modern surveillance systems can be defined as “intelligent”, following a definition in [25], if they are capable of integrating the ambient intelligence paradigm [21] into safety or security applications. Many works have been devoted in the last decade to link traditional computer vision tasks to high-level context aware functionalities such as scene understanding, behavior analysis, interaction classification or recognition of possible threats or dangerous situations ([26] [16] [3] [27]). Another notable improvement of modern state-of-the-art surveillance systems (third generation surveillance systems) is the capability of internal logical task decomposition and distributed data processing. The so-called distribution of intelligence in such systems is made possible by the autonomous processing capabilities of the elements of the architecture. It can be described as a dynamic process of mapping the logical architecture (decomposed into a set of sequential sub-tasks) into the physical structure of the surveillance system itself. In such a way processing and communication resources can be optimized within the system. Modern surveillance systems can be used to provide different services to a number of different end-users; all the communication and processing resources of the system can be shared between different users according to specific strategies. Therefore, a modern third generation intelligent surveillance system whose characteristics are compliant to the above definitions can be a large, expensive and structurally complex system. The design of such a system is an articulated task: hardware selection, communication links design, sensors placement are some of the possible constraints that have to be considered. Moreover, existing architectures can be reused for new security applications, therefore is important to assess in advance how new functionalities can perform on existing structures. Different aspects of surveillance architectures should be taken into account to assess their performances for what concerns the algorithms accuracy. Also communication or computational complexity issues that can arise in complex multi-sensor systems should be considered. An efficient performance evaluation should not be an a posteriori analysis of such systems: modifications or refinements in such a large scale can be difficult, time consuming and economically disadvantageous tasks. A better solution is to link performance evaluation to the design phase, in order to produce useful feedbacks regarding specific design choices. Therefore, a correct modeling of intelligent surveillance systems can be defined in order to take into account all the designcritical aspects of surveillance systems.
2.1.1 Surveillance Systems as Data Fusion Architectures Since the main goal of an intelligent surveillance system is to provide to human operators a complete, reliable and context-aware description of the environment based
26
S. Maludrottu, A. Dore, and C.S. Regazzoni
Fig. 2.1 Scheme of the data fusion process of a surveillance system derived from the JDL model.
upon raw data obtained from the deployed sensors. A data fusion process is distributed between different elements of the architecture during which information extraction and refinement is performed at different abstraction levels. The data fusion process can be modeled, according to the JDL model [11], as a hierarchical process composed by four steps (See Fig. 2.1): 0-preprocessing: Raw sensor data are filtered, conditioned and processed before any fusion with data from other sensors; significant information is obtained from raw data using feature extraction algorithms. The preprocessing techniques adopted depend on the specific nature of the sensors. 1-processing: Spatial and temporal alignment of data from different sensors is performed. A common description of objects present in the environment is provided for instance, in terms of position, of velocity, of attributes and of characteristics. Specific recognition or classification algorithms are instantiated in order to label objects perceived in the scene. 2-situation analysis: Automated reasoning or ambient intelligence methods are adopted to provide a complete description of events in the environment. Interactions between objects are compared to predefined models and classified. The meaning of events within a scene is determined. 3-threat analysis: The evolution of the current situation in the environment is evaluated to determine potential threats using predictive models or probabilistic frameworks. Alarms are raised if potentially dangerous situations are recognized. Aggregated data and alarms resulting from the data fusion process are provided to human operators through an appropriate human-computer interface (HCI). HCIs may contain maps of the environment, icons or message boards in order to facilitate an immediate understanding of potential threats or suspicious situations. Aside from the data fusion process, database management can be considered as a key task to effectively handle the large amount of data involved in the data fusion process. Since the database (DB) is a critical component in a surveillance systems, many works have been devoted to DB modeling in complex systems. To avoid risks
2
Performance Evaluation for Intelligent Surveillance
27
of data loss or DB failure, a common approach is a DB (virtual) replication in every node of the architecture [14]. Therefore modern third-generation surveillance systems can be defined and identified by the mapping of specific data analysis and data fusion tasks into an architecture of intelligent hardware devices. This mapping can be an a-priori design decision or can be dynamically modified according to resources available in real-time and specific events in the environment.
2.1.2 Surveillance Tasks Decomposition Distribution of intelligence is a critical feature of surveillance systems in order to concurrently optimize bandwidth usage and processing speed. For instance, data analysis performed at lower levels of the architecture can lead to a more efficient bandwidth and central processing usage, offering in the same time robustness through redundancy, since a distributed processing can better handle components failures. Typical surveillance tasks (e.g. tracking) can be decomposed into a chain of logical modules organized in a hierarchical structure: lower modules provide as output the data needed as input by higher level modules. In this way, starting from raw sensor data, at each level an higher inference can be reached. This decomposition, originally proposed in [13] defines the paradigm of intelligence distribution in terms of module allocation to different physical devices with autonomous processing capabilities. Three kinds of modules have been defined at different abstraction levels: representation modules (as information-processing tasks whose output is higher-symbolic representation of the input data), recognition modules (as algorithms that compare input data with a predefined set of event descriptors) and communication modules (that produce a coded representation of input data suitable for transmission). In [2] a mobile agent system (MAS) is defined as a dynamic task assignment framework to smart cameras in surveillance applications: high-level tasks are allocated to sensor clusters. The system itself decomposes them into subtasks and maps this set of subtasks to single sensors according to the available processing resources of cameras and the current state of the system. Moreover, software agents can dynamically migrate between different sensors according to changes in the observed environment or in the available resources. In [19] is defined an optimal task partitioning framework for distributed systems that is capable of a dynamic adaptation to system or environment changes by taking into consideration Quality of Service (QoS) constraints. In [22] an agents-based architecture is presented, where agents, defined as software modules, are capable of autonomous migration between nodes of the architecture. Within this framework, camera agents are responsible for detection and tracking tasks while object agents provide updated descriptions of temporal events.
28
S. Maludrottu, A. Dore, and C.S. Regazzoni
More in general, the mapping of tasks on the architecture of an intelligent surveillance system can be defined as an association function A: A:T →N
(2.1)
where T is the complete set of processing tasks ti that the system is required to perform and N is the set of intelligent devices (or nodes) ni in the system itself. For every node ni it must hold that ∑ j pi j ≤ p j where pi j is the cost value in terms of processing resources (memory or CPU) of task ti allocated to node n j and p j is the total processing capability of node n j .
2.2 Intelligent Surveillance System Modeling 2.2.1 Introduction In the following of this Section, a model is proposed to provide a general framework for the performance evaluation of complex multi-sensor intelligent surveillance systems. To this aim three main components of the system are identified, i.e., smart sensor, fusion node and control center. In this work we will consider a structure where sensors and control centers can exchange data only with the fusion nodes, albeit it would be possible to add other communication links (e.g. direct communication between sensors) to model specific architectures. This structure has been chosen since this work focuses on the hierarchical architecture where the fusion node overlooks the sensors and, eventually, it modifies their processing tasks or their internal parameters or it passes information gathered from other sensors. This model can be constituted by one or more fusion nodes according to the extension of the monitored area and the number of sensors. The fusion nodes send the processed data to a common remote control center. If the considered environment is very large or structurally complex, or if a huge quantity of sensors have to be considered, more than one level of intermediate nodes can be used to perform the data fusion process. A hierarchical multi-level partitioning of the environment can be adopted in order to manage the data flow between multiple layers of intermediate fusion nodes. Each zone at level i-th (“macro-zone”) is divided into a set of smaller zones (“micro-zones”) at (i + 1)-th level according to a specific partitioning strategy such as the Adaptive Recursive Tessellation [24]. Higher-level fusion nodes take as input metadata produced by lower-level fusion nodes. These data (related to contiguous micro-zones) are fused in order to get a global representation of a macro-zone of the environment. Control centers are directly connected to all the fusion nodes (or to highest-level fusion nodes in a multi-level fusion structure as defined before) of the architecture. In this way, the metadata descriptive of the entire environment are collected in order to provide a complete representation of the environment to the final users. Usually a single control center is present, although in multi-user surveillance systems, more than one control center can be realized.
2
Performance Evaluation for Intelligent Surveillance
29
Without loss of generality, we will consider the overall fields of view of the fusion nodes contiguous non-overlapped. A global metric used to evaluate the global performances of a multi-sensor system will be described in Sect. 2.2.5. The global metric will take into account not only specific data processing errors but also problems or failures of structural elements of the architecture. In Section 2.4 it will be shown how, keeping unaltered the processing algorithms involved, different design choices (in terms of sensor selection and positioning) can significantly affect the overall system performances.
2.2.2 Smart Sensor Model Each smart sensor Si generates a local representation of specific physical quantities (such as light, motion, sound...). Input data are locally processed by the sensor and appropriate metadata are sent to higher levels of the architecture. The processing function Ti (t) represents a data analysis task that produces a certain output data D f i (t + Δ ti ) given an input data D i (t) referring to the scene observed in its field of view VSi (t). The quantity Δ ti (defined as time response of sensor Si ) accounts for the computational time needed by the smart sensor to process input data. The value Δ ti depends on several factors, such as: the specific processing task considered, the computational complexity and the available processing resources. The input noise is modeled with E i (t) and the noise on the processed data sent to the fusion node is E f i (t + Δ ti ). The variables D i (t), D f i (t + Δ ti ), E i (t) and E f i (t + Δ ti ) are vectors of dimension proportional to the complexity of the scene observed by Si at time t (e.g. in a tracking application those vectors have dimension equal to the number of targets Ni (t)). The computational complexity C function of D i (t) is associated to the processing function. The error vector in this application coincides with the accuracy error that has to be evaluated through a suitable performance metric. To further improve modularity, different performance metrics can be adopted (including the ones mentioned in section 2.3.3). Since a third generation intelligent surveillance system is a context-aware and (possibly) multi-user multi-functional system, the specific processing task Ti (t) associated to each sensor Si can be modified according to specific user requests and salient changes in the environment. The database Di represents the memory M of the smart sensor. It has been decided to address it as a database since many surveillance algorithms require to compare actual data with known models or previous data. The algorithm T exchanges information with the database that can also send additional information to the fusion node. The communication unit Ci is responsible for transmitting data to the fusion node. The communication unit has to satisfy the requirements of error free data transmission considering possible bandwidth B limitations of the link toward the fusion node. In order to comply with real-time constraints, a maximum-delay value Δ tmax,s can
30
S. Maludrottu, A. Dore, and C.S. Regazzoni
be defined. Thus, if Δ ti > Δ tmax,s , the output of the processing task can be considered to contain obsolete data and it is discarded. Each smart sensor is considered to be placed in a certain known position of the environment; its field of view (FOV) depends on its angle of view and its orientation. If a smart sensor has a fixed orientation its FOV VSi (t) becomes a constant VSi . Therefore a smart sensor Si can be modeled as composed by four fundamental elements: • • • •
a processing function Ti (t), a database Di , a communication unit Ci , a field of view VSi (t).
The schematic diagram of a smart sensor model can be seen in Fig.2.2.
Fig. 2.2 Representation of the smart sensor model
2.2.3 Fusion Node Model In this stage the data coming from each sensor are integrated to realize an accurate single model of each object perceived in the environment. The structure of the fusion node N is similar to the smart sensor model but some differences are present in the processing function and communication unit. The fusion node processing function TF (t) takes M data D f i (t) (together with the respective error measure E f i (t), i = 1, ..., M) as input, where M is the number of smart sensors connected to the fusion node Fk . Due to failures of specific
2
Performance Evaluation for Intelligent Surveillance
31
sensors or communication links, or due to data loss for real-time constraints or bandwidth limitations, a subset of input data can be received by the fusion node. An “information loss” parameter IF (t) is defined to account for the percentage of the input data not correctly received by the fusion node. Thus, the transfer function D f 1 (t), ..., D f M (t), E f 1 (t), . . . , E f M (t), IF (t)). The cardinality of DF (t) and is: TF (D D f 1 (t) ∪ D f 2 (t) ∪ ...D D f M (t))). EF (t) vectors is NF (t) = TF (card(D The output data D ci (t + Δ tF ) of the processing function TF (t) is sent to the control centers. The time response of fusion node N (Δ tF ) accounts for the computational time needed by the fusion node to process input data. An efficient data fusion algoE f 1 (t), . . . , E f M (t)), an error vector E F (t + Δ tF ) in rithm produces, as output of T (E which each value is lower than the respective error value in E i (t) detected from the i-th sensor. The robustness of data fusion algorithms can be defined as a function E f i , IF ) → E F that records how the error of the output data is affected by the RF : (E errors of the input data and by information loss. Worth of note is that the computational complexity C of the fusion node processing is related to the dimension of the overlapped fields of view and the intricacy of the events analyzed in this region. Similarly to the smart sensor model, the processing function TF (t) depends on the specific task decomposition at time t. The database DF performs the same tasks described for the smart sensor model. The communication unit CF analyzes the transmission load for each communication link connecting the fusion node N to the smart sensors Si . Moreover, it manages the data transmission to control centers through a link of bandwidth BF . It can also send messages to the smart sensor communication unit in order to perform an adaptive optimization procedure. For example, if one link is overused the fusion node can ask to all the smart sensor connected to that link to reduce their transmission rate by a higher compression or a dynamic reallocation of processing tasks can be performed. Taking into account real-time constraints, a maximum delay value for the fusion node Δ tmax,F can be defined. Output data are sent to control centers only if the time response value Δ tF ≤ Δ tmax,F . A fusion node N is modeled by three fundamental elements: • a processing function TF (t), • a database DF , • a communication unit CF .
2.2.4 Control Center Model The control center C receives data produced by fusion nodes. Collected data can be further processed in order to produce a global representation of the environment. Aggregated data are shown to operators by means of an appropriate HCI. At this stage changes of allocation of data processing tasks within the system are issued if the user requests about specific functionalities are received or as a result of a context analysis (e.g. if threats or suspicious situation are detected by the system). The control center processing function TC (t) can be decomposed as TP (t) + THCI where TP (t) is the input data processing task and THCI is the processing related to
32
S. Maludrottu, A. Dore, and C.S. Regazzoni
the user interface (supposed constant over time). If more than one fusion node Nk is connected to the control center C, TP (t) is an high-level data fusion task: TP (t) takes M data D ci (t) (together with the respective error measure E Fi (t), i = 1, ..., M ) as input, where M is the number of intermediate fusion nodes connected to the control center Ck . An ’information loss’ parameter IC (t) can be defined for the control center as the percentage of input data not correctly received at time t. Thus, the transfer D c1 (t), ..., D cM (t), E F1 (t), . . . , E FM (t), IC (t)). function is: TF (D The Output data DC (t + Δ tC ) affected by an error E C (t + Δ tC ) is presented to operators through an appropriate interface. The value Δ tC is the time response value of the control center. The output error is affected by the information loss parameter similarly to the fusion node model. Similarly to the fusion node model, a robustness E Fi , IC ) → E F . function can be defined as RC : (E The database DD performs the same tasks described for the smart sensor model. The communication unit CC analyzes the transmission load for each communication link that connects the control center C to the intermediate fusion nodes Ni . It sends messages to fusion nodes communication units to optimize processing and communication resources or to manage/start/terminate specific surveillance tasks. The control center C is modeled considering: • a processing function TC (t), • a database DC , • a communication unit CC .
2.2.5 Performance Evaluation of Multi-sensor Architectures The Performance evaluation of data fusion systems, such as multi-sensors surveillance systems is usually done considering only the information processing and refinement algorithms. Much work has been dedicated in the recent years to evaluate the performances of surveillance algorithms usually by a comparison of the output of those algorithms with a manually obtained ground truth. In this way a quantitative evaluation and an effective comparison of different approaches can be usually assessed. In visual surveillance, for instance, since typical tasks are tracking or detection of moving objects, the corresponding ground truth can be obtained by manually labeling or drawing bounding boxes around objects. However, the evaluation of those systems does not pertain only on algorithms accuracy but several other aspects must be considered as well: the data communication between sensors, fusion nodes and control centers, the computational complexity and the memory used. A complete model must take into account all the significant elements in a multi-sensor system. Therefore, in order to define a more global procedure to assess the performances of intelligent third-generation surveillance systems a general model of smart sensors, fusion nodes and control centers has been studied. A global evaluation function P(F(T, R), DF , I) can be defined in order to assess the overall performances of surveillance architectures. The global result of this function will implicitly take into account both the robustness and the reliability of processing algorithms and the specific structure of the surveillance system itself.
2
Performance Evaluation for Intelligent Surveillance
33
P depends on several parameters. The performance metric F(T, R) is related to the specific surveillance task (tracking, detection, classification,...) that takes into account the processing functions T = {Ti , TF , TC } of robustness R = {Ri , RF , RC } assigned to the nodes of the architecture. I and DF are, respectively, the information loss (accounts for bandwidth/CPU limitation problems, real-time constraints...) and the database failure rate (accounts for storage/retrieval errors) of every element of the structure. In this way, different intelligent surveillance systems can be evaluated and compared on the basis of different design choices. The impact of specific data processing functions assigned to smart sensors Ti , intermediate fusion nodes TF and control centers TC on the overall performances can be assessed. Sensors selection and positioning can be validated for specific surveillance applications. Finally, the cost of data loss due to different causes (bandwidth limitations, real-time requirements,...) can be successfully evaluated.
2.3 A Case Study: Multi-sensor Architectures for Tracking 2.3.1 Introduction In order to test the proposed model for a real-world surveillance system, a distributed tracking architecture has been chosen as a benchmark problem as in [1]. The structure consists of a set of sensors Si that transmit over a common transmission channel C, the local results of tracking/localization tasks to a single fusion node N. In the following case study, video sensors have been considered since in real-world surveillance systems, video cameras are by far the most common sensors. However, this does not have to be considered as a limitation of the proposed model, since heterogeneous sensors (audio, radio, GPS,...) can be used as well in tracking applications to provide reliable information on moving objects in the environment. In the fusion node data association as well as data fusion are performed. The local object identifiers (IDs) are replaced by common identifiers and point-to-track association is done. A correct modeling solution of this problem can be obtained through a hierarchical decomposition of the architecture of the tracking system into its physical and logical subparts according to the model defined in section 2.2.1. In the literature many similar approaches have been proposed for distributed fusion architecture modeling: in [15], a hierarchical architecture for video surveillance and tracking is proposed; the camera network sends the data gathered to the upper level nodes in which tracking data are fused in a centralized manner. In some related works specific metrics are introduced in order to measure the reliability of each subpart involved in the data fusion. Typical problems arising in video data flow are due to the superimposition of targets and elements of the scene occupying different 3D in the image plane. This issue, called occlusion, causes a momentary lack of observations with consequent difficulties in the position estimation procedure. Moreover, background elements
34
S. Maludrottu, A. Dore, and C.S. Regazzoni
can produce misleading observations (or clutter) that are to be excluded from the measurements-target association procedures. A possible and widely exploited approach to overcome these issues and to enhance the tracking robustness consists in using more sensors monitoring with complete or partial fields of view overlapping the same area. In this way it is possible to obtain multiple observations of the same entity whose joint processing likely improves tracking performance. The fusion module can be considered as a function that associates in a proper way the data produced by the sensors with overlapped fields of views and estimates an unique scene representation. An example of the above mentioned architecture is shown in Figure 2.3. Sensors are considered as smart sensors , i.e. with an embedded processing unit and with the capability of performing autonomous localization and tracking tasks. Fusion nodes gather data from two or more sensors and process them jointly to generate an unique scene representation.
Fig. 2.3 Example of a data fusion system for target tracking
2.3.2 Multisensor Tracking System Simulator On the basis of the architectural model described in sect. 2.2.1 a tracking simulator has been realized to demonstrate the possibility to evaluate the system performances. A context generator has been designed in order to produce a set of realistic target trajectories simulating moving objects with stochastic dynamics. The simulated smart sensor Si acquires data from the context generator only for what concerns his field of view VSi (t) and processes them using a Kalman Filter. Multiple sensors
2
Performance Evaluation for Intelligent Surveillance
35
with overlapped fields of view send their processed data to one fusion node Fk that processes the data using a modified Kalman Filter in addition to a Nearest Neighbor data association technique [28]. 2.3.2.1
Context Generator
The context generator produces trajectories with linear autoregressive first order model with Gaussian noise. Trajectories lay in a two-dimensional space representing the map of a monitored environment. A Poissonian distribution is used to describe the probability of new target born and death. Occlusions are also simulated when two targets are aligned with respect to the camera line of sight and they are sufficiently close each other. Clutter noise is also introduced at each scan as a Poisson process. The context generator (see Fig. 2.4) allows us to have a realistic but controllable environment by which simulating different situations automatically generating the correspondent ground truth. Yet, the proposed context generator has to be considered as a simplified model of the environment, since some of the typical issues of visual-based surveillance applications (such as sudden or continuous lighting variations, adverse weather conditions, camera vibrations...) are not taken into consideration. In order to assess how these problems affect the overall performances of a given multi-sensor architecture, either a more detailed context generator can be implemented (as in [23]) or the simulator can be modified to use real data as input (as in [4]). However, in this work a simple context generator module has been chosen since this case study intends to show the possibility of using the proposed model for performance evaluation. A more realistic simulator can be implemented according to the specific application of the proposed general model. 2.3.2.2
Smart Sensor
When a target trajectory enters in the field of view VSi (t) of the i-th sensor a singleview tracking task is initialized to detect the position and to follow the movements of targets. For each target, in order to estimate the position of tracked objects a Kalman filter is used, whose equations are: x(k + 1) = Aw x(k) + ww (k) z (k) = H w x (k) + v w (k)
(2.2) (2.3)
w l and v l are, respectively, the transition model and the measurement model noise and they are independent, white Gaussian and zero mean. A l and H l respectively are the state transition matrix and the measurement matrix. In this work the computational complexity is a linear function of the number of targets present in the FOV. When this function exceeds a threshold that is defined to represent the unit processing capability, a random lack of data to be sent to the fusion node is simulated.
36
S. Maludrottu, A. Dore, and C.S. Regazzoni
Fig. 2.4 Context generator for multi-camera tracking: white dots are the moving targets and green dots are the sensors placed in the environment
In this example the communication unit of the sensor is responsible of sending data regarding the occupation of the link to the fusion node. A function has been implemented to simulate the optimization of data transmission. In this simulator the database is not modeled since the performed tasks do not need a data storage. 2.3.2.3
Fusion Node
The communication unit handles the data stream coming from the smart sensors. A maximum bandwidth value Bmax has been defined according to the physical constraints of the architecture. If ∑i Bi (k) > Bmax (where Bi is the bandwidth occupation of the sensor i at the time k ) a sensor subset S∗ (k) of the set S of all the sensors connected to the fusion node is chosen in order to optimize the bandwidth usage and such that ∑i B∗i (k) <= Bmax . Various strategies can be implemented to select the sensor subset S∗ (k); in this example a priority value pi is assigned to each sensor Si and we define the subset of selected sensors equal to an empty set S∗ (k) = 0; / then we will add the sensors one by one, choosing at each step the sensor between the remaining ones with the highest priority value. After the sensor selection a feasibility check is performed; if the bandwidth constraint is satisfied the sensor is added to the selected sensor subset (S∗ (k) = S∗ (k) + S j ).
2
Performance Evaluation for Intelligent Surveillance
37
In the processing function initially the association of measurements with targets is performed. Subsequently that these targets are updated and a target initialization procedure is applied for unmatched measurements. Each target xt is represented by using xt (k), state covariance Pt (k) and a category estimate et (k) for each frame k. The state vector is: T x = x y x˙ y˙ (2.4) where (x, y) are the ground plane coordinates and (x, ˙ y) ˙ is the velocity. In the measurement mt we include the ground plane position zt (k) whose covari T ance is Rt (k), where: z = x y . Therefore we measure the state on the ground plane using a Kalman filter. The association between tracks is evaluated by using the Mahalanobis distance and it is computed as: H Tw + R (i) S jt (k) = H w Pt (k/k − 1)H j (k)
(2.5)
(i) Thus, defining I(k) = z j − H w xˆ (k/k − 1) we can write: d 2jt = I(k)T S jt (k)−1 I(k)
(2.6)
where xˆ (k/k − 1) is the target prediction. To determine the association this distance is compared against a threshold and the Nearest Neighbor Algorithm is applied in order to establish the association. We can make the hypothesis that for each target, (i) in one view, there is at most one measurement, i.e. ∑ j β jt ≤ 1. Obviously if the (i)
measurement misses then ∑ j β jt = 0. The overall measurement for each target is the result of single camera measure(i) ments weighted with their covariance R j as follows: Rt =
∑∑ i
(i) β jt
−1 (2.7)
j
zt = R t
(i) −1 Rj
∑∑ i
(i) β jt
(i) −1 (i) Rj zt
(2.8)
j
These equations express the accuracy improvement obtained thanks to the fusion of all the views; as a matter of fact Equation 2.7 implies that the overall uncertainty is lower than the one provided by a single camera. Moreover Equation 2.8 ensures that the fused measurement is biased to the most accurate measurement from one camera. The target updating is, then, obtained in the following way: xˆ t (k/k) = xˆ t (k/k − 1) + K t (k) [zzt (k) − H w xˆ t (k/k − 1)]
(2.9)
−1 H Tw H w Pt (k/k − 1)H H Tw + Rt (k) K t (k) = Pt (k/k − 1)H
(2.10)
H w ] Pt (k/k − 1) with 0 < η < 1. its covariance is: Pt (k/k) = [II − K t (k)H
38
S. Maludrottu, A. Dore, and C.S. Regazzoni
For those measurements that do not match to any existing targets, a target initialization procedure has to be applied. Hence, every unmatched measurement from a camera must be checked against the other unmatched measurements from other cameras to find new targets. (i∗ ) We define z j∗ as the unmatched measurement from the i∗ th camera and xn the new target associated with it. All the association matrices β (i) are extended by one (i) column, β jn , for the new target xn ; its elements represent the association between the jth measurement from ith camera and the new target. For the i∗ th camera the elements of the columns nth, referring to the new target, are:
1 if j = j∗ (i∗ ) β jn = (2.11) 0 otherwise (i∗ )
(i)
Then the Mahalanobis distance between z j∗ and each unmatched measurement z j from ith camera is calculated as: (i) (i∗ ) T (i) (i∗ ) −1 (i) (i∗ ) di2∗ j = z j (k) − z j∗ R j − R j∗ z j (k) − z j∗
(2.12)
(i∗ )
(i)
where R j and R j∗ are the measurement covariances. Then, the new target is initialised as follows: T xˆ n (k/k) = z n (k)T 0 0 (2.13) R n (k) O 2 P n (k/k) = O 2 σv2 I 2
(2.14)
where O 2 and I 2 are 2 × 2 zero and identity matrices, respectively. The processing capability of the fusion node is limited due to the hardware specific processing resources (CPU, memory...). Similarly to the communication unit we define a maximum CPU load Cmax ; each of the Si ⊂ S∗ (k) sensors (output of the communication unit at a given time k) produces a CPU usage Ci (k). If ∑i Ci (k) >= Cmax we define a priority-based selection of the data streams in order to obtain a feasible subset S (k) such that Ci (Si (k)) <= Cmax . This selection will cause a loss of input data in order to preserve the computational feasibility; thus the data fusion will be performed using only a subset of the whole input data to the fusion node. The information loss parameter for the fusion node is considered as:
(k)| IF (k) = card|S card|S| . In this simulator the database is not modeled since the performed tasks do not need a data storage.
2.3.3 Performance Evaluation of Tracking Architectures The importance of evaluation of multi-sensor tracking architectures is demonstrated by the relevant body of research focused on the evaluation of reliability and accuracy
2
Performance Evaluation for Intelligent Surveillance
39
of automatic tracking applications. The general considerations about surveillance systems evaluation (see Sect. 2.2.5) are still valid. The vast majority of the works found in literature defines metrics related to specific parts of the architecture missing a more generic analysis. In order to evaluate the performances of such applications, different metrics have been defined: common performance metrics can be computed from true positives (TP), false positives (FP), true negatives (TN) or false negatives (FN) values. Other metrics can be used as well: in [17] an evaluation approach based on receiver operating curves (ROC) and an effectiveness F-measure metric is proposed. In [20] a multi-objective metric is adopted to evaluate and guide an evolutionary optimization of image processing algorithms. This function takes into account different accuracy and continuity values specific for a tracking application. The work described in [6] focuses on the definition of the main requirements for effective performance analysis for tracking methods. In [18], a number of metrics is defined for trajectory comparison, in order to determine how precisely a target is tracked by a computer vision system. In [29], an evaluation benchmark is presented to compare different visual surveillance algorithms developed for PETS (Performance Evaluation of Tracking and Surveillance) workshops. PETS metrics are specifically developed to provide an automatic mechanism to quantitatively compare similar purpose algorithms operating on the same data.
2.4 Experimental Results Three scenarios have been investigated in order to demonstrate the possibility of a complete evaluation of surveillance systems using the proposed method. In this experimental setup, no real-time constraints and no limitations induced by bandwidth/processing inadequacy are considered. Therefore, the information loss parameter Ii is considered equal to zero for each node i of the architecture. Moreover, since the DB is not modeled, we assume a DF equal to zero. In this case, the global performance metric P(F(T, R), DF , I) → P(F(T, R)) and is defined as P=
∑ i | f i − g i | + w · mi T OTi
(2.15)
where | fi − gi | is the Euclidean distance between the target position fi and its corresponding ground truth gi at timestep i. The term w is a penalty value that is set equal to 10, mi is the number of missing associations between ground truth data and fused tracks at timestep i, and T OTi is the total number of timesteps of the generated context (see Figure 2.5). The proposed metric combines two evaluation functions defined in [7]: a tracking matching error measure (TME) and a track detection failure measure (TDF). The former accounts for the positional error of tracking considered as the average distance between system tracks and ground truth tracks while the latter is the ratio of frames in which a moving object is not detected. A reference physical architecture has been defined as follows: five smart cameras c1...5 are placed in a parking lot area as shown in Fig.2.4. Their fields of view
40
S. Maludrottu, A. Dore, and C.S. Regazzoni
Fig. 2.5 Representation of the ground truth (’o’) and fused target tracking (’*’)
FOVs VSi (θi , ai ) are determined by two parameters: ai is the angle of view and θi is an orientation angle that accounts for the pose of the camera, defined as the angle between the central line of the FOV and an horizontal reference line (both parameters are measured in degrees). A common fusion node N and wireless communication channel W . In the following scenarios ai and θi have the following values: a1 = a2 = a3 = a4 = a5 = 45, θ1 = 90, θ2 = 200, θ3 = 90, θ4 = 45, θ5 = 330. First scenario. Sensor placement is an important design issue that has a critical impact on the overall performance of a multi-sensor surveillance system; many works address the sensors optimization problem and many comparative studies have been devoted to a careful comparison of different approaches (e.g. [8] and [9]). In this scenario a pose optimization problem is studied. The position of the cameras Ci is fixed but the orientation θi can be changed in real-time in order to obtain better tracking performances. An alternative optimized tracking architecture is defined to be evaluated and compared, using the proposed approach, to the default fixed cameras architecture. In this way it can be evaluated how the implementation of a pose optimization algorithm affects the overall performances of an already existing surveillance system. The overall structure (5 smart sensors, 1 fusion node and wireless communications) is maintained equal to the reference architecture. Only the orientation of sensors is changed over time to find a better configuration according to a genetic-based optimization technique. Genetic algorithms are population based search algorithms [12] used as optimization tools in a lot of different applications. In this particular case, a GA-based
2
Performance Evaluation for Intelligent Surveillance
41
optimization algorithm has been implemented to find the best orientations of a group of smart cameras in a distributed multi-sensor architecture. The input of GA is the current position of moving objects in the environment as perceived by the fusion node and the output is the orientation angle of each camera. The fitness function to be optimized for each sensor separately is formulated as ∑i wi di , where di is the perpendicular distance from an object position to the central line of FOV of sensor i and w is the number of objects in a given position. The camera is considered to be best oriented by minimizing the sum of distances. Parameters of GA are initialized as follows: population size of 20, number of generations equal to 100 and crossover and mutation rate respectively 0.8 and 0.04. Ten different scenes of various length with a fixed number of trajectories (10) have been provided by the context generator as input for the model. Many parameters influence the context generator algorithm behavior: the total number of trajectories, the generation rate, the Gaussian noise (i.e. the measurement error in smart sensors), etc. Those parameters have been randomly modified for each context generation in order to provide a ground truth as diversified as possible. Figure 2.6 presents the different results of the sample architectures and shows the connection between different camera placement strategies and performances: the dynamic camera orientation approach outperforms a fixed orientation solution up to 20%. Second scenario. Another critical issue in the design phase is the analysis of possible bottlenecks (defined as parts of the system that limit the overall performance) in
Fig. 2.6 Performance comparison of reference architecture (fixed camera orientation - in blue) and sensor-optimized architecture (in purple)
42
S. Maludrottu, A. Dore, and C.S. Regazzoni
Table 2.1 (a) Performance comparison of subparts of Architecture 1 (fixed cameras); (b) Performance comparison of subparts of Architecture 2 (optimized camera orientations). Cont refers to the generated context scene number used as input. Avg is the average performance over ten experiments. Cont Node 1 2 3 4 5 6 7 8 9 10
0.0188 0.0214 0.0182 0.0218 0.0362 0.0356 0.0439 0.0306 0.0348 0.035
S1
S2
S3
S4
S5
0.16 0.11 0.15 0.14 0.17 0.17 0.14 0.13 0.18 0.12
0.11 0.19 0.12 0.22 0.2 0.28 0.26 0.15 0.19 0.17
0.61 0.65 0.60 0.69 0.98 1.04 1.17 0.75 0.68 0.68
0.02 0.02 0.03 0.03 0.1 0.08 0.09 0.04 0.04 0.04
0.03 0.03 0.03 0.02 0.05 0.06 0.08 0.07 0.11 0.09
Avg 0.0296 0.15 0.19 0.78 0.05 0.06
Cont Node 1 2 3 4 5 6 7 8 9 10
0.0162 0.0189 0.0153 0.0184 0.03 0.0288 0.0347 0.0275 0.0296 0.033
S1
S2
S3
S4
S5
0.12 0.11 0.11 0.11 0.13 0.19 0.14 0.11 0.13 0.11
0.10 0.17 0.10 0.18 0.14 0.21 0.23 0.15 0.15 0.21
0.55 0.58 0.49 0.58 0.67 0.81 0.8 0.65 0.52 0.55
0.03 0.02 0.02 0.03 0.08 0.07 0.06 0.03 0.04 0.04
0.03 0.04 0.03 0.04 0.05 0.06 0.05 0.05 0.10 0.11
Avg 0.0252 0.13 0.16 0.62 0.04 0.06
Table 2.2 Performance comparison of different fusion architectures Context 1 2 3 4 5 6 7 8 9 10
Arch. A Arch. B 0.0254 0.0198 0.0238 0.0176 0.0164 0.0176 0.0258 0.022 0.0178 0.0312
0.0303 0.023 0.0316 0.0201 0.0271 0.0.275 0.0295 0.0278 0.025 0.0404
Average Perf. 0.0217 0.0282
the architecture. Detection of bottlenecks can avoid possible data losses or a better hardware selection in the design phase. Focusing on the architectures previously defined, the overall result in terms of tracking performances has been compared with the tracks produced by each smart sensor in the architecture in order to point out the contribution of each one to the overall performances. As shown in Table 2.1 sensor 3 has the worst performance since it acts as a bottleneck for the entire structure; a better placement or perhaps a more efficient sensor could improve the performance of the entire structure. Third scenario. The reference architecture (A) is compared to another possible surveillance architecture to perform multi-sensor tracking in the same environment.
2
Performance Evaluation for Intelligent Surveillance
43
The new architecture (B) is constituted by only three sensors (C1,C2 and C5 in Fig. 2.4) with fixed and wider fields of view (a1 = a2 = a3 = 60), more accurate target detection and tracking algorithms and better fusion node hardware. Ten different scenes with a fixed number of trajectories (10) have been provided as input for the model. In Table 2.2 it is shown that architecture A, even with worse hardware obtains a better overall result thanks to the higher number of sensor placed in the area.
2.5 Conclusions In this work a model of a system architecture for multi-sensor third generation surveillance systems has been proposed. These systems are defined as tree-like hierarchical structures, whose nodes (smart sensors, intermediate processing nodes and control centers) are capable of autonomous data processing functions. Surveillance tasks can be decomposed into a chain of logical modules according to a hierarchical data fusion structure. Each node has been modeled taking into account specific descriptive elements such as processing functions and communication units. Distribution of intelligence is realized as the mapping process of subtasks to single elements of the architecture. The aim of this model is to represent basic tasks performed by such a system in order to address the problem of performance evaluation. This description can be used to validate algorithms as well as system design with respect to the monitored scenario. A simulator of a multi-sensor hierarchical tracking system has been presented to show the appropriateness of the proposed model as a tool to assess system architectures and algorithms. Different design issues have been investigated, such as sensor selection and placement and bottleneck analysis.
References [1] Areta, J., Bar-Shalom, Y., Levedahl, M., Pattipati, K.R.: Hierarchical track association and fusion for a networked surveillance system. Journal of Advanced in Information Fusion 1(2), 140–157 (2006) [2] Bramberger, M., Doblander, A., Maier, A., Rinner, B., Schwabach, H.: Distributed embedded smart cameras for surveillance applications. Computer, 68–75 (2006) [3] Brdiczka, O., Chen, P., Zaidenberg, S., Reignier, P., Crowley, J.: Automatic acquisition of context models and its application to video surveillance. In: Proceedings of the International Conference in Pattern Recognition, vol. 1, pp. 1175–1178 (2006) [4] Chen, T., Haussecker, H., Bovyrin, A., Belenov, R., Rodyushkin, K., Kuranov, A., Eruhimov, V.: Computer vision workload analysis: Case study of video surveillance systems. INTEL Technology Journal 9, 109–118 (2005) [5] Dore, A., Pinasco, M., Regazzoni, C.: Multi-modal data fusion techniques and applications. In: Multi-camera Networks, pp. 213–237 (2009) [6] Ellis, T.: Performance metrics and methods for tracking in surveillance. In: IEEE International Workshop on Performance Evaluation of Tracking and Surveillance (2002)
44
S. Maludrottu, A. Dore, and C.S. Regazzoni
[7] Eom, K.Y., Ahn, T.K., Kim, G.J., Jang, G.J., Jung, J.Y., Kim, M.H.: Hierarchically categorized performance evaluation criteria for intelligent surveillance system. In: International Symposium on Web Information Systems and Applications, pp. 218–221 (2009) [8] Ercan, A.O., Yang, D.B., El Gamal, A., Guibas, L.: Optimal placement and selection of camera network nodes for target localization. In: Gibbons, P.B., Abdelzaher, T., Aspnes, J., Rao, R. (eds.) DCOSS 2006. LNCS, vol. 4026, pp. 389–404. Springer, Heidelberg (2006) [9] Erdem, U., Sclaroff, S.: Automated camera layout to satisfy task-specific and floor-plan specific coverage requirements. Computer Vision and Image Understanding 103, 156– 169 (2006) [10] Foresti, G., Snidaro, L.: A Distributed Sensor Network for Video Surveillance of Outdoors, ch. 1, pp. 7–27 (2003) [11] Hall, D., McMullen, S.: Mathematical Techniques in Multisensor Data Fusion, ch. 2, pp. 37–72. Artech House, Boston (2004) [12] Holland, J.H.: Adaptation in natural and artificial systems. MIT Press, Cambridge (1992) [13] Marcenaro, L., Oberti, F., Foresti, G., Regazzoni, C.: Distributed architectures and logical-task decomposition in multimedia surveillance systems. Proceedings of the IEEE (10), 1419–1440 (2001) [14] Mathiason, G., Andler, S., Son, S., Selano, L.: Virtual full replication for wireless sensor networks, Proceedings of the 19th ECRTS (WiP) (2007) [15] Micheloni, C., Foresti, G., Snidaro, L.: A network of cooperative cameras for visualsurveillance. IEE Visual, Image and Signal Processing 152(2), 205–212 (2005) [16] Moncrieff, S., Venkatesh, S., West, G.: Context aware privacy in visual surveillance. In: International Conference in Pattern Recognition (ICPR 2008), pp. 1–4 (2008) [17] Nazarevic-McManus, N., Renno, J., Jones, G.: Performence evaluation in visual surveillance using the f-measure. In: ACM International Workshop on Video Surveillance and Sensor Networks, pp. 45–52 (2006) [18] Needham, C.J., Boyle, R.D.: Performance evaluation metrics and statistics for positional tracker evaluation. In: Crowley, J.L., Piater, J.H., Vincze, M., Paletta, L. (eds.) ICVS 2003. LNCS, vol. 2626, pp. 278–289. Springer, Heidelberg (2003) [19] Nogueira, L., Pinho, L.: Iterative refinement approach for qos-aware service configuration. In: Kleinjohann, B., Kleinjohann, L., Machado, R., Pereira, C., Thiagarajan, P. (eds.) From Model-Driven Design to Resource Management for Distributed Embedded Systems, IFIP International Federation for Information Processing. Springer, Boston ´ Garc´ıa, J., Berlanga, A., Molina, J.M.: Adjustment of surveillance video sys[20] P´erez, O., ´ tems by a performance evaluation function. In: Mira, J., Alvarez, J.R. (eds.) IWINAC 2005. LNCS, vol. 3562, pp. 499–508. Springer, Heidelberg (2005) [21] Remagnino, P., Foresti, G.: Ambient intelligence: A new multidisciplinary paradigm. IEEE Transactions on Systems, Man and Cybernetics - Part A 35(1) (2005) [22] Remagnino, P., Shihab, A., Jones, G.: Distributed intelligence for multi-camera visual surveillance. Pattern Recognition (37), 675–689 (2004) [23] Taylor, G.R., Chosak, A.J., Brewer, P.C.: Ovvv: Using virtual worlds to design and evaluate surveillance systems, IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2007. pp. 1–8 (2007) [24] Tsui, P., Brimicombe, A.J.: The hierarchical tassellation model and its use in spatial analysis. Transactions in GIS 2(3), 267–279 (1997)
2
Performance Evaluation for Intelligent Surveillance
45
[25] Valera, M., Velastin, S.: Intelligent distributed surveillance systems: a review. IEE Proceedings - Vision Image and Signal Processing 152, 192–204 (2005) [26] Velastin, S., Khoudour, L., Lo, B., Sun, J., Vicencio-Silva, M.: Prismatica: A multisensor surveillance system for public transport network. In: Proceedings of 12th IEE Road Transport Information and Control Conference (RTIC 2004) (2004) [27] Xiaoling, X., Li, L.: Real time analysis of situation events for intelligent surveillance. In: Proceedings of International Symposium on Computational Intelligence and Design (ICID 2008), pp. 122–125 (2008) [28] Xu, M., Orwell, J., Lowey, L., Thirde, D.: Architecture and algorithms for tracking football players with multiple cameras. IEE Proceedings - Vision, Image and Signal Processing 152(2), 232–241 (2005) [29] Young, D., Ferryman, J.: Pets metrics: On-line performance evaluation service. In: IEEE International Workshop on Performance Evaluation of Tracking and Surveillance (2005)
Chapter 3
Incremental Learning on Trajectory Clustering Luis Patino, Franc¸ois Bremond, and Monique Thonnat
Abstract. Scene understanding corresponds to the real time process of perceiving, analysing and elaborating an interpretation of a 3D dynamic scene observed through a network of cameras. The whole challenge consists in managing this huge amount of information and in structuring all the knowledge. On-line Clustering is an efficient manner to process such huge amounts of data. On-line processing is indeed an important capability required to perform monitoring and behaviour analysis on a long-term basis. In this paper we show how a simple clustering algorithm can be tuned to perform on-line. The system works by finding the main trajectory patterns of people in the video. We present results obtained on real videos corresponding to the monitoring of the Toulouse airport in France.
3.1 Introduction Scene understanding corresponds to the real time process of perceiving, analysing and elaborating an interpretation of a 3D dynamic scene observed through a network of sensors (including cameras and microphones). This process consists mainly in matching signal information coming from sensors observing the scene with a large variety of models which humans are using to understand the scene. This scene can contain a number of physical objects of various types (e.g. people, vehicle) interacting with each other or with their environment (e.g. equipment) more or less structured. The scene can last a few instants (e.g. the fall of a person) or a few months (e.g. the depression of a person), can be limited to a laboratory slide observed through a microscope or go beyond the size of a city. Sensors include usually cameras (e.g. omnidirectional, infrared), but also may include microphones and other sensors (e.g. optical cells, contact sensors, physiological sensors, smoke detectors, GNSS). Luis Patino · Franc¸ois Bremond · Monique Thonnat INRIA Sophia Antipolis - M´editerran´ee, 2004 route des Lucioles - BP 93 - 06902 Sophia Antipolis E-mail:
[email protected] {Francois.Bremond,Monique.Thonnat}@inria.fr P. Remagnino et al. (Eds.): Innovations in Defence Support Systems – 3, SCI 336, pp. 47–70. c Springer-Verlag Berlin Heidelberg 2011 springerlink.com
48
L. Patino, F. Bremond, and M. Thonnat
Despite few success stories, such as traffic monitoring (e.g. Citilog), swimming pool monitoring (e.g. Poseidon) and intrusion detection (e.g. ObjectVideo, Keeneo), scene understanding systems remain erratic and can function only under restrictive conditions (e.g. during day rather than night, diffuse lighting conditions, no shadows). Having poor performance over time, they are hardly modifiable, containing little a priori knowledge on their environment. Moreover, these systems are very specific and need to be redeveloped from scratch for new applications. To answer these issues, most researchers have tried to develop original vision algorithms with focused functionalities, robust enough to handle real life conditions. Up to now no vision algorithms were able to address the large varieties of conditions characterising real world scenes, in terms of sensor conditions, hardware requirements, lighting conditions, physical object varieties, application objectives... Here we state that the scene understanding process relies on the maintenance of the coherency of the representation of the global 3D scene throughout time. This approach which can be called 4D semantic interpretation is driven by models and invariants characterising the scene and its dynamics. The invariants (called also regularities) are general rules characterising the scene dynamics. For instance, the intensity of a pixel can change significantly mostly in two cases: change of lighting conditions (e.g. shadow) or change due to a physical object (e.g. occlusion). Another rule verifies that physical objects cannot disappear in the middle of the scene. There is still an open issue which consists in determining whether these models and invariants are given a priori or are learned. The whole challenge consists in managing this huge amount of information and in structuring all this knowledge in order to capitalise experiences, to share them with other computer vision systems and to update them along experimentations. To face this challenge several knowledge engineering tools are needed: Tools for scene perception. A first category of tools contains vision algorithms to handle all the varieties of real world conditions. The goal of all these algorithms is to detect and classify the physical objects which are defined as interesting by the users. A first set of algorithms consists of robust segmentation algorithms for detecting the physical objects of interest. These segmentation algorithms are based on the hypothesis that the objects of interest are related to what is moving in the video, which can be inferred by detecting signal changes. Specific algorithms to separate physical objects from different categories of noise (e.g. due to light change, ghost, moving contextual object), and algorithms to extract meaningful features (e.g. 3D HOG, wavelet based descriptors, colour histograms) characterising the objects of interest belong to this category. Tools for verification of the 3D coherency throughout time (physical world) A second category of tools are the ones combining all the features coming from the detection of the physical objects observed by different sensors and in tracking these objects throughout time. Algorithms for tracking multiple objects in 2D or 3D with one camera or a network of cameras belong here; for instance, algorithms that take advantage of contextual information and of a graph of tracked moving regions where an object trajectory can be seen as the most probable path in the graph. This
3
Incremental Learning on Trajectory Clustering
49
property enables to process long video sequences and to ensure the trajectory coherence. Moreover, these tracking algorithms compute the uncertainty of the tracking process by estimating the matching probability of two objects at successive instants. A second example is to fuse the information coming from several sensors at different levels depending on the environment configuration. Information fusion at the signal level can provide more precise information, but information fusion at higher levels is more reliable and easier to accomplish. In particular, we are using three types of fusion algorithms: (1) multiple cameras with overlapping field of view, (2) a video camera with pressure sensors and sensors to measure the consumption of water and electrical appliances, and (3) video cameras coupled with other sensors (contact sensors and optical cells). Tools for Event recognition (semantic world). At the event level, the computation of relationships between physical objects constitutes a third category of tools. Here, the real challenge is to explore efficiently all the possible spatio-temporal relationships of these objects that may correspond to events (called also actions, situations, activities, behaviours, scenarios, scripts and chronicles). The varieties of these events are huge and depend on their spatial and temporal granularities, on the number of the physical objects involved in the events, and on the event complexity (number of components constituting the event and the type of temporal relationship). Different types of formalism can be used: HMM and Bayesian networks, temporal scenarios [28]. Tools for knowledge management. To be able to improve scene understanding systems, we need at one point to evaluate their performance. Therefore we have proposed a complete framework for performance evaluation which consists of a video data set associated with ground-truth, a set of metrics for all the tasks of the understanding process, an automatic evaluation software and a graphical tool to visualise the algorithm performance results (i.e. to highlight algorithm limitations and to perform comparative studies). Scene understanding systems can be optimised using machine learning techniques in order to find the best set of program parameters and to obtain an efficient and effective real-time process. It is also possible to improve system performance by adding a higher reasoning stage and a feedback process towards lower processing layers. Scene understanding is essentially a bottom-up approach consisting in abstracting information coming from signal (i.e. approach guided by data). However, in some cases, a top-down approach (i.e. approach guided by models) can improve lower process performance by providing a more global knowledge of the observed scene or by optimising available resources. In particular, the global coherency of the 4D world can help to decide whether some moving regions correspond to noise or to physical objects of interest. Tools for Communication, Visualisation Knowledge Acquisition and learning Even when the correct interpretation of the scene has been performed, a scene understanding system still has to communicate its understanding to the users or to adapt its processing to user needs. A specific tool can be designed for acquiring a priori knowledge and the scenarios to be recognised through end-user interactions.
50
L. Patino, F. Bremond, and M. Thonnat
3D animations can help end-users to define and to visualize these scenarios. Thus, these tools aim at learning the scenarios of interest for users. These scenarios can be often seen as the complex frequent events or as frequent combinations of primitive events called also event patterns. The present work belongs to the latter category. We aim at designing an unsupervised system for the extraction of structured knowledge from large video recordings. By employing clustering techniques, we define the invariants (as mentioned above) characterising the scene dynamics. However, not all clustering techniques are well adapted to perform on-line. On-line learning is indeed an important capability required to perform scene analysis on long-term basis. In this work we show how meaningful scene activity characterisation can be achieved through trajectory analysis. To handle the difficulty of processing large amounts of video, we employ a clustering algorithm that has been tuned to perform on-line. The approach has been validated on video data from the Toulouse airport in France (European project COFRIEND [1]). The reminder of the paper is organised as follows. We review some relevant work for scene interpretation from trajectory analysis in the following subsection (section 3.1.1 Related Work). We present the general structure of our approach in section 3.2 (General structure of the proposed approach). A brief description of the object detection and tracking employed in our system is given in section 3.3 (On-line processing: Real-time Object detection). The detailed description of the trajectory analysis undertaken in this work is given in section 4 (Trajectory analysis), including the algorithm to tune the clustering parameters. The evaluation of the trajectory analysis work is given in the following section. The results obtained after processing the video data from the Toulouse airport are presented in section 6. Our general remarks and conclusions are given at the end of the paper.
3.1.1 Related Work Extraction of the activities contained in the video by applying data-mining techniques represents a field that has only started to be addressed. Recently it has been shown that the analysis of motion from mobile objects detected in videos can give meaningful activity information. Trajectory analysis has become a popular approach due to its effectiveness in detecting normal/abnormal behaviours. For example, Piciarelli et al. [22] employ a splitting algorithm applied on very structured scenes (such as roads) represented as a zone hierarchy. Foresti et al. [11] employ an adaptive neural tree to classify an event occurring on a parking lot (again a highly structured scene) as normal/suspicious/dangerous. Anjum et al. [2] employ PCA to seek for trajectory outliers. In these cases the drawback of the approach is that the analysis is only adapted to highly structured scenes. Similarly, Naftel et al. [20] first reduce the dimensionality of the trajectory data employing Discrete Fourier Transform (DFT) coefficients and then apply a self-organizing map (SOM) clustering algorithm to find normal behaviour. Antonini et al. [3] transform the trajectory data employing Independent Component Analysis (ICA), while the final clusters
3
Incremental Learning on Trajectory Clustering
51
are found employing an agglomerative hierarchical algorithm. In these approaches it is however delicate to select the number of coefficients that will represent the data after dimensionality reduction. Data mining of trajectories has also been applied with statistical methods. Gaffney et al. [13] have employed mixtures of regression models to cluster hand movements, although the trajectories were constrained to have the same length. Hidden Markov Models (HMM) have also been employed [5, 21, 24]. In addition to activity clustering, so as to enable dynamic adaptation to unexpected event processing or newly observed data, we need a system able to learn the activity clusters in an on-line way. On-line learning is indeed an important capability required to perform behaviour analysis on long-term basis and to anticipate the human interaction evolutions. An on-line learning algorithm gives a system the ability to incrementally learn new information from datasets that consecutively become available, even if the new data introduce additional classes that were not formerly seen. This kind of algorithm does not require access to previously used datasets, yet it is capable of largely retaining the previously acquired knowledge and has no problem of accommodating any new classes that are introduced in the new data [23]. Some various restrictions, such as whether the learner has partial or no access to previous data [26, 17, 15], or whether new classes or new features are introduced with additional data [30], have also been proposed [19]. Many popular classifiers, however, are not structurally suitable for incremental learning; either because they are stable [such as the multilayer perceptron (MLP), radial basis function (RBF) networks, or support vector machines (SVM)], or because they have high plasticity and cannot retain previously acquired knowledge, without having access to old data (such as -nearest neighbor) [19]. Specific algorithms have been developed to perform on-line incremental learning, such as Leader [14], Adaptive Resonance Theory modules (ARTMAP) [8, 7], Evolved Incremental Learning for Neural Networks [25], leaders-subleaders [29], and BIRCH [18]. Among them, the Hartigan algorithm [14], also known as Leader algorithm, is probably the most employed in the literature. The Leader algorithm, computes the distance between new data and already built clusters to decide to associate these new data with the clusters or to generate new ones better characterising the data. However, all these approaches rely on a manually-selected threshold to decide whether the data is too far away from the clusters. To improve this approach we propose to control the learning rate with coefficients indicating how flexible the cluster can be updated with new data.
3.2 General Structure of the Proposed Approach The monitoring system is mainly composed of two different processing components (shown in Figure 3.1). The first one is a video analysis subsystem for the detection and tracking of objects. This is a processing that goes on a frame-by-frame basis. The second subsystem achieves the extraction of trajectory patterns from the video. This subsystem is composed of two modules: The trajectory analysis module and the statistical analysis module. In the first module we perform the analysis of trajectories
52
L. Patino, F. Bremond, and M. Thonnat
by clustering and obtain behavioural patterns of interaction. In the second module we compute meaningful descriptive measures on the scene dynamics. For the storage of video streams and the trajectories obtained from the video processing module, a relational database has been setup. The trajectory analysis modules read the trajectories from the database and return the identified trajectory types; the discovered activities on the video; and resulting statistics calculated from the activities. Streams of video are acquired at a speed of 10 frames per second. The video analysis subsystem takes its input directly from the data acquisition component; the video is stored into the DB parallel to the analysis process. The whole system helps the manager or designer who wants to get global and long-term information from the monitored site. The user can specify a period of time where he/she wishes to retrieve and analyse stored information. In particular the user can access the whole database to visualize specific events, streams of video and off-line information.
Fig. 3.1 General architecture of the system.
3.3 On-Line Processing: Real-Time Object Detection Tracking objects in video is not the main contribution of this paper and therefore only a general description is made here. Detecting objects in an image is a difficult and challenging task. One solution widely employed consists in performing a thresholding operation between the pixel intensity of each frame with the pixel intensity of a background reference image. The latter can be a captured image of the same scene having no foreground objects, or no moving objects in front of the camera. The result of the thresholding operation is a binary mask of foreground pixels. The neighbouring foreground pixels are grouped together to form regions often referred
3
Incremental Learning on Trajectory Clustering
53
to as blobs which correspond to the moving regions in the image. If the moving object projection in the image plane does not overlap with each other, i.e. no dynamic occlusion, then each detected moving blob corresponds to a single moving object. The detailed description of the background subtraction algorithm, which also estimates when the background reference image needs to be updated, can be found in [12]. Having 3D information about the scene under view enables the calibration of the camera. Point correspondences between selected 3D points in the scene and their corresponding point in the 2D image plane allow us to generate the 3D location of any points belonging to moving objects. Thus, the 3D (i.e. width and height) of each detected moving blob can be measured as well as their 3D location on the ground plane in the scene with respect to a chosen coordinate system. The 3D object information is then compared against several 3D models provided by the user. From this comparison, a detected object is linked to a semantic class. Detected and classified 3D objects in a scene can be tracked within the scope of the camera using the 3D information of their location on the ground as well as their 3D dimensions. Tracking a few objects in a scene can be easy as far as they do not interact heavily in front of the camera: i.e. occlusions are rare and short. However, the complexity of tracking several mobile objects becomes a non-trivial and very difficult task to achieve when several object projected images overlap with each other on the image plane. Occluded objects have missing or wrong 3D locations, which can create incoherency in the temporal evolution of their 3D location. Our tracking algorithm [4] builds a temporal graph of connected objects over time to cope with the problems encountered during tracking. The detected objects are connected between each pair of successive frames by a frame to frame (F2F) tracker. Links between objects are associated with a weight (i.e. a matching likelihood) computed from three criteria: the comparison between their semantic class, 3D dimensions, and their 3D distance on the ground plane. The graph of linked objects provided by the F2F tracker is then analysed by the tracking algorithm, also referred to as the Long Term tracker, which builds paths of mobiles according to the link weights. The best path is then taken out as the trajectory of the related mobiles. The proposed tracking approach has the advantage of being simple to implement and able to run at a ‘high’ frame rate. However, it is sensitive to noise and this could prevent tracking correctly long trajectories.
3.4 Trajectory Analysis The second layer of analysis in our approach is related to the knowledge discovery of higher semantic information from analysis of activities recorded over a period of time that can span, for instance, from minutes to a whole day (or several days of recording). Patterns of activity are extracted from the analysis of trajectories.
3.4.1 Object Representation: Feature Analysis For the trajectory pattern characterisation of the object, we have selected a comprehensive, compact, and flexible representation. It is suitable also for further analysis
54
L. Patino, F. Bremond, and M. Thonnat
as opposed to many video systems, which actually store the sequence of object locations for each frame of the video building thus a cumbersome representation with little semantic information. If the dataset is made up of N objects, the trajectory for object O j in this dataset is defined as the set of points [x j (t), y j (t)] corresponding to their position points; x and y are time series vectors whose length is not equal for all objects as the time they spend in the scene is variable. Two key points defining these time series are the beginning and the end, [x j (1), y j (1)] and [x j (end), y j (end)] as they define where the object is coming from and where it is going to. We build a feature vector from these two points. Additionally, we also include the directional information given as [cos(θ ), sin(θ )], where θ is the angle which defines the vector joining [x j (1), y j (1)] and [x j (end), y j (end)]. A mobile object seen in the scene is thus represented by the feature vector: v j = [x j (1), y j (1), x j (end), y j (end), cos(θ ), sin(θ )]
(3.1)
This feature vector constitutes a set of simple descriptors that have proven experimentally to be enough to describe activities in a large variety of domains (such as traffic monitoring, subway control, monitoring smart environments), mainly because they are the most salient, but also they are appropriate for real world videos depicting unstructured scenes where trajectories of different types have strong overlap and they are usually the ones used by end-users of different domains.
3.4.2 Incremental Learning We need a system able to learn the activity clusters in an on-line way. On-line learning is indeed an important capability required to perform behaviour analysis on a long-term basis. A first approach proposed in the state-of-the-art for on-line clustering is the Leader algorithm [14]. In this method, it is assumed that a rule for computing the distance D between any pair of objects, and a threshold T is given. The algorithm constructs a partition of the input space (defining a set of clusters) and a leading representative for each cluster, so that every object in a cluster is within a distance T of the leading object. The threshold T is thus a measure of the diameter of each cluster. The clusters CLi , are numbered CL1 , CL2 , CL3 , . . . , CLk . The leading object representative associated with cluster CL j is denoted by L j . The algorithm makes one pass through the dataset, assigning each object to the cluster whose leader is the closest and making a new cluster, and a new leader, for objects that are not close enough to any existing leaders. The process is repeated until all objects are assigned to a cluster. Leader-subleader [29], ARTMAP [8] and BIRCH [18] algorithms are of this type. The strongest advantages of the Leader algorithm are that it requires a single scan of the database, and only cluster representatives need to be accessed during processing. However, the algorithm is extremely sensitive to threshold parameter defining the minimum activation of a cluster CL. A new input object defined by its feature vector v will be allocated to cluster CL j if v falls into its input receptive field (hyper-sphere whose radio is given by r j = T ). Defining T is application dependent. It can be supplied by an expert with a deep knowledge
3
Incremental Learning on Trajectory Clustering
55
of the data or employing heuristics. In this work we propose to learn this parameter employing a training set and a machine learning process. Let each cluster CLi be defined by a radial basis function (RBF) centered at the position given by its leader Li : CLi (v) = φ (Li , v, T ) = exp(−v − Li2 T 2 )
(3.2)
The RBF function has a maximum of 1 when its input is v = Li and thus acts as a similarity detector with decreasing values outputted whenever v strides away from Li . We can make the choice that an object element will be included into a cluster if CLi (v) ≥ 0.5, which is a natural choice. The cluster receptive field (hyper-sphere) is controlled by the parameter T. Now, consider C = {CL1 , . . . ,CLk } is a clustering structure of a dataset X = {v1 , v2 , . . . , vN }; {L1 , L2 , . . . , Lk } are the leaders in this clustering structure and P = {P1, . . . , Ps } is the true partition of the data (Ground-truth) and {M1 , . . . , Ms } are the main representatives (or Leaders) in the true partition. We can define an error function given by E=
1 N
N
∑ Ej
(3.3)
j=1
⎧ ⎨ 0 i f L (v j ) , v j ∈ CLi ; L (v j ) , v j ∈ Pi E j = −1 i f v j ∈ CLi ; |CLi | = 1 and L (v j ) = M (v j ) ⎩ 1 otherwise
(3.4)
and L(v j ) is the Leader associated to v j in the clustering structure Ci and |Ci | is the cluster cardinality. M(v j ) is the Leader associated to v j in the true partition P. In the above equation, the first case represents a good clustering when the cluster prototype and the cluster elements match the ground truth partition P. The error is zero and the cluster size is correct. The second case corresponds to a cluster made of a singleton element. This element prototype does not correspond to any expected cluster prototype in the ‘true’ partition P. In this case the cluster size has to grow in order to enclose the singleton element. The remaining case is where an element is wrongly included into a cluster; The cluster size has then to decrease to exclude unwilling border elements. Minimising this error is equivalent to refining the clustering structure C or is equivalent to adjusting the parameter T that controls the cluster receptive field. A straightforward way to adjust T and minimise the error is employing an iterative gradient-descent method: T (t + 1) = T (t) − η
∂ E(t) ∂T
(3.5)
where the error gradient at time t is:
∂ E(t) 1 ∂ Φˆ = ∑ E j (t) ∂T N j ∂T
(3.6)
56
L. Patino, F. Bremond, and M. Thonnat
and the cluster activation gradient is: 2 ∂ Φˆ ∂ = exp − v j − L(v j ) T 2 ∂T ∂T
(3.7)
2 2 ∂ Φˆ = − v j − L(v j ) (2T ) exp(− v j − L(v j ) T 2 ) ∂T The threshold update can thus be written as:
(3.8)
2 1 E j (t) −2T v j − L(v j ) Φˆ ∑ N j
(3.9)
T (t + 1) = T (t) − η
The final value is typically set when the error is sufficiently small or the process reaches a given number of iterations. Convergence to an optimum value for both, T and E is only guaranteed if data in the ‘true’ partition P is well structured (having high intra-class homogeneity and high inter-class separation; see unsupervised/supervised evaluation below). With the purpose of tuning parameter T, and for this application, we have defined a Training data set (with associated Ground-truth) containing sixty nine synthetic trajectories. The ground-truth trajectories were manually drawn on a top view scene image. Figure 3.2 shows the empty scene of the Toulouse airport with some drawn trajectories. Semantic descriptions such as From Taxi parking area to Tow-tractor waiting point were manually given. There are twenty three of such annotated semantic descriptions, which are called in the following trajectory types. Each trajectory type is associated with a main trajectory that best matches that description. Besides, two complementary trajectories define the confidence limits within which we can still associate that semantic description. In figure 3.2 the main trajectory of each trajectory type is represented by a red continuous line while blue broken lines represent the complementary trajectories of the trajectory type. Thus, each ground-truth trajectory is associated to a semantic descriptor or trajectory type. Each trajectory type contains a triplet of trajectories. The proposed gradient-descent methodology was applied to the ground-truth dataset. The threshold T, in the leader algorithm, is initially set to a large value (which causes a merge of most trajectory types). Figure 3.3 shows how this threshold value evolves as the gradient algorithm iterates. The graph for the corresponding error is shown in figure 3.4. Remark that for this application we have not encountered local minima problems. However, as gradient-descent algorithms are clearly exposed to this problem, it could be envisaged to verify whether the minima found is indeed the global optima. A multiresolution analysis would be of help for this. It is also possible to evaluate, in an unsupervised or supervised manner, the quality of the resulting clustering structure: Unsupervised evaluation: typical clustering validity indexes evaluating the intracluster homogeneity and inter-cluster separation such as Silhouette [6, 16], Dunn [10] and Davies-Bouldin [9] indexes (given in Annex 1) can be employed.
3
Incremental Learning on Trajectory Clustering
57
Fig. 3.2 Ground-truth for different semantic clusters.
Fig. 3.3 Evolution of the threshold T controlling the cluster receptive field.
Figure 3.5 shows the evolution of these three indexes on the clustering of trajectories as the gradient-descent algorithm evolves. Supervised evaluation: supervised validity indexes, which in this case compare the clustering results to the true data partition, such as the Jaccard index [27] (given in
58
L. Patino, F. Bremond, and M. Thonnat
Fig. 3.4 Evolution of the gradient-descent error with the number of iterations. The error gives an indication of how many elements of different trajectory types are merged together in a single cluster.
Fig. 3.5 Validity indexes such as Silhouette (higher values are better), Dunn (higher values are better) and Davies-Bouldin (lower values are better) at each iteration step of the gradientdescent algorithm.
Annex 2) can also be employed. Figure 3.6 shows the evolution of this index on the clustering of trajectories as the gradient-descent algorithm evolves. For large values of the threshold T (above 1.5) it is possible to see that a large number of trajectories are badly clustered (about 1/3 of the dataset). The unsupervised indexes are also unstable (presenting some oscillatory changes over the different iterations) and indicative of a bad clustering structure (meaning low inter-cluster distance and high intra-cluster distance). The mapping with the true partition is also poor (indicated by low values of the Jaccard index). For values of the threshold T below 1.4 there is an almost monotonically improvement of the unsupervised and
3
Incremental Learning on Trajectory Clustering
59
Fig. 3.6 Supervised evaluation at each iteration step of the gradient-descent algorithm. The Jaccard index compares the resulting clustering with the partition given by the trajectory ground-truth.
supervised clustering indexes. The Jaccard index reaches its maximum value (meaning a perfect matching with the true partition) for a threshold T=0.79, which is then selected for our analysis. The unsupervised indexes are also indicative of a good clustering structure. The Leaders defined from this process are selected as the initial cluster centres that will guide the partition of new incoming data.
3.5 Trajectory Analysis Evaluation In order to test the efficiency of the trajectory clustering algorithm we have analysed a new set of synthetic trajectories, which we denote by ‘experimental set’. This new set was composed of 230 trajectories, which have the same structure as the training dataset; that is, each trajectory is associated with a semantic meaning. Moreover, each trajectory in the test dataset was generated from the Ground-truth dataset in the following form. Trajectories are generated by randomly selecting among points uniformly distributed on each side between the main trajectory and the two adjacent trajectories. These points lie on segments linking the main trajectory to the two adjacent trajectories. Each segment starts on a sample point of the main trajectory and goes to the nearest sample point in the adjacent trajectory. Adjacent trajectories are up-sampled for better point distribution. Ten points lie on each segment linking the main trajectory and an adjacent trajectory. For this reason, the trajectories generated from the principal trajectory will convey the same semantic. Figure 3.7 below shows a couple of examples. The clustering algorithm is then run again without any knowledge of the semantic description for each trajectory on the experimental data set. After the clustering process is achieved, the resulting partition can be assessed by comparing with the one initially defined by the Ground-truth; what the Jaccard index does. In this case, the
60
L. Patino, F. Bremond, and M. Thonnat
Fig. 3.7 Four different sets of synthetic trajectories. Each set contains twenty trajectories different from the main and adjacent trajectories previously defined in the Ground-truth dataset (Figure 3.2).
Jaccard index takes a value of 0.9119 (our baseline). Moreover, typical metrics related to the ROC space (Receiver operating characteristics) can be computed which evaluate both the mapping between the clustering algorithm output and the Groundtruth partition. These measures are the true positive rate (TPR) and false positive rate (FPR), which in this case take the following values: TPR=0.9565, FPR=0.002. In order to assess the robustness of the trajectory clustering algorithm, we have evaluated our approach on more different sets of synthetic objects. Each set has the particularity of containing groups of very similar trajectories (even with an overlap in the most difficult cases), yet associated with different semantics. The evaluation consists thus in assessing how much the clustering algorithm will be affected by the different levels of complexity/noise introduced. To characterise the different datasets, we have computed the unsupervised clustering indexes Silhouette, Dunn and Davies-Bouldin. The table below (Table 3.1) summarises the results. Each experimental set mentioned in the table above contains an increasing complexity. For instance, some groups of trajectories defined in the ‘experimental set 2’, which contain some overlap between them, are also present in the next experimental sets (experimental set 3, 4 , 5., . . . ). For each experimental set, some new groups of trajectories are added, which in turn induce more overlapping situations and will also be present in the following experimental sets. The figure below (Figure 3.8) presents some examples of such trajectories in the different experimental sets. The structuring indexes ‘Silhouette’, ‘Dunn’ and ‘Davies-Bouldin’ reflect the less distinct separation induced between trajectories with different semantic meanings (Silhouette and Dunn indexes decrease, while the Davies-Bouldin index increases). The different experimental datasets cover situations presenting a very strong separation between groups of trajectories with different semantic meanings (structuring indexes Silhouette=0.85 Dunn=0.79 Davies-Bouldin=0.22), partial confusion
3
Incremental Learning on Trajectory Clustering
61
Table 3.1 Clustering results on different synthetic datasets. Input
Output
Name
Nb. Trj.
Nb. Structure GT’s characteristics Sil Dunn DB
Nb. Structure Clusters characteristics Sil Dunn DB
Jaccard ROC measures Index TPR FPR
ExpSet ExpSet2 ExpSet3 ExpSet4 ExpSet5 ExpSet6 ExpSet7 ExpSet8 ExpSet9 ExpSet10 ExpSet11 ExpSet12 ExpSet13 ExpSet14
230 280 340 440 520 590 650 710 750 840 890 920 990 1070
23 28 34 44 52 59 65 71 75 84 89 92 99 107
22 27 32 42 51 55 59 65 62 71 72 77 86 77
0.91 0.92 0.87 0.85 0.91 0.86 0.79 0.82 0.61 0.66 0.60 0.62 0.62 0.45
0.85 0.82 0.80 0.76 0.75 0.73 0.72 0.70 0.70 0.67 0.66 0.65 0.64 0.60
0.79 0.59 0.58 0.47 0.46 0.47 0.39 0.37 0.39 0.25 0.13 0.21 0.22 0.04
0.22 0.24 0.26 0.33 0.33 0.37 0.39 0.40 0.41 0.45 0.48 0.50 0.52 0.61
0.82 0.80 0.76 0.73 0.74 0.70 0.67 0.67 0.57 0.58 0.55 0.54 0.53 0.45
0.79 0.80 0.03 0.06 0.30 0.14 0.08 0.05 0.03 0.03 0.04 0.02 0.04 0.02
0.30 0.33 0.41 0.47 0.42 0.49 0.57 0.54 0.62 0.65 0.66 0.68 0.66 0.74
0.95 0.96 0.94 0.92 0.95 0.93 0.88 0.91 0.80 0.82 0.77 0.79 0.77 0.66
0.0020 0.0013 0.0018 0.0016 0.00075 0.0012 0.0017 0.0012 0.0025 0.0020 0.0024 0.0022 0.0019 0.0030
(Silhouette=0.7588 Dunn=0.4690 Davies-Bouldin=0.3331), high confusion (Silhouette=0.6517 Dunn=0.2131 Davies-Bouldin=0.5031) and very high confusion (Silhouette=0.6098 Dunn=0.0463 Davies-Bouldin=0.6157). The trajectory clustering algorithm performs accordingly, having more difficulty to retrieve all initial semantic groups when the confusion increases (thus the internal structure of the input data decreases); at the same time, the mapping between the trajectory clustering results and the semantic groups (GT) also worsens as exposed by the Jaccard Index. However, the overall behaviour shown by the true positive rate (TPR) and false positive rate (FPR) remains globally correct with TPR values near or above 0.77 for all studied cases except for the worst case experimental set 14 where the TPR is below 0.7. In order to assess the generalisation capability of the trajectory clustering algorithm, we have carried out new experiments employing the CAVIAR dataset (http://www-prima.inrialpes.fr/PETS04/caviar data.html). The dataset contains people observed at the lobby entrance of a building. The annotated groundtruth includes for each person its bounding box (id, centre coordinates, width, height) with a description of his/her movement type (inactive, active, walking, running) for a given situation (moving, inactive, browsing) and most importantly gives contextual information for the acted scenarios (browsing, immobile, left object, walking, drop down). In Figure 3.9 below, some examples of the acted scenarios in the CAVIAR dataset and the involved contextual objects are shown. From the CAVIAR dataset we have kept only the representative trajectory of the acted scenario (which we further call principal trajectory). Other non-related trajectories like supplementary movements or non-actor trajectories, which are thus not related to the acted scenario, are filtered out. In total, forty different activities can
62
L. Patino, F. Bremond, and M. Thonnat
Fig. 3.8 Different examples of trajectories added to a given experimental data set under study. Different colours in an experimental data set correspond to different semantics attributed to the trajectories (each trajectory is associated with only one semantic meaning). Although the semantics between trajectories may be different, their spatial similarity can be very close.
Fig. 3.9 Two different people trajectory types while going to look for information (Browsing) at two different places.
be distinguished. They include Browsing at different places of the scene, Walking (going through the hall) from different locations, and leaving or dropping an object at different locations of the hall. We have created the new evaluation set applying the following formulae: xi = N (α rx , x1 ) + N (β rx , xi )
(3.10)
yi = N (α ry , y1 ) + N (β ry , yi )
(3.11)
Where N (σu , u) is a random number from a normal distribution with mean u and standard deviation σu ; rx = |max (xi ) − min (xi )| is the range function on x; and α , β are two different constants to control the spread of the random functions. For each principal trajectory, we have generated 30 new synthetic trajectories by adding random noise as explained above. In total, the synthetic CAVIAR dataset contained
3
Incremental Learning on Trajectory Clustering
63
Fig. 3.10 Synthetic trajectories in the CAVIAR dataset generated from the activities shown in the previous figure. The trajectories are plotted employing their 3D coordinates on the ground.
1200 trajectories. Figure 3.10 shows the synthetic trajectories generated from the principal trajectories of the activities shown before. The CAVIAR synthetic evaluation data set was further divided into a Learning set (containing 1/3 of the trajectories in the synthetic evaluation data set, plus all of the principal trajectories) and Test set (containing the remaining 2/3 from the trajectories in the synthetic evaluation data set). The Learning set was employed to tune the ‘T’ threshold, which is critical to the clustering algorithm as indicated before. In this case, the tuning algorithm has fetched a ‘T’ value of T=1.05 and for which, the best clustering partition matches the ‘true’ data partition. When evaluating the algorithm in the Test set, the supervised Jaccard index used to compare the resulting partition with that ‘true’ expected partition gives a value of: Jaccard index=0.99. One supplementary evaluation set was created by adding more trajectories but which contain some spatial overlap to those already defined, yet they convey a different semantic (same procedure carried out in the first synthetic dataset). The structuring indexes ‘Silhouette’, ‘Dunn’ and ‘Davies-Bouldin’ reflect again the less distinct separation induced between trajectories with different semantic meanings. Table 3.2 summarises the results on both CAVIAR experimental datasets. Again, the same trend as for the synthetic COFRIEND dataset appears. When the confusion between semantics increases (thus the internal structure of the input data decreases), retrieving all initial semantic groups is more difficult and the mapping between the trajectory clustering results and the semantic groups also worsens (Jaccard index decreases). Table 3.2 Clustering results on CAVIAR synthetic datasets. Input Name
Output Nb. Trj.
ExpSet1 440 ExpSet2 740
Nb. GT’s 22 37
Sil
Nb. Jaccard Structure ROC measures Clusters Index Sil TPR FPR
0.79 0.65
22 26
Structure
0.79 0.60
0.99 0.56
0.99 0.69
0.0001 0.008
64
L. Patino, F. Bremond, and M. Thonnat
3.6 Results We have processed in total five video datasets corresponding to different monitoring instances of an aircraft in the airport docking area (in the following, these video datasets are to be named: cof1, cof2, cof3, cof4 and cof8). These correspond to about five hours of video which corresponds to about 8000 trajectories. The system was first tuned and initialised as previously described (i.e. employing a learning dataset with 230 trajectories distributed into 23 trajectory types). Figure 3.11 shows the online system learning as the different video sequences (datasets) are processed.
Fig. 3.11 Number of processed trajectories (blue curve) and number of trajectory clusters created by the online system as the different datasets are sequentially processed. Remark that the number of trajectory clusters does not increase much in relation to the number of trajectories analysed after the processing of the‘cof2’ dataset.
The structure of the resulting clustering after the processing of a given dataset can be measured again with unsupervised evaluation indexes (i.e. the intra-cluster homogeneity and inter-cluster separation) as in section 3.4 ‘Trajectory analysis’. We calculated the Silhouette index for the clustering partition induced on each analysed dataset. Table 3.3 below gives such results. When comparing these Silhouette values with those obtained in section ‘Trajectory analysis’ for the evaluation of the clustering algorithm, we can deduce that the analysed datasets contain still high levels of complexity/noise. Table 3.3 Clustering structure evaluated by the Silhouette index on the processed datasets. dataset Silhouette Index cof1 cof2 cof3 cof4 cof8
0.45 0.24 0.43 0.39 0.39
3
Incremental Learning on Trajectory Clustering
65
We employed the trajectory clusters to measure the similarity between the different datasets. For this purpose a histogram was built for each dataset where each bin of the histogram represents the number of mobile objects being associated with that particular trajectory cluster. The similarity between datasets comes down to measuring the similarity between the established histograms. For this purpose we employ the Kullback-Leibler divergence measure given next for any two different histograms h1 and h2: KL (h1, h2) = ∑ ph1 (r) log r
ph1 (r) ph2 (r) + ph2 (r) log ph2 (r) ∑ ph1 (r) r
(3.12)
and r is a given bin on the histogram of trajectories. Because the Kullback-Leibler divergence measure is a non-bounded measure, which equals to zero when h1=h2, we actually calculate the correlation (corr) between the different datasets by adding a normalisation factor and a unit offset: corr (h1, h2) = 1 +
KL (h1, h2) ∑r ph1 (r) log (ph1 (r)) + ∑r ph2 (r) log (ph2 (r))
(3.13)
The correlation between the different datasets is then given in the table below. Table 3.4 Trajectory-based correlation between the different analysed datasets.
cof1 cof2 cof3 cof4 cof8
cof1 1 0.77 0.75 0.79 0.76
cof2 0.77 1 0.78 0.80 0.78
cof3 0.75 0.78 1 0.76 0.71
cof4 0.79 0.80 0.76 1 0.81
cof8 0.76 0.78 0.71 0.81 1
From the trajectory-based correlation table it can then be observed that sequences cof2, cof4 and cof8 are the most similar, although in general all five sequences do contain a large number of common trajectories as their minimum correlation value is above 0.75.
3.7 Conclusions Activity clustering is one of the new trends in video understanding. Here we have presented an on-line learning approach for trajectory and activity learning. Previous state of the art has mainly focused on the recognition of activities mostly with the aim to label them as normal/suspicious/dangerous. However, the adaptation/update of the activity model with the analysis of long term periods has only been partially adressed. Moreover, most state of the art on activity analysis has been designed for the case of structured motions such as those observed in traffic monitoring (vehicle
66
L. Patino, F. Bremond, and M. Thonnat
going straight on the road, vehicle turning on a round-about, ...) or specific isolated body motions like walking, running, jumping. In this paper, we have adressed the problem of incremental learning of unstructured spatial motion patterns. The proposed algorithm allows monitoring and processing large periods of time (large amounts of data), and thus perform analysis on a long-term basis. The proposed approach employs a simple, yet advantageous incremental algorithm: The Leader algorithm. The strongest advantage is that it requires only a single scan of the data, and only cluster representatives need to be stored in the main memory. Generally, incremental approaches rely on a manually-selected threshold to decide whether the data is too far away from the clusters. To improve this approach we propose to control the learning rate with coefficients indicating when the cluster can be updated with new data. We solve the difficulty of tuning the system by employing a training set and machine learning. The system respects the main principles of incremental learning: The system learns new information from datasets that consecutively become available. The algorithm does not require access to previously used datasets, yet it is capable of largely retaining the previously acquired knowledge and has no problem of accommodating any new classes that are introduced in the new data. In terms of the studied application, the system has thus the capacity to create new clusters for new trajectories whose type had not been previously observed. Exhaustive evaluation is made on synthetic and real datasets employing unsupervised and supervised evaluation indexes. Results show the ability of trajectory clusters to characterise the scene activities. In this work we have adressed only the recognition of single mobiles appearing in the scene. In a future work, we will adress group-related activity such as ‘Meeting’ (trajectory merging) and ‘Splitting’. Our future work will also include more exhaustive analysis with temporal information, extracted from trajectories, to achieve more precise behaviour characterisation and to distinguish between mobiles moving at ‘walking’ speed or higher speed. We will also include normal/abnormal behaviour analysis from trajectory clustering.
Acknowledgements This work was partially funded by the EU FP7 project CO-FRIEND with grant no. 214975.
References [1] Cofriend, http://easaier.silogic.fr/co-friend/, http://easaier.silogic.fr/co-friend/ [2] Anjum, N., Cavallaro, A.: Single camera calibration for trajectory-based behavior analysi. In: IEEE Conference on Advanced Video and Signal Based Surveillance, pp. 147–152. IEEE, Los Alamitos (2007) [3] Antonini, G., Thiran, J.: Counting pedestrians in video sequences using trajectory clustering. IEEE Transactions on Circuits and Systems for Video Technology 16, 1008– 1020 (2006)
3
Incremental Learning on Trajectory Clustering
67
[4] Avanzi, A., Bremond, F., Tornieri, C., Thonnat, M.: Design and assessment of an intelligent activity monitoring platform. EURASIP Journal on Advances in Signal Processing 2005, 2359–2374 (2005) [5] Bashir, F., Khokhar, A., Schonfeld, D.: Object trajectory-based activity classification and recognition using hidden markov models. IEEE Transactions on Image Processing 16, 1912–1919 (2007) [6] Campello, R., Hruschka, E.: A fuzzy extension of the silhouette width criterion for cluster analysis. Fuzzy Sets and Systems 157, 2858–2875 (2006) [7] Carpenter, G.A., Grossberg, S.: A self-organizing neural network for supervised learning, recognition, and prediction. IEEE Communications Magazine 30, 38–49 (1992) [8] Carpenter, G.A., Grossberg, S., Reynolds, J.: Artmap: supervised real-time learning and classification of nonstationary data by a self-organizing neural network. Neural Networks 4, 565–588 (1991) [9] Davies, D., Bouldin, D.: A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence 1, 224–227 (1979) [10] Dunn, J.: Well-separated clusters and optimal fuzzy partitions. Cybernetics and Systems 4, 95–104 (1974) [11] Foresti, G., Micheloni, C., Snidaro, L.: Event classification for automatic visual-based surveillance of parking lots. In: Proceedings of the 17th International Conference on Pattern Recognition, ICPR 2004, vol. 3, pp. 314–317. IEEE, Los Alamitos (2004) [12] Fusier, F., Valentin, V., Bremond, F., Thonnat, M., Borg, M., Thirde, D., Ferryman, J.: Video understanding for complex activity recognition. Machine Vision and Applications 18, 167–188 (2007) [13] Gaffney, S., Smyth, P.: Trajectory clustering with mixtures of regression models. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, California, United States (1999) [14] Hartigan, J.A.: Clustering algorithms. John Wiley & Sons, Inc., New York (1975) [15] Jantke, K.: Types of incremental learning. In: AAAI Symposium on Training Issues in Incremental Learning, pp. 23–25 (1993) [16] Kaufman, L., Rousseeuw, P.: Finding groups in data. An introduction to cluster analysis, New York (1990) [17] Lange, S., Grieser, G.: On the power of incremental learning. Theoretical Computer Science 288, 277–307 (2002) [18] Livny, M., Zhang, T., Ramakrishnan, R.: Birch: an efficient data clustering method for very large databases. In: ACM SIGMOD International Conference on Management of Data, Montreal, vol. 1, pp. 103–114 (1996) [19] Muhlbaier, M.D., Topalis, A., Polikar, R.: Learn++.nc: Combining ensemble of classifiers with dynamically weighted consult-and-vote for efficient incremental learning of new classes. IEEE Transactions on Neural Networks 20, 152–168 (2009) [20] Naftel, A., Khalid, S.: Classifying spatiotemporal object trajectories using unsupervised learning in the coefficient feature space. Multimedia Systems 12, 227–238 (2006) [21] Oliver, N., Rosario, B., Pentland, A.: A bayesian computer vision system for modeling human interactions. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 831–843 (2000) [22] Piciarelli, C., Foresti, G., Snidaro, L.: Trajectory clustering and its applications for video surveillance. In: Proceedings of IEEE Conference on Advanced Video and Signal Based Surveillance, 2005, pp. 40–45. IEEE, Los Alamitos (2005)
68
L. Patino, F. Bremond, and M. Thonnat
[23] Polikar, R., Upda, L., Upda, S., Honavar, V.: Learn++: an incremental learning algorithm for supervised neural networks. IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 31, 497–508 (2001) [24] Porikli, F.: Learning object trajectory patterns by spectral clustering. In: 2004 IEEE International Conference on Multimedia and Expo (ICME), vol. 2, pp. 1171–1174. IEEE, Los Alamitos (2004) [25] Seipone, T., Bullinaria, J.: Evolving neural networks that suffer minimal catastrophic forgetting. In: Modeling Language, Cognition and Action - Proceedings of the Ninth Neural Computation and Psychology Workshop, pp. 385–390. World Scientific Publishing Co. Pte. Ltd., Singapore (2005) [26] Sharma, A.: A note on batch and incremental learnability. Journal of Computer and System Sciences 56, 272–276 (1998) [27] Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison-Wesley, Reading (2005) [28] Vu, V.-T., Br´emond, F., Thonnat, M.: Automatic video interpretation: A recognition algorithm for temporal scenarios based on pre-compiled scenario models. In: Crowley, J.L., Piater, J.H., Vincze, M., Paletta, L. (eds.) ICVS 2003. LNCS, vol. 2626, pp. 523– 533. Springer, Heidelberg (2003) [29] Vijaya, P.: Leaders subleaders: An efficient hierarchical clustering algorithm for large data sets. Pattern Recognition Letters 25, 505–513 (2004) [30] Zhou, Z., Chen, Z.: Hybrid decision tree. Knowledge-based systems 15, 515–528 (2002)
Appendix 1 Silhouette Index The Silhouette index is defined as follows: Consider a data object v j , j ∈ {1, 2, · · · , N}, belonging to cluster cli , i ∈ {1, 2, · · · , c}. This means that object v j is closer to the prototype of cluster cli than to any other prototype. Let the average distance of this object to all objects belonging to cluster cli be denoted by ai j . Also, let the average distance of this object to all objects belonging to another cluster i i = i be called di j . Finally let bi j be the minimum di j computed over i = 1, · · · , c which represents the dissimilarity of object j to its closest neighbouring cluster. The Silhouette index is then S=
1 N
bi j − ai j
N
∑ s j and s j = max (ai j , bi j )
j=1
Larger values of S correspond to a good clustering partition.
Dunn Index The Dunn index is defined as follows: Let cli and cli be two different clusters of the input dataset. Then, the diameter Δ of cli is defined as
Δ (cli ) = max
v j ,v j ∈cli
d v j , v j
3
Incremental Learning on Trajectory Clustering
69
Let δ be the distance between cli and cli Then δ is defined as
δ (cli , cli ) =
max
v j ∈cli ,v j ∈cli
d v j , v j
and, d (x, y) indicates the distance between points x and y. For any partition, the Dunn index is ⎧ ⎧ ⎫⎫ ⎨ ⎨ δ (cl , cl ) ⎬⎬ i i D = min min and i, i ∈ {1, · · · , N} , i = i i ⎩ i ⎩ max (Δ (cli )) ⎭⎭ i
Larger values of D correspond to a good clustering partition.
Davies-Bouldin Index The Davies-Bouldin index is defined as follows: This index is a function of the ratio of the sum of within-cluster scatter to between-cluster separation. The scatter within cluster, cli , is computed as Si =
1 v j − mi ∑ |cli | v j ∈cli
mi is the prototype for cluster cli . The distance δ between clusters cli and cli is defined as
δ (cli , cli ) = mi − mi The Davies-Bouldin (DB) index is then defined as
DB =
1 N Si + Si Rii ; i, i ∈ {1, · · · , N} , i = i and Rii = ∑ Ri with Ri = max N i=1 δ (cli , cli ) i,i
Low values of the DB index are associated with a proper clustering.
Appendix 2 Jaccard Index Consider C = CL1 , · · · ,CLm is a clustering structure of a data set X = {v1 , v2 , · · · , vn }; and P = {P1 , · · · , Ps } is a defined partition of the data . We refer to a pair of points (vi , v j ) from the data set using the following terms: • SS: if both points belong to the same cluster of the clustering structure C and to the same group of partition P. • SD: if points belong to the same cluster of C and to different groups of P.
70
L. Patino, F. Bremond, and M. Thonnat
• DS: if points belong to different clusters of C and to the same group of P. • DD: if both points belong to different clusters of C and to different groups of P. Assuming now that a, b, c and d are the number of SS, SD, DS and DD pairs respectively, then a + b + c + d = M which is the maximum number of all pairs in the data set (meaning, M = N (N − 1)/2 where N is the total number of points in the data set). Now we can define the Jaccard index (J) measuring the degree of similarity between C and P: J=
a a+b+c
Chapter 4
Highly Accurate Estimation of Pedestrian Speed Profiles from Video Sequences Panagiotis Sourtzinos, Dimitrios Makris, and Paolo Remagnino
Abstract. This paper presents a system that estimates accurately the speed of individual pedestrians with constant walking speed from monocular image sequences with people captured from aside view/camera. Such accurate estimations are needed to tune speed models for pedestrian simulation software. The system uses a combination of image segmentation and motion tracking to localize foot locations of pedestrians and convert them to ground plane speeds using camera calibration model.
4.1 Introduction Pedestrian simulation is increasingly being used in the design and optimization of public spaces such as transport terminals; sport, entertainment and leisure venues; shopping centers; commercial and public buildings; and venues for major international events such as the Olympics. To accurately simulate pedestrian behavior and crowd dynamics in such environments simulation tools must be calibrated and validated using precise real world data [1]. A key determinant of crowd behavior is the preferred walking speed of individuals within the crowd; this has been shown to vary by context and region [2] thus appropriate speed profiles are required for each study. While attempts have been made to automate the collection of pedestrian data for preferred walking speed [3] [4], the developers of pedestrian simulation software have found the precision to be insufficient and have adopted manual methods to extract pedestrian speeds from video footage. The most common approach is for a human operator to examine video sequences and manually mark position of pedestrians in each frame of the video. The paths of the tracked pedestrians are analyzed and the preferred speeds extracted. These manual techniques are time consuming, resource intensive and error prone; therefore there is demand for automated video analysis which can deliver a high degree of accuracy. In this work we present an automated Panagiotis Sourtzinos · Dimitrios Makris · Paolo Remagnino Digital Imaging Research Centre, Kingston University, UK
P. Remagnino et al. (Eds.): Innovations in Defence Support Systems – 3, SCI 336, pp. 71–81. c Springer-Verlag Berlin Heidelberg 2011 springerlink.com
72
P. Sourtzinos, D. Makris, and P. Remagnino
system for the accurate estimation of the preferred walking speed of pedestrians walking alone and unimpeded. Such a constraint comes from the requirement of a microscopic pedestrian simulation where each entity is modeled from an estimated speed profile. Thus the speed profile should be modeled by examining the speed of pedestrians walking without obstructions. The proposed system uses a combination of image segmentation and motion tracking methods to identify the foot locations of a pedestrian and convert them to a ground plane speed using camera de-projection and calibration models.
4.2 Background To automatically estimate the speed of a pedestrian, the object must be tracked in every frame. A trajectory is therefore extracted in terms of the sequence of their locations over time. Pedestrian tracking methods may be classified as motion segmentation tracking, pedestrian appearance detection tracking and feature based tracking. Motion segmentation aims to detect pedestrians as moving regions in image sequences. Motion segmentation techniques include temporal differencing, optical flow and background subtraction. In temporal differencing [5] moving regions are detected by pixel wise differences between two or three consecutive frames while optical flow techniques [6] use the flow of vectors of moving objects in order to perform motion segmentation. In background subtraction techniques motion is detected as the difference between the current image and the reference background as in [7], where an adaptive mixture of Gaussians is used to model the background. Alternatively, pedestrians may be detected using pedestrian appearance models, such as boosted edgelets body part detectors [8] and Histograms of Oriented Gradients HOG [9] that model the appearance of a person using statistical models. Pedestrian detections (blobs) extracted by either motion segmentation or appearance model methods are then temporally grouped into trajectories using methods such as the Kalman filter [10] or the particle filter [11] that associate detections across consecutive frames. In feature-based tracking a set of interest points of sufficient texture is selected and matched in successive frames and then they are clustered into pedestrian trajectories. For instance, Kanade-Lucas-Tomasi KLT features are used in [12] while in [13] Harris corners as features are employed. Although most of the above algorithms are sufficient for general pedestrian tracking they are not accurate enough for speed estimation. Accurate speed estimation requires pinpoint accuracy of the pedestrian location on the ground plane in real world coordinates. Even if camera calibration is used to project the tracking information on the ground plane accuracy will be affected by the accuracy of tracking and the uncertainty of the pedestrian height. Ismail, et al., [14] dealt with the same problem as in our work and they presented a system for automated pedestrian speed estimation. Camera calibration is performed by collecting linear field observations of entities appearing in the video images and the track of pedestrians calculated using the KLT feature tracker. However, their approach makes use of top-view scenarios, where the pedestrians are small in size. As a consequence small errors in image-based position estimation
4
Highly Accurate Estimation of Pedestrian Speed Profiles
73
may have a great impact on the speed estimation and cause significant errors. Thus in different scenarios such as looking at pedestrians sideways from a closer distance their approach will not be able to provide accurate results.
4.3 Methodology Accurate speed estimation of walking pedestrians, viewed from aside view/camera, is achieved by estimating the sequence of locations when a pedestrians foot touches the ground. Our system performs this detection by first applying motion detection through foreground-background separation on a video sequence, and then by tracking a foreground blob we roughly estimate the pedestrian location. Then, we locate the static foot of a pedestrian throughout the video frames. Finally, we calculate their ground plane speed, by projecting the spatiotemporal information of the pedestrian feet, using a camera calibration model. The following sections describe in more detail our methodology. An overview of our system is presented in Figure 4.1.
Fig. 4.1 Proposed Methodology.
4.3.1 Motion Detection and Tracking We perform background separation using the approach of KawTraKulPong and Bowden [7] (Figure 4.3 (b)). We track all blobs whose size are above a threshold by using a connected component tracking algorithm and mean shift particle filter as resolver for collisions between them [15][16]. Both motion detection and multi-target tracking methods are implemented in OpenCV [17].
4.3.2 Static Foot Localization We assume that the only moving objects in the video sequences are pedestrians who walk parallel to the image plane I, so every blob Bi (t) detected at frame t is
74
P. Sourtzinos, D. Makris, and P. Remagnino
considered to correspond to a pedestrian i. In order to measure the speed of a moving person we need to identify, localise and track a specific body part of this person throughout a video sequence. We choose to locate the static feet of a pedestrian, because we can directly derive ground plane positions that allow accurate speed estimations. Based on the approach of Bouchrika and Nixon [18], we locate the static foot of a pedestrian during his walking gait. When a pedestrian is viewed from the side they appear in an upright position. We assume that their foot will be located in the bottom part Li (t) (as it is defined in Figure 4.2) of the foreground blob that is associated with this pedestrian (Figure 4.3(c)). During walking one foot is static while the other foot steps forward, thus the accumulation of image corners located in the area of a static foot must be higher, while corners corresponding to the moving foot or other body parts are spread in larger areas due to their motion. Using the bottom part of the foreground blob as a mask (Figure 4.3(d)) for every
Fig. 4.2 The black rectangle is the blob boudning box, while the red rectangle defines the Li (t).
frame we select the corresponding area of the image plane from the original image and we apply a corner detection algorithm [19] (Figure 4.3(e)) for as long as the blob is visible. We create a 2D histogram map C i , which has the same size as the original image I, for every tracked human i. accumulates the corners cix,y (t), found at every frame t, within the lower part Li (t) of the blob Bi (t).
4
Highly Accurate Estimation of Pedestrian Speed Profiles i Cx,y =
cix,y (t)=
75
i t cx,y (t)
i 1 if corner inIx,y at framet 0 otherwise
(4.1)
i Also for each pixel we create a set Tx,y , which stores the temporal information of the presence of the corners. That is: i i = t : if corner inIx,y at framet Tx,y
(4.2)
Fig. 4.3 (a) Part of the original frame, (b) foreground pixels, (c) tracked blob, (d) lower part of tracked blob that is used for foot localisation, (e) detected corners at the lower part of tracked blob.
However, it is difficult to identify local maxima in the histogram map due to the non smooth continuity of the brightness of the points in the image plane. Therefore, we calculate a proximity image P i for each pedestrian i by smoothing the histogram map using a combination of mean average and Gaussian filters of size N × N . The result of this process is illustrated in Figure 4.4(a). Then, we use the filter of equation 4.3 to force the peak location at the lower part of the foot. The size of N varies based on the video resolution and the expected size of the pedestrians. Finally we locate the local maxima (see Figure 4.4(b)) by scanning a window of size 4N × 4N . 1 if r < N/2 (4.3) Fr,c = −2 if r ≥ N/2 The set of peaks (local maxima points) contains estimates of the feet positions along with potential outliers. Assuming that a person is walking straight, the peaks which correspond to the feet must be located on a line. Therefore, we fit a line based on the peaks and discard any outlier peaks that their distance from the line is above
76
P. Sourtzinos, D. Makris, and P. Remagnino
a threshold ϑα (see Figure 4.4(c)). The set of the remaining peaks may contain multiple estimates of the same foot (e.g. because of high concentration of corners in the front and the back of the foot). This is because their distance (foot size) may be bigger than the local maximum scanning window size. We use the temporal i information of corners as recorded in Tx,y , in order to calculate the average frame at which a corner was detected. Peaks with a temporal difference of average frames of appearance lower than a threshold ϑβ are considered to belong to the same static foot and thus we group them together in a set Sf , where f is the number of foot hypothesis identified (see Figure 4.4(d)). Since each step of a pedestrian should be described by only one peak, we derive all the possible combinations of peaks Qn , where: n = |S1 | · |S2 | · . . . · |Sf | (4.4) To identify which combination describes best the walk of the pedestrian under examination, we exploit the assumption of constant speed, or equivalently that each step has similar distance from the next and the previous one (Figure 4.4(e)). That is which combination has the smallest variance in distance between steps:
k BQ n
k ), where argmin var(BQ n k = d(Qkn , Qk+1 n ), k = 1, 2, . . . , f − 1
(4.5)
d(a, b) is the Euclidean distance between a and b.
Fig. 4.4 (a) Proximity image, (b) local maxima, (c) discarding any outliers (lare circle) that do not fit the assumption of straight walking, (d) local maxima that seem to belong to the same foot are linked together, (e) final feet location estimations by minimizing the variance on step length.
4
Highly Accurate Estimation of Pedestrian Speed Profiles
77
4.3.3 Speed Estimation In order to identify the real world position of the peaks, we perform camera calibration using Tsai’s coplanar calibration method [20]. The dimensions of the floor tiles are known and we construct an artificial board (see Figure 4.5) based on the tiles position. Therefore, the corner locations of the checkered board are known both in the 2D image coordinates and in 3D real world coordinates and therefore the camera model is estimated using the Tsai method. The extracted camera model allows
Fig. 4.5 An artificially made checkered board on top of tiles is used to perform camera calibration.
direct convertion of the image-based coordinates to real world coordinates, as long as they are constrained on the ground plane. After identifying the real world coordinate position R for a static foot location, knowing the average frame ϕ of appearance of that location and by knowing the frame rate r of our video sequence we are able to estimate the speed Vik for each step of a pedestrian by using equation 4.6. Vik =
r · d(Rnk , Rnk+1 ) |ϕRkn − ϕRk+1 | n
(4.6)
Similarly we estimate the average speed of a pedestrian by considering the first and the last static feet locations detected. We are interested in pedestrians who move with constant speed. To identify those pedestrians we calculate the mean step speed V¯i for each one of them and we consider them as valid if they satisfy equation 4.7. k ¯ |Vi − Vi | max < 0.15 (4.7) k V¯i
78
P. Sourtzinos, D. Makris, and P. Remagnino
4.4 Results Our dataset consists of 6 video sequences (720 by 576 pixels resolution) of pedestrians recorded in an underground station in Hong Kong (a frame is displayed in Figure 4.6). In order to produce our ground truth data we manually marked the position of the heel strike for each step of each pedestrian under consideration at the frame when the foot becomes static and inside the measurement area. In total we marked 502 pedestrians which moved on a straight line and we calculated their speed for each step. We used our approach to measure the speed of the pedestrians
Fig. 4.6 Sample frame from video sequence. The red line encloses the area within which measurements were estimated.
which were marked manually and we discarded those with 2 or less footsteps detected, since this implies that the tracking was not reliable enough to extract sufficient information. In total we estimated the feet locations of 398 people. From the 398 pedestrians using the ground truth data we discarded 10 with no constant speed. In order to evaluate our method we compare it with the speed estimate calculated from the localisation of the bounding box tracking. For fair comparison, we convert the results of bounding-box tracking to a sequence of steps. We achieve that by sampling uniformly the trajectory of the mid-lower point of the bounding box between the frames of the first and last step as identified by our approach. Using the constant speed constraint (eq. 4.7) we discarded 203 data using our method for calculating the static foot locations while for the bounding box tracking we discarded 154. The
4
Highly Accurate Estimation of Pedestrian Speed Profiles
79
high rate of not approved pedestrians is due to the noise generated when people walk close to one another, since our system will generate corners that do not belong to the object under observation. However our method managed to discard all 10 false positives while the bounding box tracking discarded 5 false positives. Since our objective is to estimate accurate speed profiles, discarding false negatives tracks is not an issue. What it really matters is to have very few (ideally zero) false positives and very accurate speed estimation of true positives. In Figure 4.7 we display the error
Fig. 4.7 Error rates.
Fig. 4.8 Speed profiles.
80
P. Sourtzinos, D. Makris, and P. Remagnino
rate of the two approaches. Using our approach 69% percent of all measured pedestrians have less than 5% speed error, from the ground truth measurements, while 97% of all pedestrians have less than 10% error. On the other hand using the bounding box tracker, at the same error rates we get 50% and 93% of the pedestrians. In Figure 4.8 we can see the estimated speed profiles. Our approach produces almost identical results as those in the ground truth. The Bhattacharyya distances between the speed profile distribution of the ground truth and our approach is 0.0262 while the distance of the ground truth from the bounding box tracker is 0.1568.
4.5 Conclusions In this paper we have presented an algorithm for accurate estimation of pedestrian speed. Our results have shown that the estimated speed profile is highly accurate. Our method may fail to track successfully some individuals, because of noise around the heel strike positions. Fortunately, the tracks are filtered out by the constant speed assumption. We will also investigate methods to calculate the speed profile of pedestrians who move perpendicular to the horizontal axis of the image plane.
References 1. Berrou, J.L., Beecham, J., Quaglia, P., Kagarlis, M.A., Gerodimos, A.: A calibration and validation of the legion simulation model using empirical data. In: Pedestrian and Evacuation Dynamics, pp. 167–181. Springer, Heidelberg (2005) 2. Lam, W., Cheung, C.: Pedestrian speed/flow relationships for walking facilities in hong kong. Journal of Transportation Engineering 126(4), 343–349 (2000) 3. Hoogendoorn, S., Daamen, W.: Pedestrian behavior at bottlenecks. Transportation Science 39(2), 147–159 (2005) 4. Willis, A., Kukla, R., Kerridge, J., Hine, J.: Laying the Foundations: The Use of Video Footage to Explore Pedestrian Dynamics in PEDFLOW, pp. 181–186 (2002) 5. Patil, R.S., Fujiyoshi, H., Lipton, A.J.: Moving target classification and tracking from real time video. In: Proceedings of the Image Understanding Workshop, pp. 129–136 (1998) 6. Meyer, D., Denzler, J.: Model based extraction of articulated objects in image sequences for gait analysis. In: Proceedings of the International Conference on Image Processing, pp. 78–81 (1997) 7. KaewTraKullPong, P., Bowden, R.: An improved adaptive background mixture model for realtime tracking with shadow detection. In: Proceedings of the European Workshop on Advanced Video Bases Surveillance Systems (2001) 8. Wu, B., Nevatia, R.: Detection and tracking of multiple, partially occluded humans by bayesian combination of edgelet based part detectors. International Journal of Computer Vision 75(2), 247–266 (2007) 9. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 886–893 (2005) 10. Kalman, R.E.: A new approach to linear filtering and prediction problems. Transactions of the ASME - Journal of Basic Engineering 82(1), 35–45 (1960)
4
Highly Accurate Estimation of Pedestrian Speed Profiles
81
11. Isard, M., Blake, A.: CONDENSATION - conditional density propagation for visual tracking. International Journal of Computer Vision 29(1), 5–28 (1998) 12. Polana, R., Nelson, R.: Low level recognition of human motion. In: Proceedings of the IEEE Workshop Motion of Non-Rigid and Articulated Objects, pp. 77–82 (1994) 13. Perbet, F., Maki, A., Stenger, B.: Correlated probabilistic trajectories for pedestrian motion detection. In: Proceedngs of the IEEE International Conference on Computer Vision, pp. 1647–1654 (2009) 14. Ismail, K., Sayed, T., Saunier, N.: Automated collection of pedestrian data using computer vision techniques. In: Transportation Research Board Annual Meeting Compendium of Papers, Washington D.C (January 2009) Reference 09-1122 15. Comaniciu, D., Ramesh, V., Meer, P.: Real-time tracking of non-rigid objects using mean shift. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 142–149 (2000) 16. Nummiaro, K., Koller-Meier, E., Gool, L.V.: A color-based particle filter. In: Proceedings of the International Workshop on Generative-Model-Based Vision, pp. 53–60 (2002) 17. Bradski, G., Kaehler, A.: Learning OpenCV: Computer Vision with the OpenCV Library. O’Reilly Media, Sebastopol (2008) 18. Bouchrika, I., Nixon, M.S.: People detection and recognition using gait for automated visual surveillance. In: IET Conference on Crime and Security, pp. 576–581 (2006) 19. Shi, J., Tomasi, C.: Good features to track. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 593–600 (1994) 20. Tsai, R.Y.: A versatile camera calibration technique for high-accuracy 3d machine vision metrology using off-the-shelf TV cameras and lenses. IEEE Journal of Robotics and Automation 3(4), 323–344 (1987)
Chapter 5
System-Wide Tracking of Individuals Christopher Madden and Massimo Piccardi
Abstract. Tracking the movements of people within large video surveillance systems is becoming increasingly important in the current security conscious environment. Such system-wide tracking is based on algorithms for tracking a person within a single camera, which typically operate by extracting features that describe the shape, appearance and motion of that person as they are observed in each video frame. These features can be extracted then matched across different cameras to obtain global tracks that span multiple cameras within the surveillance area. In this chapter, we combine a number of such features within a statistical framework to determine the probability of any two tracks being made by the same individual. Techniques are presented to improve the accuracy of the features. These include the application of spatial or temporal smoothing, the identification and removal of significant feature errors, as well as the mitigation of other potential error sources, such as illumination. The results of tracking using individual features and the combined system-wide tracks are presented based upon an analysis of people observed in real surveillance footage. These show that software operating on current camera technology can provide significant assistance to security operators in the system-wide tracking of individual people.
5.1 Introduction This chapter investigates the automated tracking of individual people within a set of cameras that defines a surveillance system. Such systems aim to locate the position Christopher Madden University of Adelaide, Australian Centre for Visual Technologies, Adelaide SA 5007, Australia e-mail:
[email protected] http://www.acvt.com.au/research/surveillance/ Massimo Piccardi University of Technology, Sydney, Department of Computer Science, Sydney NSW, Australia e-mail:
[email protected]
P. Remagnino et al. (Eds.): Innovations in Defence Support Systems – 3, SCI 336, pp. 83–103. c Springer-Verlag Berlin Heidelberg 2011 springerlink.com
84
C. Madden and M. Piccardi
of an individual person as they are observed over time, called a track, based upon a set of assumptions, which reflect the limitations of current surveillance technology. Whilst technologies such as RFID tags, security codes, or fingerprint scanners on every door may facilitate tracking, they are not widely available and could become an annoyance to constantly use. This work focuses upon CCTV surveillance cameras as they are the dominant technology that currently exists in the typical surveillance system to provide broad coverage. We specifically address building or campus surveillance as it allows for the modelling of an individual’s features, because there is a limited level of crowding compared to other institutions, such as train stations. Tracking individuals in heavily crowded scenes is such a difficult task that little research has been effective in this area [26]. Computer vision-based object tracking from surveillance cameras is based upon shape, motion, and appearance features of the tracked target [11]. These features are used to evaluate a set of observations to determine the targets location over time, or its ’track’. Motion features, such as location and velocity, are the primary feature used in tracking as they are less reliant upon camera quality, whilst low camera resolution reduces the ability of surveillance systems to exploit shape or appearance features effectively. Surveillance systems are created to assist human operators to view key locations around the surveillance area, at a minimum cost. They therefore consist of a relatively small number of low quality cameras sparsely located to cover key regions of the area under surveillance. Coverage of such key security locations is improved through the installation of additional cameras or other security devices, where they are considered cost effective. Minimisation of video footage size for storage requirements through image compression is problematic for automatic analysis, as are the changes in illumination conditions and colour responses for each camera in the system. Such low camera coverage, image compression and illumination conditions can be difficult for human operators to effectively manage, but are more problematic for automating aspects of the tracking process. Advances in camera technology is improving the affordability of cameras with higher resolutions, providing improved image quality where such cameras are installed. Increasing camera resolution provides improved information about the object being observed, including more reliable observations of shape and appearance features. Upgrading every camera in a system can be very costly, so most surveillance is still conducted on low resolution cameras. Motion features are extremely useful in single camera views, as well as groups of overlapping or near overlapping camera view, where there are strict restrictions on transitions between cameras [12, 27, 28]; however they are not reliable for tracking, or matching the tracks of individuals across large unobserved regions. Such regions occur within large gaps in camera coverage, where the movements of humans are not observed, and may not reflect average motion. Indeed the most important tracks may occur where an individual differs significantly from the average pattern as they attempt to avoid detection by performing illicit activities in blind spots. Shape and appearance features provide an ability to enhance feature models to improve the accuracy of matching individuals observed across multiple cameras. Through the combination of such matched tracks, a system-wide track of individuals is obtained.
5
System-Wide Tracking
85
Previous work on matching or combining the tracks of an individual across disjoint cameras has focused upon colour matching [3, 8, 13, 16] as a key feature for matching people, though some other biometrics such as gait and height estimates were also been explored [2, 3, 17]. Statistics on the transitions of objects between cameras has also been recently proposed [4, 27]. In this work we use the word ’track’ to refer to the information obtained from the uninterrupted location of a single individual as they are observed in a single camera over time. A track includes the indexes of the first and last frame of the track, the individuals segmented region and appearance in each frame, and other features that can be derived from this information, such as gait and shape. Individuals can be recognised based on their colour appearance, where individuals are wearing differing coloured clothes; however pose and illumination changes can be a major problem to the invariance of appearance features. Javed et al. [13] proposed to compensate for illumination variations by training their system to recognise sets of frequent illumination conditions in order to transform colours to a normalised colour set. This approach can not compensate for different illumination in different regions of the images, or for changes over time. Gandhi and Trivedi [8] present a cylindrical representation of the individual to obtain spatial colour information using multiple overlapping cameras. This is sensitive to the alignment of the cylindrical representation and articulated motion. Darrel et al. [3] propose the fusion of facial patterns with height and colour features, although typical surveillance cameras do not offer sufficient resolutoin for accurately measuring facial patterns. BenAbdekader et al. [2] present height estimation based upon converting the height of an individual object’s bounding box into a measurement using camera properties without a full camera calibration. Stride length and periodicity are also determined; however, they require a stable frame rate higher than twice the gait frequency, which may not always occur in surveillance systems. Model-based gait features such as those developed by Zhou et al. [29] may be used as a feature within the framework presented; however gait has yet been proven across multiple cameras [22]. Works by Zadjel et al. [27] and Monari et al. [4] explore the use of transitions between cameras as an additional feature to provide information about matching tracks, though they are based upon average statistics with small gaps between cameras, which does not reflect many existing surveillance systems. They also do not discuss the impact of individuals who might vary from the normal statistics in order to avoid detection. This chapter presents research into a framework that overcomes many of the existing limitations by fusing multiple features into a robust model extracted from the track of an individual target. This model can be used in a multi camera surveillance system to compare the tracks obtained from individuals and build up a set of locations that an individual target has been observed, a system-wide track of that individual. We define a surveillance session as ‘a portion of one day where people enter the surveillance area from a known set of entry points to perform their activities before leaving through known exit points’, where the entry and exit points are observed by a camera. This definition leads to the following simplifying assumptions about the surveillance area, and the people viewed within that area:
86
C. Madden and M. Piccardi
1. All entry and exit points of the surveillance area are in view of a surveillance camera, or multiple cameras. 2. Individuals are unlikely to change their clothing or footwear; hence, intrinsic shape and appearance features will remain relatively constant for the surveillance session. 3. Individuals are tracked accurately whilst they remain within the view of any of the system’s cameras. 4. Individuals are segmented from the background into a single region, or singly labelled group, but not necessarily accurately. 5. Individuals are often observed at a distance from a single surveillance camera without any other equipment available. Thus some biometric features, such as faces or fingerprints, are not generally available. 6. Where cameras are significantly distant, motion features may vary unpredictably between those cameras as individuals can move freely in such unobserved regions. 7. Illumination may vary between observable areas, though this will be limited by the nature of the area under surveillance. Whilst the above assumptions limit some difficulties inherent in the problem, they aim to reflect realistic surveillance environment. These assumptions hold for current tracking technology on video where traffic is sparse, though situations where they can be relaxed are discussed throughout the chapter. The more problematic assumptions relate to the reliability of object segmentation and object tracking. Some fragmentation of segmented and tracked objects does occur, especially when traffic becomes dense; however such errors tend to generate extra objects or broken tracks, which may be merged through matching within the framework. The assumptions also suggest motion features may be unreliable for regions where camera views do not overlap, so transitions between camera views, other than those with no other exit, are not included within the feature fusion framework presented in this chapter. Due to the articulated motion of people, few shape features other than height or gait are likely to remain stable during walking. Most appearance features are likely to remain stable within the extent of a surveillance session, although they may also be affected by articulation, motion direction and illumination changes. The framework for tracking via matching presented is therefore based upon the expandable Bayesian fusion of multiple features, based upon their reliability. Whilst exceptional cases may be easily constructed for any feature set to fail, the features proposed are designed to provide sufficient discrimination (at the ground truth level) for a majority of real cases. Where people may be difficult to discriminate between accurately by selecting the best matches, it can reduce the amount of manual footage revision that would be required for human operators to perform the task alone. This chapter first presents an exploration of object features that are utilised within a surveillance system. We include both shape and appearance features, as well as methods to mitigate the effect of illumination, and other general noise or errors in feature extraction. The framework for fusing these features to perform tracking via matching is presented in the context of the features that have been explored.
5
System-Wide Tracking
87
Results are presented for the individual features, and the fusion of multiple features to explore the impact of the selected features. The chapter concludes by examining the impact of the results and possible extensions to expand the framework.
5.2 Features Exploring features that can be used to distinguish between individuals is an important component of this research. Such features need to measure some stable differences between individuals tracked in surveillance cameras so they can be incorporated into a feature model. Such feature models can be combined and compared within a feature fusion framework to identify tracks that were created by the same individual, and hence build a system-wide track of that individual. It is important to consider the stability of features across multiple cameras spread across a large area, as this can create significant measurement noise. The assumptions in the previous section provide a guide to the conditions which are typical within surveillance, allowing for an initial investigation of features. They suggest that features which might provide a reliable identification of an individual, such as facial features [10, 23] or fingerprints [14], are not likely to be available with enough resolution in most locations, so other features to match individuals must be used. Non-biometric features are a trade-off for use within a multiple feature model as they are less discriminative between individuals, but tend to remain more invariant throughout the system. These include a range of features that are related to either shape, appearance, and location or movement. Of these features, location and movement have been the most widely used features in tracking [11]. This is because objects can be accurately observed at a location at a given point in time and followed whilst in view of that camera, or other overlapping cameras , until they exit the observed scene. Where cameras are near to overlapping, the movement of the individual can predict when they will enter the adjacent view [12, 27]; however where camera views are separated by large distances, the apparent movement of people becomes more erratic. Where there are direct transitions between cameras, such as a hallway observed at both ends, location can be used [27]. This is because a person can not move out of the corridor without being observed by the system, thus providing some location information when they are not observed. Care needs to be taken to handle cases where errors occur, as individuals may not be accurately observed or tracked on some occasions in some cameras. Where an individual may move outside an area without being observed, then knowing that they have entered that area is less useful. Indeed transition times between cameras become much less reliable as they grow larger because the uncertainty of individuals maintaining either their perceived motion, or some average motion dramatically diminishes. In such situations the reliable information is primarily about how far it is possible for them to move, as some transitions between cameras will be impossible given the time available. Due to the uncertainty of this features, we do not investigate such indirect transitions further.
88
C. Madden and M. Piccardi
Shape and appearance features are explored in detail in the subsequent subsections, along with methods to increase their robustness for errors. Examples are provided to demonstrate each of the features that are proposed with a focus upon individual features, as the feature fusion framework that combines the features to improve overall results is presented in Section 5.3.
5.2.1 Shape Features A variety of features can be extracted based upon the shape of an individual object observed in surveillance cameras , such as height or image moments of the object; however few of these are biometric in nature due to the articulation of human motion. As the resolution of camera views increases, shape feature accuracy may also increase, although the cost of camera upgrades will delay their widespread deployment. A person’s height [2, 17] and gait features [22, 29] have been proposed to provide a stable estimate from surveillance footage, which can be used to discriminate between individuals. The accuracy of gait features has been low from current systems, such as those elicited by the gait challenge of Sarkar et al. [22]. Factors such as different flooring, and even minor segmentation errors can lead to errors, especially when extracted from low resolution surveillance cameras, so this section will focus upon height features, and how they can be extracted from an existing surveillance system. The most common method of producing height estimates are from overlapping camera views, such as in [2]. This method produces a reliable estimate based upon the 3D position of the top of the head; however it is reliant upon multiple overlapping cameras, which are rare within surveillance systems. In [17] we demonstrated that height estimation can be achieved using the single camera views that dominate surveillance systems , although it requires camera calibration and reasonable segmentation of the individual to obtain useful measurements. Rather than using the intersection of the top of the head on two camera planes, this method uses a ground plane homography from the camera calibration to determine the real-world location of the feet. When an individual is walking, which is the usual case, the head is directly above the ground-plane line joining the feet, so the real-world location of the head can be determined from a combination of the head location in the image, and the real world location of the feet. The results of this method are shown per frame from 5 tracks of 2 individuals in Figure 5.1, and summarised in Table 5.1. In [20] we extended this idea by automatically extracting the key positions at the top of the head and a reasonably accurate estimation of the ground plane position. This provides a method to determine a height estimate for each frame of an individuals track; however gait effects will make this height estimate vary in a cyclical manner, which is sampled depending upon the frame rate. The automatic estimation of height from a single camera begins with a ground plane homography calculated by camera calibration [17]. This is used to convert the position of the feet from image coordinates, u,v, into real-world coordinates, x,y,z.
5
System-Wide Tracking
89
Fig. 5.1 Per frame height estimates calculated from manually identified points of 2 individuals observed in 5 tracks
Table 5.1 Statistical analysis of track heights from Figure 5.1 Track Average Height (cm) Standard Deviation Matching Tracks 1 170.64 1.63 2, 3 2 171.85 1.34 1, 3 3 171.23 1.73 1, 2 4 166.15 1.04 5 5 166.64 0.92 4
[2] uses the centre of the bottom line of the bounding box, but this can be improved by estimating the actual position of the feet using a k-curvature technique [7]. This method analyses the boundary of the object to find the high curvature regions of the object. As described in [20], high curvature at the bottom of the object can determine the feet location and thus estimate a ’bottom’ point b(u, v) in the image as shown in Figure 5.2. High curvature at the top of the object is likely to be the top of the head h(u, v). Using this method the two key image points can be used with the ground plane homography to estimate the height from a single image. Unlike the multiple camera height estimation method, using the monocular height estimate is more reliant upon segmentation, as effects like shadows or splitting the object can lead to significant errors in the height estimate from a frame. A robust method of combining the height estimate of each frame is needed to provide a reliable height estimate. In [17] a robust median was found to be the closest estimate over a track of the true height of that individual, which could then be compared with estimates from other tracks. It was suggested that the errors are likely to decrease as more observations are made at higher resolutions, either from
90
C. Madden and M. Piccardi
Fig. 5.2 Finding b(u, v) using two feet
improved cameras, or the object being closer to the camera. This is because each pixel error in location will lead to a smaller overall error. Given any two tracks to compare, their robust height estimates can then be tested for similarity. In [20], rather than comparing the robust height estimates of the two tracks, we proposed to directly compute a robust estimate of their differences. To this aim, we used the pairwise height differences, Hd, between each of the frames of two tracks under comparison. The height similarity sH is then given as: sH =
σ(Hd) μ(Hd)
(5.1)
where μ(Hd) and σ(Hd) are the average and standard deviation estimate of Hd respectively. Where there is an intrinsic difference in height of the two individuals, μ(Hd) will greatly exceed σ(Hd), which will be reflected as a low similarity. This statistical comparison will better reflect the measurement error that may occur in the height estimate. The following steps outline the height difference estimation process: 1. Determine the height estimate of the objects in each frame i of the track: a. The silhouette of the segemented object is analysed using a k-curvature technique [7]. b. Areas near the bottom of the object with high curvature k are then used to determine where the feet are positioned, and thus extract a midpoint at the bottom of the object b(u,v). c. This point is converted into world ground plane coordinates b(x,y,z) to determine location. d. This location is then used with the image plane position of the top of the head h(u,v) to estimate the height of the person in this frame, Hfi . 2. The pairwise differences, Hd, between the height in each frame, Hfi , and every other frame are computed between the two tracks under comparison. 3. The estimated height differences between the object tracks are statistically analysed to determine sH in (5.1).
5
System-Wide Tracking
91
5.2.2 Appearance Features The appearance of an object can be studied using a wide range of features, of which the most commonly used are facial appearance, and colour. Analysis of facial appearance, often called face recognition, is a large and growing field that aims to identify individuals based upon differences in their facial structures in images. Whilst there are many advances in the field to account for variations in aspects like scene illumination or face rotation, the most reliable results tend to come from systems with multiple cameras and images where the face occupies most of the image [24], though often results are inaccurate where large numbers of people are observed. [10] proposes a method to use PTZ cameras to obtain high resolution facial images of people observed in parts of their surveillance system, such cameras are not often installed for this purpose. PTZ cameras are typically used on a predefined guard tour to observe a large outdoor area, and are not available to obtain facial information. Facial features are therefore likely only to be available at specific locations, and will not provide a generally useful feature from surveillance footage. A variety of colour features can be used to describe the appearance of individual people [9]. The fundamental issues are the colour representation to use and the spatial domain over which to compute it. At one extreme, one could represent the appearance of an individual simply as the colour map of its segmented region, or blob; however, due to the continuous deformation of the persons shape, the comparison of two such representations would be problematic as they are not invariant to pose. At the other extreme, one could disregard completely the spatial information about colours within the blob and just collect overall colour statistics. While this representation would be highly invariant to deformation, it cannot discriminate, for instance, between a person with a red blouse and black skirt and one with a black shirt and red pants. A practical trade-off is therefore needed between invariance and discrimination, such as in [25]. Additional issues for consideration are invariance to changes in illumination, computational efficiency, and memory efficiency. Appearance features need to trade-off these issues whilst preserving enough information to discriminate between individual people [16]. Many people have proposed colour models for analysing an individual’s colour appearance [3, 8, 13, 15, 4, 27], though many of these provided simplistic contractions of the colour space, limiting the individual colour variation available for discriminating between individual people. In [16] we proposed a sparse histogram model based upon [15], which stores colours that are clustered in a normalised 3D RGB colour space. This requires no colour space changes and limits the memory requirements by only storing clusters of colours that occur in the object. Using the normalised colour space changes reduces the impact of high illumination conditions, whilst still capturing some intrinsic changes when the light is at lower levels. An additional step of ’controlled equalisation’, as outlined in the following section, is also used to mitigate the impact of colour effects such as illumination changes and camera colour response [16], which is detailed further in the following section. The advantage of using histograms over lower order statistics, such as Gaussians or Gaussian mixtures, is that they are non-parametric and thus adaptable to a wide
92
C. Madden and M. Piccardi
variety of clothing, as shown in Figure 5.3. The obvious disadvantages lie in the increased storage requirements and computational costs.
Fig. 5.3 Examples of red histograms from different people segmented with minor errors
To cater for spatial discrimination of colour appearance, in [20] we proposed two extra colour features relating to the upper and lower clothing colours of an individual, similar to those used in [4, 27], in addition to the global colours used in [3, 5, 16]. These features are chosen to represent the often different colours of the clothing on the upper torso, and those on the legs. The narrow spatial aspect of these features also allows for an analysis of the spatial positioning of an individuals colour appearance, ensuring that changes in the position of the colours can be detected and used for discrimination. The narrowness and positioning of both of the upper and lower sparse histograms regions also allow for them to remain constant under minor segmentation errors that will only have a minimal impact upon a person’s features, but are sensitive to large segmentation errors. Using the difference between
Fig. 5.4 Examples of upper and lower regions of segmented individuals
5
System-Wide Tracking
93
these features allows for the identification of single frames with large segmentation errors due to their difference from the rest of the frames within an individual targets track [19]. Changes in the features over extended periods of time are indicative of errors in the tracking process, such as where the wrong target has been included. Figure 5.5 shows the upper MCR feature region (enclosed between the two top lines) and the lower MCR feature region (between the two bottom lines). Positioning of such regions in the first and third frames in figure is regarded as acceptable. In the second frame, instead, the lower MCR feature region is significantly displaced; in this case, the sudden change in the lower MCR feature clearly indicates a very poorly segmented frame. In [19] we explored this use of change in features, as well as the change in size of the bounding box to identify frames with large segmentation errors, with the results are summarised in Table 5.2. The accuracy of the major segmentation error detection was as high as 84 percent with only 3 percent false alarms when a combination of features was used, indicating that whilst most of the erroneous frames were identified and discarded, the vast majority of reliable frames remained available. The low false alarm for segmentation errors is important as some of the tracks consisted of as few as 12 frames before the segmentation error removal process was applied.
Fig. 5.5 Example of upper and lower regions from three segmentations of an individual
Table 5.2 PD and PFA values of Bounding Box and MCR features for detecting segmentation errors Feature PD as % PFA as % Vertical Bounding Box changes 72 9 Global MCR 66 31 Upper MCR 53 11 Lower MCR 66 5 Fused MCR 84 3
Extracting the MCR for each of these three colour features utilises the same process for each feature, but analyses different spatial component of the appearance of the segmented object. The process for extracting the MCR’s is described in detail in [16], but summarises as: • A controlled equalisation step performs a data-dependent intensity transformation that spreads the histogram to mitigate illumination changes that may occur within surveillance environments.
94
C. Madden and M. Piccardi
• Online K-means clustering of pixels of similar colour within a normalised colour distance generates the MCR of each spatial region. • Once segmentation errors are removed, robust MCR features can be obtained over a small window of frames to improve robustness to articulated motion. The three colour features are: • The global MCR feature represents the colours of the whole segmented object without any spatial information. • The upper MCR feature represents the colour of the top portion of clothing. This corresponds to the region from 30 − 40 percent of the person from the top of the objects bounding box as shown in Figure 5.5. This narrow band was chosen to ensure that it avoids the inclusion of the head and hair of the object, as well as low necklines, but does not go so low to overlap with the leg area. • The lower MCR feature is aimed to represent the colour of the lower portion of clothing. This corresponds to the region from 65 − 80 percent of the object from the top of the objects bounding box as shown in Figure 5.5. This narrow band avoids the very bottom of the object which can be prone to shadows, or artifacts where the feet touch the ground. It also tries to avoid overlapping with the belt or upper torso area of the person.
5.2.3 Mitigating Illumination Effects The colour of an object in a camera view is not the intrinsic colour of the object itself, but rather a view-dependent measurement of the light reflected from the object, and the camera sensitivity to that light [6]. By recording the camera response to different wavelengths of light, the colour sensitivity can be estimated and exploited for model-based or empirical camera characterisation [1]; however illumination provides a more difficult challenge. Compensating for illumination changes are broadly classified by Finlayson et al. [6] into colour invariants, which seeks transformation of the colours that are illumination independent; or colour constancy, which seeks to estimate the illumination of the scene to extract the intrinsic colours of objects. Whilst accurate models of the illumination of the scene could extract the intrinsic colours of objects, the implementation of this technique is very difficult given that the illumination can change over time. In [18] we explored a number of transformation methods to mitigate the effect of illumination upon an objects appearance. These adaptations were local to the object, as the illumination changes on the 3D object surface may not form a large enough component of the image to be captured by global methods. Of the methods explored in [18] the controlled equalisation provided the lowest probability of missing correct matches, whilst full equalisation provided a slight improvement of overall errors. These methods were shown to reduce the overall errors of the appearance component of the track matching framework by over 30 percent. Histogram equalisation, also denoted here as full equalisation, aims to spread a given histogram across the entire bandwidth in order to equalise as far as possible the histogram values in the frequency domain. This operation is data-dependent and
5
System-Wide Tracking
95
inherently non-linear as shown in Figure 5.6; however it retains the rank order of the colours within the histogram. ’Controlled equalisation’ is based upon equalising a combination of the object pixels and an amount of pre-equalised pixels that is a proportion k of the object size. These pre-equalised pixels effectively ’control’ the amount of equalisation, such that the pixels are spread to a limited degree within the spectrum instead of being spread fully. Thus although an object becomes more matchable under a range of illumination conditions , it retains a higher degree of discrimination from objects of differing intrinsic colour. This technique is demonstrated on the 256 bin red colour channel at varying levels of k in Figure 5.6 below, where the vertical axis represents the percentage of pixels occurring in any bin, and k = 0 denoting full equalisation.
Fig. 5.6 Controlled Equalisation of the individuals pixels with varying k values
This equalisation can be formally described by designating the set of N pixels in a generic object as A, and calling B a second set of kN pixels which are perfectly equalised in their R, G, and B components. Note that the parameter k designates the proportionality of the amount of equalised pixels to the amount of pixels in A. From their union A ∪ B, the cumulative histograms of the R, G, and B components, pr (i) , pg (i), and pr (i) for i = 0 . . . 255 are computed. A histogram equalisation of the individual colour channels is then derived as shown in Equations 5.2-5.4: Tr (i) =
i 255 pr (j) (1 + k) N j=0
(5.2)
96
C. Madden and M. Piccardi
Tg (i) =
i 255 pg (j) (1 + k) N j=0
(5.3)
Tb (i) =
i 255 pb (j) (1 + k) N j=0
(5.4)
These intensity transforms can then be applied to re-map the R, G, and B components in the object’s pixels providing the ’controlled equalisation’. The parameter k controls the amount of pre-equalised pixels used, which in turn controls the spread of the object histogram. In [18] a number of illumination mitigation techniques were compared, with the results for the unmitigated data, and the controlled equalisation technique summarised in Table 5.3. These results are based upon the similarity values obtained from 50 matching and 70 non-matching track pairs, and compare their estimated probablity of false alarm, P F A, probability of missed detection, P M D, and their total error rate. Table 5.3 Results of similarity measurements for matching and non-matching tracks Method Parameter None Equal Equal Equal Equal
Full 0.5 1 2
Matching mean std 0.8498 0.1013 0.9095 0.0452 0.9080 0.0423 0.9116 0.0348 0.9135 0.0378
Non-Matching mean std 0.2230 0.3467 0.2522 0.3791 0.2637 0.3939 0.2680 0.3996 0.2726 0.4060
Theoretical Error % PMD PFA total 2.32 11.01 13.33 0.60 7.57 8.17 0.59 8.60 9.20 0.46 8.32 8.78 0.69 9.37 10.07
The results in Table 5.3 show that for all values of k investigated, there is a reduction in appearance matching errors within the track matching framework. More importantly the variation of similarity measurements for matching tracks is reduced, even as the mean is increased, but the mean and variation of the non-matching similarity measurements are not affected. This leads to the demonstrated reduction in both the probability of missed detection, and false alarms.
5.3 Feature Fusion Framework A framework is required to allow for the comparison of models generated from tracks within the surveillance system. In the previous sections details were provided upon how to compare the similarity of shape and appearance features, but it is important to note that such similarities are not always directly comparable. In [20] we proposed the use of Bayes theorem for classification to obtain the desired posteriors from the trained likelihoods to form a framework for fusing features. Bayes theorem utilises posterior probabilities of features being matched or non matched which are derived from trained likelihoods. These probabilities are derived by using a small but
5
System-Wide Tracking
97
representative sample set of known tracks from individual people that are known to be either matching H0 or non-matching H1, with the probability distributions being comparable between features. Fusing together the track similarities of each feature occurs as shown in (5.5) , (5.6), where sH is the height similarity, sUC relates to the upper clothing colour, sLC relates to the lower clothing colour and sGC relates to the global colour. A bias term B allows for the results to be adjusted to favour either matching H0 or non-matching H1 depending upon the cost of missed detections or false alarms within the application. This method also allows for the extension of the feature framework by adding extra terms to (5.5) , (5.6) relating to the H0 and H1 in a similar manner. P (H0|sH , sUC , sLC , sGC ) = B (P (sH |H0) P (sUC |H0) P (sLC |H0) P (sGC |H0))
(5.5)
P (H1|sH , sUC , sLC , sGC ) = P (sH |H1) P (sGC |H1) P (sLC |H1) P (sGC |H1)
(5.6)
The classification of the track pair is provided by the maximum probability between hypotheses H0 or H1. The fusion scheme holds for additional features where they are available, provided they can be treated as mutually independent. In addition, an indication of the information of a given feature can be determined by calculating equations (5.5) , (5.6) with and without the given feature. Where a feature compliments the other features, an improvement should be noted in the framework accuracy. Using this method, the global appearance features sGC were found to provide little additional information than that provided by the upper appearance sUC and lower appearance sLC features. This Bayesian feature fusion framework is based upon track level model similarities, but some features are available for comparison at other levels, such as the frame level. Features such as appearance vary little from frame to frame, and therefore could be compared on a frame by frame basis, whilst other cyclic features, such as height estimate, are more accurately estimated over the whole track duration. Such frame level comparisons could be combined together to form a track based similarity measure using a simple averaging process, but the application of robust statistical analysis for outlier removal, such as proposed in [21], should improve their accuracy.
5.4 Results The results presented here report track matching accuracy based on each feature, and for the fused case based upon a comparison of 26 tracks from four people across two cameras, giving over 300 possible comparison combinations. Of these, 60 comparison combinations are used as training data for the fusion framework, with the remaining used for testing. An indication of the clothing colour and good
98
C. Madden and M. Piccardi
Fig. 5.7 Four people of interest and good automatically segmented masks
segmentation examples for the four individuals is given in Figure 5.7, where it is easy to see that the individuals are wearing clothing of approximately 50 percent or more differing colours. Ground-truth height differences between the individuals ranged from approximately 5 centimeters to 30 centimeters. These results are from individuals observed in video surveillance cameras with a resolution of 293 x 214 pixels at approximately 6 frames per second. The results are given in Figure 5.8 as an ROC curve of the independent and fused variables determined by varying the bias term B.
Fig. 5.8 ROC curves of the height, colour and fused feature results to be revised
Figure 5.8 demonstrates that the fusion of the chosen features can provide a probability of detection of 91 percent with only 5 percent false alarms at a chosen operating point. A detailed analysis of the results demonstrated that some cameras lead
5
System-Wide Tracking
99
to more accurate segmentation and feature measurements due to increased contrast between the individual and the background and more stable lighting conditions, indicating that the feature probability distributions could be better defined based upon the camera pairs within which the tracks occur; however this was not utilised for these experiments. Compensation of the effects of variable illumination on the object’s appearance was performed as outlined in section 5.2.3; however the application of colour calibration [1] may further improve the discriminative power of the colour features.
Fig. 5.9 ROC curves of the height, colour and fused feature results to be revised
Figure 5.9 provides a more detailed analysis of the complementary nature of the features on a subset of the track data available. It specifically looks at the information gain of the system when the global appearance feature sGC is applied. The ROC curve shows that the performance when height and global colours are fused does not improve significant, and even performs worse at some operating points. The fusion of height with all three colour features produces very similar performance to the system when just the upper and lower colour features are fused with height. This suggests that whilst the global colour feature will include some colour components from regions such as skin and hair colour, these additional components do not provide additional ability to distinguish between the people in this dataset. The maximum-a-posteriori classification of (5) and (6) based upon the knowledge that only 1 in 4 tracks are matching provides an indication of the bias term B, which minimises the total Bayesian error. We suggest that it would be preferable to work at a different operating point along the ROC curve in order to achieve a higher probability of detection, because a human operator can quickly and easily identify a falsely matched pair by doing a manual review of the potentially matched tracks
100
C. Madden and M. Piccardi
Fig. 5.10 Pictorial storyboard summary of potentially matching tracks
as the difference in appearance is likely to be visually obvious to the operator. Manually correcting a missed detection is more arduous as an operator would then need to manually compare the current track to all other possible tracks to determine the best match. Hence, we have opted to adjust B in (5) by a factor of three, achieving the results reported. Whilst this method relies upon good segmentation, the overall detection rates show that this method works well with automated detection, and removal of frames with segmentation errors, using existing surveillance cameras. The results of this fusion framework indicate which track pairs are likely to be matching, and can therefore be combined together to provide the system-wide track of an individual person. This similarity is very dependent upon the assumption that their clothing and footwear do not change during a surveillance session; however where such changes occur, it is likely that a human operator may be capable of rectifying. For such an operator to compare each track to all other observed tracks within a surveillance system would involve a time consuming search, which can be minimised by automatically suggesting the most likely matches, as shown in Figure 5.10. The number of comparison can be reduced by using the assumption that all people are observed entering and leaving the surveillance system, as this would provide a limit on when they were within the system. Errors on observations in these key locations will reduce the effectiveness of this measure. Such a visual storyboard verification process is likely to limit both the time required for investigations and generating system-wide tracks of individual people, as well as reducing the errors within such tracks; however this requires significant further work in implementation to evaluating the true effectiveness of such a system.
5.5 Conclusions System-wide tracking of individual people is seen as an increasingly important security task. Currently this task is performed by a set of security operators, who
5
System-Wide Tracking
101
monitor anywhere from tens to thousands of cameras to attempt to identify suspicious individuals or activities, or to analyse events after they have happened. A key component of operator activity is the difficult task of tracking the movements of individual people of interest as they move through the system either in real-time, or for post-event analysis. This chapter has explored features that can be extracted and utilised within a Bayesian framework to assist operators to combine the observations of an individual across many cameras, generating a system-wide track. This has explored a variety of motion, shape and appearance features, as well as methods to make their measurements more robust, in order to determine those features which might provide useful complementary object information. The results presented are based upon an implementation of framework using the robust feature estimates after error correction has been applied. This error correction occurs primarily though the identification and removal of frames with large segmentation errors, which distort estimated object features, though the identification of changes in those features. The results of analysing such feature changes demonstrated that the majority of frames with large segmentation errors were removed, with few false alarms. The results of the track matching based upon robust features showed that features can be chosen carefully for the information they provide. For instance fused spatial appearance is not improved when global colour is included; however height provides complementary information which improves the overall result. The surveillance dataset examined was captured across a small set of real surveillance cameras, with a correct matching rate for the 4 analysed individuals as high as 91%, with only 5% false matches. Subsequent experiments on the appearance of a larger dataset of 6 individuals observed across 4 cameras supports the appearance components of these results, though more experimentation is required to accurately evaluate its use in large surveillence systems. The results indicate that where the features chosen are used within the proposed framework, they can provide significant information to assist security operators. This assistance, through the automatic analysis of the surveillance footage, can provide useful visual suggestions to an operator based upon the tracks which are most likely to have occurred from a single individual. This can improve the efficiency of the task, as operators will no longer have to search the entire set of tracks to determine matches, but can quickly verify those accurately identified matching tracks to build up the set of tracks which define the system-wide track of an individual.
References [1] Barnard, K., Funt, B.: Camera characterization for color research. Color Research and Application 27(3), 153–164 (2002) [2] BenAbdekader, C., Cultler, R., Davis, L.: Person identification using automatic height and stride estimation. In: Proceedings of International Conference on Image Processing (2002) [3] Darrell, T., Gordon, G., Harveille, M., Woodfill, J.: Integrated person tracking using stereo, colour, and pattern detection. International Journal of Computer Vision 37(2), 175–185 (2000)
102
C. Madden and M. Piccardi
[4] Monari, E., Maerker, J., Kroschel, K.: A robust and efficient approach for human tracking in multi-camera systems. In: Proceedings of Advanced Video and Signal-based Surveillance (2009) [5] Erdem, C.E., Ernst, F., Redert, A., Hendriks, E.: Temporal stabilization of video object segmentation for 3d-tv applications. In: Proceedings of International Conference on Image Processing (2004) [6] Finlayson, G., Hordley, S., Schaefer, G., Tian, G.Y.: Illuminant and device invariant colour using histogram equalisation. Pattern Recognition 38(2), 179–190 (2005) [7] Freeman, H., Davis, L.: A corner-finding algorithm for chain-coded curves. IEEE Transactions on Computing 26, 297–303 (1997) [8] Gandhi, T., Trivedhi, M.: Panoramic appearance map (pam) for multi-camera based person re-identification. In: Advanced Video and Signal Based Surveillance (2006) [9] Gonzales, R., Woods, R.: Digital Image Processing. Prentice-Hall, Englewood Cliffs (2002) [10] Hampapur, A., Brown, L., Connell, J., Ekin, A., Haas, N., Lu, M., Merkl, H., Pankanti, S.: Smart video surveillance: Exploring the concept of multiscale spatiotemporal tracking. IEEE Signal Processing Magazine 22(2), 38–51 (2005) [11] Hu, W., Tan, T., Wang, L., Maybank, S.: A survey on visual surveillance of object motion and behaviors. IEEE Transactions on Systems, Man and Cybernetics 34, 334– 352 (2004) [12] Javed, O., Rasheed, Z., Shafique, K., Shah, M.: Tracking across multiple cameras with disjoint views. In: International Conference on Computer Vision (2003) [13] Javed, O., Shafique, K., Shah, M.: Appearance modeling for tracking in multiple nonoverlapping cameras. In: IEEE Conference on Computer Vision and Pattern Recognition (2005) [14] Lee, H., Gaensslen, R.: Advances in Fingerprint Technology. CRC Press, Boca Raton (2001) [15] Li, L., Huang, W., Gu, I., Tian, K., Tian, Q.: Principal color representation for tracking persons. In: International Conference on Systems, Man, and Cybernetics, vol. 1, pp. 1007–1012 (2003) [16] Madden, C., Cheng, E., Piccardi, M.: Tracking people across disjoint camera views by an illumination-tolerant appearance representation. Machine Vision Applications 18, 233–247 (2007) [17] Madden, C., Piccardi, M.: Height measurement as a session-based biometric for people matching across disjoint camera views. In: Proceedings of Image and Vision Computing, New Zealand (2005) [18] Madden, C., Piccardi, M.: Comparison of techniques for mitigating illumination changes on human objects in video surveillance. In: International Symposium on Visual Computing (2007) [19] Madden, C., Piccardi, M.: Detecting major segmentation errors for a tracked person using colour feature analysis. In: Proceedings of International Conference on Image Analysis and Processing (2007) [20] Madden, C., Piccardi, M.: A framework for track matching across disjoint cameras using robust shape and appearance features. In: Advanced Video and Signal based Surveillance Conference (2007) [21] Mosteller, C.F., Tukey, J.W.: Data Analysis and Regression: A Second Course in Statistics. Addison-Wesley, Reading (1977)
5
System-Wide Tracking
103
[22] Sarkar, S., Phillips, P.J., Liu, Z., Vega, I.R., Grother, P., Bowyer, K.W.: The humanid gait challenge problem: Data sets, performance, and analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 27, 162–177 (2005) [23] del Solar, J.R., Navarrete, P.: Eigenspace-based face recognition: a comparative study of different approaches. IEEE Transactions on Systems, Man and Cybernetics, Part C 35(3), 315–325 (2005) [24] Wechsler, H.: Reliable Face Recognition Methods System Design, Implementation and Evaluation. Springer, Heidelberg (2007) [25] Yang, Y., Harwood, D., Yoon, K., Davis, L.: Human appearance modeling for matching across video sequences. Machine Vision and Applications 18(3), 139–149 (2007) [26] Zhang, Z., Gunes, H., Piccardi, M.: Tracking people in crowds by a part matching approach. In: Proceedings of Advanced Video and Signal-based Surveillance (2008) [27] Zajdel, W., Krose, B.: A sequential algorithm for surveillance with non-overlapping cameras. International Journal of Pattern Recognition and Artifcial Intelligence 19(9), 977–996 (2005) [28] Zhao, T., Nevatia, R.: Tracking multiple humans in complex situations. IEEE Transactions on Pattern Analysis and Machine Intelligence 26(9), 1208–1221 (2004) [29] Zhou, Z., Prugel-Bennet, A., Damper, D.R.I.: A bayesian framework for extracting human gait using strong prior knowledge. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(11), 1738–1752 (2006)
Chapter 6
A Scalable Approach Based on Normality Components for Intelligent Surveillance Javier Albusac, Jos´e J. Castro-Schez, David Vallejo, Luis Jim´enez-Linares, and Carlos Glez-Morcillo
Abstract. Since their first developments, traditional video surveillance systems have been designed to monitor environments. However, these systems have several limitations to automatically understand events and behaviours without human collaboration. In order to overcome this problem, intelligent surveillance systems arise as a possible solution. This kind of systems are not affected by negative factors such as fatigue or tiredness and they can be more effective than people when recognising certain kinds of events, such as the detection of suspicious or unattended objects. Intelligent surveillance refers to using Artificial Intelligence and Computer Vision techniques in order to improve traditional surveillance and process semantic information, obtained from low-level security devices. Normally these systems consist of a set of independent analysis modules that deal with particular problems, such as the trajectory analysis of pedestrian in parking lots, speed estimation of vehicles, gait or facial recognition, etc. However, most of them present a common problem: lack of flexibility and scalability to include new kinds of analysis and combine all of them in order to obtain a global interpretation. In this work, a formal model to define normal events and behaviours in monitored environments and to build scalable surveillance systems is presented. This model is based on the use of normality components, which are independent and reusable for environments with different characteristics and different kinds of objects. Each component specifies how an object should ideally behave according to a surveillance aspect, such as trajectory or velocity. The model also includes the fusion mechanisms required for combining the particular analysis made by each component. Finally, when a new component is designed making use of the proposed model, the system increases its abilities to detect new kind of abnormal events, and the normality of an object depends on a higher number of factors. Javier Albusac · Jos´e J. Castro-Schez · David Vallejo · Luis Jim´enez-Linares · Carlos Glez-Morcillo University of Castilla-La Mancha, Paseo de la Universidad 4, Ciudad Real (Spain) ZIP 13170 Fax: +34 926 295 354 E-mail:
[email protected] P. Remagnino et al. (Eds.): Innovations in Defence Support Systems – 3, SCI 336, pp. 105–145. c Springer-Verlag Berlin Heidelberg 2011 springerlink.com
106
J. Albusac et al.
6.1 Introduction Currently, the security of citizens and infrastructures in public and private environments is one of the main topics of interest for most governments and institutions [26]. One representative example is London, where more than four million cameras have been deployed and connected to security control centres [13], where human operators supervise the video streams received in order to detect anomalous situations that might represent a danger for people. The main disadvantage of this kind of surveillance is the huge dependence on human supervisors, which usually involves an efficiency decrease after observing monitors for long periods of time, as stated in previous research [29]. Furthermore, there exist surveillance activities, such as the detection of unattended objects, where the automatic systems can be more effective than human surveillance [13]. Therefore, traditional video surveillance has several limitations to meet the security demands of monitored environments. In other words, the security systems should be driven by intelligence and proactiveness. In this way, intelligent surveillance systems need to be developed in order to improve the robustness of traditional systems and reduce the workload of security staff [7], calling their attention only at particular moments. In the last two decades, the research community has proposed different algorithms, techniques and models to develop the named second and third generation of surveillance systems [31]. Second generation systems combine CCTV (closed-circuit television) technology and Computer Vision and Artificial Intelligence methods, while third generation surveillance systems provide solutions for highly distributed and complex environments. There exist different proposals of intelligent surveillance systems such as the analysis of trajectories of vehicles and pedestrians in parking lots, the access control in buildings, the interaction among human beings in underground railway stations, the activities carried out in indoor environments, etc. Most of these systems have normally been designed to solve particular problems in specific scenarios. They can be understood as a set of pieces that act separately to detect specific events in particular scenarios, without establishing relationships between them to get a global analysis. Moreover, these proposals are not usually reused in scenarios different from the one they were initially considered. Within this context, a proposal of a scalable and flexible general model for the development of surveillance systems is needed, which allows to deploy instances and is adaptable to most of monitored environments. This model should also facilitate the analysis of different events of interest depending on the surveillance requirements, increasing the capability of the system when new aspects need to be monitored. In short, an advance surveillance system should be able to include new analysis modules that can be customised for the scenario where they will be deployed, determining the normality or abnormality of such a scenario at every moment. The discussed limitations motivate the present work, where a formal model for the design and normality analysis of surveillance environments is proposed. This model is inspired by the component-based software development, the multi-layered design and the division of complex problems into subproblems. According to this model, an intelligent surveillance system consists of a set of normality components
6
Normality Components for Intelligent Surveillance
107
and a set of methods to combine their output. A normality component defines how an object should ideally behave according to the monitored events of interest, considering a non-normal situation as suspicious or anomalous. These components can be easily integrated into the surveillance system and customised depending on the requirements and security level desired. In addition, knowledge acquisition tools and machine learning techniques are used to configure and deploy the components. The output of the components designed by means of the proposed model are highly descriptive so that the security expert responsible for interacting with the surveillance system can easily check whether the system works correctly and identifies anomalous or suspicious situations. This feedback, close to natural language, facilitates the system maintenance and the reuse of knowledge in environments with similar features. On the other hand, the security systems should provide responses in a short time as possible to reduce the hypothetical damage caused by the occurrence of anomalous situations (e.g. crowd detection). This is the reason why the systems developed through the proposed model must provide a good balance between response time and robustness. The rest of the chapter is organised as follows: the next section discusses previous work and their shortcomings. Section 3 describes a formal model to build scalable and flexible surveillance systems. Section 4, the model previously described is applied to design a new normality component, which deals with the trajectory analysis problem. The experimental results are presented and discussed in Section 5. The final section presents conclusions and future works.
6.2 Previous Work In the last decade several research projects have been developed in order to achieve advances in the area of intelligent surveillance. Some of the problems addressed in these projects are moving object detection, tracking , object classification (usually between pedestrians and vehicles), automatic learning of activity models to understand events and behaviours, or support for decision making. One relevant research project was CROMATICA (Crowd Monitoring with Telematic and Communication Assistance) [18], conceived with the goal of improving the surveillance of passengers in public transport. This project combined video analysis-based technologies and wireless data transfer [31]. Such a project was later extended with PRISMATICA (Pro-active Integrated Systems for Security Management by Technological Institutional and Communication Assistance) [33], where the social, ethic, organisational and technical aspects of surveillance in public transport were especially taken into account [31]. Regarding the technical advances, the system had a distributed nature and allowed the communication between heterogeneous devices by means of the middleware CORBA. Some of the anomalous events detected by the system were people staying at forbidden areas, intrusions (access to restricted areas), anomalous length of queues, or people walking in the wrong direction. In these systems, the analysis and interpretation of events and behaviours is carried out in indoor environments and mainly focused on people.
108
J. Albusac et al.
Another relevant research project was VSAM (Video Surveillance and Monitoring) [12], funded by U.S. Defence Department, concretely by DARPA (Defence Advanced Research Projects Agency). The system used the video streams gathered by the security cameras to classify moving objects as people, groups of people or vehicles. The layer responsible for identifying events was designed to analyse vehicle trajectories and people movements. Thanks to the study of the people walk and their poses, the system was able to determine if a person was walking, running or performing other activities. W 4 [16] was a system focused on analysing people behaviour from their poses in outdoor environments, ignoring the behaviour of elements such as vehicles. The system detected and determined the people figure by using the images obtained by security cameras and the information captured by infrared sensors. Once obtained the figure, a second module identified relevant parts of the human body (head, trunk, arms, and legs). The position of these parts was used to identify events of interest such as people running, walking with a bag, carrying objects, putting their hands up (which might involve an external thread), etc. ADVISOR (Annotated Digital Video for Intelligent Surveillance and Optimised Retrieval) [28] was another European project developed with the goal of analysing people behaviour in subway stations. Like PRISMATICA, this system is composed of multiple cameras distributed around a station, connected to a central server that performs the image analysis. This module stores in a database the sequence of images that represent an anomalous situation. Each one of these sequences are described by means of textual information. Precisely, this information is later used by human operators to retrieve the previously identify abnormal events. AVITRACK [5, 8, 2], Co-Friend [30] and ARGOS [6] are more recent research projects. The first two are related to CROMATICA and PRISMATICA while CoFriend evolves from AVITRACK. Such projects aim to analyse people and vehicles behaviour at airports, especially in maintenance tasks such as load and unload of aeroplanes, refuel, single repairs, etc. On the other hand, ARGOS was conceived to deal with the control of sea traffic. Most of previously cited works are composed of multiple analysis modules designed in an independent way without following a common scheme. These systems can be understood as a set of pieces that act separately to detect specific events in particular scenarios, without establishing relationships between them to get a global analysis. Nevertheless, the tendency when designing intelligent surveillance systems should be towards the development of scalable systems that provide mechanisms to include new analysis modules. Furthermore, this integration of components should not affect the already deployed ones, making easy their combination to increase the ability for the system to perform a global analysis of the environment. Another existing limitation to overcome is the low flexibility when configuring the behaviour analysis applied to each class of object. A flexible surveillance system should give the possibility of easily specifying what analysis modules monitor each class of object, according to the characteristics of the environment and the surveillance demands. Thus, if a system is composed of two modules to analyse trajectories and speed, to choose which module is applied to monitor people and
6
Normality Components for Intelligent Surveillance
109
vehicles should be allowed, fact that may vary for different environment. For these reasons, and as stated in the previous section, the main motivation of this work is to define a model that allows to build scalable and flexible surveillance systems, which will be discussed in depth in the next sections.
6.3 Formal Model to Build Scalable and Flexible Surveillance Systems 6.3.1 General Overview The complexity of surveillance systems can be reduced by adopting a multi-layered approach in the design and development phases, providing the system with high flexibility and modularity and making easy the maintenance and integration of new functionalities. The low-level layers process the information gathered by the environment sensors and generate spatio-temporal data associated to each one of the recognised objects of interest. The intermediate layers analyse the identified events and behaviours. Finally, the high-level layers monitor the results obtained by the previous layers and give support to the decision making process when a critical situations was detected. In order to simplify the design and development of surveillance systems, the problem of monitoring an environment can be structured into a set of subproblems. This is possible since surveillance generally involves monitoring a scenario from different points of view, the elements that exist in such a scenario (e.g. moving or static objects), the aspects or events of interest to monitor (e.g. trajectories, objects speed, normality in a pedestrian crossing, etc) or even the physical scenarios that compose the whole environment (e.g. different rooms in an indoor environment). Another relevant aspect when reducing the complexity of the design of these systems is the time. For instance, the period of time in which the surveillance is performed determines the functionality of the system and should be taken into account when carrying out the design of the system. On the other hand, to reuse existing and well-tested designs and developments may facilitate the task of devising new surveillance systems. This idea leads us to the develop intelligent surveillance systems based on components that can be reused as customised modules that deal with different surveillance tasks. These components may be previously designed and developed by third parties and available in public or private repositories in order to use them when designing a new surveillance system. The adoption of this approach implies that each component has a set of tools that allow its instantiation for particular problems or environments. From a general point of view and according to the previously introduced ideas, the development of surveillance systems implies the analysis of the surveillance tasks carried out to address these problems. This analysis is made by means of the following questions:
110
J. Albusac et al.
• What objects or elements of the environments will be analysed? • What aspects or events of interest will be monitored in relation to the previous objects or elements? • What kind of sensors are available in the monitored environment? What new kind of sensors need to be deployed on the scenario to gather useful information to perform the surveillance? • When these events of interest may take place? Once these questions are answered, a study of the existing solutions that can be used or adapted must be carried out, taking into account the following issues: • What processing mechanisms are there to process the data gathered by the deployed sensors? The main goal of these processing elements is to transform and adapt these data to generate the information used by the components that perform the different surveillance tasks. • What existing components are there to carry out such tasks? If a component was already designed and developed to accomplish a surveillance task, it has to be customised for the environment where it will be deployed. On the contrary, the component must be designed and developed. In this way, a surveillance component is associated to a scenario element, to an aspect or event of interest to monitor, to a sub-environment, and to a particular time interval. Generally, these facts imply that the perceptual or low-level layer gets the information from the environment (e.g. objects of the scenario, their location, physical features such as size or orientation, time interval of the monitoring, etc). The intermediate layer processes this information and performs an analysis about the environment state to determine whether the studied situations are normal or abnormal. To do that, this layer makes use of a model that defines in a general way the normality of objects according to a particular aspect, generating the named normality components which are independent of any environment. Within this context, the most undesired situation that can take place under this approach is an anomalous situation that was not previously defined and considered by a surveillance system. However, the system is aware of something suspicious or abnormal is happening at that time. These normality components defined in a general way must be instantiated for particular environments. To instantiate a component under the approach discussed in this work implies customising it by means of knowledge acquisition tools and/or machine learning algorithms. The use of this last kind of algorithms is advisable in dynamic environments where the environmental conditions may easily change or the number of required instances is too high. When a normality component determines that an object does not behave in a correct way according to the monitored aspect, the identification module of abnormal situations is activated. The combination of multiple normality analysis performed by each component and the output of the identification modules of anomalies constitutes the global normality analysis for a sub-environment. Similarly, the global analysis of the whole scenario is the result of applying multiple fusion techniques to
6
Normality Components for Intelligent Surveillance
111
Fig. 6.1 General scheme of the proposed scalable surveillance system.
the different analysis performed in each one of the monitored sub-environments (if the whole scenario was previously divided to reduce complexity). The surveillance system makes use of this information to give support to the security personnel in the process of decision making. Finally, the upper layer is also composed of the monitoring tools used by the security staff to analyse the results provided by the rest of layers. The design of surveillance systems based on the previously discussed ideas, that is, layers, normality components and sub-environments, offers the following advantages: • Reuse, the development of normality components in a general way, independently of the monitored environment, facilitates the reuse of expert knowledge. • Reduction of the development cycle, as a consequence of the previous advantage. • Quality improvement, the reuse of previous experience avoid making mistakes. • Reliability, if a component has been exhaustively tested and evaluated in multiple environments, debugging possible malfunctioning, its inclusion into a new surveillance system guarantees a high probability of success.
112
J. Albusac et al.
• Flexibility, the combination of components allows to build different kinds of surveillance systems depending on the security requirements. These components represents the ingredients of the application that are aggregated to reach a particular goal; in this case, the analysis of specific situations. In the next sections, the concept of normality component that supports the ideas discussed in the current section is formalised, establishing how to perform the normality analysis of environments depending on these components.
6.3.2 Normality in Monitored Environments The normality analysis in an environment is carried out by using the normality components responsible for the monitoring of the required aspects. Each one of these components is understood as a black box that receives information from the environment and, depending on the defined constraints (spatial, temporal, etc.) that checks the normality, provides an output that denotes the normality of the environment given in a list of grades (e.g. normal, suspicious, abnormal...). This is due to the fact that a situation is considered as normal or not depending on the degree of satisfaction of such constraints. This approach leads us to Fuzzy Logic [36, 37], a mathematical model widely used to deal with the uncertainty and vagueness of real world problems, as the mechanism for representing these normality constraints. On the other hand, this model is also useful to work on the data gathered by the low-level layers, which are usually imprecise. 6.3.2.1
The Surveillance Problem: divide and conquer
By means of the divide and conquer technique, the surveillance problem can be defined as follows: Definition 6.1. The surveillance of an environment E is the understanding of the different perceptions obtained from the sensors deployed on the monitored environment {S1 , S2 , ..., Sn }. Each perception of the global environment is considered, at the same time, as an environment Ei . Thus, the global normality definition in the environment E is composed of the particular normality definitions for each subenvironment Ei . Therefore, the surveillance problem P in an environment E is defined as the monitoring of multiple sub-environments, which are simpler and developed from the perceptions of the sensors deployed on the environment: P = {E1 , E2 , ..., En }
(6.1)
In this way, the complexity of defining the normality is reduced since the number of situations that can take place in a sub-environment Ei is lower than in the global environment E. Example 6.1. The surveillance of a shopping center E, where multiple security cameras are deployed, may be carry out depending on the analysis of the situations
6
Normality Components for Intelligent Surveillance
113
that happen in those parts of the shopping center Ei monitored from some of the deployed cameras. The global normality analysis of E is obtained by aggregating the individual analysis of each Ei . 6.3.2.2
Subproblem of Surveillance: Monitored Environments
Definition 6.2. A monitored sub-environment E is defined as a 4-tuple composed of the following elements: E =< V ; O;C; O × C > (6.2) where • V is the set of input variables used to perform the required surveillance tasks and to inform about the objects features and the current state of the environment. On the other hand, the values of these variables can be directly gathered by the sensors or generated by preprocessing modules. Furthermore, such values may be precise or, on the contrary, provided with imprecision and vagueness. • O is the set of classes of monitored objects in the sub-environment, whose behaviour must be analysed (e.g. people, group of people, car, truck, bicycle, etc). • C refers to the set of monitored aspects in the sub-environment, denoted as concepts from now on. • O ×C determines the concepts that must be used to analyse the normality of each class of object. Depending on the class of each object, different concepts will be used to determine if their behaviour is normal or not. Example 6.2. Let Ei be a part of the environment monitored by a security camera, where there exist traditional urban elements such as gardens, pavements, pedestrian crossings, traffic lights, roads, etc. The classes of objects that may appear are O = {pedestrian, vehicle}. The concepts to monitor in Ei are C = {tra jectories, speed, pedestrian crossings}. That is, the trajectory, speed and behaviour regarding pedestrian normality crossings normality will be analysed. Some examples of the variables V needed to perform the surveillance may be as follows: object location, object size, time, list of key regions of the environment, position of relevant static elements, etc. The concepts proposed to analyse the behaviour of each class of object is given by O × C = {(pedestrian, trajectories), (pedestrian, pedestrian crossing), (vehicle, trajectories), (vehicle, speed), (vehicle, pedestrian crossing)}. According to this example, a pedestrian behaves in a normal way if he/she follows normal trajectories and crosses the road through the pedestrian crossing. In the case of vehicles, their speed in each time is also monitored. 6.3.2.3
Surveillance Based on the Analysis of Concepts
As previously discussed, the surveillance of an environment is carried out by means of concepts so that there exists one concept per each one of the events of interest to monitor.
114
J. Albusac et al.
Definition 6.3. A concept ci (ci ∈ C) is defined as a 3-tuple composed of the following elements: ci =< Vi ; DDVi ; Φi > (6.3) where Vi is the set of input variables used to define the concept ci so that Vi ⊆ V . On the other hand, DDVi is the set of definition domains of the variables that belong to Vi . Therefore, if Vi = {v1i , v2i , ..., vni }, then DDVi is defined as DDVi = {DDV1i, DDV2i , ..., DDVni }, where DDV ji is the definition domain of the variable v ji . The definition domain of a variable specifies the possible values that can take. Finally, Φi is the set of constraints used to complete the definition of the concept ci , according to the elements of Vi (Φi = { μ1i , μ2i , ..., μki }). The normality analysis of ci depends on how the constraints associated to ci are met. 6.3.2.4
Normality Constraint
Definition 6.4. A normality constraint, associated to a concept ci is defined as a fuzzy set Xi over the domain P(Vi ), with an associated membership function μXi :
μXi : P(Vi ) −→ [0, 1]
(6.4)
where 1 represents the maximum degree of satisfaction of the constraint and 0 the minimum. The rest of values represent intermediate degrees of normality. Sometimes, defining several constraints by means of crisp sets (the object meets the constraint or not) is more suitable and practical. In such a case, the membership function is as follows: 1 if x ∈ Xi ; μXi (x) = 0 if x ∈ / Xi ; The constraints definition can be done by using simple constraints through a set of operations generating a new constraint. Operations between constraints Let A and B be two normality constraints defined over the domain P(Vi ) so that: 1. The union (A ∪ B) of constraints is a new constraint that is met if and only if A or B are satisfied. The membership function of A ∪ B is defined as μA∪B (x) = max{μA (x), μB (x)}. 2. The intersection (A ∩ B) of constraints is a new constraint that is met if and only if A and B are simultaneously satisfied. The membership function of A ∩ B is defined as μA∩B (x) = min{μA (x), μB (x)}. ¯ is a constraint that is met if and only if A is not satisfied. 3. The complement (A) The membership function of A¯ is defined as μA¯ (x) = 1 − μA(x). Properties of normality constraints The properties associated to normality constraints are exactly the same that the properties of fuzzy sets. Let A, B and C be normality constraints defined over the domain P(Vi ):
6
1. 2. 3. 4. 5. 6. 7.
Normality Components for Intelligent Surveillance
115
Idempotent: A ∪ A = A ; A ∩ A = A Conmutative: A ∪ B = B ∪ A ; A ∩ B = B ∩ A Associative: A ∪ (B ∪C) = (A ∪ B) ∪C ; A ∩ (B ∩C) = (A ∩ B) ∩C Distributive: A ∪ (B ∩C) = (A ∪ B) ∩ (A ∪C) ; A ∩ (B ∪C) = (A ∩ B) ∪ (A ∩C) Doble negation: ¬(¬A) = A Transitive: Si A ⊂ B ∧ B ⊂ C → A ⊂ C Limit condition: A ∪ 0/ = A; A ∩ 0/ = 0/ ; A ∪ P(Vi ) = P(Vi ) ; A ∩ P(Vi ) = A
The concept definition determines the framework that establishes the general rules about how to carry out the associated surveillance task. However, the process of making instances from such a concept needs to be applied for particular environments. For example, in the case of the trajectory concept, such a definition establishes the mechanisms required to analyse normal trajectories, but without instantiating them because they depend on the particular environment to monitor. 6.3.2.5
Concept Instance
The next step after defining a concept and its constraint in a general way is to make instances of such a concept for particular environments. Definition 6.5. An instance y of a concept ci in an environment E j (E j ∈ P), denoted j as ciy , is defined as follows: ciyj =< Vi ; DDVi ; Φ˜ i = { μ˜ 1i , μ˜ 2i , ..., μ˜ zi } >
(6.5)
where Φ˜ i is the set of particularised constraints of the set Φi , that is, each μ˜ xi ∈ Φ˜ i represents the particularisation of μxi ∈ Φi . It is verified that |Φi | ≥ |Φ˜ i |. Example 6.3. If ci is the trajectory concept, an instance of ci represents a normal trajectory within the environment E j . 6.3.2.6
Normality Constraint Instance
A normality constraint instance is used to adapt the general definition of a kind of analysis based on a concept to a specific environment. Definition 6.6. A normality constraint instance is a fuzzy set defined over P(DDVi ) with an associated membership function ( μ˜ Xi ):
μ˜ Xi : P(DDVi ) −→ [0, 1]
(6.6)
so that if vki ∈ P(Vi ) is employed to define μXi , then the values of vki defined over DDVki ∈ P(DDVi ) are used to make the instance μ˜ Xi . Example 6.4. If μXi represents a constraint that checks if a moving object follows the sequence of regions of a trajectory, μ˜ xi is the constraint employed to check if
116
J. Albusac et al.
the moving object follows a particular sequence of regions within the monitored environment. 6.3.2.7
Degree of Normality Associated to an Instance
Each monitored object has associated a degree of normality that establishes how normal its behaviour is according to each attached concept to it, represented by the deployed instances of such concepts within the environment. Definition 6.7. The degree of normality of an object ob j within an environment E j (E j ∈ P) regarding an instance y of the concept ci (ciyj ), denoted as Nc j (ob j), is iy
calculated from the values obtained for each μ˜ xi : |Φ˜ i |
Nc j (ob j) = iy
being
y μ˜ xi
(6.7)
x=1
a t-norm, such as the t-norm that calculates the minimum value.
This particular t-norm, that is, the minimum, is suitable for surveillance systems and, particularly, for the model proposed in this work since in this kind of systems the violations of some of the constraints that define the normality in the environment are needed to be detected. In other words, if some of the constraints are not satisfied, the normality degree Nc j (ob j) will be low. The fact of using other t-norms implies iy
obtaining lower values than this minimum, since a relevant property of the t-norm operators is that any of them is always lower or equal to the value obtained by the operator that calculates the minimum. Therefore, this would imply a stricter surveillance and, as a result of that, a high number of alarm activations. For instance, if j the product t-norm (T (a, b) = a · b) is used and ciy is defined through two constraints that are satisfied with a value of 0.6, the degree of normality Nc j (ob j) will be iy
0.6 · 0.6 = 0.36, fact that does not make sense since most of the constraints were satisfied with a higher degree (0.6). On the other hand, with the minimum t-norm this degree would be min(0.6, 0.6) = 0.6, which is a value that represents in a better way the degree of satisfaction of the constraints that define the instance. Example 6.5. If ci is the normal trajectory concept and ci1j is a particular normal trajectory instantiated within the environment E j from a set of constraints μXi and their particularisations (μ˜ X1i ), then Nc j determines the degree of normality of the j
i1
object ob j regarding the trajectory ci1 . A high value of Nc j means that ob j follows i1
the trajectory ci1j in a suitable way. On the other hand, a low value represents the contrary case or that ob j does not follow such a trajectory at all. In other words, a low value of Nc j means that one or more constraints of the instance ciyj are not met. i1
6
Normality Components for Intelligent Surveillance
6.3.2.8
117
Degree of Normality Associated to a Concept
The next step After calculating the degree of normality of an object for each instance of a particular concept is to calculate the normality of such an object according to all the defined instances of such a concept. In this way, it is possible to study the general behaviour of an object according to a concept. Definition 6.8. The degree of normality of an object ob j within an environment E j (E j ∈ P) according to a concept ci , denoted as Nc j (ob j), is calculated as follows: i
Nc j (ob j) = i
w y=1
Nc j (ob j)
(6.8)
iy
where w is the number of instances of the concept ci and for instance the maximum t-conorm.
is a t-conorm operator,
In the same way that min(a, b), which represents the maximum value that can be obtained by applying a t-norm, max(a, b) represents the minimum value obtained by applying a t-conorm operator, that is, the value calculated with any other t-conorm operator will be higher than the calculated by max. This fact also justifies its use on the proposed model so that an object ob j behaves in a normal way according to a concept ci within an environment E j (Nc j (ob j)) if it satisfies the constraints of some i of the deployed instances for ci . On the other hand, considering the application of the t-norm and t-conorm operators, the normality analysis of an object ob j, according to ci in E j , is the result of applying an AND-OR fuzzy network over a set of constraints: j j ci1 = μ˜ 1i1 ∧ μ˜ 2i1 ∧ ..... μ˜ ni1 = Nci1 (ob j) ∨ ci2j = μ˜ 1i2 ∧ μ˜ 2i2 ∧ ..... μ˜ ni2 = Ncji2 (ob j) . . . . . . . . . ∨ . . . . . j j w w w ˜ ˜ ˜ ciw = μ1i ∧ μ2i ∧ ..... μni = Nciw (ob j)
Ncji (ob j) Example 6.6. If ci is the trajectory concept and cikj represents the normal trajectories defined through the instances of the variables and constraints in E j , then the degree of normality Nc j (ob j) of each object according to the concept is calculated i
by applying a t-conorm operator over the values Nc j (ob j) obtained for each one of iy
the instances. Let the following values be:
118
J. Albusac et al.
ci1j = μ˜ 1i1 ∧ μ˜ 2i1 ∧ ..... μ˜ ni1 = Ncji1 (ob j) = 0.2 ∨ j j ci2 = μ˜ 1i2 ∧ μ˜ 2i2 ∧ ..... μ˜ ni2 = Nci2 (ob j) = 0.8 . . . . . . . . . ∨ . . . . . j j w w w ˜ ˜ ˜ ciw = μ1i ∧ μ2i ∧ ..... μni = Nciw (ob j) = 0.0 Ncji (ob j) = 0.8 In this case, the object satisfies remarkably the constraints defined for the instance j j ci2 of the concept ci , since the value calculated for Nci2 (ob j) is 0.8. In other words, the object follows a normal trajectory, which is the trajectory instantiated in ci2 . In short, the degree of normality of an object associated to a concept ci , Nc j (ob j), is i
a numerical value that belongs to the interval [0, 1], which is a representative sign of the object behaviour regarding a concept or surveillance aspect. High values of this parameter represent normal situations within the monitored environment while low ones represent suspicious or abnormal situations. 6.3.2.9
Normal Behaviour within an Environment According to a Concept
The final goal to reach after the analysis of a situation is to activate a set of alarms or to draw the attention of the security personnel when such a situation does not meet the limits of normality. That is, after having calculated the degree of normality Ncji (ob j), the model needs a mechanism to decide whether the object behaves normally or not, depending on the calculated degree of normality. This can be addressed by using an alpha threshold defined within the range [0, 1] over the degree of normality, fact that implies that a situation changes from normal to abnormal abruptly. For this reason, the normality is considered as a linguistic variable VN that takes a set of values over the domain definition DDVVN = {AA, PA, SB, PN, AN} (see Figure 6.2). In this way, the object behaviour can be absolutely abnormal (AA), possibly abnormal (PA), suspicious (SB), possibly normal (PN), and absolutely normal (AN); so that a behaviour can belong to more than one set at the same time. The definition of each value of the domain DDVVN depends on the features of the environment to monitor, the desired security level and the criterion of the expert in charge of setting up the configuration parameters, which must be adapted to the system behaviour. Figure 6.2 graphically shows how the values assigned to the sets that represent anomalous situations are not high, which implies that the system is not very strict, avoiding in this way a frequent (and possible unnecessary) alarm activation. In this way, every time that the degree of normality of an object behaviour is calculated according to a concept, Ncji (ob j), the membership of this value to the fuzzy sets that establish the definition domain DDVVN of the normality variable (VN )
6
Normality Components for Intelligent Surveillance
119
is studied, determining the normality of the analysed situation. The alarm activation will rely on upper layers, which will perform the required actions depending on the normality values received.
Fig. 6.2 Definition domain DDVVN of the normality variable VN .
The next section discusses how to aggregate the output of different normality components to get a global analysis and determine if the behaviour of the objects is normal within a particular environment, taking into account multiple surveillance concepts.
6.3.3 Global Normality Analysis by Aggregating Independent Analysis The normality of an object does not exclusively depend on an concept. In fact, a global evaluation of the normality of all the monitored concepts within the environment must be carried out. Therefore, a mechanism for combining multiple analysis is needed to get a global value that represents the normality of the object behaviour in a general way. A possible approach consists in combining the output values of the normality components by using t-norms or t-conorms. For instance, the minimum t-norm establishes that the global normality value corresponds to the value of the analysis of the lowest degree of normality. On the other hand, if the maximum t-conorm is used, then the result will come determined by the component that provides the degree of normality with the highest value. Both techniques do not reflect the real process when combining multiple degrees of normality. For this reason, the OWA (Ordered Weighted Averaging) operators [34] are proposed to address this problem due to their flexibility.
120
J. Albusac et al.
Formally, an OWA operator is represented as a function F : Rn → R associated to a vector of weights W of length n; W = [w1 , w2 , . . . , wn ], where each wi ∈ [0, 1] and ∑ni=1 wi = 1. OWA(a1 , a2 , . . . , an ) =
n
∑ wj · bj
j=1
(6.9)
⎡
⎤ b1 ⎢ . ⎥ ⎥ OWA(a1 , a2 , . . . , an ) = [w1 , w2 , . . . , wn ] ⎢ ⎣ . ⎦ bn where (a1 , a2 , . . . , an ) represent the set of initial values or criteria used by the operator to make a decision, and (b1 , b2 , . . . , bn ) represent the ordered set associated to (a1 , a2 , . . . , an ), being b j the j-th highest value of such a set. Furthermore, the values of the weights wi belonging to the vector W are linked to positions and not to particular values of the original set. In the model devised in this work, the OWA operator is used to aggregate the j j j normality values calculated by each component (Nc1 , Nc2 , . . . , Ncn ). One of the key characteristics of this set of aggregation operators is the flexibility to vary their behaviour depending on the values assigned to the vector of weighs W . Such a vector determines the behaviour of the operator, which may tend to behave as union or intersection operators. In fact, this behaviour can be customised to reflect the minimum t-norm or the maximum t-conorm. Figure 6.3 graphically shows the relationships between the values obtained by the operators t-norm, OWA and t-conorm, respectively.
Fig. 6.3 Visual comparison between the range of values obtained by t-norm, t-conorm and OWA operators.
In order to analyse the behaviour of the OWA operator that deals with the minimum t-norm and the maximum t-conorm, R. Yager proposed the following formula [34, 35]: n
orness(W ) = (1/n − 1) ∑ ((n − 1) · wi)
(6.10)
i=1
After having initialised the vector W , closer values of the orness measure to 1 reflect that the OWA operator is close to the maximum; while closer values to 0 reflect the proximity to the minimum. The particular values of the vector of weights W depend on the desired surveillance level. In this way, an orness value close to 1 represents a soft surveillance so that a situation will be considered normal while there is a normal behaviour. On
6
Normality Components for Intelligent Surveillance
121
the other hand, values close to 0 reflect a hard surveillance so that if there is a single anomalous behaviour, then the situation will be considered as abnormal. Values close to 0.5 reflect intermediate cases. Table 6.1 shows three configurations of W and how they affect to a common monitored situation given by the output of three normality components Ncj1 (ob j), Ncj2 (ob j) and Ncj3 (ob j), reflecting the minimum, the maximum and a weighted measure. Table 6.1 Multiple configurations of the OWA operator.
Criteria Ncj1 (ob j) j Nc2 (ob j) j Nc3 (ob j)
Minimum Maximum Average Value Weight Evaluation Weight Evaluation Weight Evaluation 0.95 0 0 1 0.95 0.33 0.31 0.8 0 0 0 0 0.33 0.264 0.2 1 0.2 0 0 0.33 0.066
TOTAL
0.2
0.95
0.64
After having obtained the global normality value by applying the OWA operator j j j (OWA(Nc1 , Nc2 , . . . , Ncn )), the final normality value N j (ob jk ) associated to an object ob jk in a particular environment E is given by the degree of membership of this value to a fuzzy sets. These sets define the values possibly normal behaviour (PN) and absolutely normal behaviour (AN) of the definition domain DDVVN of the normality variable VN (see Figure 6.2), determining in this way the normality of the analysed situation: N j (ob jk ) = μPN (OWA(Ncj1 , Ncj2 , . . . , Ncjn )) + μAN (OWA(Ncj1 , Ncj2 , . . . , Ncjn )) (6.11) j
j
j
j
j
j
where μPN (OWA(Nc1 , Nc2 , . . . , Ncn )) and μAN (OWA(Nc1 , Nc2 , . . . , Ncn )) establish the membership of the output value of the OWA operator to the sets PN and AN, respectively. The global normality value within an environment E j considering the activity of all the moving objects, denoted as GN j , is calculated as the minimum of the normality value N j calculated for each object ob jk : |O|
GN = j
N j (ob jk )
(6.12)
k=1
where O is the set of monitored objects at a particular time, |O| is the number of monitored objects and N j (ob jk ) is the normality value calculated for each object ob jk ∈ O. The use of the minimum is justified because when an object does not behave normally, the global normality degree GN j in the environment E j must be a low value.
122
J. Albusac et al.
6.4 Model Application: Trajectory Analysis This section discusses the model application to monitor the trajectories and speed of moving objects in a particular scenario. These analyses are performed by two independent normality components, but the global normality is evaluated by taking into account both of them. This aggregation is carried out by using the approach described in Section 6.3.3. The analysis of trajectories and speed, using the visual information gathered by security cameras, represent two problems that have been widely studied by other researchers. The use of a formal model to deal with them provides several advantages over other approaches. In the case of trajectory analysis, most authors focus their work on analysing the spatial information to recognise the paths of moving objects [17, 22, 23, 25]. However, these definitions might not be enough in the surveillance field, since additional constraints need to be checked very often. Most of existing proposals make use of algorithms for learning normal trajectories, which usually match with those frequently followed by most of moving objects. Within this context, it is important to take into account that a trajectory often followed by a vehicle may not be normal for a pedestrian or, on the other hand, normal trajectories at a particular time interval may be abnormal at a different interval. Therefore, the trajectory analysis based on the proposed model provides richer and more scalable definitions because new kinds of constraints are possible to be included when they are needed (or even removed unnecessary constraints previously added). On the other hand, the second component analyses the speed of moving objects by using the 2D images provided by a security camera without knowing the camera configuration parameters. The main goal of this component is to classify the real speed of objects as normal or anomalous. Most of the proposed methods estimate the speed and require a previous calibration process that needs relevant information of the camera, such as the height or the zoom level. Afterwards, they use transformation and projection operations to get a real 3D representation of the position and the object movement [9, 11, 20, 21, 24]. The definition of the speed concept in the proposed model is made by taking into account how people work when they analyse a video, that is, they do not need to know the exact values of position and speed of moving objects to infer that they move at a fast speed. On the contrary, the security staff only analyse the object displacement using the closest static objects as reference. The output of this component is therefore obtained depending on the analysis that determines whether an object moves at a normal speed regarding such a displacement. To do that, horizontal and vertical fuzzy partitions of 2D images are done. This information is used by an inductive learning algorithm to generate a set of constraints that establish how the horizontal, vertical and global displacements of each kind of object in each region that composes the fuzzy partition are performed. An example of a generated constraint is as follows: The speed of vehicles in far regions from the camera is normal if the horizontal displacement is small.
6
Normality Components for Intelligent Surveillance
123
A more detailed discussion on the definition of this concept is given in [3]. Next, the application of the proposed formal model for the definition and analysis of trajectories is described in depth. Nevertheless, Section 6.5 will discuss the results obtained by the two normality components previously introduced.
6.4.1 Normal Trajectory Concept A trajectory is defined by means of three different kinds of constraints: i) role constraints, that specify what kind of objects are allowed to follow a trajectory; ii) spatial constraints, that determine the sequence of regions where moving objects must move (possibly with an associated order) and, finally; iii) temporal constraints, that refer to the maximum period of time or interval allowed to follow a trajectory. In this work, the regions are represented by means of polygons graphically defined by means of a knowledge acquisition tool over a 2D image in order to establish trajectories. In this way, a single sequence of regions may represent a trajectory that comprises multiple paths whose points are close each other (see Figure 6.4). On the other hand, two similar paths are uncommon to correspond to a single trajectory.
Fig. 6.4 (a) Group of trajectories followed by vehicles and marked by means of continuous lines over the scene. (b) Group of trajectories without the background image that represents the scene. (c) Example of sequence of zones or regions used to define the previous group of trajectories.
In order to define trajectories, the following information is required: • Set of relevant regions of the monitored environment. • Regions where every kind of object can be located at a particular time. • Sequence of covered regions by an object, temporally ordered, since such an object is first identified. • Kinds of monitored objects, determined by means of physical features such as height, width, horizontal and vertical displacements, etc. • Identifiers associated to each moving object by the tracking process. • Temporal reference to the current moment. • Time spent to carry out each one of the defined trajectories. This information justifies the choice of variables for the definition of the trajectory concept discussed in the next section.
124
J. Albusac et al.
6.4.2 Set of Variables for the Trajectory Definition As previously specified in the model formalisation, a concept is defined by means of a 3-tuple ci =< Vi ; DDVi ; Φi > composed of three elements, where Vi is the set of variables needed to represent the concept ci , DDVi is the definition domain of each variable belonging to Vi and, finally, the concept definition is completed through the association of a set of constraints Φi . Next, the variables used in the trajectory component are described by taking into account this information. (a) The system must be aware of the division of the environment into regions, and the regions where a moving object may be located. Within this context, the system needs a internal representation of the more relevant regions in order to establish a relation between the objects position and the previously defined regions. Table 6.2 Set of regions A that represent the regions of activity in a monitored environment. Variable DDVA Description A {a1 , a2 , . . . , an ,} Set of areas belonging to a monitored environment. Each ai is represented through a polygon, defined by means of a set of ordered points.
(b) Sequence of temporally-ordered covered areas by an object. Table 6.3 completes the set of variables of the environment regions and the position of each object. Each μa (ob j) represents the intersection of an object ob j with a particular area a. This information is not generated by the component but obtained from the low-level modules, which take the data provided by the segmentation and tracking processes to perform this task. The main reason to calculate the regions where an object is located in this low-level modules is because multiple modules or normality components may need it. On the other hand, the systems maintains a register (ob j) for each object ob j, where the set of covered regions by every object is stored to establish similarities with normal trajectories. Table 6.3 Variable used, together with the set of areas A, to represent the objects position on the scene. Variable (Vi ) DDVi μa (ob j) [0, 1]
(ob j)
Description Imprecise information received from low-level layers about the possibility for an object ob j to be located over an area a, where 1 represents that the object is totally located over it and 0 the contrary case. {μa (ob j)}+ List of μa (ob j)|μa (ob j) > 0, updated until the current time.
(c) Object membership to one or multiple classes (see Table 6.4). Each μc (ob j) represents the degree of membership of an object ob j to the class c within the interval [0, 1], where 1 is the maximum membership value and 0 the minimum. The degree of membership of an object to a class is also calculated by low-level layers that use the spatial information of the segmentation and tracking processes. Therefore,
6
Normality Components for Intelligent Surveillance
125
both the regions where an object is located and its classes are the input of the component that analyses trajectories. Table 6.4 Membership value of an object ob j to a class c. Variable (Vi ) DDVi μc (ob j) [0,1]
Description Imprecise information of the object membership to a class c, ∀c ∈ O, ∃ μc (ob j)).
(d) Temporal references. The system mainteins a temporal reference to the current moment in which the analysis is being performed and, at the same time, it allows the maximum duration and temporal intervals of the trajectories (see Table 6.5). Table 6.5 Set of temporal constraints associated to trajectories. Variable tc
tmax
DDV [1, 31] ∪ {∗} × [1, 12] ∪ {∗}× [1900 − 9999] ∪ {∗} × [0, 24] ∪ {∗} × [0, 59] ∪ {∗} × [0, 59] ∪ {∗} tmax ∈ [0, 9999]
Int j < ts ,te > DDVtc × DDVtc
Description Absolute reference to the current moment. The format is DD/MM/YY - hh : mm : ss. The symbol * is used as a wild card.
tmax is the maximum allowed duration for a trajectory, measured in seconds. This parameter is often used to define temporal constraints. Time interval definition through initial (ts ) and final (te ) moments. This variable is also used to define temporal constraints.
Finally, Table 6.6 shows the rest of variables, which are mainly used to define constraints. Table 6.6 Variables ϒ , Ψ and Ψ (ob j) to define role and spatial constraints, respectively. Variable DDV ϒ O
Description ϒ represents the set of classes/roles that are allowed to follow a certain trajectory. Each trajectory has associated ϒ , used to define role constraints. Ψ DDVΨ ⊆ DDVA Ψ represents the sequence of areas that must be covered to perform a certain trajectory. Each trajectory has associated a sequence Ψ , used to define spatial constraints. Ψ (ob j) DDVΨ Maximum membership value of object ob j to each of the regions covered of Ψ until the current time. If an object does not satisfy the order imposed by Ψ (the object is not located on a region of the sequence Ψ or the order has been broken), the list of covered regions Ψ (ob j) becomes empty and is initialised to a void value: Ψ (ob j) = 0/
126
J. Albusac et al.
6.4.3 Preprocessing Modules The preprocessing modules are responsible for providing the values of the variables used by the normality components in order to perform the analysis. Concretely, the component of normal trajectories needs three preprocessors whose main functions are as follows: i) tracking of objects, ii) estimation of object position (regions where it is located) and iii) object classification. The first one is especially important for the system to know the objects that appear on the scene. In fact, the intermediate layers will not be able to analyse the behaviours of those objects that were not previously recognised. The design and development of tracking algorithms is not one of the main goals of this work but needed to carry out the proposed analysis. For this reason, OpenCV and the library Blob Tracker Facility [1] haven been used to perform this task. The application blobtrack.cpp, implemented in C++, makes use of these libraries to segment and track objects from the video stream gathered by the security cameras. A detailed description of the whole process is discussed in [10]. The original application has been customised to extend its functionality and get more spatial data, providing information about the horizontal and vertical displacement of each object ob j between consecutive frames, calculated as the difference between the central point of the ellipse that wraps the monitored object. On the other hand, the source code has been modified to generate a persistent log used by the preprocessing modules and the normality components. The second preprocessor makes use of the scene knowledge and the information generated by the first preprocessor to estimate the regions where an object is located. That is, the preprocessor estimates the object location from the following information: • Set of areas (A), defined over the scenario to monitor. Figure 6.5.b shows an example of scene division into multiple regions. • Spatial information of each object, provided by the segmentation algorithm. Particularly, the coordinate of the ellipse central point that frames the object, the object height and the object width. The membership value of an object to a particular area (μa (ob j)) is the result of dividing the number of points that represent the object base over the area a by the total number of points of the object base. The object base over an area matches up with the lower half of the ellipse that wraps the object. The points that form this base are those that belong to the rectangle which upper left vertex is calculated as (x − width/2, y) and which inferior right vertex is (x + width/2, y + height/2), being x and y the coordinates of the ellipse central point. To calculate if the point (x, y) belongs to a polygon defined from a set of points, W. Randolph Franklin’s algorithm [15] is used, which is based on a previous algorithm [27]. Finally, the third preprocessor allows to establish the class or classes of a detected object. To do that, the output generated by the first preprocessor and a set of IF-THEN fuzzy rules automatically acquired by the inductive learning algorithm [3] are used. These rules, such as IF v0 es ZD0 ∧ . . . ∧ vn es ZDn THEN y j , are
6
Normality Components for Intelligent Surveillance
127
Fig. 6.5 (a) Scene monitored from a security camera. (b) Scene division into areas or regions.
characterised by the antecedents composed of a set of variables Vi that take a subset of values ZDi ⊂ DDVvi , and a consequent y j that represents the object class if the rule is fired. Currently, the set of variables V employed to classify objects differs from [3] and is composed of the following variables: • Horizontal v1 and vertical v2 location. The position and distance regarding the camera affect to the object dimensions. • Width-height (W/H) aspect, v3 . This value is obtained by dividing the ellipse width by the ellipse height. To study the objects dimensions instead of their size improves the classification results. This is due to the fact that the segmentation algorithms are not perfect and, therefore, the estimated ellipse does not completely wrap the identified objects most of the times. However, it is possible to maintain the suitable dimensions of the class they belong. • Horizontal and vertical displacements aspect, v4 . This aspect is obtained by dividing by the variables that contain the horizontal and vertical displacement values of the object. In some scenarios, these variables may be a relevant discriminator between object classes. In the scene of Figure 6.5, the camera location and the routes where the vehicles drive imply that their horizontal displacements are normally larger than the vertical ones. • Global displacement (MOV), v5 . This value is calculated as the euclidean distance between the ellipse central point that wraps the object between consecutive frames. This variable is also interesting to distinguish object classes since there are displacements that are not real for certain kinds of objects. The length of the displacements varies depending on the region of the 2D image where they take place, that is, depending on the distance between the object and the camera. Obviously, the length of the displacements in regions that are far from the camera are much smaller than those produced in close regions, fact that does not imply a shorter distance on the real scene. The preprocessor also maintains a persistent register that stores how each object was classified at every moment. Depending on this information and the fired rules,
128
J. Albusac et al.
the membership values to each one of the classes are calculated. Algorithm 6.1 specifies the steps to be followed in order to determine the membership to the set of classes O. Figure 6.6 shows an example that describes this classification process. Algorithm 6.1. Object classification algorithm Require: Detected object in the current frame. The values of the variables (v1 , v2 , v3 , v4 , v5 ) are (x1 , x2 , x3 , x4 , x5 ). Ensure: Degrees of membership to the classes pedestrian y1 and vehicle y2 . 1.- Calculate the satisfaction degree of each rule. 2.- FOR EACH class oi ∈ O : A) The value of the current membership is calculated without taking into account previous classifications. If multiple rules of the same class are fired, the value used will be the one with the higher degree of satisfaction of such rules. If no rule is fired, the membership value is 0. B) The final membership value to each one of the classes is calculated as the arithmetic media of the previous membership values and the current one. END FOR
Fig. 6.6 Object classification example. The first row represents the time, where ti < ti+1 . The second row shows the form and size of the ellipse that wraps the object in each ti . Finally, the last row shows the object classification. If the image that represents the class has a numeric value associated to the rigth, it means that the object was classified in ti with such a degree of membership. However, the final membership value is calculated as the arithmetic media of the previous membership values and the current one.
6.4.4 Constraint Definition The last step to complete the normality definition of the trajectory concept consists in the specification of a set of general constraints Φi , included in the tuple < Vi , DDVi , Φi >. According to the formal model, the satisfaction of these constraints allows the surveillance system to infer if the moving objects follow trajectories in a correct way, together with the degree of normality associated to the concept. A high value of this parameter implies that a monitored object follows in a suitable
6
Normality Components for Intelligent Surveillance
129
way one of the normal trajectories defined for the monitored environment. The defined constraints are as follows: Role constraint μ11 . Each trajectory is associated with a list ϒ that specifies the kinds of objects that are allowed to follow it. Only those objects that belong to a class of ϒ satisfy this constraint. For instance, those objects classified as pedestrians will not meet the role constraint of a trajectory defined for vehicles. The membership function μ11 of this constraint is defined in the following way:
μ11 (ob j) = μck (ob j) ⇐⇒ ∀c = ck ; c, ck ∈ ϒ ; μck (ob j) ≥ μc (ob j) So that μc (ob j) represents the membership value of the object ob j to the class c, and μck (ob j) the maximum membership value to one class in ϒ . For example, if ϒ ={Person, Group of People}, and the object ob j has been classified as Person = 0.3 and Group of People = 0.7, then the degree of satisfaction of the constraint μ11 (ob j) is 0.7, that is, max(μPerson (ob j), μGroupO f People (ob j)). Spatial constraint μ21 . The spatial constraint included in the trajectory component checks if an object follows the order of the sequence of regions defined for the trajectory (Ψ ). Every time that the component receives spation information from low-level layers, it calculates the membership value to each one of such regions depending on the object position. Secondly, each (ob j) (see Table 6.3) and Ψ (ob j) are updated only if some of the maximum values stored in such a sequence is exceeded (see Table 6.6). For each one of the trajectories associated to an object, a list Ψ (ob j) is managed in order to store the maximum membership values to regions of Ψ covered by the object. In other words, if the maximum membership value of an object to an area ak is μak (ob j) =0.8, being ak an area belonging to Ψ and the current membership is μak (ob j) =0.3, then the value of μak (ob j) in Ψ (ob j) is not updated. In this way, extremely low values in transitions between regions are avoided. When an object moves from one region to another, the membership value to the previous region may be really low, fact that may generate the violation of the spatial constraint. The degree of satisfaction of the spatial constraint is calculated as the arithmetic average of all μak (ob j) ∈ Ψ (ob j). As shown in Table 6.6, if the object violates the order relation defined in Ψ , the value of Ψ (ob j) will be 0/ and, therefore, the degree of satisfaction of such a constraint will be 0. The membership function μ21 of this constraint is defined as follows:
|(ob j)| μak (ob j) ∑k if |Ψ (ob j)| > 0; |(ob j)| μ21 (ob j) = 0 in other case; Temporal constraints μ31 and μ41 . The temporal constraints allow to specify when a particular trajectory is allowed to be followed. The first kind of temporal constraint (μ31 ) is used to check if a certain time limit tmax is exceeded, that is, an object follows a certain trajectory in a correct way if the trajectory is completed in a time shorter than tmax . The membership function μ31 of the constraint is defined in the following way:
130
J. Albusac et al.
μ31 (tmax ,tc ,tb ) =
1 if (tmax = ∅) ∨ (tmax ≤ (tc − tb )) 0 in other case;
where tc represents the current time and tb the time when the object began the trajectory. The second kind of temporal constraint (μ41 ) is used to check if an object follows a certain trajectory within the suitable time interval. To do that, five temporal relationships that relate moments and intervals, based on the Allen’s Interval Algebra [4] and the work developed by Lin [19], have been defined. Table 6.7 shows these relationships. Table 6.7 Temporal relationships between moments and time intervals. Relationship (tc , Int j ) Before After During Starts Finish
Logic definition tc < start(Int j ) end(Int j ) < tc start(Int j ) < tc < end(Int j ) start(Int j ) = tc end(Int j ) = tc
Formally, the constraint μ41 is defined as follows: ⎧ ⎨ 1 if (Int j = ∅) ∨ Starts(tc , Int j ) ∨ During(tc , Int j ) ∨Finish(tc , Int j )) μ41 (Int j ,tc ) = ⎩ 0 in other case; An object satisfies a temporal constraint with an associated interval Int j if the current moment tc belongs to such an interval. This fact takes place when some of the following conditions is met: Starts(tc , Int j ), During(tc , Int j ) or Finish(tc , Int j ). The next step after defining the concept in a general way and developing the normality component is to make instances for particular environments. Tables 6.8 and 6.9 show the set of particular instances for the scene of Figure 6.5. Each one of Table 6.8 Set of instances of the normal trajectory concept for vehicles. c11,y c11,1 c11,2 c11,3 c11,4 c11,5 c11,6 c11,7 c11,8 c11,9
μ˜ 11 μ˜ 21 ϒ = {vehicle} Ψ = {a1 , a2 , a3 , a4 , a5 , a6 } ϒ = {vehicle} Ψ = {a1 , a2 , a3 , a4 , a5 , a6 , a8 , a9 , a10 , a22 } ϒ = {vehicle} Ψ = {a1 , a2 , a3 , a4 , a5 , a6 , a8 , a9 , a11 , a13 , a14 , a15 } ϒ = {vehicle} Ψ = {a7 , a8 , a9 , a10 , a22 ϒ = {vehicle} Ψ = {a7 , a8 , a9 , a11 , a13 , a14 , a15 } ϒ = {vehicle} Ψ = {a7 , a8 , a9 , a11 , a5 , a6 } ϒ = {vehicle} Ψ = {a23 , a12 , a11 , a5 , a6 } ϒ = {vehicle} Ψ = {a23 , a12 , a11 , a13 , a14 , a15 } ϒ = {vehicle} Ψ = {a23 , a12 , a9 , a11 , a5 , a6 , a8 , a9 , a10 , a22 }
μ˜ 31 150sg 150sg 150sg
μ˜ 41 ∅ ∅ ∅
150sg 150sg 150sg 150sg 150sg 150sg
∅ ∅ ∅ ∅ ∅ ∅
6
Normality Components for Intelligent Surveillance
131
Table 6.9 Set of instances of the normal trajectory concept for pedestrians. HU is the identifier of the time interval defined for the open hours at University. c11,y c11,10 c11,11 c11,12 c11,13 c11,14 c11,15 c11,16 c11,17 c11,18 c11,19 c11,20 c11,21
μ˜ 11 μ˜ 21 ϒ = {pedestrian} Ψ = {a29 , a30 , a31 } ϒ = {pedestrian} Ψ = {a31 , a30 , a29 } ϒ = {pedestrian} Ψ = {a18 , a17 , a16 } ϒ = {pedestrian} Ψ = {a16 , a17 , a18 } ϒ = {pedestrian} Ψ = {a16 , a4 , a28 , a13 , a19 } ϒ = {pedestrian} Ψ = {a19 , a113 , a28 , a4 , a16 } ϒ = {pedestrian} Ψ = {a19 , a20 , a21 } ϒ = {pedestrian} Ψ = {a21 , a20 , a19 } ϒ = {pedestrian} Ψ = {a24 , a25 } ϒ = {pedestrian} Ψ = {a25 , a24 } ϒ = {pedestrian} Ψ = {a26 , a27 } ϒ = {pedestrian} Ψ = {a27 , a26 }
μ˜ 31 μ˜ 41 ∅ [HU ] ∅ [HU ] ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅
the rows represents an instance of the normal trajectory concept, while each one of the columns reflects a particularisation of the defined constraints.
6.5 Experimental Results The scenario chosen for validating and evaluating the model and the designed normality components is graphically shown in Figure 6.5. It is an urban traffic environment where both vehicles and pedestrians are submitted to traffic laws, such as correct trajectories or suitable speed for each stretch. This environment is monitored from a security camera located in the ORETO research group at the University of Castilla-La Mancha. The process of experimental validation consists in the monitoring of ten video scenes with a duration between 2’30” and 3’. One of the premises was to select visual material that covered multiple situations, with different illumination conditions and image resolutions. Figures 6.7 and 6.8 show the degree of illumination of each one of the scenes employed to test the system. Scenes 7 to 10 have a lower resolution. The main goal of the conducted experiments is to evaluate the proposed methods to classify objects and analyse the normality of trajectories and speed. For this reason, those moving objects that are not detected by the tracking algorithm, being global errors of the surveillance systems, were not reflected on the results discussed in this work. In other words, the normality components cannot analyse the behaviour of undetected objects and, therefore, these errors are due to the low-level processing of spatio-temporal information. Each one of the detected objects in a single frame is considered as a situation to be analysed. For each one of these situations, the system checks if the classification, the regions calculation, trajectory analysis, speed analysis, and global analysis are
132
J. Albusac et al.
Fig. 6.7 Screenshots of the scenes employed in the tests.
Fig. 6.8 Set of histograms that represent the brightness and contrast values of each one of the test scenes. The higher the values in the Y axis, the higher the illumination of the scene. The first six scenes correspond to a cloudy day where the objects lack of marked shadows while the remaining four have better illumination conditions.
normal. The matches and errors of the normality components and the global analysis are classified into true positives (TP), true negatives (VN), false positives (FP), and false negatives (FN); each one of them representing the following situations: • • • •
True positive (TP): normal situation correctly classified as normal. True negative (TN): anomalous situation correctly classified as anomalous. False positive (FP): anomalous situation incorrectly classified as normal. False negative (FN): normal situation incorrectly classified as anomalous.
Furthermore, for each one of the evaluated scenes, the relationships between these parameters have been established by means of the following coefficients [14]: • Sensitivity or true positive rate (TPR): T PR = T P/(T P + FN), that is, the hit rate. • False negative rate (FPR): FPR = FP/FP + T N.
6
Normality Components for Intelligent Surveillance
133
• Precision or accuracy (ACC): ACC = (T P+ T N)/(P + N), where P is the number of positive cases and N the number of negative cases. • Specificity (SPC) or true negative rate: SPC = T N/N = T N/(FP + T N) = 1 − FPR. • Positive predictive value (PPV): PPV = T P/(T P + FP). • Negative predictive value (NPV): NPV = T N/(T N + FN). • False discovery rate (FDR): FDR = FP/(FP + T P). Next, the results obtained taking into account the previous criteria are discussed. Particularly, Table 6.10 shows the success rate of the object classification process for each one of the ten test scenes. The three first causes of errors are due to the classification algorithm developed in this work, while the fourth one is due to the computer vision algorithms used in low-level layers, hardly avoided by the classification process. As can be appreciated, the classification method gets good results; concretely, 95% success rate as the average value, and 91% in the worst case.
Table 6.10 Results obtanined in the object classification process of pedestrians and vehicles. More common error causes are a) wrong fired rule; b) no rule fired, being the system unable to classify the object; c) historic, the system takes into account previous classifications of the object, making errors in current classifitions if the previous ones were incorrect; d) ellipse change between two objects that cross at a point (this fact is critical when the objects belong to different classes). Object classification Scene
Success
1 2 3 4 5 6 7 8 9 10 AVERAGE
7176 (95%) 6760 (96%) 6657 (95%) 11065 (99%) 1397 (99%) 2242 (93%) 7498 (95%) 2866 (97%) 6115 (91%) 1887 (98%) 95.8 %
Wrong rule 287 188 227 84 0 153 345 19 359 3
Errors Rules not Historic fired 1 35 1 56 28 49 4 0 1 0 1 0 0 15 4 0 0 123 1 0
Ellipse change 0 0 30 19 0 0 8 64 92 27
Table 6.11 shows the results obtained in the estimated process of regions where detected objects are located, taking into account the intersection between their support or base points and the polygons that represent the regions. This process mainly depends on the tracking algorithm robustness and the association of the ellipses that wrap the objects. If the ellipses do not wrap most of the support points of objects, the error rate in the classification process will be high. This class of errors imply the violation of spatial constraints, generating false positives and false negatives.
134
J. Albusac et al.
Table 6.11 Results obtained in the estimation of regions where the detected objects are located.
Scene 1 2 3 4 5 6 7 8 9 10 AVERAGE
Estimation of objects location (region calculation) Errors Success Wrong ellipse Object with multiple ellipses 7069 (94%) 238 192 7001 (99%) 4 0 6920 (98%) 71 0 11130 (99%) 23 16 1398 (100%) 0 0 2396 (100%) 0 0 7732 (98%) 137 0 2910 (98%) 26 17 6654 (99%) 35 0 1906 (99%) 12 0 98,4 %
Table 6.12 shows the results obtained in the normality analysis of the trajectory concept. The average success rate in this case is 96,4%. Some of the errors made are due to errors in previous processes, such as object misclassification or wrong region estimation. These errors occasionally cause the violation of the constraints of the trajectory concept so that the monitored object does not have associated normal trajectories, generating false positives or false negatives. The errors made by the preprocessors that provide the information to the normality components do not always imply errors in such components. For instance, a vehicle classified as a pedestrian located over a pedestrian crossing; or a pedestrian located over an allowed area for pedestrians, if the system considers that the pedestrian is located over another allowed area for pedestrians. In this last situation and although the region estimation by the system was not correct, it will infer a correct result. For this reason, the number of errors when classifying objects or estimating regions may be higher than in the trajectory analysis. The output of the normality components for each analysed frame is a numeric value used to study the membership to different normality intervals: absolutely normal, possibly normal, suspicious, possibly abnormal, and absolutely abnormal. Any normal situation with an exclusive membership to the absolutely normal or possibly normal intervals is considered as a success of the trajectory component and, in the same way, any anomalous situation with an exclusive membership to the absolutely abnormal or possibly abnormal is also considered as a success. The contrary cases imply false positives and false negatives, which are considered as errors. Regarding the suspicious interval, an error is considered in the following cases: i) a normal situation where the membership to such interval is higher than to the possibly normal interval, and ii) an anomalous situation where the membership to the suspicious interval is higher than to the possibly abnormal interval. To consider suspicious situations as critical or not should rely on the layers responsible for the decision
6
Normality Components for Intelligent Surveillance
135
making process, depending on the desired security level. The more strict the system, the higher the probability of alarm activation and the lower the probability of undetected anomalous situations. Table 6.12 Results obtained by the trajectory analysis component. Labels C1 (object misclassification), C2 (wrong position of the ellipse) and C3 (ellipse swap between objects) refer to error causes. Number Scene of frames
Trajectory analysis (Component 1) Number of Success situations Normal
1 2 3 4 5 6 7 8 9 10
3943 3000 3000 3250 995 1471 5233 1906 2165 772
7224 7142 6751 11051 1398 2343 7831 2738 6390 1843
Abnormal
275 137 240 121 0 53 35 215 299 75 TOTAL
Errors C1
7093 (94%) 6847 (97%) 6731 (96%) 11026 (99%) 1397 (99%) 2242 (93%) 7590 (96%) 2920 (98%) 6420 (95%) 1875 (97%) 96,4 %
184 154 209 88 1 154 134 7 234 4
C2
C3
222 0 4 0 21 30 39 19 0 0 0 0 134 8 26 0 35 0 12 27
Table 6.13 shows the number of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN) for the trajectory component and the relationships between them. These parameters are used to measure the quality of the proposed methods in this work in order to appreciate the number of normal situations classified as normal, the number of detected anomalous situations or the number of wrong alarm activations. Particularly, the results shown in this table are satisfactory since the number of false positives and false negatives is low in relation to the total number of analysed situations. More critical situations are those anomalous situations classified as normal or, in other words, the number of false positives (FP). This table also shows that the value of this parameter is really low in the ten test scenes. Table 6.14 shows the results obtained in the speed classification process from the normality point of view, calculated by the second normality component developed in this work. In this case, the average success rate is 98,7%. In the conducted experiments, both vehicles and pedestrians use to drive along and walk around, respectively, at a normal speed. If this is not the case, most of anomalous behaviours were detected. Since this normality component is based on the displacement of the ellipse that wraps the monitored object, the partial occlusions are important to be taken into account, which might cause abrupt changes in the ellipse position and, therefore, in the speed estimation. In other words, while a partial occlusion is taking place,
136
J. Albusac et al.
Table 6.13 Existing relationships between true/false positives and true/false negatives in the trajectory component.
Coefficient calculation for the trajectory component Scene
TP
TN
FP
FN
TPR
FPR ACC SPC PPV NPV
FDR
1
6928
165
71
335
0.95
0.30
0.01
2
6847
0
0
158
0.98
-
3
6684
47
5
255
0.963
4
11026
0
4
142
0.987
5
1397
0
0
1
0.99
-
6
2228
14
0
154
0.935
0
7
7590
0
0
276
0.965
-
0.965
8
2766
154
26
7
0.997
9
6181
239
0
269
0.958
0
0.96
1
1
0.470
0
10
1875
0
0
43
0.978
-
0.978
-
1
0
0
0.94 0.699 0.98 0.330 0.977
-
1
0
0
0.096 0.963 0.904 0.999 0.156 1
0.987
0
1
0.99
-
0.936
1 -
0
0
0
1
0
0
1
0.083
0
1
0
0.144 0.989 0.856 0.991 0.957
0 0.009
Table 6.14 Results obtained by the normality component of speed analysis. Labels C1 (object misclassification), C2 (abrupt position change or wrong size ellipse), C3 (wrong fired rules), and C4 (ellipse swap between objects of different class) refer to error causes.
Scene
Number of frames
1 2 3 4 5 6 7 8 9 10
3943 3000 3000 3250 995 1471 5233 1906 2165 772
Speed analysis (Component 2) Number of Success situations Normal
7224 7142 6751 11051 1398 2343 7831 2738 6390 1843
Abnormal
275 137 240 121 0 53 35 215 299 75 TOTAL
7478 (99%) 6923 (98%) 6842 (97%) 11095 (99%) 1398 (100%) 2344 (97%) 7813 (99%) 2943 (99%) 6639 (99%) 1918 (100%) 98,7%
Errors C1
C2
C3
C4
0 40 0 0 0 13 0 0 0 0
0 0 25 0 0 39 18 0 15 0
21 25 124 77 0 0 35 10 10 0
0 17 0 0 0 0 0 0 25 0
the ellipse maintains the visible part of the object and might change the position abruptly when the previous invisible part becomes visible. In the test environment, this phenomenon happens three times (see Figure 6.5): i) lower left part by the tree branches, ii) upper central part by the bushes of the roundabout and iii) right central part by the fence of the building. The used learning algorithm deals with this problem since it learns that the displacements in this kind of regions may be higher, considering the object speed as normal.
6
Normality Components for Intelligent Surveillance
137
Table 6.15 shows the existing relationship between TP, TN, FP, and FN. The good results are a consequence of the reduced number of false positives and false negatives and the high number of true positives and true negatives. Table 6.15 Existing relationships between true/false positives and true/false negatives in the normality component of speed analysis.
Coefficient calculation for the component 2 Scene
TP
TN
FP
FN
1
7460
18
21
0
TPR FPR ACC SPC PPV NPV 1
2
6786
137
0
82
0.99
3
6641
201
39
110 0.984 0.163 0.979 0.838 0.994 0.646
0.005
4
10993
102
15
62
0.001
5
1398
0
0
0
1
-
1
-
1
-
0
6
2344
0
39
13
0.994
1
0.978
0
0.984
0
0.02
7
7788
25
10
43
0.995 0.286 0.995 0.714 0.999 0.368
8
2918
25
10
0
9
6589
50
10
40
10
1843
75
0
0
0.53 0.99 0.462 0.99 0
0.988
1
1
1 0.626
0.994 0.128 0.993 0.872 0.999 0.622
1
0.286 0.997 0.714 0.997
0.994 0.167 0.99 1
0
1
1
0.83 0.998 0.556 1
1
1
FDR 0.002 0
0.001 0 0.001 0
Tables 6.16 and 6.17 show the results of the global analysis process, making use of the OWA aggregation operators to combine the output given by the two normality components and get a global degree of normality. The errors in this process can be due to the first component, the second or either both at the same time (this last situation is counted as a single error in the global analysis). Occasionally, the errors caused by one of the components do not produce an error in the global analysis due to the use of a vector of weights by the aggregation operator. For instance, a situation considered as suspicious by the first component and absolutely normal by the second one might be considered as possibly normal by the global analysis module. On the other hand, Figure 6.9 graphically shows the relationship between two of the parameters related to anomalous situations, TN and FP. The number of true negatives (TN) represents the number of anomalous situations detected and correctly understood by the system, while the number of false positives (FP) refers to the anomalous situations unnoticed by the system. The higher the number of false positives and the lower the number of true negatives, then the lower the quality and the efficiency of the methods employed to understand events and situations. The diagrams of Figure 6.9 reflect that the proposed methods in this work do not only detect the normal situations correctly but also the anomalous situations that rarely take place. Tables 6.18 and 6.19 show the total and average times spent by each one of the processes that compose the surveillance system. The region estimation where an object is located is the most time-consuming task. This is due to the fact that the algorithm used checks if everyone of the support points of the object belongs to one
138
J. Albusac et al.
Table 6.16 Results obtained by the global module combining the analysis of the two normality components. Labels C1 and C2 refer to errors caused by the first and second components, respectively.
Scene
Number of frames
1 2 3 4 5 6 7 8 9 10
3943 3000 3000 3250 995 1471 5233 1906 2165 772
Global analysis: trajectories and speed Number of Success situations Normal
Errors
Abnormal
7224 7142 6751 11051 1398 2343 7831 2738 6390 1843
275 137 240 121 0 53 35 215 299 75 TOTAL
7072 (94%) 6805 (97%) 6504 (93%) 10949 (98%) 1397 (99%) 2190 (91%) 7537 (95%) 2910 (98%) 6370 (95%) 1875 (97%) 95,7%
C1
C2
Both
406 118 260 146 1 154 276 33 269 43
21 42 149 77 0 39 53 10 50 0
0 40 0 0 0 13 0 0 0 0
Table 6.17 Existing relationships between true/false positives and true/false negatives in the global normality analysis process.
Coefficient calculation for the global analysis Sce.
TP
TN
FP
FN
TPR FPR ACC SPC PPV NPV
FDR
1
6889
183
92
335
0.95
0.33
0.01
2
6668
137
0
200
0.97
0
3
6334
248
44
365
0.946 0.151 0.941 0.849 0.993 0.405
0.01
4
10847
102
19
204
0.982 0.157 0.98 0.843 0.998 0.333
0.001
5
1397
0
0
1
6
2189
14
39
154
0.934 0.736 0.919 0.264 0.982 0.083
0.02
7
7512
25
10
319
0.959 0.286 0.958 0.714 0.999 0.073
0.001
8
2731
179
36
7
0.997 0.167 0.985 0.833 0.987 0.962
0.01
9
6081
289
10
309
0.952 0.033 0.95 0.967 0.998 0.483 0.001
10
1830
75
0
43
0.977
0.99
-
0
0.94 0.67 0.971
0.99
0.978
1
-
1
0.98 0.353 1
1
1
0.407
0
0.636
0
0
0
of the predefined regions. Even so, the time spent in the worst case is 63 milliseconds, approximately, which allows the system to quickly respond to any event. New versions of the algorithms will be proposed in future works to reduce this time. The results obtained prove that the design of components by means of the formal model discussed in this work is feasible, gives a high performance, offers response times really short and, finally, allows to represent knowledge with high interpretability. Furthermore, the two components developed have been combined with success, thanks to the use of OWA operators, to get a global evaluation of the objects behaviour, which is normal if they follow one normal trajectory at a suitable speed.
6
Normality Components for Intelligent Surveillance
139
Fig. 6.9 Relationships between the anomalous situations detected by the normality components and the global analysis. True negatives (TN) represent detected anomalous situations, while false positives (FP) represent anomalous situations understood as normal. Table 6.18 Time spent by the system processes, measured in seconds. Sce. is scene, elem. are elements to analyse, T is the total time spent, and M is the average time.
Time (seconds) Scene Elem.
Object clas-
Object
Trajectory
Speed
Global
sification
location
analysis
analysis
analysis
T
M
T
1
7499 0.256 3x10−5
2
2x10−5
3 4 5 6 7 8 9 10
7005 0.199
6991 0.215 3x10
−5
11172 0.376 3x10
−5
1398 0.055 3x10
−5
2396 0.069 2x10
−5
7866 0.288 3x10
−5
2953 0.080 2x10
−5
6689 0.221 3x10
−5
1918 0.070
3x10−5
M
290.980 154.996 441.485 371.468 67.501 88.928 263.230 178.272 283.696 58.321
T
M
T
0.039 0.113 2x10−5 0.088 0.022 0.142
2x10−5
0.063 0.235 3x10
−5
0.033 0.272 2x10
−5
0.048 0.074 5x10
−5
0.037 0.125 5x10
−5
0.033 0.183 2x10
−5
0.060 0.134 4x10
−5
0.038 0.176 2x10
−5
0.028 0.064
3x10−5
M
T
M
10−5
0.079
0.078
10−5
0.061 8x10−6
0.097
10
−5
0.079
10
−5
0.080 7x10−6
10
−5
0.051 3x10−5
10
−5
0.040
10−5
10
−5
0.099 0.015 0.025
10−5 10−5
0.086
10−5
0.026 8x10
−6
0.038
10−5
0.048 7x10
−6
0.073
10−5
0.038
10−5
0.083
0.030
10−5
Table 6.19 Time spent by the learning process of fuzzy rules for object and speed classification. Time spent by the learning process Process Learning of rules for object classification Learning of rules for speed classification
Examples of the
Time spent in learning Number of fired rules
training set
(seconds)
4880 2680
1.921 1.610
56 15
Although the success rates are often high (between 91% and 99%), it is important to take into account that the duration of the test videos ranges from 2’30” to 3’. This fact implies to include future modifications is still needed to reduce the number of alarm activations, above all in longer analysis times. Another issue to be addressed is the system scalability when the complexity of the monitored environment get increased. The typical case study is analyzing a
140
J. Albusac et al.
high number of moving objects simultaneously. To deal with this kind of situations, the discussed components can be managed by software agents, which are deployed through the agent platform discussed in [32]. This alternative involves two main advantages regarding scalability issues: i) the knowledge of surveillance components can be managed in an autonomous way by the agents and ii) an agent responsible for a certain particular surveillance component can be replicated on multiple computers to distribute the workload when monitoring becomes complex. For instance, an agent in charge of analyzing trajectories can be replicated n times in m different computers in a transparent way for the rest of agents. To quantitatively prove the scalability of the proposed system regarding this issue, Table 6.20 shows the time (measured in seconds) spent when monitoring the scenario 1 of Figure 6.1 by employing different replication schemes and varying the analyzed situations per second (Ev/s). Column PAT P involves the preprocessing agent (PA) and represents the processing time related to the object classification and location tasks; TATC and SATC represent the communication times between PA and the agents that monitor trajectories (TA) and speed (SA), respectively; TAT P and SAT P represent the processing time spent by the agents that analyze such surveillance components (trajectory analysis and speed analysis); ENATC and ENAT P represent the communication and processing times, respectively, spent by the agent responsible for the global analysis. The set of conducted experiments of Table 6.20 were executed over a network composed of three different computers so that the workload of analyzing the two surveillance concepts can be distributed. The main conclusion obtained when studying the results shown in such a table is that the system scales well when the rate of analyzed video events per second increases, that is, when the number of monitored moving objects per second is higher. These results can be extrapolated to using an Table 6.20 Set of conducted experiments varying the rate of fps (video events analyzed per second) and the computing nodes location (L=localhost; LR=localhost with replication; D=distributed). Test 1 2 3 4
T Ev/s PATP TATC L 10 242.12 4.09 L 15 234.62 4.84 L 20 234.04 5.72 L 25 253.56 7.29
TAT P 1.53 1.47 1.34 1.22
SATC 4.07 4.93 5.70 7.11
SAT P ENATC ENAT P 0.92 6.89 3.41 0.95 6.90 3.16 0.88 6.82 3.20 0.73 6.77 3.22
5 6 7 8
LR LR LR LR
10 15 20 25
340.90 5.80 1.34 5.83 1.50 371.67 8.05 1.88 6.71 1.69 394.70 13.10 1.79 12.35 14.81 427.88 15.86 2.65 16.81 19.01
9 10 11 12
D D D D
10 15 20 25
209.19 192.994 185.819 189.011
16.751 16.429 16.053 20.903
5.939 6.521 6.62 6.731
14.449 13.632 14.154 18.284
5.817 6.177 6.786 6.293
7.38 8.13 8.37 8.33
3.89 4.05 4.21 3.92
23.623 24.975 17.775 22.730
13.557 14.199 13.657 13.484
6
Normality Components for Intelligent Surveillance
141
increasing number of normality components since the agents themselves are replicated although they monitor a common event of interest (e.g. trajectories). On the other hand, the goodness of the obtained results is also due to the system robustness against errors made by low-level layers. To conclude this section, some of these errors and the solutions adopted to reduce the number of false positives and false negatives in the event and behaviour understanding are discussed: • The object classifier maintains a persistent register about the different classifications of objects since they were first detected until they disappear, that is, the classification of objects does not only depend on the last analysed situation. In this way, if an object was correctly classified during a long period of time, a misclassification at particular moments do not cause classification errors. • The object classification is based on the object dimensions, movements and regions where it is located without taking into account their size. Therefore, although the ellipse that wraps an object in the tracking process is not perfect, the classification algorithm will work correctly. • The developed surveillance system also manages a persistent register of the regions covered by each object together with the maximum degree of membership to each one of them. In this way, occasional segmentation errors and wrong ellipses do not affect to the degree of satisfaction of spatial constraints. In other words, the system believes that the object follows the sequence of regions that form the normal trajectory in a correct way although there are certain deviations at particular moments. • Abrupt errors in the ellipse positioning in case of partial occlusions produced by static environmental elements do not cause error in the speed estimation. This is due to the proposed learning algorithm is able to learn the regions where this phenomenon takes place. • The regions that compose the environment are classified as entry/exit areas where objects appear and disappear and intermediate areas. The occlusions might cause that the system loses the reference of an object and assigns a new identifier to a previously detected object. The trajectory definition through the proposed formal model allows to define trajectories whose sequence of regions is only composed of intermediate areas. In this way, when an object is detected again on an intermediate area due to an occlusion, the system will be able to keep on assigning normal trajectories to such an object. • The global normality value for a particular object depends on the combination of the analysis carried out by each one of the normality components. In this way, although the evaluation of one component is not right, the final normality degree might be correct.
6.6 Conclusions Since their first developments, traditional video surveillance systems have been designed to monitor environments. However, these systems have several limitations to
142
J. Albusac et al.
satisfy the security demands posed by the society. This need together with the arrival of new technologies and the price reduction of dedicated hardware for security represent some of the relevant reasons why intelligent surveillance is currently one of the hot research topics. One of the main challenges in this field is to provide security expert systems with the autonomy and ability required to automatically understand events and analyse behaviours in order to improve the productivity and effectiveness in surveillance tasks. In complex environments where multiple situations take place at the same time, human agents are restricted to deal with all them. On the other hand, artificial expert systems do not have these limitations due to their processing capabilities. Furthermore, artificial systems are not affected by factors such as fatigue or tiredness and they can be more effective than people when recognising certain kinds of events, such as the detection of suspicious or unattended objects. In the last few years, multiple second and third generation surveillance systems have been developed both in the commercial and academic fields. Most of them have been designed to solve particular problems in specific scenarios. These systems provide several advantages over the traditional ones, but two goals need to be reached in order to maintain this progress. i) First, the design of scalable surveillance systems that allow to include new modules to increase the analysis capacity and to aggregate their output to get a global view of the environment state. ii) Secondly, there exists a need for a higher flexibility to vary the artificial system behaviour depending on the requirements of the analysed environment and the kinds of monitored objects. The way of monitoring the behaviour of a particular object may change in relation to the rest of objects. Depending on the kind of objects that appear in the scenario and the security level established, the system must provide the flexibility to configure the existing analysis modules so that the system designed in a global way can be adapted to particular environments. This work proposes a formal model for the design of scalable and flexible surveillance systems. This model is based on the normality definition of events and behaviours according to a set of independent components, which can be instantiated in particular monitored environments. These modules, denoted as normality components, specify how each kind of object should ideally behave depending on the aspect to watch, such as the trajectory that follows or the monitored speed. This model also allows to specify the components that are employed for each kind of object in each environment, increasing the flexibility. In addition, the output of these components are combined by means of aggregation operators to get a global view of the behaviour of each single object and the whole environment. The integration of new normality components does not imply modifications on the rest, increasing the analysis capacity of the artificial system. Two normality components, trajectories and speed analysis, have been defined by means of this model. Both make use of the spatial information gathered by the security cameras, previously processed by modules of low-level layers. Although the analysis of trajectories and speed have been widely studied by other researchers, most of the existing approaches are based on the analysis of spatial and temporal information. This information may be enough to recognise the route followed by
6
Normality Components for Intelligent Surveillance
143
objects or to estimate their speed, but it is not enough for surveillance systems that need to deal with a higher number of factors. The proposed model defines the normality of a concept using a set of fuzzy constraints. This fact allows to develop more complete and suitable definitions for surveillance systems. In this way, the system also provides flexibility in the internal analysis performed by the normality components, since multiple constraints can be easily added or removed. The use of fuzzy logic not only deals with the uncertainty and vagueness of low-level information, but also allows us to represent the expert knowledge used by the surveillance system with a high interpretability by means of linguistic labels, providing it with the expressiveness required to justify the decision making. Precisely, in order to facilitate the knowledge acquisition and the deployment of component instances in particular environments, different knowledge acquisition tools and machine learning algorithms have been associated to the normality components. Finally, the experimental results prove the feasibility of the proposed model and the designed normality components, both in relation to the events and behaviours understanding and the time response to provide such results, two key aspects in surveillance systems. However, to improve the system by reducing the number of false alarms is needed in order to minimise the human dependence. One of these improvements may involve the use of fusion techniques in low-level layers, in addition to the output of the normality components. In this way, a component may deal with the information gathered by multiple sensors about a particular object. The management of redundant information may improve the classification and understanding results when the data provided by a single source is not reliable. Another future research line is to develop new normality components that extend the analysis capabilities of the system. Finally, the development of components that analyse and detect specific anomalous situations will be also addressed so that their output can be combined with the normality modules to improve and complete the analysis of the environment state. Acknowledgements. This work has been founded by the Regional Government of CastillaLa Mancha under the Research Project PII1C09-0137-6488 and by the Ministry of Science and Innovation under the Research Project TIN2009-14538-C02-02 (FEDER).
References [1] Opencv videosurveillance, http://opencv.willowgarage.com/wiki/VideoSurveillance [2] Aguilera, J., Wildernauer, H., Kampel, M., Borg, M., Thirde, D., Ferryman, J.: Evaluation of motion segmentation quality for aircraft activity surveillance. In: IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 293–300 (2005) [3] Albusac, J., Castro-Schez, J.J., L´opez-L´opez, L.M., Vallejo, D., Jimenez, L.: A Supervised Learning Approach to Automate the Acquisition of Knowledge in Surveillance Systems. Signal Processing, Special issue on Visual Information Analysis for Security 89(12), 2400–2414 (2009)
144
J. Albusac et al.
[4] Allen, J., Ferguson, G.: Actions and Events in Interval Temporal Logic. Journal of Logic and Computation 4(5), 531–579 (1994) [5] Blauensteiner, P., Kampel, M.: Visual Surveillance of an Airports Apron-An Overview of the AVITRACK Project. In: Annual Workshop of AAPR, Digital Imaging in Media and Education, pp. 1–8 (2004) [6] Bloisi, D., Iocchi, L., Bloisi, D., Iocchi, L., Remagnino, P., Monekosso, D.N.: ARGOS– A Video Surveillance System for Boat Traffic Monitoring in Venice. International Journal of Pattern Recognition and Artificial Intelligence 23(7), 1477–1502 (2009) [7] Buxton, H.: Learning and understanding dynamic scene activity: a review. Image and Vision Computing 21(1), 125–136 (2003) [8] Carter, N., Young, D., Ferryman, J.: A combined Bayesian Markovian approach for behaviour recognition. In: Proceedings of the 18th International Conference on Pattern Recognition, pp. 761 –764 (2006) [9] Cathey, F., Dailey, D.: A novel technique to dynamically measure vehicle speed using uncalibrated roadway cameras. In: IEEE Intelligent Vehicles Symposium, pp. 777–782 (2005) [10] Chen, T., Haussecker, H., Bovyrin, A., Belenov, R., Rodyushkin, K., Kuranov, A., Eruhimov, V.: Computer vision workload analysis: Case study of video surveillance systems. Intel Technology Journal 9(2), 109–118 (2005) [11] Cho, Y., Rice, J.: Estimating velocity fields on a freeway from low-resolution videos. IEEE Transactions on Intelligent Transportation Systems 7(4), 463–469 (2006) [12] Collins, R., Lipton, A., Kanade, T., Fujiyoshi, H., Duggins, D., Tsin, Y., Tolliver, D., Enomoto, N., Hasegawa, O., Burt, P., et al.: A System for video surveillance and monitoring (Technical report CMU-RI-TR-00-12). Tech. rep., Robotics Institute, Carnegie Mellon University (2000) [13] Dee, H., Velastin, S.: How close are we to solving the problem of automated visual surveillance? Machine Vision and Applications 19(5), 329–343 (2008) [14] Fawcett, T.: Roc graphs: Notes and practical considerations for data mining researchers (Technical report hpl-2003-4). Tech. rep., HP Laboratories, Palo Alto, CA, USA (2003) [15] Franklin, W.: Pnpoly - point inclusion in polygon test (2006), http://www.ecse.rpi.edu/Homepages/wrf/Research/ ShortNotes/pnpoly.html [16] Haritaoglu, I., Harwood, D., Davis, L.: W 4 : Real-Time Surveillance of People and Their Activities. IEEE Transactions on Patter Analysis and Machine Intelligence 22(8), 809– 830 (2000) [17] Johnson, N., Hogg, D.: Learning the distribution of object trajectories for event recognition. Image and Vision Computing 14(8), 609–615 (1996) [18] Khoudour, L., Deparis, J., Bruyelle, J., Cabestaing, F., Aubert, D., Bouchafa, S., Velastin, S., Vincencio-Silva, M., Wherett, M.: Project CROMATICA. In: Del Bimbo, A. (ed.) ICIAP 1997. LNCS, vol. 1311, pp. 757–764. Springer, Heidelberg (1997) [19] Lin, L., Gong, H., Li, L., Wang, L.: Semantic event representation and recognition using syntactic attribute graph grammar. Pattern Recognition Letters 30(2), 180–186 (2009) [20] Maduro, C., Batista, K., Peixoto, P., Batista, J.: Estimation of vehicle velocity and traffic intensity using rectified images. In: Proceedings of the 15th IEEE International Conference on Image Processing (ICIP 2008), pp. 777–780 (2008) [21] Magee, D.: Tracking multiple vehicles using foreground, background and motion models. Image and Vision Computing 22(2), 143–155 (2004)
6
Normality Components for Intelligent Surveillance
145
[22] Makris, D., Ellis, T.: Learning semantic scene models from observing activity in visual surveillance. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 35(3), 397–408 (2005) [23] Morris, B., Trivedi, M.: A survey of vision-based trajectory learning and analysis for surveillance. IEEE Transactions on Circuits and Systems for Video Technology 18(8), 1114–1127 (2008) [24] Palaio, H., Maduro, C., Batista, K., Batista, J.: Ground plane velocity estimation embedding rectification on a particle filter multi-target tracking. In: Proceedings of the 2009 IEEE International Conference on Robotics and Automation, pp. 2717–27722 (2009) [25] Piciarelli, C., Foresti, G.: On-line trajectory clustering for anomalous events detection. Pattern Recognition Letters 27(15), 1835–1842 (2006) [26] Remagnino, P., Velastin, S., Foresti, G., Trivedi, M.: Novel concepts and challenges for the next generation of video surveillance systems. Machine Vision and Applications 18(3), 135–137 (2007) [27] Shimrat, M.: Algorithm 112: Position of point relative to polygon. Communications of the ACM 5(8), 434 (1962) [28] Siebel, N., Maybank, S.: Ground plane velocity estimation embedding rectification on a particle filter multi-target tracking. In: Proceedings of the ECCV Workshop Applications of Computer Vision, pp. 103–111 (2004) [29] Smith, G.: Behind the screens: Examining constructions of deviance and informal practices among cctv control room operators in the UK. Communications of the ACM 2(2), 376–395 (2004) [30] Sridhar, M., Cohn, A., Hogg, D.: Unsupervised Learning of Event Classes from Video. In: Proc. AAAI, AAAI Press, Menlo Park (to appear, 2010) [31] Valera, M., Velastin, S.: Intelligent distributed surveillance systems: a review. IEE Proceedings-Vision, Image and Signal Processing 152(2), 192–204 (2005) [32] Vallejo, D., Albusac, J., Mateos, J., Glez-Morcillo, C., Jimenez, L.: A modern approach to multiagent development. Journal of Systems and Software 83(3), 467–484 (2009) [33] Velastin, S., Khoudour, L., Lo, B., Sun, J., Vicencio-Silva, M.: PRISMATICA: a multisensor surveillance system for public transport networks. In: 12th IEE International Conference on Road Transport Information and Control, pp. 19–25 (2004) [34] Yager, R.: On ordered weighted averaging aggregation operators in multicriteria decisionmaking. IEEE Transactions on Systems, Man and Cybernetics 18(1), 183–190 (1988) [35] Yager, R.: Families of OWA operators. Fuzzy sets and systems 59(2), 125–148 (1993) [36] Zadeh, L.: Fuzzy sets. Information and Control 8(3), 338–353 (1965) [37] Zadeh, L.: From computing with numbers to computing with words - from manipulation of measurements to manipulation of perceptions. Circuits and Systems I: Fundamental Theory and Applications 46(1), 105–119 (1999)
Chapter 7
Distributed Camera Overlap Estimation – Enabling Large Scale Surveillance Anton van den Hengel, Anthony Dick, Henry Detmold, Alex Cichowski, Christopher Madden, and Rhys Hill
Abstract. A key enabler for the construction of large-scale intelligent surveillance systems is the accurate estimation of activity topology graphs. An activity topology graph describes the relationships between the fields of view of the cameras in a surveillance network. An accurate activity topology estimate allows higher-level processing such as network-wide tracking to be localised within neighbourhoods defined by the topology, and thus to scale. The camera overlap graph is an important special case of the general activity topology, in which edges represent overlap between cameras’ fields of view. We describe a family of pairwise occupancy overlap estimators, which are the only approaches proven to scale to networks with thousands of cameras. A distributed implementation is described, which enables the estimator to scale beyond the limits achievable by centralised implementations, and supports growth of the network whilst it remains online. Formulae are derived to describe the memory and network bandwidth requirements of the distributed implementation, which are verified by empirical results. Finally, the efficacy of the overlap estimators is demonstrated using results from their application in higher-level processing, specifically to network-wide tracking, which becomes feasible within the topology oriented architecture.
7.1 Introduction Video surveillance networks are increasing in scale: installations of 50, 000 camera surveillance networks are now being reported [6], the Singapore public transport authority operates a network of over 6, 000 cameras [13], and networks of more than 100 cameras are common place. Even for networks of less than one hundred Anton van den Hengel · Anthony Dick · Henry Detmold · Alex Cichowski · Christopher Madden · Rhys Hill University of Adelaide, Australian Centre for Visual Technologies, Adelaide SA 5007, Australia E-mail: {anton,ard,henry,alexc,cmadden,rhys}@cs.adelaide.edu.au http://www.acvt.com.au/research/surveillance/ P. Remagnino et al. (Eds.): Innovations in Defence Support Systems – 3, SCI 336, pp. 147–182. c Springer-Verlag Berlin Heidelberg 2011 springerlink.com
148
A.v.d. Hengel et al.
cameras, human operators require assistance from software to make sense of the vast amounts of data in video streams, whether to monitor ongoing activity, or to search through archives to analyse a specific event. Computer vision research has made significant progress in automating processing on the very small scale (see [11] for a survey), but there has been less progress in scaling these techniques to the much larger networks now being deployed. In particular, there has been little progress in transforming single system (centralised) approaches into scalable approaches based on distributed processing. One promising approach to tackling large surveillance networks is to identify a core set of network wide common services, which are required by many visual processing approaches, and focus research effort on providing these services on large networks. Prominent among such services is estimation of activity topology. The activity topology of a surveillance network is a graph describing the spatial and temporal relationships between the fields of view of the network’s cameras. An important special case of activity topology is the camera overlap graph, where edges link cameras having commonality in their fields of view. An accurate and up-todate activity topology estimate supports reasoning about events that span multiple cameras. In particular, the camera overlap special case of activity topology supports efficient solution of the camera handover problem, which is concerned with the continuation of visual processing (e.g. tracking) when a target leaves one camera’s field of view and needs to be resumed using data from other cameras (i.e. those adjacent in the topology). Furthermore, the identification of connected subgraphs within the overall topology provides a means of partitioning high-level visual processing, such that processing relating to a given set of inter-related camera takes place within a partition dedicated to that set, without dependence on the rest of the network. Several estimators of camera overlap have been developed based on pairwise occupancy. In fact there is a family of approaches [14], of which the exclusion approach [15] is the first example. These approaches have two desirable properties as implementations of camera overlap estimation. First of all, they produce overlap estimates of sufficient accuracy (precision and recall) to be useful in tracking [5]. Secondly, they have a proven ability to provide on-line estimation for networks of up to 1,000 cameras [7], whereas no other approach has demonstrated the ability to scale beyond 100 cameras. Initial implementations of these approaches, including exclusion [7], require a central server component. The scale of the surveillance systems these implementation can support is limited by the physical memory on this central server. The practical consequence is that systems with more than a few thousand cameras require specialised equipment for the central server, making implementation prohibitively expensive. This chapter describes a distributed implementation of activity topology estimation by exclusion [8], and shows how it generalises to the entire family of pairwise occupancy estimators of camera overlap. The distributed implementation overcomes previous implementations’ dependence on a single central server and thus enables the use of a cluster of commodity servers as a much cost effective approach to estimation of camera overlap on large surveillance networks. Results comparing
7
Camera Topology
149
partitioned and non-partitioned exclusion demonstrate that the advantages of partitioning outweigh the costs. The partitioning scheme used in the distributed implementation enables partitions to execute independently. This both enhances performance (through increased parallelism) and, more importantly, permits new partitions to be added without affecting existing partitions. As a result, the camera overlap estimation sub-system that can grow in capacity as the number of cameras increases, whilst remaining on-line 24 × 7. Formulae are derived to model quantitative aspects of the distributed implementation, namely its memory and network bandwidth requirements, and these are verified by empirical results. Finally, to demonstrate the utility of the approach in real applications, this chapter reports precision-recall results for a tracking application built on top of a pairwise occupancy estimator of camera overlap. These results demonstrate that pairwise occupancy estimators produce camera overlap estimates of sufficient accuracy to support efficient camera handover, which is critical to network-wide tracking of targets in general. Furthermore, handover supports localisation of tracking processes around targets’ current loci of activity, enabling tracking computations to be partitioned, and thus to scale with the system size. Such tracking of targets across cameras forms the basis of many surveillance tasks, as it allows the system to build up broader analysis of target behaviour, such that aberrant behaviour or target motions can be identified, or following a target through to their current location within the area under surveillance.
7.2 Previous Work Activity topology has typically been learned by tracking people as they appear and disappear from camera fields of view (FOVs) over a long period of time. For example, in [18] the delay between the disappearance of each person from one camera and their appearance in another is stored to form a set of histograms describing the transit time between each camera pair. The system is demonstrated on a network of 3 cameras, but does not scale easily as it requires that correspondences between tracks are given during the training phase when topology is learned. Previous work by one of the authors [9] suggests an alternative approach whereby activity topology is represented by a Markov model . This does not require correspondences, but does need to learn O(n2 ) transition matrix elements during a training phase and so does not scale well with the number of cameras n, due to the number of observations required for the Markov model. The training phase required in this and similar work is problematic in large networks, chiefly because the camera configuration, and thus activity topology, changes with surprising frequency; as cameras are added, removed, moved and fail. Approaches requiring a training phase to complete before operation would have to cease operation each time there is a change, and only resume once re-training has completed. This is an intolerable restriction on the availability of a surveillance network. Instead, on-line automatic approaches, where topology is estimated concurrently with the operation of surveillance, are desirable.
150
A.v.d. Hengel et al.
Ellis et al. [10] do not require correspondences or a training phase, instead observing motion over a long period of time and accumulating appearance and disappearance information in a histogram. Instead of recording known correspondences, it records every possible disappearance that could relate to an appearance. Over time, actual transitions are reinforced and extracted from the histogram with a threshold. A variation on this approach is presented in [19], and has been extended by Stauffer [24] and Tieu et al. [26] to include a more rigorous definition of a transition based on statistical significance, and by Gilbert et al. [12] to incorporate a coarse to fine topology estimation. These methods rely on correctly analysing enough data to distinguish true correspondences, and have only been demonstrated on networks of less than 10 cameras. Rahimi et al. [22, 21] perform experiments on configurations of several cameras involving non-overlapping FOVs. One experiment [21] involves calculating the 2D position and 1D orientation of a set of overhead cameras viewing a common ground plane by manually recording the paths followed by people in each camera’s FOV. It is shown that a simple smoothness prior is enough to locate the cameras and reconstruct a path where it was not visible to any camera. In another experiment [22], pre-calibrated cameras are mounted on the walls of a room so that they face horizontally. In this case, the 2D trajectory of a person is recovered as they move around in the room, even when the person is not visible to any camera. On a larger scale, Brand et al. [3] consider the problem of locating hundreds of cameras distributed about an urban landscape. Their method relies on having internal camera calibration and accurate orientation information for each camera, and enough cameras viewing common features to constrain the solution. Given this information the method can localise the cameras accurately due to the constraints imposed by viewing common scene points from different viewpoints. The patented approach of Buehler [4] appears to scale to networks of about 100 cameras. It is based on statistical correlation operators (lift and correlation coefficient) in pairwise occupancy data, and as such is quite close to instantiations of our approach based on an inclusion loss function (i.e. a subset of the possible instantiations). The use of the lift operator within our framework is discussed subsequently. Finally, it should be noted that instantiations of our approach based on an exclusion loss function are complementary to most other approaches. Other approaches (except [19]), accumulate positive evidence of overlap, whereas exclusion (and [19]) accumulates negative evidence ruling out overlap. The two can be composed, for example, exclusion can be used to prune the search space, enabling one or more positive approaches to operate efficiently on that part of the space that is not ruled out.
7.3 Activity Topology and Camera Overlap An estimate of the activity topology of a surveillance network makes feasible a number of processes critical within on-line video surveillance. The nodes of the activity topology graph are the fields of view of individual cameras, or alternatively regions within those fields of view. Each such region is labelled a cell and denoted ci , with an
7
Camera Topology
151
example of a 12x9 cell division of a camera view given in Figure 7.1. The edges of the graph represent the connections between cells, hence dividing the cameras views into smaller cells can allow for a finer spatial resolution of connections within the activity graph. These connections may be used to represent the overlap of the cells between cameras or, by including timing information, to describe the movement of targets through the graph. Overlap is an important special case of the more general notion of topology as it provide a method to efficiently support camera handover, and thus subsequent tasks such as tracking targets across multiple cameras, hence this work focuses upon this special case in this work.
Fig. 7.1 A sample camera view split into 12x9 rectangular cell regions.
7.3.1 Formulation The activity topology graph is defined as follows: 1. Edges are directed, such that (ci , cj ) represents the flow from ci to cj whereas (cj , ci ) represents the (distinct) flow from cj to ci . Directed edges can be converted to undirected edges if required, but the exclusion algorithm estimates each direction independently and thus we retain this information. [a,b] 2. Each edge has a set of labels, pi,j for various time delay intervals [a, b], each giving the probability that activity leaving ci arrives at cj after a delay between a and b. In this work, each edge has exactly one such label, that for [−, ] where is some small value large enough to account for clock skew between cameras. [−,] Thus pi,j describes overlap between cameras. Actual activity topologies are constrained by building layout, camera placement and other factors. Typical topologies contain sub-graphs with many edges between the
152
A.v.d. Hengel et al.
nodes within the same sub-graph and few edges between nodes within different subgraphs. For some non-overlapping cameras links in the activity topology may only occur with significant time delay, and are not considered here. Instead this work focuses upon the subset of overlap topologies, where time padding allows for clock skew within the system. These nearly isolated sub-graphs are termed zones within the activity topology. Figure 7.2 shows a recovered activity topology for a network of over a hundred cameras, with zones represented by circles.
Fig. 7.2 Estimated activity topology for a real camera network. Edges linking cameras are shown as coloured lines, while zones of highly connected cameras are pictured as circles. Singletons are omitted.
7.4 Estimating Camera Overlap Consider a set of n cameras that generates n images at time t. By applying foreground detection [25] to all images we obtain a set of foreground blobs, each of which can be summarised by an image position and camera index. Each image is partitioned into cells, and each cell can be labelled “occupied” or “unoccupied” depending on whether it contains a foreground object. Our framework for pairwise occupancy based estimation of camera overlap processes data in a pipeline with three stages, as follows: 1. Joint sampling 2. Measurement 3. Discrimination
7
Camera Topology
153
Since the processing takes the form of a pipeline, it continues indefinitely as long as new video inputs are available, leading to the revision of results produced in the discrimination stage. In the joint sampling stage, occupancy states are sampled over time, jointly for possible cell pairs, enabling the estimation of a number of joint probabilities of interest, such as Pr(cell i occupied, cell j occupied). The configuration of the joint sampling stage for a given overlap estimator is fully defined by the sample streams used for the left hand side (X) and right hand side (Y ) of each joint sample. These sample streams are termed cell systems in our approach, and since there must be two (perhaps the same), we define the configuration of the joint sampling stage as a cell system pair. In the second stage, measurement, various measures are applied based on the sampled information to determine which cell pairs are likely to be overlapping. The measures presented here include: • • • •
I(X; Y ) – mutual information [23]. H(X|Y ) – conditional entropy [23]. The lift operator used in the patented approach of Intellivid [4]. P r(X = 1|Y = 1) – conditional probability; used within the exclusion approach [15].
There are, of course, various other possibilities. The configuration of the measurement stage is fully defined by the measure chosen. In the final stage, discrimination, we define a decision threshold, which in combination with the measure in the previous stage constitutes a discriminant function [2]. That is, each cell pair can be classified as overlapping or not by comparing the value of the measure to the threshold. The value of the threshold is critical. For a given loss function, it is possible to derive an appropriate threshold based on assumptions of likely occupancy rate of each cell, and the error rate of the occupancy detector. However, in practice we have found these values to be so variable that it is futile to attempt to estimate them a priori. Instead we consider a range of thresholds when evaluating each measure in different scenarios. Specifically, we consider 3 types of loss function and the corresponding range of thresholds that apply: • minimisation – a loss function that minimises the probability of misclassification. • inclusion – in which the penalty for concluding that two cells overlap when they do not is higher than the reverse, and we effectively ignore all overlaps until we have sufficient evidence for inclusion in the overlap graph. • exclusion – where the penalty for deciding on non-overlap in the case where there is overlap is higher, and we effectively assume all overlaps until we have sufficient evidence for exclusion from the overlap graph. The configuration of the discrimination stage is fully defined by the loss function.
154
A.v.d. Hengel et al.
To summarise, any pairwise occupancy overlap estimator can be defined within our framework by a three-tuple: CellSystemP air M easure LossF unction In this work, the focus is on the performance of techniques to detect camera overlap in large surveillance networks. Note, however, that the technique is not limited to this special case of activity-based camera overlap , but can also be applied to the general case (connections between non-overlapping cameras) through the use of varying time offsets in the operands to the estimation techniques. Future work will evaluate this scenario.
7.4.1 Joint Sampling The joint sampling stage includes sampling individual occupancies, storage of sufficient data to calculate probabilities needed for measurement, and the calculation of those probabilities. 7.4.1.1
Sampling Cell Occupancy
A cell is defined as an arbitrary region of image-space within a camera’s field of view. A cell can be in one of three states at any given time: occupied (O), unoccupied (U), or unknown (X). A fourth pseudo-state of being present (P) is defined as the cell being in either of the occupied or unoccupied states, but not the unknown state. A cell is considered occupied when a foreground target is currently within it, and unoccupied if no target is within it. A cell enters the unknown state when no data from the relevant camera is available. Thus, for a given camera all cells are always either unknown or present simultaneously. This limited number of states does not include other object or background feature information to limit the processing required for comparisons. By keeping track of unknown cell states, overlap estimation can be made robust to camera outages, with periods of camera inactivity not adversely affecting the information that can be derived from the active cells. Next we define a cell system to be a set of cells in a surveillance network. A simple cell system could be formed, for example, by dividing the image space of each camera into a regular grid, and taking the cell system to be the collection of all cells of every such grid. Cell systems are an important abstraction within our implementation, as they enable flexibility in specifying what image-space regions are of interest, and the detail to which overlap is to be discerned. Choice of cell systems involves trade-offs between accuracy, performance, and memory requirements of the resulting system. The number of cells affects the memory requirements of the system, and the pattern and density of cells can affect the accuracy of information in the derived overlap topology. Now, joint sampling involves the coordinated consideration of two occupancy signals, the left hand signal and the right hand signal, at the same time point. These
7
Camera Topology
155
two signals are sampled from separate and possibly different cell systems, thus the joint sample element of the three-tuple describing the overall is itself a pair: LHSCellSystem, RHSCellSystem Further, the class of grid-based cell systems can be specified as five-tuples: rx , ry , px , py , pt with: • rx , ry – the number of cells per camera in the x and y dimensions. • px , py – the spatial padding in cells: a cell is considered occupied if there is an occupied cell within the same camera that lies within this allowed tolerance of cells. • pt – the temporal padding in seconds: a cell is considered occupied at a given frame if it is occupied within this time allowance of the specified frame. In most of our work (i.e. the classic exclusion approach), we have used the following cell system pair: 12, 9, 0, 0, 0 , 12, 9, 1, 1, 1 There is a myriad of interesting possibilities. For example, setting the x and y resolutions both to 1 in one of the two cell systems can dramatically reduce the memory required for implementation whilst only slightly reducing the resolution of the overlap estimate. At the other extreme, setting both resolutions to the pixel resolutions of the camera increases overlap resolution, but at a cost of greater memory and CPU requirements. 7.4.1.2
Storage of Sampled Data
As we are developing a system to operate over large, continuously operating camera networks, it is important to be able to calculate cell occupancy probabilities efficiently and in a scalable fashion. We define c, the total number of cells in the system: (7.1) c = n × (rx + ry ) where n is the number of cameras, rx is the number of cells per camera in the left hand side cell system, and ry is the number of cells per camera in the right hand side cell system. Also we define cx , the number of cells in the left hand side cell system: cx = n × rx
(7.2)
and cy , the number of cells in the right hand side cell system: cy = n × r y
(7.3)
c = cx + cy
(7.4)
and of course:
156
A.v.d. Hengel et al.
Then, all the required quantities can be calculated by maintaining the following counters: • T – the number of frames for which the network has been operating • Xi – the number of frames at which the camera of ci is missing: a n element vector; • Oi – the number of frames at which cell ci is occupied: a c element vector; • Ui – the number of frames at which cell ci is unoccupied: a c element vector; • XXij – the number of frames at which the camera of ci is unavailable and the camera of cj is unavailable: a n × n matrix; • XOij – the number of frames at which the camera of ci is unavailable and cell cj is occupied: a n × c matrix; • OXij – the number of frames at which cell ci is occupied and the camera of cj is unavailable: a c × n matrix; • OOij – the number of frames at which cell ci is occupied and cell cj is occupied: a cx × cy matrix. Of these, OOij requires by far the largest amount of storage, though XOij and XXij are also O(n2 ) in space. 7.4.1.3
Calculation of Probabilities
Using this sampled data, probabilities can be estimated as in the following example: OOij PPij OOij Pr(Oj = 1|Oi = 1) ≈ OPij
Pr(Oi = 1, Oj = 1) ≈
(7.5) (7.6)
Where, PPij denotes the number of times ci and cj have been simultaneously present, and OPij denotes the number of times ci has been occupied and cj present. Note that although these counters are not explicitly stored, they can be reconstructed as in the following examples: U Pij = Ui − U Xij OPij = Oi − OXij
(7.7) (7.8)
P Pij = U Pij + OPij
(7.9)
7.4.2 Measurement and Discrimination A wide variety of functions of two binary random variables can be used for the measurement stage. This subsection describe several of the more useful such functions. The discrimination stage is dependent on the measurement stage (and in particular on the choice of measure), so we describe appropriate discriminant functions with each measure.
7
Camera Topology
7.4.2.1
157
Mutual Information Measure
If two cells ci and cj are at least partially overlapping, then Oi and Oj are not independent. Conversely, if cells ci and cj are not overlapping or close to overlapping, we assume that Oi and Oj at each timestep are independent. We can test for independence by calculating the mutual information between these variables: I(Oi ; Oj ) =
Pr(oi , oj ) log
oi ∈Oi ,oj ∈Oj
Pr(oi , oj ) Pr(oi ) Pr(oj )
(7.10)
I(Oi ; Oj ) ranges between 0, indicating independence, and H(Oi ), indicating that Oj is completely determined by Oi , where H(Oi ) is the entropy of Oi : H(Oi ) = − Pr(oi ) log Pr(oi ) (7.11) oi ∈Oi
The hypothesis corresponding to no overlap is I(Oi ; Oj ) = 0, while the alternative hypothesis, indicating some degree of overlap, is I(Oi ; Oj ) > 0. Using an inclusion discriminant function, the penalty for false positives (labelling a nonoverlapping cell pair as overlapping) is high and therefore an appropriate threshold on I(Oi ; Oj ) is one that is closer to H(Oi ) than to 0. Conversely, for exclusion, the cost is greater for a false negative, and thus the appropriate threshold is closer to 0. 7.4.2.2
Conditional Entropy Measure
Dependency between cell pairs is not necessarily symmetric. For example, if a wide angle camera and a zoomed camera are viewing the same area, then a cell ci in the zoomed camera may occupy a fraction of the area covered by a cell cj in the wide angle camera. In this case, Oi = 1 implies Oj = 1, but the converse is not true. Similarly, Oj = 0 implies Oi = 0, but again the converse is not true. To this end we can measure the conditional entropy of each variable: H(Oi | Oj ) = − Pr(oi , oj ) log Pr(oi |oj ) (7.12) oi ∈Oi ,oj ∈Oj
H(Oj | Oi ) = −
Pr(oi , oj ) log Pr(oj |oi )
(7.13)
oi ∈Oi ,oj ∈Oj
H(Oi | Oj ) ranges between H(Oi ), indicating independence, and 0, indicating that Oj is completely dependent on Oi . For inclusion, the decision threshold in a discriminant function based on H(Oi | Oj ) is set close to H(Oi ), requiring strong evidence for overlap to decide in its favour. For exclusion, the decision threshold is set closer to 0. 7.4.2.3
Lift Operator Measure
In some scenarios, occupancy data can be very unbalanced; for example in low traffic areas, O = 1 is far less probable than O = 0. This means that the joint
158
A.v.d. Hengel et al.
observation (0, 0) is not in fact strong evidence that this cell pair is related. This can make decisions based on calculation of the mutual information difficult, as the entropy of each variable is already low, and I(Oi ; Oj ) ranges between 0 (complete dependence) and max(H(Oi ), H(Oj )) (independence). One solution is to measure the independence of the events Oi = 1, Oj = 1 rather than the independence of the variables Oi , Oj . This leads to a measure known as lift [4]: Pr(Oi = 1, Oj = 1) lift(Oi , Oj ) = (7.14) Pr(Oi = 1) Pr(Oj = 1) Lift ranges between 1, indicating independence (non-overlap) and 1/ Pr(Oj = 1), indicating that cells i and j are identical. For inclusion, the decision threshold in a discriminant function based on lift is set closer to 1/OR than to 1, to reduce the risk of false overlap detection. For exclusion, the decision threshold is set near to 1, indicating that a cell pair must be nearly independent to be considered non-overlapping. 7.4.2.4
Conditional Probability Measure
Combining the ideas of non-symmetry between cell pairs (from the H(Oi | Oj ) measure) with that of measuring dependence only of events rather than variables (from the lift measure), we arrive at a conditional probability measure for cell overlap: Pr(Oi = 1, Oj = 1) Pr(Oj = 1|Oi = 1) = (7.15) Pr(Oi = 1) which can be seen to be a non-symmetric version of lift, and also analogous to conditional entropy (based on the same quantities). Pr(Oj = 1|Oi = 1) ranges between Pr(Oj = 1), indicating independence, and 1, indicating complete dependence. The conditional probability measure is equivalent to the overlap measure used in the exclusion approach presented in [15]. For inclusion, the decision threshold in the discriminant function should be set close to 1, so that strong evidence is required to label a cell pair as overlapping. For exclusion, the decision threshold is set close to Pr(Oj = 1). 7.4.2.5
Definition of Measurement and Discrimination Stage Configurations
Recall the overall configuration of an overlap estimator in the framework is defined by a three-tuple: CellSystemP air M easure LossF unction The measurement and discrimination stages are defined by the the second and third elements respectively. For the M easure element, we specify a function of two abstracted binary random variables X and Y , so a configuration with mutual information measure would take the form: CellSystemP air I(Y ; Y ) LossF unction
7
Camera Topology
159
For the LossF unction element, we simply choose between the three symbols Inclusion, Exclusion and M inimisation, thus an exclusion estimator using a mutual information measure is: CellSystemP air I(Y ; Y ) Exclusion
7.4.3 The Original Exclusion Estimator The original exclusion approach [15] was implemented [16, 7] prior to the formulation of this framework, but can easily be expressed within it. Specifically, this estimator uses: • An LHS cell system with 12 × 9 cells per camera, no spatial padding, and one frame of temporal padding. • An RHS cell system with 12 × 9 cells per camera, one cell of spatial padding in each direction, and no temporal padding. • The P r(X = 1, Y = 1) measure. • The Exclusion loss function. This is expressed in the framework notation thus: 12, 9, 0, 0, 0 , 12, 9, 1, 1, 1 P r(X = 1, Y = 1) Exclusion
(7.16)
7.4.4 Accuracy of Pairwise Occupancy Overlap Estimators The accuracy of estimators created using our framework is evaluated in terms of precision-recall of the identified overlap edges, when compared to ground truth. The dataset consists of 26 surveillance cameras placed around an office and laboratory environment as shown in Figure 7.3. Each camera has some degree of overlap with at least one other camera’s field of view; ground truth has been obtained by manual inspection. The data were obtained at ten frames per second over a period of approximately four hours. Results are reported for estimators generated from four different configurations of the framework, as follows: 1. An estimator based on the lift measure. This is expressed in the framework as: 12, 9, 0, 0, 1 , 12, 9, 0, 0, 1 lif t(X, Y ) ∗ 2. An estimator similar to the original exclusion approach, based on the conditional probability measure and using asymmetric time padding. This is expressed in the framework as: 12, 9, 0, 0, 0 , 12, 9, 0, 0, 1 P r(X = 1, Y = 1) ∗
160
A.v.d. Hengel et al.
19
8
24
9
6
20
5
10
7
21 17 11 18
12 22
1
25
4 2
23
16
15
13
14
26
3
Fig. 7.3 A floor plan showing the 26 camera dataset. The camera positions are shown by circles, with coloured triangles designating each camera’s field of view. White areas are open spaces, whilst light blue regions showing areas that are opaque to cameras.
3. An estimator based on the mutual information measure. This is expressed in the framework as: 12, 9, 0, 0, 1 , 12, 9, 0, 0, 1 I(X; Y ) ∗ 4. An estimator based on the conditional entropy measure. This is expressed in the framework as: 12, 9, 0, 0, 1 , 12, 9, 0, 0, 1 H(X|Y ) ∗ For the LossF unction component of these configurations (denoted *), we vary the thresholds between their bounds, and hence vary between extreme inclusion (emphasising precision) and extreme exclusion (emphasising recall). Figure 7.4 shows precision-recall results for the four different estimators applied to the 26 camera dataset as P-R curves. These results demonstrate that using some estimators, a reasonable level of precision can be obtained, even for a relatively high level of recall. The lift and conditional entropy estimators provide very poor precision across the range of recall values. The mutual information-based estimator provides a considerable improvement: it provides overlap information of sufficient accuracy
7
Camera Topology
161
Fig. 7.4 The accuracy of the overlap topology results
to enable subsequent processes. The conditional probability estimator provides even higher precision up to the point of 80% recall of the ground truth information, though it is outperformed by the mutual information estimator for very high levels of recall. Section 7.6 evaluates the utility of the overlap estimates as input to a tracking system.
7.5 Distribution The key to distribution of the framework is in the partitioning of the data. The distributed implementation is illustrated and evaluated in terms of the original exclusion approach, for which extensive CPU, memory and network bandwidth measurements are available, and which is expressed in the framework as follows: 12, 9, 0, 0, 0, 12, 9, 1, 1, 1 P r(X = 1, Y = 1) Exclusion
162
A.v.d. Hengel et al.
Certain optimisations are possible in cases that have similar cell structure in both the LHS and RHS cell systems. Specifically, we can define r, the number of cells per camera, and then redefine, c, the total number of cells to be: c=n×r
(7.17)
which is smaller than the general value given in Equation 7.1. The role of the overlap estimation component within an overall surveillance system is to derive activity topology from occupancy. We adopt a layered approach, with an overlap estimation layer that consumes occupancy information from a lower layer (detection pipelines), and produces activity topology information to be consumed by higher layers. This system model is shown in Figure 7.5. The operation of the detection pipeline layer is straightforward. Video data is captured from cameras or from archival storage. This data is processed by background subtraction to obtain foreground masks. The foreground masks are converted into blobs by a connected components algorithm. Finally occupancy is detected at the midpoint of the bottom edge of each blob, which is taken to be the lowest visible extent of the foreground object.
Overlap Estimation Servers
…
Each server estimates overlap for a region of an adjacency matrix representing the camera overlap graph
… …
…
…
…
… Each detection server sends occupancy data to a subset of the overlap estimation servers
Detection Servers
…
…
Each server runs detection pipelines for several cameras
Cameras
…
… …
…
…
…
…
…
Fig. 7.5 Architecture of partitioned system model
7
Camera Topology
163
Our aim is to partition the overlap estimation layer across multiple computers. Each such computer is termed an estimation partition. A detection pipeline for a given camera forwards occupancy data to any estimation partitions requiring data for that camera. For a given cell pair, joint sampling, measurement and discrimination are then all performed on a single estimation partition. In designing the estimation layer, we wish to: 1. Distribute the memory required to store data for cell pairs across multiple (affordable) computers. 2. Avoid communication (and in particular, synchronisation) between estimation partitions, in order to permit processing within each partition to proceed in parallel. 3. Permit the system to grow, through addition of estimation partitions, whilst the existing partitions continue processing and hence the extant system remains online. 4. Quantify the volume of communication required between the estimation partitions and the rest of the surveillance system, and ensure that this requirement remains within acceptable bounds. 5. Keep the size of partitions uniform. Below the overlap estimation layer we have detection pipelines, each performing foreground detection and cell occupancy determination for a single camera. The detection pipelines are independent and can be trivially distributed in any desired fashion. Above the estimation layer, the resulting topology estimates must be made available to higher levels of the surveillance system. The possibilities include: 1. As estimation partitions derive topology estimates they forward significant changes in those estimates to a central database. These changes include both increases in likelihood of an edge in the topology and decreases in likelihood. In the extreme, this includes edges disappearing completely, reflecting changes in activity topology over time, and hence those edges being removed from the central topology database. 2. Option 1, but with the central database replaced with a distributed database. 3. Higher layers obtain topology information by querying the estimation layer partition(s) holding it. In effect, the estimation partitions act as a distributed database. However, the experiments reported here concern only the estimation layer, and not the detection pipelines or any further use of the estimated topology within a surveillance system.
7.5.1 The Exclusion Approach The original exclusion approach requires only the following joint sampling data to operate: • OUij – the number of frames at which cell ci is occupied and cj is unoccupied, i.e. the exclusion count: a c × c matrix.
164
A.v.d. Hengel et al.
• OPij – the number of frames at which cell cj is occupied and cj is present (i.e. not unknown), i.e. the exclusion opportunity count: a c × c matrix, compressible to a c × n matrix since all cells within a camera share the same present/unknown state at any given time. Based on this data, exclusion estimates camera overlap through the evaluation of the Pr(Oj = 1|Oi = 1) measure, which can be approximated as follows: Pr(Oj = 1|Oi = 1) ≈
OOij OPij − OUij = OPij OPij
(7.18)
The overlap estimate is further strengthened by exploitation of the bi-directional nature of overlap; we consider cells ci and cj to overlap only when the following Boolean function is true: Xij = (Pr(Oi = 1|Oj = 1) > P ∗ ) ∧ (Pr(Oj = 1|Oi = 1) > P ∗ )
(7.19)
with P ∗ a threshold value. The effect of varying this threshold, in terms of the precision and recall achieved by the estimator, is extensively evaluated in [17].
7.5.2 Partitioning of the Joint Sampling Matrices In centralised overlap estimation implementations, all joint sampling data and matrices are maintained in the memory of a centralised processing node. Because these matrices are large and dense (at least in the case of OUij ), the memory available on this central node places an overall limit on the size of network that can be supported. For example, an instantiation of exclusion with 108 (12 × 9) cells per camera, 1000 cameras, and 16-bit (2 byte) OUij counts will require: (108 × 1000)2 × 2 = 23, 328, 000, 000 bytes (or approximately 24GB) to represent OUij . Some optimisation is possible; for example, a previously reported implementation [7] used byte-sized counts and a selective reset procedure (division by two of sections of by OUij and OPij , so as to maintain approximately correct Pr(Oj = 1|Oi = 1) ratios) to support 1000 cameras within 12GB. Nevertheless, there are two obstacles to further increases in the scale of systems that can be built: 1. The requirement that all joint sampling matrices be stored in a single server means that the memory (and processing) capacity of that server limits the maximum size of the networks that is feasible. For example, the current limit for easily affordable server hardware is less than 100GB, and only incremental improvements can be expected, so centralised implementations are limited to supporting networks of a few thousand cameras.
7
Camera Topology
165
2. The requirement for O(n2 ) memory (however distributed) means that even if it is possible to increase system scale by the addition of more hardware, it becomes increasingly expensive to do so, and at some point it ceases to be feasible. Both of these challenges need to be overcome; in this work we focus on the first. 7.5.2.1
A Partitioning Scheme
Observe from Equation 7.19 that calculation of overlap (Xij ) for given i and j requires both Pr(Oj = 1|Oi = 1) and Pr(Oi = 1|Oj = 1), and hence (from Equation 7.18) each of OUij , OUji , OPij and OPji . Whilst it would be possible to perform the final overlap calculation separately from the calculation (and storage) of OUij and OPij , we assume that it is not practically useful to do so, which implies that for given i and j, each of OUij , OUji , OPij and OPji must reside in the same partition (so as to avoid inter-partition communication). This constraint drives our partitioning scheme, along with the aims identified previously.
Fig. 7.6 The partitioning scheme for 200 partitions
166
A.v.d. Hengel et al.
Figure 7.6 shows partitioning across 200 estimation partitions; each partition contains two distinct square regions of the OUij matrix, such that the required symmetry is obtained. These square regions are termed half partitions. Note that with some measures, such as Mutual Information, the values in OUij and OUji are always the same. Hence in these cases only one half partition need be stored (in the case of half partitions on the diagonal of the matrix, only one half of each half partition need be stored). Thus the overall storage requirement in these cases is halved. Note however that this optimisation is not possible with asymmetric measures such as Conditional Probability and so we always store the whole matrix to maintain generality. The OPij matrix can be partitioned in the same way. However, given that the OPij matrix contains significant redundancy (the OPij values for all j in a given camera are the same), some optimisation is possible. Each of the square regions within Figure 7.6 contains sufficient rows and columns for several whole cameras worth of data. Table 7.1 defines the system parameters for partitioned estimation, with Figure 7.7 illustrating the r and R parameters. Note also: n R= √ 2N
(7.20)
relates n, N and R where N ∈ 2N2 and N ≥ 2. Now, for a given (cell) co-ordinate pair (i, j) within OUij we can determine the partition co-ordinates, (I, J) of the half partition containing the data for (i, j), as follows: i j (I, J) = ( , ) (7.21) rR rR From the partition co-ordinates of a given half partition, (I, J), we determine the partition number of the (whole) partition to which that half-partition belongs, using the following recursively defined partition numbering function: ⎧ P N (J, I) if J > I ⎪ ⎪ ⎪ ⎨ P N (I − 1, J − 1) if I = J∧ P N (I, J) = (7.22) I mod 2 = 1 ⎪ 2 ⎪ ⎪ ⎩ I +J otherwise 2
This recursive function gives the partition numbers shown in Figure 7.6. More importantly, it is used within the distributed estimation implementation to locate the partition responsible for a given region of OUij . Detection pipelines producing occupancy data use Equation 7.22 to determine the partitions to which they should send that occupancy data, and clients querying the activity topology may use it to locate the partition that holds the information they seek. The inverse relation maps partition numbers to a set of two half partition coordinate pairs. This set, P for a given partition is:
P = (I, J), (I, J) : J < J (7.23)
7
Camera Topology
167
Table 7.1 Partitioned Estimator System Parameters Parameter Definition n the number of cameras. N the number of partitions. r the number of cells into which each camera’s field of view is divided. R the length of a half partition, in terms of the number of distinct whole cameras for which that half partition contains data.
r: number of cells per camera
…
…
…
…
R: number of cameras per half partition
…
…
Fig. 7.7 Partitioned system parameter detail
The co-ordinate pair (I, J) is termed the upper half-partition, and the pair (I, J) is termed the lower half-partition; they are distinguished based on the y axis coordinate, as shown. Now, the x co-ordinate of the upper half-partition, I, is a function of the partition number, P : √ I= 2P (7.24) and the y co-ordinate of the upper half-partition, J, is a function of the partition number and the x co-ordinate:
168
A.v.d. Hengel et al.
2
I J =P − 2
(7.25)
Combining equations 7.24 and 7.25 yields the following, upper half-partition address function: ⎛ ⎡ √ 2 ⎤⎞ √ ⎜ ⎢ 2P ⎥⎟ U HP A(P ) = ⎝ 2P , P − ⎢ (7.26) ⎥⎠ ⎢ ⎥ 2 ⎢ ⎥ Next, the lower half-partition co-ordinate pair, (I, J), is a function of the upper half-partition co-ordinate pair, (I, J), as follows: ⎧ ⎨ (I + 1, J + 1) if I = J (I, J) = (7.27) ⎩ (J, I) otherwise Combining equations 7.24, 7.25 and 7.27 yields the following, lower half-partition address function: ⎧ √ 2 √ 2P ⎪ ⎪ 2P + 1, P − +1 ⎪ 2 ⎪ ⎪ ⎪ √ 2 ⎪ ⎪ √ 2P ⎪ ⎪ 2P = P − ⎨ if 2 LHP A(P ) = (7.28) ⎪ ⎪ √ 2 ⎪ ⎪ √ 2P ⎪ ⎪ , 2P ⎪ 2 ⎪ P− ⎪ ⎪ ⎩ otherwise Equations 7.26 and 7.28 define the partition co-ordinates of the two half-partitions corresponding to a given partition number. This is exploited in an implementation strategy whereby partition creation is parameterised by partition number, and this mapping is used to determine the two rectangular regions of OUij to be stored in the partition. 7.5.2.2
Incremental Expansion
A key property of the partitioning scheme is support for incremental expansion of OUij and hence of the system. As shown in Figure 7.8, new partitions, with higher partition numbers, can be added on the right and bottom borders of the matrix, leaving the existing partitions unchanged in both partition number and content. Since the addition of new partitions is entirely independent of the existing partitions, expansion can occur whilst the system (i.e. the existing partitions) remains on-line. Figure 7.8 shows expansion by two (partition) rows and (partition) columns each time. The matrix must remain square, and hence must grow by the same amount
7
Camera Topology
169
Fig. 7.8 Expansion from 2 to 8 and then to 18 partiions
in each direction. The implication is that when growth is necessary, a large number of new partitions must be added (not just one at a time). Growth by two rows and columns at a time is the smallest increment that ensures all partitions are exactly the same size. Growth by one row and column can lead to the latest partition on the diagonal being half the size of all the rest (it has only one half partition instead of two). At worst, this leads to under-utilisation of one computing node, and full utilisation will be restored at the next growth increment. It is also worth noting that whilst partitions have to be of fixed size, the mapping between partitions and computing nodes can be virtualised, allowing, for example, more recently added nodes, which are likely to have greater capacity, to be assigned more than a single partition. Now, the number of partitions, N is expressed in terms of the length, L of the (square) partition grid: L2 (7.29) 2 Now suppose that at some point in its life time, a surveillance system has N partitions. Growth by two partitions in each direction results in partition grid length, L + 2, and hence the number of partitions in the system after growth, N , is: N=
N =
2
(L + 2) L2 + 4L + 4 = = N + 2L + 2 2 2
(7.30)
The growth in the number of partitions is then: N − N = 2L + 2 =
2n +2 R
i.e. it is linear in the number of cameras in the system prior to growth.
(7.31)
170
A.v.d. Hengel et al.
7.5.3 Analysis of Distributed Exclusion Here we describe the expected properties for an implementation of distributed exclusion based on our partitioning approach. Section 7.5.4 evaluates the properties measured for a real implementation against the predictions made here. The properties of interest are: • Network bandwidth – the input bandwidth for each estimation partition and the aggregate bandwidth between the detection and estimation layers. • Memory – memory required within each partition. Specifically, our aim is to relate these properties to the system parameters defined in Tables 7.1 and 7.2. Such relationships enable those engineering a surveillance system to provision enough memory and network hardware to prevent degradation of system performance, due to paging (or worse, memory exhaustion) and contention respectively, thus increasing the probability of the system maintaining continuous availability. Table 7.2 System Implementation Parameters Parameter Definition f the number of frames per units time processed by the estimation partitions. d the maximum time which occupancy data may be buffered prior to processing by the estimation partitions. b the size of each OUij count, in bytes.
7.5.3.1
Network Bandwidth Requirements
The input bandwidth required by a partition is determined by the occupancy and padded occupancy data needed in the two half partitions constituting the partition. The occupancy data required in a half partition is that in the x co-ordinate range of OUij which the half-partition represents. Similarly, the padded occupancy data required corresponds to the y-axis range. The size of the range in each case is the length (in cells) of a half-partition, that is: nr l = Rr = √ 2N
(7.32)
As with R in Equation 7.20, this is defined only where N ∈ 2N2 and N ≥ 2. Now, observe from Figure 7.6 that there are only two configurations of partitions: 1. For partitions on the diagonal of the partition grid, each of the two halfpartitions occupies the same co-ordinate range in both x and y axes. 2. For all other partitions, the x co-ordinate range of one half partition is the y co-ordinate range of the other half-partition, and vice versa.
7
Camera Topology
171
In both cases, a (whole) partition occupies a given (possibly non-contiguous) range in the x dimension and the same range in the y dimension, the total size of these ranges is 2l. Given that padded occupancy, pi , is computed from occupancy, ok , for k in the set of cells including i and its immediate neighbours (within the same camera), all the padded occupancy data needed in a partition can be computed from the occupancy data that is also needed. Therefore the amount of data per frame needed as input into a partition is simply the amount of occupancy data, that is 2l bits. The unpartitioned case (N = 1) is handled separately: the number of inputs per frame is simply nr. With f frames per second this gives the partition’s input bandwidth per second, βP , in bits per second: nrf if N = 1 βP = (7.33) √ 2lf = 2nrf if N ∈ 2N2 ∧ N ≥ 2 2N Now, in practice, the information sent over the network will need to be encoded in some structured form, so the actual bandwidth requirement will be some constant multiple of βP . For N partitions, each on a separate host, the aggregate bandwidth per unit time, βT , in bits per second, is: nrf if N = 1 √ βT = (7.34) n 2Nrf if N ∈ 2N2 ∧ N ≥ 2 7.5.3.2
Memory Requirements
Each estimation partition requires memory for: • • • •
Representation of two half partitions of the OUij matrix. Representation of two half partitions of the OPij matrix. Buffering occupancy data received from detection servers. Other miscellaneous purposes, such as parsing the XML data received from the detection servers.
Globally, the OUij matrix requires one integer count for each pair of camera cells, (i, j). The number of cell pairs is r2 n2 , so the storage required for OUij in each partition is: r 2 n2 b μO U = (7.35) N The OPij matrix is the same size as OUij . However, for given i, OPij for all j identifying cells within a given camera has the same value (as the cells identified by j are either all available or all unavailable at a given point in time), so the storage required for OPij in each partition is: μO P =
rn2 b μO U = N r
(7.36)
172
A.v.d. Hengel et al.
The occupancy data processed within a given partition is produced by several detection servers. Hence it may be the case that different occupancy data pertaining to a given time point arrives at an estimation partition at different times, and in fact some fraction of data (typically very small) may not arrive at all. To cope with this, occupancy data is buffered in estimation partitions prior to processing. Double buffering is required to permit data to continue to arrive in parallel within processing of buffered data. Each data item requires at least two bits (to represent the absence of data as well as the occupied and unoccupied states). Using one byte per item, the storage required for buffering within a partition is: 2dnrf if N = 1 μB = 2dβP = 4dnrf (7.37) √ if N ∈ 2N2 ∧ N ≥ 2 2N Finally, estimation partitions require memory for parsing and connection management and for code and other fixed requirements, the storage required for parsing and connection management is proportional to the number of cameras processed by the partition, whereas the remaining memory is constant, so: nμP + μC if N = 1 μM = √2n (7.38) 2 μ + μ C if N ∈ 2N ∧ N ≥ 2 2N P where μP and μC are constants determined empirically from a given implementation.
7.5.4 Evaluation of Distributed Exclusion Results are reported for running distributed exclusion for surveillance networks of between 100 and 1,400 cameras and between 1 and 32 partitions. The occupancy data is derived by running detection pipelines on 2 hours footage from a real network of 132 cameras then duplicating occupancy data as necessary to synthesise 1,400 input files, each with 2 hours of occupancy data. These files are then used as input for the estimation partitions, enabling us to repeat experiments. The use of synthesis to generate a sufficiently large number of inputs for the larger tests results in input that contains an artificially high incidence of overlap, since there is complete overlap within each set of input replicas. The likely consequence of this is that the time performance of estimation computations is slightly worse than it would be in a real network. Our experimental platform is a cluster of 16 servers, each with two 2.0Ghz dualcore Opteron CPUs and each server having 4 gigabytes of memory. We instantiate up to 32 estimation partitions (of size up to 2GB) on this platform. 7.5.4.1
Verification
Results are verified against the previous, non-partitioned implementation of exclusion. It is shown in [17] that this previous implementation exhibits sufficient
7
Camera Topology
173
precision and recall of the ground truth overlap to support tracking. The partitioned implementation achieves very similar results for the same data. The differences arise because the distributed implementation choose a simpler approach to dealing with clock skew. Adopting the more sophisticated previous approach would be feasible in the partitioned implementation, and would not affect memory or network requirements, but would increase CPU time requirements. 7.5.4.2
Performance Results
The parameters for our experiments are shown in Table 7.3. Parameters n, N and (by implication) R are variables, whereas the remaining parameters have the constant values shown. The μP and μC constants have been determined empirically. Figure 7.9 shows measurements of the arithmetic mean memory usage within each partition for the various configurations tested. Also shown are curves computed from the memory requirement formulae derived in Section 7.5.3.2. As can be seen, these closely match the measured results. The standard deviation in these results is at most 1.0 × 10−3 of the mean, for the 8 partition/200 camera case. Figure 7.10 shows measurements of the arithmetic mean input bandwidth into each partition for the various configurations tested. Also shown are curves computed from the network bandwidth requirement formulae derived in Section 7.5.3.1, scaled by a constant multiple (as discussed earlier), which turns out to be 8. As one would expect, these closely match the measured results. Notice that the bandwidth requirements of the two partition case are the same as for the unpartitioned case: both partitions require input from all cameras. The standard deviation in these results is at most 3.7 × 10−2 of the mean, for the 32 partition/1200 camera case. The explanation for this variance (in fact any variance at all) is that we use a compressed format (sending only occupied cells) in the occupancy data. Table 7.3 Experimental Parameters Parameter Value N n(for N = 1) n(for N = 2) n(for N = 8) n(for N = 32) r R f d b µP µC
either 1, 2, 8 or 32 partitions. 100, 200 or 300 cameras 100, 200 or 400 cameras 200 to 800 cameras in steps of 200 200 to 1400 cameras in steps of 200 each camera’s field of view is divided into 12 × 9 = 108 cells. determined from N and n. 10 frames per second. at most 2 seconds buffering delay. 2 bytes per OUij count. 250 KB storage overhead per-camera. 120 MB storage overhead per-partition.
174
A.v.d. Hengel et al.
Fig. 7.9 Memory used within each partition
Fig. 7.10 Bandwidth into each partition
7
Camera Topology
175
Fig. 7.11 CPU time used by each partition
Figure 7.11 shows measurements of the arithmetic mean CPU time within each partition for the various configurations tested. Recall that the footage used for experimentation is two hours (7,200 seconds) so all configurations shown are significantly faster than real time. The standard deviation in these results is at most 7.4 × 10−2 of the mean, for the 32 partition/200 camera case. Partitions in this case require a mean of 77 seconds CPU time for 7,200 seconds real-time, with the consequence that CPU time sampling effects contribute much of the variance. In contrast, the 32 partition/1400 camera case requires a mean of 1,435 seconds CPU time for 7,200 second real time, and has standard deviation of 1.6 × 10−2 of the mean. At each time step, the estimation executes O(n2 ) joint sampling operations (one for each pair of cells). Thus, we fit quadratic curves (least squares) to the measured data to obtain the co-efficients of quadratic formulae predicting the time performance for each distinct number of partitions. These formulae are shown as the predicted curves in Figure 7.11. 7.5.4.3
Discussion
Observe from Figures 7.9 and 7.11 that partitioning over just two nodes delivers significant increases over the unpartitioned implementation, at the cost of a second, commodity level, server. There is some additional network cost arising from partitioning, for example the total bandwidth required in the two partition case is twice that for the unpartitioned case with same number of cameras, however the total bandwidth required is relatively small in any case and thus is not expected to be
176
A.v.d. Hengel et al.
%
%
$
$
#$
#$
#
#
""
""
"
"
!
!
$
$
"
"
$
$
problematic in practice. The total memory required for a given number of cameras is almost independent of the number of partitions, and the cost of this additional memory is more than outweighed by the ability for the memory to be distributed across multiple machines, thus avoiding any requirement for expensive machines capable of supporting unusually large amounts of memory. The experiments validate the memory requirement formulae from Section 7.5.3.2 and (trivially) the network bandwidth formulae from Section 7.5.3.1. Using the memory formulae together with the empirically derived quadratic formulae fitted to the curves in Figure 7.11, it is possible to determine the current scale limit for the estimation to operate in real-time on typical commodity server hardware. We take as typical a server with 16 GB memory and 2 CPUs: the current cost of such a server is less than the cost of ten cameras (including camera installation). We instantiate two 8 GB partitions onto each such server. Figure 7.12 shows the predicted memory and CPU time curves for the 32 partition case extended up to 3,500 cameras. As can be seen, the memory curve crosses the 8 GB requirement at about 3,200 cameras, with the CPU time curve crossing 7,200 seconds at about 3,400 cameras, leading the conclusion that a 16 server system can support over 3,000 cameras; significantly larger scale than any previously reported results.
"
"
!
!
!
!
#
$
Fig. 7.12 Scale limit for 32 partition system
7.6 Enabling Network Tracking We are now interested in investigating the use of pairwise camera overlap estimates for supporting target tracking across large networks of surveillance cameras. This is
7
Camera Topology
177
achieved through the comparison of the use of camera overlap topology information to a method based on matching target appearance histograms, and evaluating the effect of combining both methods. We use standard methods for target segmentation and single camera tracking, and instead focus on the task of implementing hand-off by joining together the tracks that have been extracted from individual views, as illustrated in Figure 7.13.
Fig. 7.13 Example showing a hand-off link (dashed purple line) between two single camera tracks (solid blue lines), as well as the summarisation of individuals to 12x9 grid cells (green shading) as employed in the topology estimation method used
The two main approaches to implementing tracking hand-off considered here are: target appearance, and camera overlap topology. Target appearance is represented by an RGB colour histogram, which has the advantage that it is already used for single camera tracking, and is straightforward to extend to multiple camera tracking. However it has the disadvantage that target appearance can change significantly between viewpoints and due to errors in segmenting the target from the background. The camera overlap topology describes the relationships between the cells that for the camera views, and can be obtained automatically using the process defined previously. The appearance of a person is defined by the pixel colours representing that person in an image, and is widely used in video surveillance. Measures based on appearance include correlation of the patch itself, correlating image gradients, and matching Gaussian distributions in colour space. We use an RGB histogram to represent the appearance of each target. Histograms are chosen because they allow for some distortion of appearance: they count only the frequency of each colour rather than its location, and they quantise colours to allow for small variations. They are also compact and easy to store, update and match. Here we use an 8x8x8 RGB histogram that is equally spaced in all three dimensions, totalling 512 bins. However it has the disadvantage that target appearance can change significantly between viewpoints and due to errors in segmenting the target from the background. Histogram matching is based upon the Bhattacharyya coefficient [1] to determine the similarity of object appearances. If we let i sequentially reference histogram bins, then the similarity of normalised histograms A and B is given by:
178
A.v.d. Hengel et al.
Similarity =
512
Ai · Bi
(7.39)
i=1
A decision on whether two targets match can then be reached by thresholding this similarity measure. Other similarity measures, such as Kullback-Leibler divergence [20], produced similar results. For large camera networks, the number of false matches resulting from the use of appearance matching alone will generally increase with the number of cameras in the system. In practice, for large networks this will need to be mitigated by applying at least a limited form of camera topology information, and only searching for appearance matches in the same cluster of cameras, such as within the same building. Hand-off based on appearance matching across the entire network or within clusters may then be further refined by combining it with the use of overlap topology and searching only overlapping regions for targets of matching appearance. To perform the evaluation, we draw a random sample of detections of people at particular times. For each sample, we manually identify all other simultaneous detections of the same person, in order to obtain a set of ground truth links between detections in different cameras. No restriction is placed on the detections that are sampled in this way; they may occur anywhere within tracks that have been found within a single camera. This reflects the fact that hand-off does not just link the end of one track with the start of another, but rather needs to link arbitrary points from track pairs depending on when a person becomes visible or disappears from another camera. This is quite a stringent test because it is based solely on instantaneous “snapshots” of the network: no temporal information is used. In reality, information from temporally adjacent frames may be available to assist with deciding on camera handoff. A set of 500 object detections were randomly chosen from the many hours of footage across a set of 24 overlapping cameras. These provided 160 usable test cases where segmentation errors were not extreme, and the individual was not significantly occluded. It was found that one camera provided unreliable time stamping, providing an opportunity to compare results to the more reliable 23 camera set. Figure 7.14 presents the tracking hand-off results in terms of precision and recall, for appearance matching and using each of three overlap topology estimators: exclusion, mutual information, and lift. The results show that using the appearance model alone has much lower precision when detecting tracking hand-offs than using any of the topology estimates alone, but it performs better than randomly “guessing”. This low precision could be due to a number of factors that influence appearance measures, such as segmentation errors and illumination effects. Additionally, some cameras in the dataset were behind glass windows, which can introduce reflections. This had a minor effect on segmentation, but influenced object appearance. Regardless of these effects, distinguishing between individuals wearing similar clothing is difficult using appearance features alone. Estimates of tracking hand-off links based on the automatically generated camerato-camera topology are similar to those based on the camera-to-camera ground truth
7
Camera Topology
179
Fig. 7.14 P-R curves demonstrating tracking hand-off results
topology, when the unreliable camera was removed. The precision was considerably worse with this camera included, as it introduced a number of false links in the topology. Using overlap at a camera level alone is problematic, because objects may be observed in non-overlapping regions of overlapping views and erroneously be considered to be the same object. The increased spatial resolution of overlap using 12x9 cells per camera outperformed the tracking hand-off achieved using even the ground truth camera-to-camera topology. Thus, more cases where different objects are seen in non-overlapping portions of the camera views are correctly excluded in the hand-off search process. Combining appearance and topology did not significantly increase the precision for a given level of recall, suggesting that incorrect links which fit the topology model are not correctly excluded by using appearance. A more complex appearance model or improvements in the accuracy of object segmentations may improve the accuracy of appearance; however unless the appearance model or extraction technique can compensate for illumination changing the perceived object appearance, these are still likely to be very difficult cases in real surveillance environments. The accuracy of determining appearance similarity also depends significantly upon the individual clothing that is worn. In practice, many people wear similar clothing, often with a significant amount of black or dark colours. If the appearance model can only capture large differences in appearance and clothing, then it may be difficult to
180
A.v.d. Hengel et al.
accurately discriminate between individuals; however capturing nuances in appearance can lead to large data structures that can be even more sensitive to illumination and segmentation errors. By contrast, topology information is derived solely from object detection in each cell, and thus does not depend on the appearance of people in the video. It is less sensitive to these issues, and able to obtain a topology that is accurate enough to be useful for tracking even in environments where the camera struggles to accurately capture the appearance of each person. The effect of removing the poor quality wireless camera is also indicated in the graph. It is clear that the precision of the topology is reduced by including such cameras, as the time delay allows for evidence to arise supporting overlaps that do not occur. The precision difference for appearance-only tracking was much less and is not shown in the graph. This is because the appearance is not as significantly affected by time delays, so removing the less reliable camera does not have much of an affect.
7.7 Conclusion This chapter reports recent research exploring the automatic determination of activity topologies in large scale surveillance systems. The focus is specifically on the efficient and scalable determination of overlap between the fields of view, so as to facilitate higher level tasks such as the tracking of people through the system. A framework is described that utilises joint sampling of cell occupancy information for pairs of camera views to estimate the camera overlap topology, a subset of the full activity topology. This framework is has been developed to implement a range of camera overlap estimators: results are reported for estimators based on mutual information, conditional entropy, lift and conditional probability measures. Partitioning and distributed processing using the framework provides a more cost effective approach to activity topology estimation for large surveillance networks, as this permits the aggregation of the resources of a large number of commodity servers into a much larger system than is possible with a single high-end server/supercomputer. In particular, the distributed implementation can obtain the memory it requires at acceptable cost. Results comparing partitioned and nonpartitioned implementations demonstrate that the advantages of partitioning outweigh the costs. The partitioning scheme enables partitions to execute independently; enhancing both performance (through increased parallelism) and, just as importantly, permitting partitions to be added without affecting existing partitions. Formulae are derived for the network and memory requirements of the partitioned implementation. These formulae, verified by experimental results, enable engineers seeking to use the distributed topology estimation framework to determine the resources required from the implementation platform. A further detailed investigation is conducted into the accuracy of the topologies estimated by the occupancy-based framework. This assesses support for tracking of people across multiple cameras. These results demonstrate the utility of the topologies produced by the estimation framework.
7
Camera Topology
181
The camera overlap topology estimation framework, the distributed implementation, and the use of the estimated topology for higher level functions (tracking), together demonstrate the system’s ability to support intelligent video surveillance on large scale systems.
References [1] Bhattacharyya, A.: On a measure of divergence between two statistical populations defined by their probability distributions. Bulletin of the Calcutta Mathematical Society 35, 99–109 (1943) [2] Bishop, C.: Pattern Recognition and Machine Learning. Springer, Heidelberg (2006) [3] Brand, M., Antone, M., Teller, S.: Spectral solution of large-scale extrinsic camera calibration as a graph embedding problem. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3022, pp. 262–273. Springer, Heidelberg (2004) [4] Buehler, C.: Computerized method and apparatus for determining field-of-view relationships among multiple image sensors. United States Patent 7286157 (2007) [5] Cichowski, A., Madden, C., Detmold, H., Dick, A.R., van den Hengel, A., Hill, R.: Tracking hand-off in large surveillance networks. In: Proceedings of Image and Visual Computing, New Zealand (2009) [6] Floreani, D., manufacturer of large surveillance systems: Personal Communication to Anton van den Hengel (November 2007) [7] Detmold, H., van den Hengel, A., Dick, A.R., Cichowski, A., Hill, R., Kocadag, E., Falkner, K., Munro, D.S.: Topology estimation for thousand-camera surveillance networks. In: Proceedings of International Conference on Distributed Smart Cameras, pp. 195–202 (2007) [8] Detmold, H., van den Hengel, A., Dick, A.R., Cichowski, A., Hill, R., Kocadag, E., Yarom, Y., Falkner, K., Munro, D.: Estimating camera overlap in large and growing networks. In: 2nd IEEE/ACM International Conference on Distributed Smart Cameras (2008) [9] Dick, A., Brooks, M.J.: A stochastic approach to tracking objects across multiple cameras. In: Webb, G.I., Yu, X. (eds.) AI 2004. LNCS (LNAI), vol. 3339, pp. 160–170. Springer, Heidelberg (2004) [10] Ellis, T.J., Makris, D., Black, J.: Learning a multi-camera topology. In: Proceedings of Joint IEEE Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 165–171 (2003) [11] Espina, M.V., Velastin, S.A.: Intelligent distributed surveillance systems: A review. IEEE Proceedings - Vision, Image and Signal Processing 152, 192–204 (2005) [12] Gilbert, A., Bowden, R.: Tracking objects across cameras by incrementally learning inter-camera colour calibration and patterns of activity. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 125–136. Springer, Heidelberg (2006) [13] Griffin, J.: Singapore deploys march networks vms solution (2009), http://www.ipsecuritywatch.com/web/online/IPSW-News/ Singapore-deploys-March-Networks-VMS-solution/512$4948, IP Security Watch [14] van den Hengel, A., Detmold, H., Madden, C., Dick, A.R., Cichowski, A., Hill, R.: A framework for determining overlap in large scale networks. In: Proceedings of International Conference on Distributed Smart Cameras (2009)
182
A.v.d. Hengel et al.
[15] van den Hengel, A., Dick, A., Hill, R.: Activity topology estimation for large networks of cameras. In: AVSS 2006: Proc. IEEE International Conference on Video and Signal Based Surveillance, pp. 44–49 (2006) [16] van den Hengel, A., Dick, A.R., Detmold, H., Cichowski, A., Hill, R.: Finding camera overlap in large surveillance networks. In: Yagi, Y., Kang, S.B., Kweon, I.S., Zha, H. (eds.) ACCV 2007, Part I. LNCS, vol. 4843, pp. 375–384. Springer, Heidelberg (2007) [17] Hill, R., van den Hengel, A., Dick, A.R., Cichowski, A., Detmold, H.: Empirical evaluation of the exclusion approach to estimating camera overlap. In: Proceedings of the International Conference on Distributed Smart Cameras (2008) [18] Javed, O., Rasheed, Z., Shafique, K., Shah, M.: Tracking across multiple cameras with disjoint views. In: Proceedings of International Conference on Computer Vision, pp. 952–957 (2003) [19] Ko, T.H., Berry, N.M.: On scaling distributed low-power wireless image sensors. In: Proceedings 39th Annual Hawaii International Conference on System Sciences (2006) [20] Kullback, S.: The Kullback-Leibler distance, vol. 41. The American Statistical Association (1987) [21] Rahimi, A., Dunagan, B., Darrell, T.: Simultaneous calibration and tracking with a network of non-overlapping sensors. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pp. 187–194 (2004) [22] Rahimi, A., Dunagan, B., Darrell, T.: Tracking people with a sparse network of bearing sensors. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3024, pp. 507–518. Springer, Heidelberg (2004) [23] Shannon, C.E., Weaver, W.: The mathematical theory of communication. University of Illinois Press, Urbana (1949) [24] Stauffer, C.: Learning to track objects through unobserved regions. In: IEEE Computer Society Workshop on Motion and Video Computing, vol. II, pp. 96–102 (2005) [25] Stauffer, C., Grimson, W.E.L.: Learning patterns of activity using real-time tracking. IEEE Trans. Pattern Analysis and Machine Intelligence 22(8), 747–757 (2000) [26] Tieu, K., Dalley, G., Grimson, W.: Inference of non-overlapping camera network topology by measuring statistical dependence. In: Tenth IEEE International Conference on Computer Vision, vol. II, pp. 1842–1849 (2005)
Chapter 8
Multi-robot Teams for Environmental Monitoring Maria Valera Espina, Raphael Grech, Deon De Jager, Paolo Remagnino, Luca Iocchi, Luca Marchetti, Daniele Nardi, Dorothy Monekosso, Mircea Nicolescu, and Christopher King
Abstract. In this chapter we target the problem of monitoring an environment with a team of mobile robots having on board video-cameras and fixed stereo cameras available within the environment. Current research regards homogeneous robots, whereas in this chapter we study highly heterogeneous systems and consider the problem of patrolling an area with a dynamic set of agents. The system presented in the chapter provides enhanced multi-robot coordination and vision-based activity monitoring techniques. The main objective is the integration and development of coordination techniques for multi-robot environment coverage, with the goal of maximizing the quality of information gathered from a given area thus, implementing a Heterogeneous mobile and reconfigurable multi-camera video-surveillance system.
8.1 Introduction Monitoring a large area is a challenging task for an autonomous system. During recent years, there has been increasing attention in using robotic technologies for security and defense applications, in order to enhance their performance and reduce the danger for the people involved. Moreover, the use of multi-robot systems allows for a better deployment, increased flexibility and reduced costs of the system. A significant amount of research in multi-agent systems have been dedicated to the development of and experimentation on methods, algorithms and evaluation methodologies for multi-robot patrolling in different scenarios. This chapter shows Maria Valera Espina · Raphael Grech · Deon De Jager · Paolo Remagnino Digital Imaging Research Centre, Kingston University, London, UK Luca Iocchi · Luca Marchetti · Daniele Nardi Department of Computer and System Sciences University of Rome “La Sapienza”, Italy Dorothy Monekosso Computer Science Research Institute, University of Ulster, UK Mircea Nicolescu · Christopher King Department of Computer Science and Engineering, University of Nevada, Reno P. Remagnino et al. (Eds.): Innovations in Defence Support Systems – 3, SCI 336, pp. 183–209. c Springer-Verlag Berlin Heidelberg 2011 springerlink.com
184
M.V. Espina et al.
that using multi-robots in environmental monitoring is both effective and efficient. We provide a distributed, multi-robot solution to environment monitoring, in order to detect or prevent defined, undesired events, such as intrusions, leaving unattended luggage and high temperatures (such as a fire). The problem of detecting and responding to threats through surveillance techniques is particularly well suited to a robotic solution comprising of a team of multiple robots. For large environments, the distributed nature of the multi-robot team provides robustness and increased performance of the surveillance system. Here we develop and test an integrated multirobot system as a mobile, reconfigurable, multi-camera video-surveillance system. The system goal is to monitor an environment by collectively executing the most effective strategies for gathering the best quality information from it. Using a group of mobile robots equipped with cameras has several significant advantages over a fixed surveillance camera system. Firstly, our solution can be used in environments that have previously not been equipped with a camera-based monitoring system: the robot team can be deployed quickly to obtain information about an unknown environment. Secondly, the cameras are attached to the robots, which will be positioning themselves within the environment, in order to best acquire the necessary information. This is in contrast with a static camera, which can only perform observations from a fixed view point. Thirdly, the robots in the team have the power to collaborate on the monitoring task and are able to pre-empt a potential threat. Fourthly, the robots could be equipped with additional, specialized sensors, which could be delivered at the appropriate place in the environment to detect, for example, the presence of high temperatures, such as in the case of a fire. Lastly, the robot team can communicate with a human operator and receive commands about the goals and potential changes in the mission, allowing for a dynamic, adaptive solution. Therefore, these enhanced multi-robot coordination and vision-based activity monitoring techniques, advance the state-of-the-art in surveillance applications. In this chapter, we focus on monitoring a large area by using a system with the following characteristics. 1. The system is composed of a number of agents, some of them having mobile capabilities (mobile robots) whilst others are fixed (video cameras). 2. The system is required to monitor and detect different kinds of predefined events at the same time. 3. Each agent has a set of sensors that are useful to detect some events. Sensors are of a different type within the entire system. 4. The system is required to operate in two modes: a) patrolling mode b) response mode These requirements make the problem significantly different from previous work. First of all, we consider a highly heterogeneous system, where robots and cameras inter-operate. Second, we consider different events and different sensors and we will therefore consider different sensor models for each kind of event. Third, we will study the dynamic evolution of the monitoring problem, where at each time a subset of the agents will be in response mode, while the rest of them will be in patrolling mode.
8
Multi-robot Teams for Environmental Monitoring
185
The main objectives of the developed system are: 1. develop environment monitoring techniques through behavior analysis based on stereo cameras, 2. develop distributed multi-robot coverage techniques for security and surveillance, 3. validate our solution by constructing a technological demonstrator showing the capabilities of a multi-robot system to effectively deploy itself in the environment and monitor it. In our previous work, we already developed and successfully implemented new dynamic distributed task assignment algorithms for teams of mobile robots: applied to robotic soccer [27] and for foraging-like tasks [20]. More specifically, in [27] we proposed a greedy algorithm to effectively solve the multi-agent dynamic and distributed task assignment problem, that is very effective in situations where the different tasks to be achieved have different priorities. In [20] we also proposed a distributed algorithm for dynamic task assignment based on token passing that is applicable when tasks are not known a priori, but are discovered during the mission. The problem considered here requires both finding an optimal allocation of tasks among the robots and taking into account tasks that are discovered at runtime. Therefore it is necessary to integrate the two approaches. As a result, we do not only specialize these solutions to the multi-robot surveillance and monitoring task, but also study and develop extensions to these techniques in order to improve the optimality of the solutions and the adaptivity to an open team of agents, taking into account the physical constraints of the environment and of the task. The use of stereo cameras in video-surveillance opens several research issues, such as the study of segmentation algorithms based on depth information provided by the stereo-vision, tracking algorithms that take into account 3-D information about the moving objects and techniques of behavior analysis that integrate and fuse 3-D information gathered from several stereo sensors. Furthermore, the application of multi-robot coverage to security and surveillance tasks provide new opportunities of studying multi-robot distributed coordination techniques with dynamic perception of the tasks and methods for optimal coverage of the environment in order to maximize the quality of the information gathered from it. These aspects will be considered in more detail in the coming sections. The rest of the chapter is organized as follows: Section 8.2 provides an overview of our proposed system. In Section 8.3 previous related work is presented. The Representation formalism is explained in Section 8.4 and the event-driven multi-sensor monitoring algorithm presented in Section 8.5. In Section 8.6 the system implementation and experimental results are illustrated. Finally Conclusions are drawn in Section 8.7.
8.2 Overview of the System The overall surveillance system developed is presented hereunder. The system is mainly composed of two subsystems: a video-surveillance sub-system operating with static cameras and a multi-robot system for environment monitoring and threat response.
186
M.V. Espina et al.
8.2.1 Video-Surveillance with Static Cameras One objective of the visual surveillance system was to identify when people leave, pick or exchange objects. Two scenarios were used to test the capabilities of the video-surveillance system. In the first scenario (i.e. unattended baggage event), the system was designed to send a report if a person was observed leaving a bag In the second scenario (i.e. object manipulation), the system should send a report if a person manipulated an unauthorized object from the environment. Once a report is sent, a patrol robot would be commissioned to go and take a high resolution picture of the scenario. Recognizing these types of actions may be done without sophisticated algorithms, so for this demonstration, we use a simple rule-sets based only on proximity and trajectories: For the first scenario: • If a bag appears in the environment, models will be generated for that bag and for the nearest person. If the associated person is observed moving away from the bag, it will be considered “left bag”, and a report of the incident will be generated. • If a bag is associated with one person, and a second person is observed moving away with the bag, it will be considered an “bag taken”and a report will be generated and sent to the multi-robot system. For the second scenario: • If a person is observed manipulating an object that was either present at the beginning of the sequence, or left by another person (i.e. unauthorized object), the incident will be considered an “allert”and a report will be generated and sent to the multi-robot system.
8.2.2 Multi-robot Monitoring of the Environment Our approach to multi-robot monitoring is to extend the work done in multi-robot patrolling, adding the capability for the robots to respond to events detected by visual and other sensors in a coordinated way. Therefore two problems are considered and solved. 1. Identify global tasks associated to events detected by local sensors on-board the robots or the vision components of the system. 2. Coordinate the response to these events among the multiple robots. These problems have been solved by developing a general algorithm for eventdriven distributed monitoring (see Section 8.5).
8.2.3 Experimental Scenario The scenario used for the experimental validation was tested in the campus of the Department of Computer and System Science (DIS) of Sapienza University in Rome, Italy1 . The selected scenario, shown in Figure 8.1, was an indoor corridor to 1
www.dis.uniroma1.it
8
Multi-robot Teams for Environmental Monitoring
187
simulate the unattended baggage event and a lab room to simulate the object manipulation. A team of robots carrying video cameras is deployed in the environment as they cooperate to optimize the surveillance task, by maximizing the amount and quality of information gathered from the environment using the on-board cameras. When the robots reach the desired target poses, the cameras mounted on them could act as a network of surveillance cameras and video-surveillance algorithms may run on their inputs. Moreover, another system based on fixed stereo cameras, capable of providing depth information, is available within the environment. This can eventually be also integrated on the robot platforms.
Fig. 8.1 Experimental scenario at DIS
8.3 Related Work In this chapter, we define the problem of Environmental Monitoring by extending the classical problem of Multi-Robot Patrolling to include also the Threat Response, i.e. the response to a threat event detected within the environment by an agent (either a robot or a computer vision sub-system) monitoring that environment. Examples of threat responses can vary from intercepting an intruder, examining a specific area or a specific object left by somebody. The main components to be integrated for the effective monitoring of an environment are: Multi-Robot Patrolling, MultiRobot Coverage, Dynamic Task Assignment and Automatic Video Surveillance. The current state-of-the-art about these topics is presented in the following sections.
8.3.1 Multi-robot Patrolling The patrolling task refers to the act of walking around an area, with some regularity, in order to protect or supervise it [39]. Usually, it is carried on by a team of multiple robots to increase the efficiency of the system. Given a map representation of the environment, the first step is to define whether the map should be partitioned, or not, in smaller sections. In order to maximize the efficiency, a Multi-Robot team should assign to each robot different areas to be patrolled. That means that a division of the global map has to be done, and the submaps assigned to the robots. As analyzed in [39], in most cases it is sufficient to
188
M.V. Espina et al.
adopt a static strategy, wherein the whole environment is already given as a collection of smaller areas. This means that a partitioning is not really necessary. However, more interesting approaches deal with dynamic and non-deterministic environment, resulting in a more challenging domain, that requires to be partitioned dynamically. This fact involves that the robots should coordinate themselves, to decide who has to patrol which area. The subsequent step involves how to sweep the assigned area. Basically, this is performed by transforming the environment map into a graph to be traversed. This aspect of patrolling is the most coped by the current state-of-the-art. In fact, given a topological graph the map, most of the algorithms and techniques used for dealing with graph can be adopted. Major approaches use the Traveling Salesman Problem and its variant to define an optimal (or sub-optimal) path for a given graph. In [38] is defined a steering control based mechanism that takes into account the constraint given by a real platform to define the path. First, a rectangle partitioning schema is applied to the map. Then, each rectangle is covered by the circle defining the sweep area (it depends on platform and sensors used). Finally, the path is the result of connecting the covering circles. Another possibility is given by using Hamiltonian cycle to define the path. The work in [37] defines an algorithm to transform an occupancy grid based map into a topological graph, where this kind of partitioning strategies are applied to perform the patrol task. More advanced techniques apply Game Theory [6] and Reinforcement Learning [4] methods, to include the behavior of intruder in the sequencing strategies. A comparison of this techniques and preliminary results are presented in [4]. The last aspect involved in Multi-Robot Patrolling is the task reallocation. When dynamic domain are utilized as test bed, the assigned area can change over time. This fact implies that the patrolling team needs to reshape the strategy to take into account the modification. Usually, it involves rebuilding the topological graph, and resetting the current configuration. A more efficient approach, however, requires a coordination among the robots, to minimize the task hopping. A basic approach, involving reallocation over team formation, is presented in [1].
8.3.2 Multi-robot Coverage The goal of the coverage task is to build an efficient path to ensure the whole area is crossed by the robot. Using a team of robots, the goal requires to build efficient paths to jointly ensure the coverage of the area. Therefore, an important issue in mobile Multi-Robot teams is the application of coordinating tasks to the area coverage problem. Multi-Robot environment coverage has been recently studied by solving the problem of generating patrol paths. Elmaliach et al. [18] introduced an algorithm that guarantees the maximal uniform frequency for visiting places by the robots. Their algorithm detects circular paths that visit all points in the area, while taking into account terrain directionality and velocity constraints. The approach in [2] considers also the case in which the adversary knows the patrol scheme of the robots. Correll and Martinoli [17] consider the combination of probabilistic and
8
Multi-robot Teams for Environmental Monitoring
189
deterministic algorithms for the multi-robot coverage problem. They apply their method to a swarm-robotic inspection system at different levels of wheel-slip. They conclude that the combination of both probabilistic and deterministic methods lead to more accuracy, particularly if real world factors are becoming significant. Ziparo et al. [48] considered the problem of deploying large robot teams within Urban Search And Rescue (USAR) like environments for victim search. They used RFIDs for coordinating the robots by local search, and extended the approach by a global planner for synchronizing routes in configuration time-space. These approaches are mainly focused on the problem of computing optimal team trajectories in terms of path length and terrain coverage frequency, while there has been only little attention on team robustness. Within real-world scenarios, dynamic changes and system failures are instead crucial factors for any performance metric.
8.3.3 Task Assignment Cooperation based on Task Assignment has been intensively studied and can be typically considered as a strongly coordinated/distributed approach [19]. In Reactive Task Assignment (e.g., [36]), each member of the team decides whether to engage itself in a task, without re-organizing the other member activities, drastically reducing the requirements on communication but limiting the level of cooperation that they can support. Iterative Task Assignment, such as [27, 46], allocates all tasks present in the system at each time step. In this way, the system can adapt to environmental conditions ensuring a robust allocation, but generally require knowing in advance the tasks that have to be allocated. Sequential Task Assignment methods [23, 49] allocate tasks to robots sequentially as they enter the system, therefore tasks, to be allocated, do not need to be known before the allocation process begins. Such techniques suffer, in general, from a large requirement in terms of bandwidth, due to the large amount of messages exchanged in order to assign tasks. Hybrid solutions which merge characteristics of different types of task allocation have been investigated. For example, in [22] the authors provide an emotion-based solution for multi robot recruitment. Such an approach can be considered intermediate between sequential and reactive task assignment. As previous approaches, this work does not explicitly take into account conflicts due to dynamic task perception. Conflicts arising in Task Assignment are specifically addressed for example by [3, 28]. However, conflicts described in those approaches are only related to the use of shared resources (i.e. space), while other approaches can address a more general class of conflicts, such as the ones that arise when task properties change over time due to dynamic on-line perception [20].
8.3.4 Automatic Video Surveillance Semi-automated visual surveillance systems deal with the real-time monitoring of persistent and transient objects within a specific environment. The primary aims of these systems are to provide an automatic interpretation of scenes to understand and
190
M.V. Espina et al.
predict the actions and the interactions of the observed objects based on the information acquired by sensors. As mentioned in [45], the main stages of the pipeline in an automatic visual surveillance system are moving object detection and recognition, tracking and behavioral analysis (or activity recognition). One of the most critical and challenging components of a semi-automated video surveillance system is the low-level detection and tracking phase. Even small detection errors can significantly alter the performance of routines further down the pipeline, and subsequent routines are usually unable to correct errors without using cumbersome, ad-hoc techniques. Adding to this challenge, low-level functions must process huge amounts of data, in real time, over extended periods. This data is frequently corrupted by the camera’s sensor (e.g. CCD noise, poor resolution and motion blur), the environment (e.g. illumination irregularities, camera movement, shadows and reflections), and the objects of interest (e.g. transformation, deformation or occlusion). Therefore, to adapt to the challenges of building accurate detection and tracking systems, researchers are usually forced to simplify the problem. It is common to introduce certain assumptions or constraints that may include: fixing the camera [44], constraining the background [43], constraining object movement or applying prior knowledge regarding object-appearance or location [41]. Relaxing any of these constraints often requires the system to be highly application domain oriented. There are two main approaches to object detection: “temporal difference” and “background subtraction”. The first approach consists in the subtraction of two consecutive frames followed by thresholding. The second approach is based on the subtraction of a background or reference model and the current image followed by a labeling process. The “temporal difference” has good throughput in dynamic environments as it is very adaptive. However, its performance in extracting all the relevant object pixels is poor. On the other hand, background subtraction approach has a good performance in object extraction. Although, it is sensitive to dynamic changes in the environment; to overcome this issue, adaptive background techniques are applied, which involves creating a background model and continuously upgrading to avoid poor detection in dynamic environments. There are different techniques background modeling, commonly related to the application such as active contours techniques used to track nonrigid objects against homogeneous backgrounds [7], primitive geometric shapes for certain simple rigid objects [16] or articulated shapes for humans in high-resolution images [35]. Background modeling and updating background techniques are based on pixel-based or region-based approaches. In this chapter, an updating technique based on pixel-based background model, Gaussian Mixture Model, GMM [40, 5], for foreground detection in scenario1 is presented. In scenario2, an updating technique based on region-based background model Maximally Stable Extremal Region (MSER) [31] is applied. Moving down in the pipeline of the system after the foreground extraction comes the tracking. Yilmaz et al. [47] reviewed several algorithms, listing the strengths and weaknesses of each of them; emphasizing that each tracking algorithm inevitably fails under certain set of conditions. Therefore, different tracking techniques are used in each scenario as the environment conditions are different in each of them. Therefore, in scenario1 Kalman Filters [26] are implemented and in Scenario 2 a optimized, multi-phased, kd-tree-based [24] tracking algorithm is
8
Multi-robot Teams for Environmental Monitoring
191
used. At last, in [10], a survey of activity recognition algorithms is presented where well-known probabilistic approaches, such as Bayesian Networks [11] or Hidden Markov Model [8] are used. In this video surveillance system HMM are used to generalize the object interactions and therefore recognize a predefined activity.
8.4 Representation Formalism One of our main contribution is the study of a multi-robot patrolling and threat response, with a heterogenous team of agents including both mobile robots and static cameras. The heterogeneity is given not only by the different mobility capabilities of the agents, but also by different sensor abilities. This study is motivated by the fact that integration of many technologies, such as mobile robotics, artificial vision and sensor networks can significantly increase effectiveness of surveillance applications. In such a heterogeneous team, one important issue is to devise a common formalism for representing the knowledge about the environment of the entire system. Our approach to solve the problem of multi-robot monitoring is composed by three components: 1. a map representation of the events occurring in the environment; 2. a generated list of tasks, to handle the events; 3. a coordination protocol, to distribute the tasks among the agents. The most interesting component is the map representation. Inspired by [29], a Gaussian process models the map to be covered in terms of wide-areas and hot spots. In fact, the map is partitioned in two categories. The objective of the single agent is, then, to cover the assigned areas, prioritizing the hot-spot areas, while keeping the wide-area coverage. In this approach we introduce two novel concepts: 1. we consider different types of events that can occur in the domain at the same time, each one represented with a probabilistic function, 2. we consider decaying of information certainty over time. Moreover, our system is highly heterogeneous, since it is constituted by both mobile robots carrying different sensors and static cameras. The proposed formalism, thus, allows for a unified representation of heterogeneous multi-source exploration of different types of events or threats.
8.4.1 Problem Formulation Let X denote a finite set of locations (or cells) in which the environment is divided. This decomposition depends on the actual environment, robot size, sensor capabilities and event to be detected. For example, in our experimental setup, we monitor an indoor environment looking for events related to people moving around and unattended luggage, and we use a discretization of the ground plane of 20 × 20cm. Let E denote a finite set of events that the system is required to detect and monitor. Let Z denote a finite set of sensors included in the systems: they can be either fixed or
192
M.V. Espina et al.
mounted on mobile platforms. For each event e ∈ E there is a probability (or belief) that the event is currently occurring at a location x ∈ X ; this probability is denoted by Pe (x). This probability distribution obviously sums to 1 (i.e., Σx Pe (x) = 1). This means that we assume that an event e is occurring (even if this is not the case), and that the team of agents performs a continuous monitoring of the environment. In other words, when a portion of the environment is examined and considered to be clear (i.e., low values of Pe (x)), then in another part of the environment that is not examined this probability increases and thus it will become the objective of a next search. It is also to be noted that this representation is adequate when the sensors cover only a part of the environment at any time, as in our setting, while it is not feasible in cases where sensors cover the entire environment. The computation of this probability distribution is performed by using sensors in Z . Given a sensor z ∈ Z and a set of sensor readings z0:t from time 0 to current time t, the probability that event e occurs in location x at time t can be expressed as pe (xt |z0:t ) = η pe (zt |xt )
xt−1
pe (xt |xt−1 )pe (xt−1 |z0:t−1 )dxt−1
(8.1)
Equation 8.1 is derived from Bayes Theorem (see for example [42]). The set of probability distributions pe (xt |z0:t ) for each event e represents a common formalism for the heterogeneous team considered in this work and allows for both driving the patrolling task and evaluating different strategies. This representation has an important feature: it allows for explicitly defining a sensor model for each pair sensor,event. In fact, pe (zt |xt ) represents the sensor model of sensor z in detecting event e. In this way, it is possible to accurately model heterogeneous sensors within a coherent formalism. Also the motion model p(xt |xt−1 ) can be effectively used to model both static objects (e.g., bags) and dynamic objects (e.g., persons). It is important to notice also that the sensor model pe (zt |xt ) contributes all the cells xt that are actually observed by the sensor and to cells that are not within its field of view. In this latter case, the probability of presence of the event is set to the nominal value λ and thus Pe (xt ) tends to λ (i.e., no knowledge) if no sensor is examining cell xt for some time. This mechanism implements a form of decay of information over time and requires the agents to continuously monitor the environment in order to assess that no threats are present. Usually, the idleness [14] is normally used in evaluating multi-agent patrolling approaches. This concept can be extended to our formalization as follows. Given a minimum value of probability γ , that can de defined according to the sensor models for all the events, the idleness Ie (x,t) for an event e of a location x at time t is defined as the time elapsed since the location had a low (i.e. < γ ) probability of hosting the event. More formally Ie (x,t) = t − tˆ such that pe (xtˆ) < γ ∧ ∀τ > tˆ, pe (xτ ) ≥ γ Then the worst idleness W Ie (t) for an event e at time t is defined as the biggest value of the idleness for all the locations. Formally W Ie (t) = max Ie (x,t) x
8
Multi-robot Teams for Environmental Monitoring
193
8.5 Event-Driven Distributed Monitoring As stated before, we consider two different classes of tasks. Patrolling tasks define areas of the environment that should be traveled regularly by the agents. The shape of these areas is not constrained to be any specific one. We assume, however, that a decomposition can be performed, to apply standard approaches to coverage problem [15]. Threat response tasks specify restricted portions of the map, where potentially dangerous events are currently occurring. The kind of threat is left unspecified, since it is dependent on the application domain. Examples of considered events are: an intruder detected in a restricted area, a bag or an unknown object left in a clear zone, a non-authorized access to a controlled room. The appropriate response, then, should be specified per application. We assume that the basic response for all these events requires for the agent to reach the location on the map. In this sense, they are the hot-spots specified in 8.4.1.
8.5.1 Layered Map for Event Detection Figure 8.2 shows a diagram of the proposed solution. Data acquired by the sensors of the system are first locally filtered by each agent and then globally fused in order to determine a global response to perceived events. A finite set of event maps models the event space E . For each sensor in the set S , it is possible to define a sensor model. Each sensor model defines a probability distribution function (pdf) that describes how the sensor perceives an event, its uncertainty and how to update the event map. A sensor can update different event maps, and, hence, it becomes important to define how heterogeneous sensors update the map. A Layered Map for Event Detection defines a multi-level Bayesian filter. Each level describes a probability distribution related to a specific sensor: the combination of several levels results in
Fig. 8.2 The data flow in the Layered Map for Event Detection approach.
194
M.V. Espina et al.
a probability distribution for the event of interest. However, the importance of an event decays when the time goes by: to reflect the temporal constraints in the event handling, we introduced an aging update step. This step acts before updating the filter, given the observation from sensors (like in the Predict step of recursive Bayesian filters). The pdf associated to each sensor level has this meaning: 1 if the sensor filter has converged in a hot-spot p(x) = 0 if there is no relevant information given by the sensor in x This p(x) = 0.5 means that, in that point, the sensor has complete ignorance about the environment it can perceive. Given this assumptions, in every time frame, the pdf of sensor level smooths towards complete ignorance: p(x) + δincrease if p(x) < 0.5 p(x) = p(x) − δdecrease if p(x) > 0.5 The combination of the sensor level is demanded to the Event Detection layer. This layer has a bank of filters, each one delegated to detect a specific event. Each filter uses the belief from a subset of the sensor levels to build a joint belief, representing the pdf of associated event. The characterization of the event depends on the behavior an agent can perform to response of it. In this sense, the event is a threat and the response to it depends on the coordination step presented in Section 8.5.2. This process is formalized in Algorithm 8.1. Algorithm 8.1. MSEventDetect input : u = action performed by the agent zs = set of sensor reading from a specific sensor BF = set of Bayesian sensor filters output : E = set of pdf associated to events of interest // initialize the event belief Bel Bel ← 0 foreach b f in BF do // apply aging p(b f ) (x) + δincrease if p(b f ) (x) < 0.5 p(b f ) (x|zs ) = p(b f ) (x) − δdecrease if p(b f ) (x) > 0.5 // perform Bayesian filtering Predictbfi (u) Updatebfi (zs ) Bel ← Bel ∪ p(b f ) (x) end // build joint belief in the event detection layer D E ←0 foreach d in D do p(d) (x|Bel) = ∏i γi beli E ← E ∪ p(d) (x) end
8
Multi-robot Teams for Environmental Monitoring
195
8.5.2 From Events to Tasks for Threat Response The output of the Event Detection is a distribution over the space of the environment, describing the probability of occurring event in specific areas. The team of agents need to translate these information into tasks to perform. First of all, a clustering is done to extract the high probability peaks of the distribution. The clustering uses a grid-based decomposition of the map, to give a coarse approximation of the distribution itself. If the distribution is multi-modal, then, a cluster will be associated to each peak. Each cluster ce is then defined in terms of center position and occupancy area: these information will be addressed by the task association. After the list of clusters is generated, a corresponding task list is built. In principle, one task is associated to each cluster. Two categories of tasks are, then, considered: patrolling and threat response. The super class of Threat-Response could comprehend different behaviors: explore the given area with a camera, verify the presence of an intruder or an unexpected object and so on. However, the basic behavior associated to this tasks requires for the agent, to reach the location, or its nearby, and take some kind of action. This means that the Threat-Response category defines a whole class of behaviors, distinguished by the last step. Therefore, in our experimental setup, we consider them as simple behaviors to reach the location, avoiding the specification of other specific actions.
Algorithm 8.2. Event2Task input : E = set of pdfs associated to events of interest output : T = set of tasks to perform // clustering of event set C ←0 foreach e in E do ce = Clusterize(e) C ← C ∪ ce end // associate event to task T ← 0 foreach c in C do patrolling if c ∈ E p T ←T ∪ threat if c ∈ Et end
Algorithm 8.2 illustrates the steps performed to transform the pdf of an event to a task list. Here, E p is the class of events that requires a patrolling task, while Et is the class of events requiring a threat-response task. Figure 8.3 shows an example where two pdfs (represented as sets of samples) are processed to obtain two tasks.
196
M.V. Espina et al.
Fig. 8.3 Event to task transformation.
8.5.3 Strategy for Event-Driven Distributed Monitoring We can now describe the strategy developed for the Event-driven Distributed Monitoring. Algorithm 8.3 incorporates the previously illustrated algorithm for the Event Detection and the event-to-task transformation. The agent starts with a uniform knowledge of the map: no events are detected yet. In normal conditions, the default behavior is to patrol the whole area of the map. At time t, the agent a receives information from the sensors. A sensor can model a real device, as well as a virtual one to describe other types of information (a priori, constraints on the environment and so on). Algorithm 8.1 is then used to detect a cluster of events on the map. These clusters are then passed to the Algorithm 8.2: a list of tasks is generated. The tasks are spread over the network, to wake up the coordination protocol and the Task Assignment step is performed. Each agent selects the task that is more appropriated to its skills and it signals to other agents its selection. The remaining tasks are relayed to the other agents, that, in the meantime, select the most appropriate task. If the number of tasks is larger than the number of available agents, the non-assigned tasks are put in a queue. When an agent completes its task, this is removed from the pool of tasks for each agent.
8
Multi-robot Teams for Environmental Monitoring
197
Algorithm 8.3. Event Driven Monitoring input : BF = set of Bayesian sensor filters Z = set of sensor readings from S S = set of sensors M = map of the environment // initialize the sensor filters foreach b f in BF do p(x)(b f ) = U (M) end // retrieve agent’s actions u = actions // retrieve sensor readings Z←0 foreach s in S do Z ← Z ∪ zs end // detect events E = MSEventDetect(u, Z, BF) // generate the task set T = Event2Task(E ) // assign a task to the agent a task = TaskAssignment(a, T )
8.6 Implementation and Results As mentioned in Section 8.2, two scenarios are considered for our system and for each of them different vision algorithms were implemented. In the first scenario, a bag is left unattended and a robot will go and check the suspected area. In the second scenario, the video surveillance system deals with the manipulation of unauthorized objects in a specific positions ( laptop in top-left corner in Figure 8.1). The implementation of the computer vision and robotic components to deal with these scenarios and the realization of a full demonstrator to validate the approaches are described in the following.
8.6.1 A Real-Time Multi-tracking Object System for a Stereo Camera - Scenario 1 In this scenario a multi-object tracking algorithm based on a ground plane projection of real-time 3D data coming from a stereo imagery is implemented, giving distinct separation of occluded and closely-interacting objects. Our approach, based on the research activity completed in [25, 26, 33, 34], consists of tracking, using Kalman Filters, fixed templates that are created combining the height and the statistical pixel occupancy of the objects in the scene. These objects are extracted from the background using a Gaussian Mixture Model [40, 5] using four channels: three
198
M.V. Espina et al.
channels colours (YUV colour space) and a depth channel obtained from the stereo devices [25]. The mixture model is adapted over time and it is used to create a background model that is also upgraded using an adaptive learning rate parameter according to the scene activity level on a per-pixel basis (the value is experimentally obtained). The use of depth information (3D data) can contribute to solve difficult challenges normally faced when detecting and tracking objects such us: improve the foreground segmentation due to its relatively robustness against lighting effects as shadows and also giving the shape feature provides information to discern between people and other foreground objects such as bags. Moreover, the use of a third dimension feature can help to clarify the uncertainty of predictions on the tracking process when an occlusion is produced between a foreground and background object. The 3D foreground cloud data is then rendered as if it was viewed from an overhead, orthographic camera view (see Figure 8.4); reducing the amount of information and therefore, the computational performance is increased when the tracking is done onto plan-view projection data rather than onto 3D data directly. The projection of the 3D data to a ground plane is chosen due to the assumption that people usually do not overlap in the normal direction to the ground plane. Therefore, this 3D projection allows to separate and to solve some occlusions that more difficult to solve using the original camera-view. The data association implemented in the tracking process is also based on the work presented in [26, 34]. The Gaussian and linear dynamic prediction filters used to track the occupancy and height statistics plan-view maps are the well-known Kalman Filters [9]. Figure 8.5 shows tracking of different types of objects (robot, person and bag), including occlusion. Each planview map has been synchronized with its raw frame pair and back projected to the real map of the scene.
Fig. 8.4 Process for creation of a plan-view
8.6.2 Maximally Stable Segmentation and Tracking for Real-Time Automated Surveillance - Scenario 2 In this section we present a novel real-time, color-based, (MSER) detection and tracking algorithm for detecting object manipulation events, based on the work carried out in [21]. Our algorithm synergistically combines MSER-evolution with image-segmentation to produce maximally-stable segmentation. Our MSER algorithm clusters pixels into a hierarchy of detected regions using an efficient lineconstrained evolution process. Resulting regions are used to seed a second clustering
8
Multi-robot Teams for Environmental Monitoring
199
Fig. 8.5 Seven frames of a sequence that shows tracking different types of objects (robot, person and bag), including an occlusion. Each plan-view map has been synchronized with its raw frame pair and back projected to the real plan of the the scene(right side of each image).
process to achieve image-segmentation. The resulting region-set maintains desirable properties from each process and offers several unique advantages including fast operation, dense coverage, descriptive features, temporal stability, and low-level tracking. Regions that are not automatically tracked during segmentation, can be tracked at a higher-level using MSER and line-features. We supplement low-level tracking with an algorithm that matches features using a multi-phased, kd-search algorithm. Regions are modeled and identify using transformation-invariant features that allow identification to be achieved using a constant-time hash-table. To demonstrate the capabilities of our algorithm, we apply it to a variety of real-world activity-recognition scenarios. MSER algorithm is used to reduce unimportant data, following Mikolajczyk [32] final conclusions on comparison of the most promising feature-detection techniques. The MSER algorithm was originally developed by Matas et al. [31] to identify stable areas of light-on-dark, or dark-on-light, in greyscale images. The algorithm is implemented by applying a series of binary thresholds to an image. As the threshold value iterates, areas of connected pixels grow and merge, until every pixel in the image has become a single region. During this process, the regions are monitored, and those that display a relatively stable size through a wide range of thresholds are recorded. This process produces a hierarchical tree of nested MSERs. The tree-root contains the MSER node that comprises
200
M.V. Espina et al.
every pixel in the image, with incrementally smaller nested sub-regions occurring at every tree-branch. The leaves of the tree contain the first-formed and smallest groups of pixels. Unlike other detection algorithms, the MSER identifies comparatively few regions of interest. However, our algorithm returns either a nested set of regions (traditional MSER-hierarchy formation), or a non-nested, non-overlapping set of regions (typical to image segmentation). Using non-nested regions significantly improves tracking speed and accuracy. To increase the number of detections and improve coverage, Forssen [21] redesigned the algorithm to incorporate color information. Instead of grouping pixels based on a global threshold, Forssen incrementally clustered pixels using the local color gradient (i.e. for every pixel p in the image, the color gradient is measured against adjacent pixels p[+] and p[-]). This process identifies regions of similar-colored pixels that are surrounded by dissimilar pixels. In our approach we take advantage of the increased detection offered by Forssen’s color-based approach, although in our approach the region growth is constrained using detected lines; improving segmentation results on objects with highcurvatures gradients. To detect lines, the Canny filter is used rather than MSER as it is more effective at identifying a continuous border between objects since it considers a larger section of the gradient. Therefore, our system processes each frame with the Canny algorithm. Canny edges are converted to line-segments and the pixels corresponding to each line-segment is used to constrain MSER growth. Simply speaking, MSER evolution operates as usual, but is not permitted to cross any Canny lines. An example of detected lines is shown in Figure 8.6(Right). Detected lines are displayed in green.
Fig. 8.6 Left: An example of the feed-forward process. Dark-gray pixels are preserved, Light-gray pixels are re-clustered. Center: MSERs are modeled and displayed using ellipses and average color-values. Right: An example of MSER image segmentation. Regions are filled with their average color, detected lines are shown in green, the path of the tracked hand is represented as a red line.
To improve performance on tracking large, textureless objects that are slowmoving or stationary, we apply a feed-forward algorithm, which is a relative simple addition to our MSER algorithm. After every iteration of MSER generation, we identify pixels in the current frame that are nearly identical (RGB values within 1) to the pixel in the same location of the following frame. If the majority of pixels in any given MSER remain unchanged for the following video image, the matching pixels are pre-grouped into a region for the next iteration. This pixel-cluster is then used to seed growth for the next iteration of MSER evolution. Using our feed-forward
8
Multi-robot Teams for Environmental Monitoring
201
approach, any region that cannot continually maintain its boundaries, will be assimilated into similarly-colored adjacent regions. After several iterations of region competition, many unstable regions are eliminated automatically without any additional processing (see also Figure 8.6). Once the regions using MSER features and line-corner features are obtained, the tracking algorithm is implemented to operate upon them. Our tracking algorithm applies four different phases to handle a specific type of tracking problem: “Feed-Forward Tracking ”, “MSER-Tracking”, “LineTracking”and “Secondary MSER-Tracking”. However, if an object can be tracked in an early phase, later tracking-phases are not applied to the object. By executing the fastest trackers first, we can further reduce resource requirements. In the “Feedforward Tracking”phase, using our pixel feed-forward algorithm; tracking becomes a trivial matter of matching the pixel’s donor region with the recipient region. In “MSER-Tracking” phase by as mentioned before, eliminating the problem of nesting by reducing the hierarchy of MSERs to non-hierarchical image segmentation, the representation becomes a one-to-one correspondence and matches are identified using a greedy approach. The purpose of this phase of tracking is to match only those regions that have maintained consistent size and color between successive frames. Each image region is represented by the following features: Centroid (x,y) image coordinates, Height and Width (second-order moment of pixel-positions) and finally color values. Matching is only attempted on regions that remained un-matched after the “Feed-Forward Tracking”phase. Matches are only assigned when regions have similarity measures beyond a predefined similarity threshold. In this tracking phase, line-corners are matched based on their positions, the angles of the associated lines, and the colors of the associated regions. It should be mentioned that, even if a line separates (and is therefore associated with) two regions, that line will have different properties for each region. Specifically, the line angle will be 180 degrees rotated from one region to the other, and the left and right endpoints will be reversed. Each line-end is represented by the following features: Position (x,y) image coordinates, Angle of the corresponding line, RGB color values of the corresponding region and Left / Right handedness of endpoint (Perspective of looking out from the center of the region). Line-corner matching is only attempted on regions that remained un-matched after the “MSER-Tracking”phase. At last, on the “Secondary MSERTracking”phase a greedy approach is used to match established regions (regions that were being tracked but were lost) to unassigned regions in more recent frames. Unlike the first three phases, which only consider matches between successive frames, the fourth phase matches regions within an n-frame window. Although there may be several ways to achieve foreground detection using our tracking algorithms, we feel it would be appropriate to simply follow the traditional pipeline. To this affect, the first several frames in a video sequence are committed to building a region-based model of the background. Here, MSERs are identified and tracked until a reasonable estimation of robustness and motion can be obtained. Stable regions are stored to the background model using the same set of features listed in the tracking section. Bear in mind that since background features are continually tracked, the system is equipped to identify unexpected changes to the background. The remainder of the video is considered the operation phase. Here, similarity measurements are made
202
M.V. Espina et al.
between regions in the background model, and regions found in the current video frame. Regions considered sufficient dissimilar to the background are tracked as foreground regions. Matching regions are tracked as background regions. Once the foreground is segmented from the background, a color and shape-based model is generated from the set of foreground MSER features. Our technique uses many of the principals presented by Chum and Matas[13], but our feature vectors were selected to provide improved robustness in scenes where deformable or unreliable contours are an issue. Therefore, we propose an algorithm that represents objects using an array of features that can be classified into three-types: MSER-pairs (a 4-dimensional feature-vector), MSER-individuals (a 3-dimensional feature-vector) and finally, Size-position measure (a 2-dimensional feature-vector), feature-set only used for computing vote-tally. The recognition of activities is done without sophisticated algorithms (a Hidden Markov Model was used to generalize object interactions), so for this surveillance system, we use a simple rule-sets based only on proximity and trajectories. Figure 8.7 shows an example of activity recognition using MSER. Each object is associated by a color bar at the right of the image. The apparent height of the bar corresponds to the computed probability that the person’s hand is interacting with that object. In the scenario shown on the left, a person engaged in typical homework-type behaviors including: typing on a laptop; turning pages in a book; moving a mouse; and drinking from a bottle. In the scenario on the right, a person reached into a bag of chips multiple times, and extinguished a trash-fire with a fire extinguisher.
Fig. 8.7 An example of activity recognition using MSER
8.6.3 Multi-robot Environmental Monitoring The implementation of the Multi-Robot Environmental Monitoring described in Section 8.5 has been implemented on a robotic framework and tested both on 2 Erratic robots2 and on many simulated robots in the Player/stage environment3. 2 3
www.videre.com playerstage.sourceforge.net
8
Multi-robot Teams for Environmental Monitoring
203
Fig. 8.8 Block diagram of proposed architecture.
Figure 8.8 shows the block diagram of the overall system and the interactions among the developed modules. In particular, the team of robots monitors the environment while waiting for receiving event messages from the vision sub-system. As previously described, we use a Bayesian Filtering method to achieve the Sensor Data Fusion. In particular, we use a Particle Filter for the sensor filters and event detection layer. In this way, the pdf describing the belief of the system about the events to be detected are described as sets of samples, providing a good compromise between flexibility in the representation and computational effort. The implementation of the basic robotic functionalities and of the services needed for multi-robot coordination is realized using the OpenRDK toolkit4 [12]. The mobile robots used in the demonstrator have the following features: • Navigation and Motion Control based on a two-level approach: a motion using a fine representation of the environment and a topological path-planner that operates on a less detailed map and reduces the search space; probabilistic roadmaps and rapid-exploring random trees are used to implement these two levels [13]. • Localization and Mapping based on a standard particle filter localization method and a well-known implementation GMapping5 that has been successfully experimented on our robots also in other applications [30]. • Task Assignment based on a distributed coordination paradigm using utility functions [27] already developed and successfully used in other projects. Moreover, to test the validity of the approach, we replicate the scenarios in the Player/Stage simulator. defining a map of the real environment used for the experiments, and several software agents, with the same characteristics of the real robots. The combination of OpenRDK and Player/Stage is very suitable to develop and 4 5
openrdk.sf.net openslam.org/gmapping.html
204
M.V. Espina et al.
experiment multi-robot applications, since they provide a powerful but yet flexible and easy-to-use robot programming environment.
8.6.4 System Execution In this section we show the behaviors of the overall surveillance system developed in this project. As stated, the recognition of the predefined activities for the scenarios illustrated in this chapter is done without sophisticated algorithms and simple rulesets based only on proximity and trajectories are applied. The communication between the video surveillance system and the multi-robot system is done using a TCP client-server communication interface. Each static stereo camera is attached to a PC computer and they communicate between them and the robots via private wireless network. Each static camera and its PC act as a client and one of this PCs also acts as a server. The PC video-server is the only PC that communicates directly with the robots; the PC video-server becomes a client although when it communicates with the multi-robot system, and then, the robots act
Fig. 8.9 This figure illustrates a sequence of what may happen in Scenario 1. Person A walks through the corridor with the bag and leaves it in the middle of the corridor. Person B approaches the bag and takes it, raising an alarm in the system causing the patrolling robot to go and inspect the area.
8
Multi-robot Teams for Environmental Monitoring
205
Fig. 8.10 This figure illustrates a sequence of what may happen in Scenario 2. Person B places a book(black) and a bottle(green) on the table and manipulates them under the surveillance of the system; until Person B decides to touch an unauthorized object (ie.laptop)(grey) raising an alarm in the system causing the patrolling robot to go and inspect the area.
as servers. Once the video surveillance has recognized an event; e.g. in scenario1, the person associated with the bag abandons the object (see Figure 8.9), the PC client camera sends to the PC video-server the event name and the 3D coordinates. The video-server then constructs a string with this information ( first transforms the 3D coordinates to a common coordinate system) and sends the message via wireless to the robots. Then, one of the robots is assigned to go and patrol the area and take a high-resolution picture if the event detected is “bag taken”. Figures 8.9 and 8.10 show the results in Scenarios 1 and 2 respectively. Figure 8.9 illustrates a sequence of what may happen in Scenario 1. On the top-left image of the figure a person with an object (bag) is walking through the corridor. On the top-right image of the figure, the video system detects that the person left the bag. Therefore a message is sent as “left bag”. On the bottom-left image another person walks very closed to the bag. On the bottom-right image the visual surveillance system detects that a person is taking a bag an a message “bag taken”is sent to the robots and as it can be seen one of the robots is sent to inspect the risen event. Figure8.10 illustrates a sequence of what may happen in Scenario 2. On the top-left, a laptop is placed
206
M.V. Espina et al.
on the table and one of the robots can be seen patrolling. The top-right and bottom left images of the figure there is a person who is allowed to manipulate different objects. On the bottom-right the person is touching the only object which is not allowed, therefore an alarm “allert”is risen.
8.7 Conclusion During the recent years, there has been an increased interest in using robotic technologies for security and defence applications, in order to increase their performance and reduce the danger for the people involved. The research proposed in this chapter aims to provide a distributed, multi-robot solution to the problem of environment monitoring, in order to detect or prevent undesired events, such as intrusions, or unattended baggage events or future applications such a fire detection. The problem of detecting and responding to threats through surveillance techniques is particularly well suited to a robotic solution comprising of a team of multiple robots. For large environments, the distributed nature of the multi-robot team provides robustness and increased performance of the surveillance system. In future, the extension of the system by using a group of mobile robots equipped with on-board processing cameras may have several significant advantages over a fixed surveillance camera system. First, the solution could be used in environments that have not been previously engineered with a camera-based monitoring system: the robot team could be deployed quickly to obtain information about a previously unknown environment. Second, the cameras attached to the robots could be moving through the environment in order to best acquire the necessary information, in contrast with a static camera, which can only perform observations from a fixed view point. Third, the robots in the team have the power to collaborate on the monitoring task and are also able to take actions that could pre-empt a potential threat. Fourth, the robots could be equipped with additional specialized sensors, which could be delivered at the appropriate place in the environment to detect the presence of chemical or biological agents. Last, the robot team could communicate with a human operator and receive commands about the goals and potential changes in the mission, allowing for a dynamic, adaptive solution.
Acknowledgements This publication was developed under Department of Homeland Security (DHS) Science and Technology Assistance Agreement No. 2009-ST-108-000012 awarded by the U.S. Department of Homeland Security. It has not been formally reviewed by DHS. The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the U.S. Department of Homeland Security. The Department of Homeland Security does not endorse any products or commercial services mentioned in this publication.
8
Multi-robot Teams for Environmental Monitoring
207
References [1] Agmon, N.: Multi-robot patrolling and other multi-robot cooperative tasks: An algorithmic approach. Ph.D. thesis, BarIlan University (2009) [2] Agmon, N., Kraus, S., Kaminka, G.: Multi-robot perimeter patrol in adversarial settings. In: Proc. of IEEE International Conference on Robotics and Automation (ICRA), pp. 2339–2345 (2008) [3] Alami, R., Fleury, S., Herrb, M., Ingrand, F., Robert, F.: Multi robot cooperation in the martha project. IEEE Robotics and Automation Magazine 5(1), 36–47 (1998) [4] Almeida, A., Ramalho, G., Santana, H., Tedesco, P.A., Menezes, T., Corruble, V., Chevaleyre, Y.: Recent advances on multi-agent patrolling. In: Bazzan, A.L.C., Labidi, S. (eds.) SBIA 2004. LNCS (LNAI), vol. 3171, pp. 474–483. Springer, Heidelberg (2004) [5] Bahadori, S., Iocchi, L., Leone, G.R., Nardi, D., Scozzafava, L.: Real-time people localization and tracking through fixed stereo vision. Applied Intelligence 26, 83–97 (2007) [6] Basilico, N., Gatti, N., Rossi, T., Ceppi, S., Amigoni, F.: Extending algorithms for mobile robot patrolling in the presence of adversaries to more realistic settings. In: WI-IAT 2009: Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology, pp. 557–564. IEEE Computer Society Press, Washington (2009) [7] Baumberg, A., Hogg, D.C.: Learning deformable models for tracking the human body. In: Shah, M., Jain, R. (eds.) Motion-Based Recognition, pp. 39–60 (1996) [8] Brand, M., Oliver, N., Pentland, A.: Coupled hidden markov models for complex action recognition. In: CVPR ’97: Proceedings of the 1997 Conference on Computer Vision and Pattern Recognition (CVPR 1997), p. 994. IEEE Computer Society Press, Washington (1997) [9] Brown, R., Hwang, P.: Introduction to Random Signals and Applied Kalman Filtering. John Wiley & Sons, Chichester (1997) [10] Buxton, H.: Generative models for learning and understanding dynamic scene activity. In: ECCV Workshop on Generative Model Based Vision, pp. 71–81 (2002) [11] Buxton, H., Gong, S.: Advanced visual surveillance using bayesian networks. In: International Conference on Computer Vision, pp. 111–123 (1995) [12] Calisi, D., Censi, A., Iocchi, L., Nardi, D.: OpenRDK: a modular framework for robotic software development. In: Proc. of Int. Conf. on Intelligent Robots and Systems (IROS), pp. 1872–1877 (2008) [13] Calisi, D., Farinelli, A., Iocchi, L., Nardi, D.: Autonomous navigation and exploration in a rescue environment. In: Proceedings of the 2nd European Conference on Mobile Robotics (ECMR), pp. 110–115 (2005) [14] Chevaleyre, Y.: Theoretical analysis of the multi-agent patrolling problem. In: IAT 2004: Proceedings of the IEEE/WIC/ACM International Conference on Intelligent Agent Technology, pp. 302–308. IEEE Computer Society, Washington (2004) [15] Choset, H.: Coverage for robotics - a survey of recent results. Ann. Math. Artif. Intell. 31(1-4), 113–126 (2001) [16] Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(5), 564–575 (2003) [17] Correll, N., Martinoli, A.: Robust distributed coverage using a swarm of miniature robots. In: Proc. of IEEE International Conference on Robotics and Automation (ICRA), pp. 379–384 (2007)
208
M.V. Espina et al.
[18] Elmaliach, Y., Agmon, N., Kaminka, G.A.: Multi-robot area patrol under frequency constraints. In: Proc. of IEEE International Conference on Robotics and Automation (ICRA), pp. 385–390 (2007) [19] Farinelli, A., Iocchi, L., Nardi, D.: Multi robot systems: A classification focused on coordination. IEEE Transactions on System Man and Cybernetics, part B 34(5), 2015– 2028 (2004) [20] Farinelli, A., Iocchi, L., Nardi, D., Ziparo, V.A.: Assignment of dynamically perceived tasks by token passing in multi-robot systems. Proceedings of the IEEE 94(7), 1271– 1288 (2006) [21] Forss´en, P.E.: Maximally stable colour regions for recognition and matching. In: IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, IEEE, Minneapolis, USA (2007) [22] Gage, A., Murphy, R.R.: Affective recruitment of distributed heterogeneous agents. In: Proc. of Nineteenth National Conference on Artificial Intelligence, pp. 14–19 (2004) [23] Gerkey, B., Mataric, M.J.: Principled communication for dynamic multi-robot task allocation. In: Proceedings of the Int. Symposium on Experimental Robotics, pp. 353–362 (2000) [24] Goodman, J., O’Rourke, J.: Nearest neighbors in high dimensional spaces. In: Piotr Indy, K. (ed.) Handbook of Discrete and Computational Geometry, 2nd edn. IEE Professional Applications of Computing Series, vol. 5, ch. 39 (2004) [25] Harville, M., Gordon, G., Woodfill, J.: Foreground segmentation using adaptive mixture models in color and depth. In: IEEE Workshop on Detection and Recognition of Events in Video, vol. 0, p. 3 (2001) [26] Harville, M., Li, D.: Fast, integrated person tracking and activity recognition with planview templates from a single stereo camera. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 398–405 (2004) [27] Iocchi, L., Nardi, D., Piaggio, M., Sgorbissa, A.: Distributed coordination in heterogeneous multi-robot systems. Autonomous Robots 15(2), 155–168 (2003) [28] Jung, D., Zelinsky, A.: An architecture for distributed cooperative planning in a behaviour-based multi-robot system. Journal of Robotics and Autonomous Systems 26, 149–174 (1999) [29] Low, K.H., Dolan, J., Khosla, P.: Adaptive multi-robot wide-area exploration and mapping. In: Proceedings of the 7th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2008), pp. 23–30 (2008) [30] Marchetti, L., Grisetti, G., Iocchi, L.: A comparative analysis of particle filter based localization methods. In: Lakemeyer, G., Sklar, E., Sorrenti, D.G., Takahashi, T. (eds.) RoboCup 2006: Robot Soccer World Cup X. LNCS (LNAI), vol. 4434, pp. 442–449. Springer, Heidelberg (2007) [31] Matas, J., Chum, O., Urban, M., Pajdla, T.: Robust wide baseline stereo from maximally stable extremal regions. In: Proc. of British Machine Vision Conference, vol. 1, pp. 384– 393 (2002) [32] Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Kadir, T., Gool, L.V.: A comparison of affine region detectors. International Journal of Computer Vision 65 (1-2), 43–72 (2005) [33] Mu noz-Salinas, R., Aguirre, E., Garc´ıa-Silvente, M.: People detection and tracking using stereo vision and color. Image Vision Comput. 25(6), 995–1007 (2007) [34] Mu noz-Salinas, R., Medina-Carnicer, R., Madrid-Cuevas, F.J., Carmona-Poyato, A.: People detection and tracking with multiple stereo cameras using particle filters. J. Vis. Comun. Image Represent. 20(5), 339–350 (2009)
8
Multi-robot Teams for Environmental Monitoring
209
[35] Ning, H.Z., Wang, L., Hu, W.M., Tan, T.N.: Articulated model based people tracking using motion models. In: Proc. Int. Conf. Multi-Model Interfaces, pp. 115–120 (2002) [36] Parker, L.E.: ALLIANCE: An architecture for fault tolerant multirobot cooperation. IEEE Transactions on Robotics and Automation 14(2), 220–240 (1998) [37] Portugal, D., Rocha, R.: Msp algorithm: multi-robot patrolling based on territory allocation using balanced graph partitioning. In: SAC 2010: Proceedings of the 2010 ACM Symposium on Applied Computing, pp. 1271–1276. ACM, New York (2010) [38] Qu, Y.G.Z.: Coverage control for a mobile robot patrolling a dynamic and uncertain environment. In: WCICA 2004: Proceedings of 5th World Congress on Intelligent Control and Automation, pp. 4899–4903 (2004) [39] Sak, T., Wainer, J., Goldenstein, S.K.: Probabilistic multiagent patrolling. In: Zaverucha, G., da Costa, A.L. (eds.) SBIA 2008. LNCS (LNAI), vol. 5249, pp. 124–133. Springer, Heidelberg (2008) [40] Stauffer, C., Grimson, W.E.L.: Adaptive background mixture models for real-time tracking. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 246–252 (1999) [41] Tan, T.N., Sullivan, G.D., Baker, K.D.: Model-based localization and recognition of road vehicles. International Journal Computer Vision 29(1), 22–25 (1998) [42] Thrun, S., Burgard, W., Fox, D.: Probabilistic Robotics. The MIT Press, Cambridge (2005) [43] Tian, T., Tomasi, C.: Comparison of approaches to egomotion computation. In: Computer Vision and Pattern Recognition, pp. 315–320 (1996) [44] Toyama, K., Krumm, J., Brumitt, B., Meyers, B.: Wallflower: Principles and practice of background maintenance. In: IEEE International Conference on Computer Vision, vol. 1, p. 255 (1999) [45] Valera, M., Velastin, S.A.: chap. 1. A Review of the State-of-the-Art in Distributed Surveillance Systems. In: Velastin, S.A., Remagnino, P. (eds.) Intelligent Distributed Video Surveillance Systems. IEE Professional Applications of Computing Series, vol. 5, pp. 1–25 (2006) [46] Werger, B.B., Mataric, M.J.: Broadcast of local eligibility for multi-target observation. In: DARS 2000, pp. 347–356 (2000) [47] Yilmaz, A., Javed, O., Shah, M.: Object tracking: A survey. ACM Computer Survey 38, 13 (2006) [48] Ziparo, V., Kleiner, A., Nebel, B., Nardi, D.: Rfid-based exploration for large robot teams. In: Proc. of the IEEE Int. Conf. on Robotics and Automation (ICRA), Rome, Italy (2007) [49] Zlot, R., Stenz, A., Dias, M.B., Thayer, S.: Multi robot exploration controlled by a market economy. In: Proc. of IEEE International Conference on Robotics and Automation (ICRA), pp. 3016–3023 (2002)
212
Index